test-effectiveness-auditorlisted
Install: claude install-skill wan-huiyan/claude-ecosystem-hygiene
# Test Effectiveness Auditor v1.0
Answers the question: **how helpful are our automated tests at catching bugs?** Not by proxy metrics like coverage percent or test count, but by replaying real bugs that already happened and checking whether the test suite, as it stood just before the fix, actually failed on the buggy commit.
The honest baseline for "are tests worth it" is historical: bugs that made it to production despite the tests are the direct evidence of gaps; CI failures that forced a code change before merge are the direct evidence of catches. Everything else is speculation.
## Why this matters
Most teams measure test health by coverage % (e.g. pytest --cov). Coverage tells you which lines executed, not whether any assertion would have *failed* when the behavior was wrong. A line can be 100% covered by a test that would pass under the bug. This audit inverts the question: take known bugs, rewind to the pre-fix commit, and observe whether the suite catches them.
Two methods, in priority order:
1. **Historical incident replay** (primary signal) — for each documented bug, check out the pre-fix SHA in a worktree, run the suite, observe pass/fail, and classify.
2. **CI history analysis** (secondary signal) — pull CI runs that forced a pre-merge change, classify by whether the failure represented a real logic/data/integration catch vs. noise (lint, formatting, flaky).
Mutation testing, test-layer ablation, and coverage-delta analysis are deliberately NOT in scope for