benchmarklisted
Install: claude install-skill allsmog/kuzushi-security-plugin
# Benchmark
You can't call bug-finding "world-class" — or catch a regression in it — without a
number. `/benchmark` scores findings against ground truth and reports the three metrics
that matter: **recall** (are we missing bugs?), **precision** (do we cry wolf?), and
**falseProofRate** (did we *prove* a non-bug? — the soundness failure differential
testing guards).
## Run it
- **Bundled corpus (regression):**
`node "${CLAUDE_PLUGIN_ROOT}/scripts/cmd/benchmark.mjs"`
scores every case under `bench/cases/` using its recorded `findings.json`. Add
`--case <name>` for one case.
- **A live run:**
`node "${CLAUDE_PLUGIN_ROOT}/scripts/cmd/benchmark.mjs" --target "<repo>" --ground-truth "<manifest.json>"`
scores `<repo>/.kuzushi/findings.json` after you've run the pipeline.
Flags: `--strict` (an active finding matching no expectation counts as a false positive —
only fair when the manifest is exhaustive), `--line-tolerance N` (default 5),
`--no-match-cwe` (match on file+line only).
## Ground-truth manifest
`{ "expectations": [ { "id", "kind": "vuln" | "safe", "cwe", "filePath", "line" } ] }`.
A `vuln` is a real bug the tool **should** find; a `safe` is a decoy that looks like one
and **must not** be flagged. A decoy that gets an active finding is a false positive; a
decoy that gets a *proven* finding is a false proof. Author manifests from confirmed bugs
(and their guarded siblings) so the corpus encodes both recall and precision pressure.
## Reading the result
`corpu