mkbenchmarklisted
Install: claude install-skill ngocsangyem/MeowKit
# mk:benchmark — Harness Canary Suite
Measures harness performance against a small set of ground-truth tasks. Provides the empirical signal that the dead-weight audit (per `.claude/rules/dead-weight-audit-rules.md`) consumes to make load-bearing decisions about each harness component.
## When to Use
Activate when:
- User runs `/mk:benchmark run` (default = quick tier, 5 tasks, ≤$5)
- User runs `/mk:benchmark run --full` (quick tier + 1 heavy task, ≤$30)
- User runs `/mk:benchmark compare <run-id-a> <run-id-b>` (delta table)
- Before applying a harness change (baseline)
- After applying a harness change (verify delta)
- During the dead-weight audit playbook (component enable/disable cycles)
Skip when:
- The harness has been run end-to-end manually within the last hour (use that data instead)
- Budget cap is hit before the suite finishes (record partial result, alert)
## Hard Constraints
1. **Quick tier ≤$5 total cost.** Hard block if projected cost exceeds.
2. **Full tier ≤$30 total cost.** Hard block if projected cost exceeds.
3. **`--full` is opt-in.** The heavy task (`06-small-app-build`) requires explicit `--full` flag because it triggers `mk:autobuild` which can run for hours. Refuses to run without the flag.
4. **NOT a replacement for unit tests.** This is harness-level measurement only.
5. **Results recorded in trace-log.jsonl** as `event=benchmark_result` records, tagged with `benchmark_version` + `harness_version` + `model_version`.
## Subcommands
| Subcommand