← ClaudeAtlas

mkbenchmarklisted

Use when measuring harness changes against ground truth — runs a small canary suite (5 quick tasks, 6 with --full) and records scores in trace-log.jsonl. Backs the dead-weight audit with measured deltas. Triggers on /mk:benchmark, "run benchmark", "measure harness", or before/after a harness change.
ngocsangyem/MeowKit · ★ 15 · AI & Automation · score 86
Install: claude install-skill ngocsangyem/MeowKit
# mk:benchmark — Harness Canary Suite Measures harness performance against a small set of ground-truth tasks. Provides the empirical signal that the dead-weight audit (per `.claude/rules/dead-weight-audit-rules.md`) consumes to make load-bearing decisions about each harness component. ## When to Use Activate when: - User runs `/mk:benchmark run` (default = quick tier, 5 tasks, ≤$5) - User runs `/mk:benchmark run --full` (quick tier + 1 heavy task, ≤$30) - User runs `/mk:benchmark compare <run-id-a> <run-id-b>` (delta table) - Before applying a harness change (baseline) - After applying a harness change (verify delta) - During the dead-weight audit playbook (component enable/disable cycles) Skip when: - The harness has been run end-to-end manually within the last hour (use that data instead) - Budget cap is hit before the suite finishes (record partial result, alert) ## Hard Constraints 1. **Quick tier ≤$5 total cost.** Hard block if projected cost exceeds. 2. **Full tier ≤$30 total cost.** Hard block if projected cost exceeds. 3. **`--full` is opt-in.** The heavy task (`06-small-app-build`) requires explicit `--full` flag because it triggers `mk:autobuild` which can run for hours. Refuses to run without the flag. 4. **NOT a replacement for unit tests.** This is harness-level measurement only. 5. **Results recorded in trace-log.jsonl** as `event=benchmark_result` records, tagged with `benchmark_version` + `harness_version` + `model_version`. ## Subcommands | Subcommand