ln-840-benchmark-compare
SolidRuns a canonical built-in vs hex-line benchmark with scenario manifests, activation checks, and diff-based correctness. Use when measuring hex-line MCP performance against Claude built-in tools or when preparing the canonical suite for later external baselines.
Install
Quality Score: 94/100
Skill Content
Details
- Author
- levnikolaevich
- Repository
- levnikolaevich/claude-code-skills
- Created
- 7 months ago
- Last Updated
- yesterday
- Language
- JavaScript
- License
- MIT
Integrates with
Similar Skills
Semantically similar based on skill content — not just same category
benchmark
Compare Claude Code output with full config vs minimal config using standardized tasks per stack.
skill-benchmarking
Run skill benchmarks with discriminating-only assertions against evals.json for any model and any AI agent. Use when benchmarking a skill against a model not yet tested, running with_skill/without_skill eval pairs, producing benchmark-<model>.json, re-grading an existing run, adding Phase 2 model comparison results, reviewing results in the eval viewer, updating README benchmark tables, or cleaning non-discriminating assertions from evals.json. Enforces strict grader isolation (the context that generates responses never grades them) and evidence-only passing (assertions pass only on explicit content, never on implication or charity). Works with Claude Code, Gemini CLI, GitHub Copilot, Cursor, and any other AI coding assistant.
mkbenchmark
Use when measuring harness changes against ground truth — runs a small canary suite (5 quick tasks, 6 with --full) and records scores in trace-log.jsonl. Backs the dead-weight audit with measured deltas. Triggers on /mk:benchmark, "run benchmark", "measure harness", or before/after a harness change.