← ClaudeAtlas

eval-harnesslisted

Formal evaluation framework implementing eval-driven development (EDD) — pass/fail criteria, code/rule/model/human graders, pass@k and pass^k reliability metrics, and regression suites for LLM and agent behavior.
SilantevBitcoin/Base-system-Claude · ★ 1 · AI & Automation · score 74
Install: claude install-skill SilantevBitcoin/Base-system-Claude
# Eval Harness Skill A formal evaluation framework for AI/LLM systems, implementing eval-driven development (EDD) principles. ## When to Activate - Setting up eval-driven development (EDD) for AI/LLM workflows - Defining pass/fail criteria for model or agent task completion - Measuring reliability with pass@k metrics - Creating regression test suites for prompt, model, or agent changes - Benchmarking performance across model versions ## Philosophy Eval-Driven Development treats evals as the "unit tests of AI development": - Define expected behavior BEFORE implementation - Run evals continuously during development - Track regressions with each change - Use pass@k metrics for reliability measurement ## Eval Types ### Capability Evals Test if the system can do something it couldn't before: ```markdown [CAPABILITY EVAL: feature-name] Task: Description of what the system should accomplish Success Criteria: - [ ] Criterion 1 - [ ] Criterion 2 - [ ] Criterion 3 Expected Output: Description of expected result ``` ### Regression Evals Ensure changes don't break existing functionality: ```markdown [REGRESSION EVAL: feature-name] Baseline: SHA or checkpoint name Tests: - existing-test-1: PASS/FAIL - existing-test-2: PASS/FAIL - existing-test-3: PASS/FAIL Result: X/Y passed (previously Y/Y) ``` ## Grader Types ### 1. Code-Based Grader Deterministic checks using code: ```bash # Check if file contains expected pattern grep -q "export function handleAuth" src/auth.ts &