eval-harness

Solid

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

AI & Automation 199,470 stars 30623 forks Updated yesterday MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Eval Harness Skill A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles. ## When to Activate - Setting up eval-driven development (EDD) for AI-assisted workflows - Defining pass/fail criteria for Claude Code task completion - Measuring agent reliability with pass@k metrics - Creating regression test suites for prompt or agent changes - Benchmarking agent performance across model versions ## Philosophy Eval-Driven Development treats evals as the "unit tests of AI development": - Define expected behavior BEFORE implementation - Run evals continuously during development - Track regressions with each change - Use pass@k metrics for reliability measurement ## Eval Types ### Capability Evals Test if Claude can do something it couldn't before: ```markdown [CAPABILITY EVAL: feature-name] Task: Description of what Claude should accomplish Success Criteria: - [ ] Criterion 1 - [ ] Criterion 2 - [ ] Criterion 3 Expected Output: Description of expected result ``` ### Regression Evals Ensure changes don't break existing functionality: ```markdown [REGRESSION EVAL: feature-name] Baseline: SHA or checkpoint name Tests: - existing-test-1: PASS/FAIL - existing-test-2: PASS/FAIL - existing-test-3: PASS/FAIL Result: X/Y passed (previously Y/Y) ``` ## Grader Types ### 1. Code-Based Grader Deterministic checks using code: ```bash # Check if file contains expected pattern grep -q "export function handleAuth" src/auth....

Details

Author: affaan-m
Repository: affaan-m/ECC
Created: 4 months ago
Last Updated: yesterday
Language: JavaScript
License: MIT

Integrates with

Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

0 Updated yesterday

uzysjung

AI & Automation Listed

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

4 Updated today

immacualate

AI & Automation Solid

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

54 Updated today

arabicapp

AI & Automation Solid

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

496 Updated 1 months ago

vibeeval

AI & Automation Solid

eval-harness

Evaluation harness for testing agent and skill quality through structured benchmarks, regression tests, and quality scoring.

1,160 Updated today

a5c-ai