eval-driven-developmentlisted
Install: claude install-skill jacob-balslev/skills
# Eval-Driven Development
## Coverage
The practice of building language-model-integrated systems by writing evaluations before and alongside the system, and using the eval suite's aggregated pass-rate signal to gate every change. Covers the statistical (not binary) nature of LLM evaluation, the five primitives (dataset, evaluation function, aggregation, iteration loop, regression budget), the judgment-mechanism taxonomy (programmatic / model-graded / human-graded / hybrid), the distinction between system-specific evals and canonical public benchmarks (MMLU, HumanEval, BIG-bench, GAIA, MT-Bench), why higher scores are not always improvements (Goodhart's Law), the difference between offline evals and production telemetry, and the eval-lifecycle archetypes (acceptance, regression, calibration, red-team, cross-model).
## Philosophy
Building LLM-integrated systems without evals is shipping airplanes based on how good the model feels at the desk. The system's behavior is stochastic, the input space is open-ended, and the developer's pet examples are not a representative sample of what users will throw at it. An eval suite is the empirical measurement instrument that lets a team distinguish "the new prompt works better" from "the new prompt works better on the five examples I happened to try."
The discipline's hard part is not writing evals. It is choosing what to measure, encoding the choice into a grader the team agrees with, sampling a dataset that represents production, and