eval-driven-devlisted
Install: claude install-skill AlexDuchDev/agentic-product-standard
# Eval-Driven Development
Evaluation is the most critical and most under-invested practice in building agentic products. Hamel Husain and Shreya Shankar have codified the discipline; following it separates teams that ship from teams that demo.
The core insight from Husain's "Field Guide to Rapidly Improving AI Products" (after helping 30+ AI products): **the teams who succeed barely talk about tools. They obsess over measurement and iteration.**
## First principle: error analysis before infrastructure
Most teams reach for eval infrastructure (Braintrust, LangSmith, etc.) before they know what to measure. This is backwards.
**Start by reading production traces.** Read 20–50 real outputs manually after each meaningful change. Write down what went wrong in plain language. Cluster the failure modes into 5–10 named buckets. These named buckets are your eval categories — generic "helpfulness" never catches them.
Common failure mode buckets (yours will be different and product-specific):
- "Missed human handoff" (agent should have escalated, didn't)
- "Wrong tool selection" (chose web_search when should have used internal docs)
- "Stale information" (used cached/old data when fresh was required)
- "Lost context across compaction" (forgot user's earlier constraint)
- "Hallucinated citation" (made up a source URL)
Each named failure mode becomes an eval. Generic evals do not.
## The three-level eval pyramid
```
▲
╱ ╲ Level 3: Human Review
╱ ╲ Major