build-agent-evalslisted
Install: claude install-skill pebeto/agent-stdlib
# Build agent evals
Source: [Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents). A standalone gist of this material exists but is undiscoverable; this skill packages it and adds the runnable metric script.
An eval tells you whether a change to an agent made it better or worse. Without one you are guessing from vibes, and vibes miss regressions that only show up on the tenth run. Treat the eval suite the way you treat a unit-test suite: it has an owner, it grows when bugs slip through, and it fails loudly.
## Start from real failures
Collect 20 to 50 tasks before writing any grader. The best sources are bugs your agent already produced, support tickets, and manual test cases you keep rerunning by hand. Write each task so two experts reading it reach the same verdict on pass or fail. If you cannot decide whether an output passed, the task is underspecified and will poison every measurement built on it.
Include a reference solution for each task to prove it is solvable, and build both positive cases (the agent should do X) and negative cases (the agent should refuse, or should not touch Y). A suite made only of positive cases optimizes toward an agent that does too much.
## Choose the grader to match the task
Grade what the agent produced, not the path it took. An agent that reaches the right end state by an unusual route still passed.
- **Code-based grader.** String match, schema validation, a state check against a