build-agent-evalslisted

Build automated evaluations for an AI agent from scratch: collecting tasks from real failures, choosing code/model/human graders, picking pass@k vs pass^k, building an isolated harness, and keeping the suite honest over time. Use this whenever someone wants to measure, benchmark, or regression-test an agent, write an eval harness for an LLM agent, decide how to grade non-deterministic output, set up an LLM-as-judge, or asks any version of "how do I know if my agent is actually getting better." Trigger even when they say "tests for my agent," "eval set," or "agent benchmark" rather than the word "evals." Not for container or resource limits making scores flaky across runs; that's calibrate-eval-infrastructure.
pebeto/agent-stdlib · ★ 0 · AI & Automation · score 70

Install: claude install-skill pebeto/agent-stdlib

# Build agent evals Source: [Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents). A standalone gist of this material exists but is undiscoverable; this skill packages it and adds the runnable metric script. An eval tells you whether a change to an agent made it better or worse. Without one you are guessing from vibes, and vibes miss regressions that only show up on the tenth run. Treat the eval suite the way you treat a unit-test suite: it has an owner, it grows when bugs slip through, and it fails loudly. ## Start from real failures Collect 20 to 50 tasks before writing any grader. The best sources are bugs your agent already produced, support tickets, and manual test cases you keep rerunning by hand. Write each task so two experts reading it reach the same verdict on pass or fail. If you cannot decide whether an output passed, the task is underspecified and will poison every measurement built on it. Include a reference solution for each task to prove it is solvable, and build both positive cases (the agent should do X) and negative cases (the agent should refuse, or should not touch Y). A suite made only of positive cases optimizes toward an agent that does too much. ## Choose the grader to match the task Grade what the agent produced, not the path it took. An agent that reaches the right end state by an unusual route still passed. - **Code-based grader.** String match, schema validation, a state check against a