eval-setlisted
Install: claude install-skill robdasi/skills
# Eval Set
"It worked when I tried it" is not a test. The model that passed your one manual check yesterday gets upgraded next week and quietly starts doing the wrong thing, and nothing tells you.
But you almost certainly don't have, and don't need, a pytest suite over a Claude pipeline. The verification that actually holds in production is simpler and runs inline before anything ships: the output's shape is a contract, a short list of invariants must be true, and a judge scores whether the work is complete. This skill writes those three things for your case.
Build the checks, then stop.
## Inputs (ask for whatever is missing)
- **What's being checked** (required): the skill, agent, or automation, and what it produces.
- **Its contract**: how you'd know it worked. If a `build-spec` exists, pull the acceptance criteria. If not, ask "what must be true about the output every time?" and write that down first.
- *Optional, makes it sharper:* the output's data shape, the worst output you've seen, and any bug it shipped before.
## The method
1. **Make the shape the first check.** Write (or point at) the schema the output must satisfy — the fields, the types, the allowed enum values. An output that fails the schema fails, full stop, and you catch it at build/run time instead of in front of a client. Most silent breakage is a shape violation: an invented enum label, a missing field, a null where a value was promised.
2. **Write the pre-ship invariant checklist.** The short lis