eval-setlisted

Decide whether an automation, agent, or skill actually works, without pretending you run a test suite. Turns the output's contract into a schema, a pre-ship checklist of must-be-true invariants, and a completeness judge with a numeric gate, plus one regression case per bug already found. Use this before you trust a build, before a model upgrade silently breaks it, or before each unattended run. Produces a checks file and stops.
robdasi/skills · ★ 0 · AI & Automation · score 75

Install: claude install-skill robdasi/skills

# Eval Set "It worked when I tried it" is not a test. The model that passed your one manual check yesterday gets upgraded next week and quietly starts doing the wrong thing, and nothing tells you. But you almost certainly don't have, and don't need, a pytest suite over a Claude pipeline. The verification that actually holds in production is simpler and runs inline before anything ships: the output's shape is a contract, a short list of invariants must be true, and a judge scores whether the work is complete. This skill writes those three things for your case. Build the checks, then stop. ## Inputs (ask for whatever is missing) - **What's being checked** (required): the skill, agent, or automation, and what it produces. - **Its contract**: how you'd know it worked. If a `build-spec` exists, pull the acceptance criteria. If not, ask "what must be true about the output every time?" and write that down first. - *Optional, makes it sharper:* the output's data shape, the worst output you've seen, and any bug it shipped before. ## The method 1. **Make the shape the first check.** Write (or point at) the schema the output must satisfy — the fields, the types, the allowed enum values. An output that fails the schema fails, full stop, and you catch it at build/run time instead of in front of a client. Most silent breakage is a shape violation: an invented enum label, a missing field, a null where a value was promised. 2. **Write the pre-ship invariant checklist.** The short lis