eval-before-shiplisted
Install: claude install-skill RBraga01/builder-ai
# Eval Before Ship
## The Law
```
AN LLM FEATURE IS NOT READY UNTIL NUMBERS EXIST.
"It looked good when I tested it" is not an eval.
"I ran a few examples and it worked" is not an eval.
A named suite, a defined metric, a pass rate, a failure analysis,
and a baseline comparison IS an eval. All five. Not four.
```
## When to Use
Trigger before any of these:
- Merging a PR that adds or modifies a prompt
- Deploying an LLM feature to any environment users can reach
- Switching models or providers on an existing feature
- Changing retrieval logic, rerankers, or chunk strategy in a RAG pipeline
- Updating few-shot examples or system prompt structure
## When NOT to Use
- Exploratory prototypes that will not reach users (note it: "eval required before production")
- Config-only changes that cannot affect model output (timeouts, logging, env vars)
## What Counts as an Eval
An eval must have all five components:
| Component | What It Means | What Does NOT Count |
|---|---|---|
| Named test suite | File with labelled examples in `evals/` | "I tested it manually" |
| Defined metric | Accuracy %, faithfulness score, task pass rate | "It seemed accurate" |
| Pass threshold | Explicit minimum (e.g., ≥ 85%) | No threshold = no standard |
| Failure analysis | ≥ 5 failures examined and categorised | "There were a few errors" |
| Baseline comparison | This version vs. previous version or control | First release exempt; all subsequent require it |
## The Process
### Step 1 — Define th