← ClaudeAtlas

ai-reliability-evallisted

Measures AI system reliability over time by defining pass/fail criteria before implementation, running capability checks, and tracking regression via pass@k metrics. Trigger for 'how reliable is this', 'did my changes break anything', 'measure AI performance', 'define success criteria', 'eval this feature', 'check skill regression'. Not for code correctness; use /ai-test instead. Not for quality gates; use /ai-verify instead — evals measure AI task completion consistency.
arcasilesgroup/ai-engineering · ★ 49 · AI & Automation · score 84
Install: claude install-skill arcasilesgroup/ai-engineering
# Reliability Eval ## Purpose Eval-Driven Development (EDD) treats evals as the unit tests of AI development. Define pass/fail criteria before writing code. Measure AI reliability with pass@k metrics. Track regressions across prompt, agent, and model changes. Evals answer the question: "Can the AI do this reliably?" **Key distinction**: `ai-verify` checks current code quality (linting, coverage, security). `ai-reliability-eval` measures AI reliability over time (can the agent complete this task consistently?). ## When to Use - `define`: defining pass/fail criteria before implementation (EDD principle) - `check`: running current evals and reporting status mid-implementation - `report`: generating full eval report after implementation - `regression`: ensuring changes to prompts, agents, or models don't break existing capabilities - `--skill-set`: skill-set mode — runs the optimizer over each skill's eval corpus under `.ai-engineering/evals/<skill>.jsonl` and gates pass@1 vs `.ai-engineering/evals/baseline.json`. Combine with `--regression` to fail on >5 pp pass@1 drop (sub-007 M6, D-127-07). Wired into `.github/workflows/skill-evals.yml` on PRs touching `.agents/skills/**`. ## Process ### Mode: define (Before Coding) 1. Identify the capability being built or changed 2. Write capability evals (can the AI do this new thing?) 3. Write regression evals (do existing things still work?) 4. Set success metrics (pass@k targets) 5. Store eval definition at `.ai-engineering/evals