lfe-skill-evallisted
Install: claude install-skill StChiotis/Claude-LFE
# LFE Skill-Eval Runner — Measured Catch-Rate for the Reasoning Skills
## Position in Pipeline
- **Phase**: 5 (Hygiene sub-pipeline — every 3rd sweep ≈ every 15 sessions) / on-demand
- **Persona**: any (read-only on the corpus + skills; writes the scorecard + the results record only)
- **Trigger**: dispatched from the Hygiene sub-pipeline, or on demand when measuring the reasoning skills' accuracy
- **Output**: `.docs/quality/skill-eval-scorecard.md` (human) + `.claude/lib/__eval__/results.json` (machine — the pre-commit gate reads it)
## Mission
LFE leans on five prompt-based reasoning skills; this runner is the first to measure their real catch-rate. It runs each skill's exact canonical prompt against a corpus of planted-defect (known-bad) and clean (known-good) fixtures, repeats each run k times for a consistency rate, grades every output deterministically, and reports a per-skill catch-rate, false-positive rate, and saturation flag.
## Why isolated subagents (ADR 98)
Each run executes in a fresh, isolated subagent context (the general-purpose Agent/Task tool). Isolation is the precondition for an honest consistency + saturation measurement — independent contexts keep each fixture's reasoning self-contained, so one fixture's output stays clear of the next and the rate stays honest. This uses the built-in general-purpose Agent tool, distinct from the project-registered specialist agents that ADR 93 found unreliable in this repo, and distinct from the in-chat sub-skill di