lfe-skill-evallisted

Skill-accuracy eval runner. Executes each LFE reasoning skill (security / perf / complexity / mutation + plan-critique) against the _evals corpus in isolated subagents, k times each, grades every run with the deterministic grader, and renders the skill-accuracy scorecard + machine results record. Dispatched from the Hygiene sub-pipeline or on demand; framework-dispatched (agent-only, outside the Brain-typeable set).
StChiotis/Claude-LFE · ★ 2 · AI & Automation · score 68

Install: claude install-skill StChiotis/Claude-LFE

# LFE Skill-Eval Runner — Measured Catch-Rate for the Reasoning Skills ## Position in Pipeline - **Phase**: 5 (Hygiene sub-pipeline — every 3rd sweep ≈ every 15 sessions) / on-demand - **Persona**: any (read-only on the corpus + skills; writes the scorecard + the results record only) - **Trigger**: dispatched from the Hygiene sub-pipeline, or on demand when measuring the reasoning skills' accuracy - **Output**: `.docs/quality/skill-eval-scorecard.md` (human) + `.claude/lib/__eval__/results.json` (machine — the pre-commit gate reads it) ## Mission LFE leans on five prompt-based reasoning skills; this runner is the first to measure their real catch-rate. It runs each skill's exact canonical prompt against a corpus of planted-defect (known-bad) and clean (known-good) fixtures, repeats each run k times for a consistency rate, grades every output deterministically, and reports a per-skill catch-rate, false-positive rate, and saturation flag. ## Why isolated subagents (ADR 98) Each run executes in a fresh, isolated subagent context (the general-purpose Agent/Task tool). Isolation is the precondition for an honest consistency + saturation measurement — independent contexts keep each fixture's reasoning self-contained, so one fixture's output stays clear of the next and the rate stays honest. This uses the built-in general-purpose Agent tool, distinct from the project-registered specialist agents that ADR 93 found unreliable in this repo, and distinct from the in-chat sub-skill di