os-eval-runnerlisted

Stateless evaluation engine that scores and gates skill improvement iterations using headless Python evaluation scripts. Use when the user says "evaluate this skill", "run autoresearch loop on", "optimize this skill", "run the eval loop", or when another agent proposes a change and needs validation.
richfrem/agent-plugins-skills · ★ 4 · AI & Automation · score 71

Install: claude install-skill richfrem/agent-plugins-skills

# Skill Improvement Evaluator Stateless evaluation engine that scores and gates skill improvement iterations using headless Python evaluation scripts. --- ## Ownership Boundary (Critical) ### What os-eval-runner owns (permanent, version-controlled with this skill) - Scoring scripts: `./scripts/evaluate.py`, `./scripts/eval_runner.py` - Scaffold script: `./scripts/init_autoresearch.py` - Templates: `./assets/templates/autoresearch/` (program, evals, results, proposer prompt) ### What lives with the target (deployed per experiment) All experiment state deploys alongside the target (e.g. `<experiment-dir>/references/program.md`, `<experiment-dir>/evals/evals.json`, `<experiment-dir>/evals/results.tsv`). You MUST read the spec from `<experiment-dir>/references/program.md` and NOT fall back to engine-local config templates. --- ## Phase 0: Intake Interview Run this interview before starting any loop or evaluation. If enough information is provided in the initial prompt, skip the redundant questions. 1. **Q1 — What target skill are you evaluating?** (Provide path to skill folder) 2. **Q2 �� Where should the experiment files live?** (Defaults to target skill directory) 3. **Q2b — What metric are you optimizing?** (quality_score, f1, precision, recall, or heuristic) 4. **Q3 — What mode?** (Loop mode for autonomous improvement vs QA mode for single diff validation) 5. **Q4 — (Loop mode) How many iterations?** (Default: NEVER STOP) 6. **Q5 — Does evals.json exist?** (If missin