evallisted

Quantitatively evaluate the search quality and efficiency of the Rill knowledge base via natural exploration vs Deep Path ground truth (ADR-050). Run monthly. Computes precision, recall, metadata_contribution, token cost. Use when the user wants to verify whether note structure actually works in practice.
rillmd/rill · ★ 6 · AI & Automation · score 78

Install: claude install-skill rillmd/rill

# /eval — Exploration Benchmark (monthly. Verify that note structure actually works in practice) **Conduct ALL conversation with the user in the language defined by `.claude/rules/personal-language.md`** (or the user's input language if absent). The English instructions below are for skill clarity, not for output style. Exceptions (only): tokens inside backticks or code blocks, proper nouns, ASCII acronyms. > **Tool references in this skill** (`sub-agent`, `shell`, `Read`, `Edit`, `Glob`, `Grep`) describe **intent**, not Claude-specific tool calls. Each harness should map them to its native equivalent — Claude Code uses its built-in Agent / Bash / Read / Edit tools as named; Codex CLI uses CSV fan-out / shell / `apply_patch` etc. as appropriate. > Workflow: `/inspect` -> `/repair` -> `/eval` (inspect -> repair -> verify) Quantitatively evaluate the search quality and efficiency of the knowledge base through the AI agent's natural exploration (ADR-050). See `eval/concept.md` for metric definitions. ## Arguments $ARGUMENTS — Optional. Can be combined. - `--reuse-deep YYYY-MM-DD` — Reuse Deep Path results from the specified date (`eval/results/YYYY-MM-DD-deep.yaml`) - `--refresh-deep` — Force regeneration of Deep Path for all queries (high cost) - `--baseline` — Also copy results to `eval/baseline.yaml` - `--compare YYYY-MM-DD` — Compare with results from the specified date and report differences - No arguments — Auto-detect the latest Deep Path. Incrementally update if n