← ClaudeAtlas

eval-judgelisted

Use this when you need to grade a single open-ended output against a rubric (LLM-as-judge) — to get a PASS/FAIL with a score, deterministically offline and with a real model when live. Triggers on "LLM as judge", "grade this", "rubric", "score the answer", "is this output good", "judge".
Luis247911/universal-ai-workspace-foundation · ★ 0 · AI & Automation · score 78
Install: claude install-skill Luis247911/universal-ai-workspace-foundation
# eval-judge The shared **LLM-as-judge**. Given one output and a rubric, it returns PASS/FAIL with a score. Offline it grades deterministically by rubric-keyword overlap (so CI is green without a key); under `UAW_LLM=live` it asks a real model with a reason-then-decide prompt. It is the single judge that [[eval-loop-builder]] and [[orchestrator-patterns]] both rely on — not a second copy. ## When to use - Grading a subjective quality that no `exact`/`regex` check can express (tone, completeness, helpfulness). - The scoring step inside an evaluator-optimizer loop. - A one-off "is this answer good enough?" check. ## Run it ``` python -m harness.eval judge --rubric "mentions both cost and latency" --output "use caching to cut latency and cost" python -m harness.eval judge --rubric "is a polite refusal" --output-file reply.txt python .claude/skills/eval-judge/scripts/run.py judge --rubric "..." --output "..." ``` Prints JSON `{verdict, score, detail, mock}` and exits non-zero on FAIL. ## How it judges - **Offline (mock)**: PASS if salient rubric keywords appear in the output. Deterministic — same inputs, same verdict. Good enough to wire the plumbing and keep CI green. - **Live (`UAW_LLM=live` + `[llm]` extra + key)**: a strict evaluator prompt — reason internally, then answer exactly PASS or FAIL. The reasoning is discarded; only the verdict is scored. ## Judge design rules 1. Make the rubric **specific and checkable** ("mentions X and Y"), not vague ("is good"). 2