llm-judge
SolidUse when comparing two or more code implementations against a spec or requirements doc. Triggers on "which repo is better", "compare these implementations", "evaluate both solutions", "rank these codebases", or "judge which approach wins". Also covers choosing between competing PRs or vendor submissions solving the same problem. Does NOT review a single codebase for quality — use code review skills instead. Does NOT evaluate strategy docs — use strategy-review. Requires a spec file and 2+ repo paths.
Install
Quality Score: 87/100
Skill Content
Details
- Author
- existential-birds
- Repository
- existential-birds/beagle
- Created
- 5 months ago
- Last Updated
- today
- Language
- Shell
- License
- Apache-2.0
Integrates with
Similar Skills
Semantically similar based on skill content — not just same category
review-llm-artifacts
Detects common LLM coding agent artifacts by spawning four parallel subagents over the project or changed files. Scans files changed since main by default; use --all for full-project scan. Triggers on LLM cruft cleanup, agent-generated code review, dead code sweeps, test-quality passes, or when the user asks to scan the whole repo.
judge
Interactive judge of a staged git diff against the project's Accepted ADRs. Runs bin/adr-judge with the LLM pass (Claude Sonnet by default, since v0.13.0) — same engine the pre-commit hook uses, so verdicts are consistent. On violation, walks the user through three resolution paths (write a new ADR, supersede an existing ADR, fix the code). Pairs with the pre-commit hook — invoke before committing on important changes, or after the hook blocks you to drive the resolution.
evaluating-llms
Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.
openjudge
Build custom LLM evaluation pipelines using the OpenJudge framework. Covers selecting and configuring graders (LLM-based, function-based, agentic), running batch evaluations with GradingRunner, combining scores with aggregators, applying evaluation strategies (voting, average), auto-generating graders from data, and analyzing results (pairwise win rates, statistics, validation metrics). Use when the user wants to evaluate LLM outputs, compare multiple models, design scoring criteria, or build an automated evaluation system.
judge
Research-supervisor review of program.md — validates experimental methodology (hypothesis clarity, measurement validity, control adequacy, scope, strategy fit), emits APPROVED / NEEDS-REVISION / BLOCKED verdict before expensive run loop.