llm-judge

Solid

Use when comparing two or more code implementations against a spec or requirements doc. Triggers on "which repo is better", "compare these implementations", "evaluate both solutions", "rank these codebases", or "judge which approach wins". Also covers choosing between competing PRs or vendor submissions solving the same problem. Does NOT review a single codebase for quality — use code review skills instead. Does NOT evaluate strategy docs — use strategy-review. Requires a spec file and 2+ repo paths.

AI & Automation 61 stars 8 forks Updated today Apache-2.0

Install

View on GitHub

Quality Score: 87/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# LLM Judge Compare code implementations across multiple repositories using structured evaluation. ## Usage ```bash /beagle-analysis:llm-judge <spec> <repo1> <repo2> [repo3...] [--labels=...] [--weights=...] [--branch=...] ``` ## Arguments | Argument | Required | Description | |----------|----------|-------------| | `spec` | Yes | Path to spec/requirements document | | `repos` | Yes | 2+ paths to repositories to compare | | `--labels` | No | Comma-separated labels (default: directory names) | | `--weights` | No | Override weights, e.g. `functionality:40,security:30` | | `--branch` | No | Branch to compare against main (default: `main`) | ## Workflow 1. Parse `$ARGUMENTS` into `spec_path`, `repo_paths`, `labels`, `weights`, and `branch`. 2. Validate the spec file, each repo path, and the minimum repo count. 3. Read the spec document into memory. 4. Load this skill and the supporting reference files. 5. Spawn one Phase 1 repo agent per repository to gather facts only. 6. Validate the repo-agent JSON results before proceeding. 7. Spawn one Phase 2 judge agent per dimension. 8. Aggregate scores, compute weighted totals, rank repos, and write the report. 9. Display the markdown summary and verify the JSON report. ## Hard gates Sequenced workflow: **do not start the next phase until the current gate passes.** Each pass condition must be checkable (file on disk, non-empty content, or `json.load` succeeds)—not “I reviewed internally.” | Gate | Pass condition | Unblocks | |-...

Details

Author: existential-birds
Repository: existential-birds/beagle
Created: 5 months ago
Last Updated: today
Language: Shell
License: Apache-2.0

Integrates with

Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

review-llm-artifacts

Detects common LLM coding agent artifacts by spawning four parallel subagents over the project or changed files. Scans files changed since main by default; use --all for full-project scan. Triggers on LLM cruft cleanup, agent-generated code review, dead code sweeps, test-quality passes, or when the user asks to scan the whole repo.

61 Updated today

existential-birds

AI & Automation Listed

judge

Interactive judge of a staged git diff against the project's Accepted ADRs. Runs bin/adr-judge with the LLM pass (Claude Sonnet by default, since v0.13.0) — same engine the pre-commit hook uses, so verdicts are consistent. On violation, walks the user through three resolution paths (write a new ADR, supersede an existing ADR, fix the code). Pairs with the pre-commit hook — invoke before committing on important changes, or after the hook blocks you to drive the resolution.

0 Updated yesterday

rvdbreemen

AI & Automation Listed

evaluating-llms

Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.

368 Updated 5 months ago

ancoleman

AI & Automation Solid

openjudge

Build custom LLM evaluation pipelines using the OpenJudge framework. Covers selecting and configuring graders (LLM-based, function-based, agentic), running batch evaluations with GradingRunner, combining scores with aggregators, applying evaluation strategies (voting, average), auto-generating graders from data, and analyzing results (pairwise win rates, statistics, validation metrics). Use when the user wants to evaluate LLM outputs, compare multiple models, design scoring criteria, or build an automated evaluation system.

633 Updated 3 days ago

agentscope-ai

AI & Automation Listed

judge

Research-supervisor review of program.md — validates experimental methodology (hypothesis clarity, measurement validity, control adequacy, scope, strategy fit), emits APPROVED / NEEDS-REVISION / BLOCKED verdict before expensive run loop.

17 Updated 2 days ago

Borda