openjudge

Solid

Build custom LLM evaluation pipelines using the OpenJudge framework. Covers selecting and configuring graders (LLM-based, function-based, agentic), running batch evaluations with GradingRunner, combining scores with aggregators, applying evaluation strategies (voting, average), auto-generating graders from data, and analyzing results (pairwise win rates, statistics, validation metrics). Use when the user wants to evaluate LLM outputs, compare multiple models, design scoring criteria, or build an automated evaluation system.

AI & Automation 625 stars 54 forks Updated 1 weeks ago Apache-2.0

Install

View on GitHub

Quality Score: 90/100

Stars 20%

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# OpenJudge Skill Build evaluation pipelines for LLM applications using the `openjudge` library. ## When to Use This Skill - User wants to evaluate LLM output quality (correctness, relevance, hallucination, etc.) - User wants to compare two or more models and rank them - User wants to design a scoring rubric and automate evaluation - User wants to analyze evaluation results statistically - User wants to build a reward model or quality filter ## Sub-documents — Read When Relevant | Topic | File | Read when… | |-------|------|------------| | Grader selection & configuration | `graders.md` | User needs to pick or configure an evaluator | | Batch evaluation pipeline | `pipeline.md` | User needs to run evaluation over a dataset | | Auto-generate graders from data | `generator.md` | No rubric yet; generate from labeled examples | | Analyze & compare results | `analyzer.md` | User wants win rates, statistics, or metrics | Read the relevant sub-document **before** writing any code. ## Install ```bash pip install py-openjudge ``` ## Architecture Overview ``` Dataset (List[dict]) │ ▼ GradingRunner ← orchestrates everything │ ├─► Grader A ──► EvaluationStrategy ──► _aevaluate() ──► GraderScore / GraderRank ├─► Grader B ──► EvaluationStrategy ──► _aevaluate() ──► GraderScore / GraderRank └─► Grader C ... │ ├─► Aggregator (optional) ← combine multiple grader scores into one │ └─► RunnerResult ← {grader_nam...

Details

Author: agentscope-ai
Repository: agentscope-ai/OpenJudge
Created: 10 months ago
Last Updated: 1 weeks ago
Language: Python
License: Apache-2.0

Related Skills

AI & Automation Featured

videodb

See, Understand, Act on video and audio. See- ingest from local files, URLs, RTSP/live feeds, or live record desktop; return realtime context and playable stream links. Understand- extract frames, build visual/semantic/temporal indexes, and search moments with timestamps and auto-clips. Act- transcode and normalize (codec, fps, resolution, aspect ratio), perform timeline edits (subtitles, text/image overlays, branding, audio overlays, dubbing, translation), generate media assets (image, audio, video), and create real time alerts for events from live streams or desktop capture.

196,640 Updated 2 days ago

affaan-m

AI & Automation Featured

ck

Persistent per-project memory for Claude Code. Auto-loads project context on session start, tracks sessions with git activity, and writes to native memory. Commands run deterministic Node.js scripts — behavior is consistent across model versions.

196,640 Updated 2 days ago

affaan-m

AI & Automation Featured

browser

Web browser automation with AI-optimized snapshots for claude-flow agents

55,973 Updated today

ruvnet