openjudge

Solid

Build custom LLM evaluation pipelines using the OpenJudge framework. Covers selecting and configuring graders (LLM-based, function-based, agentic), running batch evaluations with GradingRunner, combining scores with aggregators, applying evaluation strategies (voting, average), auto-generating graders from data, and analyzing results (pairwise win rates, statistics, validation metrics). Use when the user wants to evaluate LLM outputs, compare multiple models, design scoring criteria, or build an automated evaluation system.

AI & Automation 625 stars 54 forks Updated 1 weeks ago Apache-2.0

Install

View on GitHub

Quality Score: 90/100

Stars 20%
93
Recency 20%
90
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# OpenJudge Skill Build evaluation pipelines for LLM applications using the `openjudge` library. ## When to Use This Skill - User wants to evaluate LLM output quality (correctness, relevance, hallucination, etc.) - User wants to compare two or more models and rank them - User wants to design a scoring rubric and automate evaluation - User wants to analyze evaluation results statistically - User wants to build a reward model or quality filter ## Sub-documents — Read When Relevant | Topic | File | Read when… | |-------|------|------------| | Grader selection & configuration | `graders.md` | User needs to pick or configure an evaluator | | Batch evaluation pipeline | `pipeline.md` | User needs to run evaluation over a dataset | | Auto-generate graders from data | `generator.md` | No rubric yet; generate from labeled examples | | Analyze & compare results | `analyzer.md` | User wants win rates, statistics, or metrics | Read the relevant sub-document **before** writing any code. ## Install ```bash pip install py-openjudge ``` ## Architecture Overview ``` Dataset (List[dict]) │ ▼ GradingRunner ← orchestrates everything │ ├─► Grader A ──► EvaluationStrategy ──► _aevaluate() ──► GraderScore / GraderRank ├─► Grader B ──► EvaluationStrategy ──► _aevaluate() ──► GraderScore / GraderRank └─► Grader C ... │ ├─► Aggregator (optional) ← combine multiple grader scores into one │ └─► RunnerResult ← {grader_nam...

Details

Author
agentscope-ai
Repository
agentscope-ai/OpenJudge
Created
10 months ago
Last Updated
1 weeks ago
Language
Python
License
Apache-2.0

Related Skills