auto-arena
FeaturedAutomatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation.
Install
Quality Score: 92/100
Skill Content
Details
- Author
- agentscope-ai
- Repository
- agentscope-ai/OpenJudge
- Created
- 10 months ago
- Last Updated
- 5 days ago
- Language
- Python
- License
- Apache-2.0
Integrates with
Similar Skills
Semantically similar based on skill content — not just same category
autogpt-agents
Autonomous AI agent platform for building and deploying continuous agents. Use when creating visual workflow agents, deploying persistent autonomous agents, or building complex multi-step AI automation systems.
ai-pair
AI Pair Collaboration Skill. Coordinate multiple AI models to work together: one creates (Author/Developer), two others review (Codex + Gemini). Works for code, articles, video scripts, and any creative task. Trigger: /ai-pair, ai pair, dev-team, content-team, team-stop
writing
Iterative critique and improvement of long-form content (guidebooks, playbooks, essays). Launches parallel judge subagents for multi-dimensional critique, synthesizes findings, presents proposals for user approval. Never edits without consent.
test-harness-auditor
This skill should be used when auditing a repo's test, lint, type-check, static analysis, build, and debug infrastructure for AI coding agents. Use when entering a new repo, when asked to 'audit tests', 'audit harness', 'check test infrastructure', 'lint audit', 'what testing tools are configured', or when a repo has no .claude/lint-rules.json. Generates optimized configs for the lint-on-write hook.