agent-eval

Solid

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

AI & Automation 196,640 stars 30253 forks Updated 2 days ago MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Agent Eval Skill A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it. ## When to Activate - Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase - Measuring agent performance before adopting a new tool or model - Running regression checks when an agent updates its model or tooling - Producing data-backed agent selection decisions for a team ## Installation > **Note:** Install agent-eval from its repository after reviewing the source. ## Core Concepts ### YAML Task Definitions Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success: ```yaml name: add-retry-logic description: Add exponential backoff retry to the HTTP client repo: ./my-project files: - src/http_client.py prompt: | Add retry logic with exponential backoff to all HTTP requests. Max 3 retries. Initial delay 1s, max delay 30s. judge: - type: pytest command: pytest tests/test_http_client.py -v - type: grep pattern: "exponential_backoff|retry" files: src/http_client.py commit: "abc1234" # pin to specific commit for reproducibility ``` ### Git Worktree Isolation Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo. ### Metrics Collected | Metric | What It Measures | |-...

Details

Author
affaan-m
Repository
affaan-m/everything-claude-code
Created
4 months ago
Last Updated
2 days ago
Language
JavaScript
License
MIT

Integrates with

Related Skills