agent-benchmark

Solid

Framework for measuring and tracking agent response quality over time. Detects regressions before they reach production. Use when evaluating agent changes, auditing quality, or establishing performance baselines.

AI & Automation 495 stars 41 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 86/100

Stars 20%
90
Recency 20%
75
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Agent Benchmark Framework Without benchmarks, we cannot know whether agent changes improve or degrade quality. This skill defines how to measure, track, and protect agent performance. ## When to Activate - Before and after modifying any agent definition file - When adding a new skill that an agent depends on - Periodic quality audits (weekly/monthly) - When a user reports degraded agent output - Before promoting an agent from experimental to production ## Core Concepts ### Why Benchmarks Matter Agent quality degrades silently. A prompt tweak that improves one response can break ten others. Without a baseline to compare against, every change is a guess. Benchmarks make quality visible and regressions detectable. ### Benchmark Types | Type | Scope | Cost | Frequency | |------|-------|------|-----------| | Prompt Benchmark | Single agent, single task | Low | Every agent change | | Task Benchmark | End-to-end scenario | Medium | Feature changes | | Regression Suite | All critical agents | High | Weekly / before release | ## Directory Structure ``` ~/.claude/benchmarks/ fixtures/ code-reviewer/ missing-error-handling.ts # Input: code with no try/catch sql-injection.py # Input: unparameterized query clean-code.ts # Input: code with no issues security-reviewer/ hardcoded-secret.ts # Input: API key in source parameterized-query.py # Input: safe query (no findings expected) v...

Details

Author
vibeeval
Repository
vibeeval/vibecosystem
Created
2 months ago
Last Updated
1 months ago
Language
C#
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category