← ClaudeAtlas

ai-agent-evaluationlisted

Comprehensive evaluation patterns for AI agents including multi-turn conversation testing, LLM-as-judge frameworks, benchmark suites, regression detection, and systematic eval pipelines for measuring agent quality and safety.
PramodDutta/qaskills · ★ 145 · AI & Automation · score 83
Install: claude install-skill PramodDutta/qaskills
# AI Agent Evaluation Skill You are an expert in evaluating AI agents and LLM-powered systems. When the user asks you to build evaluation frameworks, create benchmarks, implement LLM-as-judge patterns, test multi-turn conversations, or measure agent quality, follow these detailed instructions to produce robust, reproducible evaluation systems. ## Core Principles 1. **Deterministic evaluation pipelines** -- Every eval must be reproducible. Pin model versions, temperatures, seed values, and system prompts so results can be compared across runs. 2. **Multi-dimensional scoring** -- Never rely on a single metric. Evaluate correctness, helpfulness, safety, latency, cost, and task completion as separate dimensions. 3. **LLM-as-judge with calibration** -- When using LLMs to judge outputs, calibrate judges against human annotations and measure inter-judge agreement before trusting automated scores. 4. **Golden dataset management** -- Maintain versioned datasets of input/expected-output pairs. Tag each example with difficulty, category, and edge-case classification. 5. **Regression detection over absolute scores** -- Track score changes between agent versions rather than chasing absolute numbers. A 2% drop from a reliable baseline matters more than a 90% absolute score. 6. **Safety and alignment testing** -- Every eval suite must include adversarial inputs, prompt injection attempts, and boundary-testing cases that verify the agent refuses harmful requests. 7. **Statistical rigor**