← ClaudeAtlas

agent-evaluationlisted

This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", "implement LLM-as-judge", "compare model outputs", "mitigate evaluation bias", or mentions multi-dimensional evaluation, agent testing, quality gates, direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment for LLM agent systems. NOT for testing code or applications (use testing-framework), NOT for agent coordination or multi-agent design (use multi-agent-patterns).
viktorbezdek/skillstack · ★ 9 · AI & Automation · score 76
Install: claude install-skill viktorbezdek/skillstack
# Evaluating LLM Agent Systems Agent evaluation requires fundamentally different approaches than traditional software testing. Agents make dynamic decisions, are non-deterministic, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. **Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops. ## When to Activate - Testing agent performance systematically - Validating context engineering choices - Measuring improvements or catching regressions over time - Building quality gates for agent pipelines - Comparing different agent configurations or model outputs - Building automated evaluation pipelines for LLM outputs - Designing A/B tests for prompt or model changes - Debugging evaluation systems that show inconsistent results - Analyzing correlation between automated and human judgments ## Decision Tree: Choosing an Evaluation Approach ``` What are you evaluating? +-- Agent outputs against known correct answers? | +-- Yes --> Direct Scoring (factual accuracy, format compliance, instruction following) | +-- No --> Are you comparing two configurations? | +-- Yes --> Pairwise Comparison with position-swap protocol | | Criteria: tone, style, persuasiveness, creativity | +-- No --> Do you have reference m