agent-evaluation

Solid

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks

AI & Automation 131 stars 27 forks Updated 1 weeks ago MIT

Install

View on GitHub

Quality Score: 91/100

Stars 20%
71
Recency 20%
90
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Agent Evaluation Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks ## Capabilities - agent-testing - benchmark-design - capability-assessment - reliability-metrics - regression-testing ## Prerequisites - Knowledge: Testing methodologies, Statistical analysis basics, LLM behavior patterns - Skills_recommended: autonomous-agents, multi-agent-orchestration - Required skills: testing-fundamentals, llm-fundamentals ## Scope - Does_not_cover: Model training evaluation (loss, perplexity), Fairness and bias testing, User experience testing - Boundaries: Focus is agent capability and reliability, Covers functional and behavioral testing ## Ecosystem ### Primary_tools - AgentBench - Multi-environment benchmark for LLM agents (ICLR 2024) - τ-bench (Tau-bench) - Sierra's real-world agent benchmark - ToolEmu - Risky behavior detection for agent tool use - Langsmith - LLM tracing and evaluation platform ### Alternatives - Braintrust - When: Need production monitoring integration LLM evaluation and monitoring - PromptFoo - When: Focus on prompt-level evaluation Prompt testing framework ### Deprecated - Manual testing only ## Patterns ### Statistical Test Evaluation Run tests multiple times and analyze result distributions **When to use**: Evaluating stochastic agent behavior interface TestResult { testId: string; ...

Details

Author
lingxling
Repository
lingxling/awesome-skills-cn
Created
3 months ago
Last Updated
1 weeks ago
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category