agent-evaluation

Featured

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks

AI & Automation 213 stars 41 forks Updated 1 weeks ago MIT

Install

View on GitHub

Quality Score: 93/100

Stars 20%

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Agent Evaluation Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks ## Capabilities - agent-testing - benchmark-design - capability-assessment - reliability-metrics - regression-testing ## Prerequisites - Knowledge: Testing methodologies, Statistical analysis basics, LLM behavior patterns - Skills_recommended: autonomous-agents, multi-agent-orchestration - Required skills: testing-fundamentals, llm-fundamentals ## Scope - Does_not_cover: Model training evaluation (loss, perplexity), Fairness and bias testing, User experience testing - Boundaries: Focus is agent capability and reliability, Covers functional and behavioral testing ## Ecosystem ### Primary_tools - AgentBench - Multi-environment benchmark for LLM agents (ICLR 2024) - τ-bench (Tau-bench) - Sierra's real-world agent benchmark - ToolEmu - Risky behavior detection for agent tool use - Langsmith - LLM tracing and evaluation platform ### Alternatives - Braintrust - When: Need production monitoring integration LLM evaluation and monitoring - PromptFoo - When: Focus on prompt-level evaluation Prompt testing framework ### Deprecated - Manual testing only ## Patterns ### Statistical Test Evaluation Run tests multiple times and analyze result distributions **When to use**: Evaluating stochastic agent behavior interface TestResult { testId: string; ...

Details

Author: lingxling
Repository: lingxling/awesome-skills-cn
Created: 5 months ago
Last Updated: 1 weeks ago
Language: Python
License: MIT

Integrates with

OpenAI · AI Anthropic · AI Hugging Face · AI Vercel · Cloud

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid