agent-evaluation

Solid

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks

AI & Automation 3 stars 1 forks Updated today MIT

Install

View on GitHub

Quality Score: 82/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Agent Evaluation Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks ## Capabilities - agent-testing - benchmark-design - capability-assessment - reliability-metrics - regression-testing ## Prerequisites - Knowledge: Testing methodologies, Statistical analysis basics, LLM behavior patterns - Skills_recommended: autonomous-agents, multi-agent-orchestration - Required skills: testing-fundamentals, llm-fundamentals ## Scope - Does_not_cover: Model training evaluation (loss, perplexity), Fairness and bias testing, User experience testing - Boundaries: Focus is agent capability and reliability, Covers functional and behavioral testing ## Ecosystem ### Primary_tools - AgentBench - Multi-environment benchmark for LLM agents (ICLR 2024) - τ-bench (Tau-bench) - Sierra's real-world agent benchmark - ToolEmu - Risky behavior detection for agent tool use - Langsmith - LLM tracing and evaluation platform ### Alternatives - Braintrust - When: Need production monitoring integration LLM evaluation and monitoring - PromptFoo - When: Focus on prompt-level evaluation Prompt testing framework ### Deprecated - Manual testing only ## Patterns ### Statistical Test Evaluation Run tests multiple times and analyze result distributions **When to use**: Evaluating stochastic agent behavior interface TestResult { testId: string; ...

Details

Author: fabioc-aloha
Repository: fabioc-aloha/Alex_Skill_Mall
Created: 3 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

Azure · Cloud

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured