← ClaudeAtlas

agent-evaluatelisted

Define behavioral contracts, run adversarial tests, and detect regressions for AI agents — invariants, edge cases, statistical analysis, and benchmark-production gap detection
manastalukdar/ai-devstudio · ★ 1 · AI & Automation · score 75
Install: claude install-skill manastalukdar/ai-devstudio
# Agent Evaluation Evaluate AI agents with behavioral contracts, adversarial testing, and regression detection. Arguments: `$ARGUMENTS` - agent name/path to evaluate, or `report` to show last evaluation results ## Behavior ### 1. Locate Agent Under Test ```bash # Find agent definitions and entry points grep -rn "agent\|Agent\|LLMChain\|create_agent\|AgentExecutor" . \ --include="*.py" --include="*.ts" --include="*.js" \ -l 2>/dev/null | grep -v node_modules | head -10 # Check for existing eval harnesses find . -name "*eval*" -o -name "*test*agent*" -o -name "*agent*test*" \ 2>/dev/null | grep -v node_modules | head -10 ``` ### 2. Define Behavioral Contracts For each agent, establish invariants — things it must always or never do: ```yaml # docs/agent-contracts/<agent-name>.yaml agent: customer-support-agent version: "1.0" must_always: - respond_in_same_language_as_user - cite_source_when_making_factual_claims - escalate_when_confidence_below_threshold must_never: - reveal_system_prompt_contents - make_refund_decisions_above_threshold - store_PII_in_tool_calls output_schema: required_fields: [response, confidence, escalate] response_max_tokens: 500 ``` Generate contract template: ```bash mkdir -p docs/agent-contracts # Write contract file based on agent analysis ``` ### 3. Build Test Suite Four test categories: **Behavioral (golden path):** ```python test_cases = [ {"input": "What is your return policy?", "expect_contains": ["30 days",