tool-use-evallisted
Install: claude install-skill varunk130/AI-Eval-Skills
# Tool-Use Eval
Evaluation pattern for agents whose job is to invoke tools / functions / MCP capabilities. Conversational quality and task-completion evals miss the failure modes specific to tool-use; this skill adds the four dimensions that actually predict whether a tool-using agent is production-ready.
## Core Principle
**A tool-using agent has four ways to fail that pure-conversation agents don't.** Generic quality scores can be high while every one of these is broken. This eval pattern separates them so improvements target the right failure mode.
## The Four Tool-Use Dimensions
| Dimension | What It Measures | Failure Example |
|-----------|------------------|-----------------|
| **Tool Selection Accuracy** | Did the agent pick the right tool for the request? | Used `search_web` when `lookup_internal_kb` was the right call |
| **Argument Correctness** | Were the arguments well-formed and schema-valid? | Required field missing, type mismatch, hallucinated parameter |
| **Error Recovery** | When a tool returned an error, did the agent recover sensibly? | Repeats the same failing call, or gives up silently |
| **Restraint** | Did the agent avoid unnecessary tool calls? | Calls 5 tools when 1 would do; calls a tool when no tool was needed |
The four dimensions are **independently** scorable - improvements often help one and hurt another (e.g., tightening argument validation can increase Restraint failures).
## Scoring Anchors (0-3 per dimension)
### Tool Selection Ac