← ClaudeAtlas

skill-evaluation-workbenchlisted

Use when designing, running, debugging, or hardening deterministic eval suites for agent skills, prompts, tool workflows, or MCP-backed cases.
yeaight7/agent-powerups · ★ 7 · AI & Automation · score 78
Install: claude install-skill yeaight7/agent-powerups
# Skill Evaluation Workbench ## When To Use - A skill or prompt needs repeatable quality checks across models or configurations. - A workflow needs file-based graders, command traces, or local artifact checks. - A tool or MCP skill needs a hidden service fixture or sandboxed test workspace. - A previous agent attempt failed and you need trace-driven diagnosis before editing instructions. ## Requirements / Checks - Confirm an eval runner exists locally before running anything. Do not install deps without approval. - Prefer local deterministic graders over model-graded assertions. - If Docker, remote models, API keys, or live services are involved, ask before execution. - Treat traces, result files, preserved workspaces, and stdout as potentially sensitive. ## Minimal Suite Structure Every suite should have at least three cases: | Case | Purpose | |---|---| | Positive (golden path) | Skill handles the normal use case correctly | | Edge case | Skill handles an important boundary condition | | Control (no-tool-needed) | Skill does not over-trigger on a clearly unrelated input | Place fixtures in `cases/`, skill/reference material in `references/`, and grader scripts in `graders/`. ## Grader Types | Type | When to use | Deterministic? | |---|---|---| | File existence | Skill was supposed to create a file | Yes | | File content match | Output matches expected text or schema | Yes | | Command exit code | Script/tool succeeded | Yes | | JSON schema | Output is valid structure