skill-evaluation-workbenchlisted
Install: claude install-skill yeaight7/agent-powerups
# Skill Evaluation Workbench
## When To Use
- A skill or prompt needs repeatable quality checks across models or configurations.
- A workflow needs file-based graders, command traces, or local artifact checks.
- A tool or MCP skill needs a hidden service fixture or sandboxed test workspace.
- A previous agent attempt failed and you need trace-driven diagnosis before editing instructions.
## Requirements / Checks
- Confirm an eval runner exists locally before running anything. Do not install deps without approval.
- Prefer local deterministic graders over model-graded assertions.
- If Docker, remote models, API keys, or live services are involved, ask before execution.
- Treat traces, result files, preserved workspaces, and stdout as potentially sensitive.
## Minimal Suite Structure
Every suite should have at least three cases:
| Case | Purpose |
|---|---|
| Positive (golden path) | Skill handles the normal use case correctly |
| Edge case | Skill handles an important boundary condition |
| Control (no-tool-needed) | Skill does not over-trigger on a clearly unrelated input |
Place fixtures in `cases/`, skill/reference material in `references/`, and grader scripts in `graders/`.
## Grader Types
| Type | When to use | Deterministic? |
|---|---|---|
| File existence | Skill was supposed to create a file | Yes |
| File content match | Output matches expected text or schema | Yes |
| Command exit code | Script/tool succeeded | Yes |
| JSON schema | Output is valid structure