← ClaudeAtlas

prompt-evaluation-runnerlisted

Use when evaluating prompts, LLM outputs, red-team suites, or model behavior with local eval configs and safe provider/cost controls.
yeaight7/agent-powerups · ★ 7 · AI & Automation · score 75
Install: claude install-skill yeaight7/agent-powerups
# Prompt Evaluation Runner ## When to use Use when you need to evaluate an LLM app, test a prompt systematically, or run red-team/vulnerability scans against a target model or application. ## Requirements / Checks 1. Check if an evaluation tool is defined in project deps, scripts, lockfiles, or local toolchain (e.g., `promptfoo`, `evals`, `braintrust`). 2. Do not run unvetted remote runners without checking the project's toolchain first (e.g., avoid `npx promptfoo@latest` if `promptfoo` is already installed locally). 3. If no runner exists, ask before adding a dev dependency or using an ephemeral runner. 4. Confirm expected cost, provider, API keys, and network target before any execution. ## Workflow 1. **Define risk** — state target behavior, failure mode, provider(s), and budget limits before writing any config. 2. **Choose assertions** — prefer deterministic checks first: | Assertion type | When to use | |---|---| | `contains` / `not-contains` | Output must include/exclude specific text | | `regex` | Structured output pattern (e.g., JSON key present) | | `json-schema` | Output must conform to a schema | | `cost` | Must stay under a token/dollar budget | | `latency` | Must respond within N ms | | `javascript` / `python` | Custom logic when simpler types don't fit | | Model grader | Last resort — only for subjective quality checks | 3. **Use model graders sparingly** — pin the grader model and provider explicitly; document the cost and no