llm-evallisted
Install: claude install-skill Claudient/Claudient
# LLM Eval Skill
## When to activate
- Building a test suite for an LLM-powered feature before shipping
- Choosing evaluation metrics for RAG, chat, summarisation, or extraction tasks
- Setting up LLM-as-judge to score model outputs automatically
- Detecting prompt or model version regressions in CI
- Monitoring production output quality over time
- Comparing two prompt versions or model versions systematically
## When NOT to use
- Unit testing application code (not LLM outputs) — use Jest/pytest
- A/B testing with real users — use the experiment-designer skill
- Security red-teaming — use the security-reviewer agent
- Choosing between LLMs for a new project — benchmarking is different from eval
## Instructions
### Evaluation dataset design
```
Build an evaluation dataset for [LLM feature].
Feature: [describe — RAG Q&A / summarisation / extraction / classification / chat]
Scale: [20 / 50 / 200 examples]
Data sources: [real user queries / synthetic / domain expert created]
Dataset design principles:
Distribution: match your production input distribution
- Sample from real user queries where possible (anonymised)
- If synthetic: generate with Claude using diverse personas and intents
- Cover: common cases (60%), edge cases (30%), adversarial cases (10%)
For each example, record:
| Field | Description |
|---|---|
| input | The prompt or question |
| expected_output | Ground truth answer (or criteria) |
| category | Type of query (factual / reasoning / format / refusal)