advanced-evaluationlisted
Install: claude install-skill shipshitdev/skills
# Advanced Evaluation
LLM-as-a-Judge techniques for evaluating AI outputs. Not a single technique but a family of approaches - choosing the right one and mitigating biases is the core competency.
## When to Activate
- Building automated evaluation pipelines for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards
- Debugging inconsistent evaluation results
- Designing A/B tests for prompt or model changes
- Creating rubrics for human or automated evaluation
## Core Concepts
### Evaluation Taxonomy
**Direct Scoring**: Single LLM rates one response on a defined scale.
- Best for: Objective criteria (factual accuracy, instruction following, toxicity)
- Reliability: Moderate to high for well-defined criteria
**Pairwise Comparison**: LLM compares two responses and selects better one.
- Best for: Subjective preferences (tone, style, persuasiveness)
- Reliability: Higher than direct scoring for preferences
### Known Biases
| Bias | Description | Mitigation |
|------|-------------|------------|
| Position | First-position preference | Swap positions, check consistency |
| Length | Longer = higher scores | Explicit prompting, length-normalized scoring |
| Self-Enhancement | Models rate own outputs higher | Use different model for evaluation |
| Verbosity | Unnecessary detail rated higher | Criteria-specific rubrics |
| Authority | Confident tone rated higher | Require evidence citation |
### Decision Framework