advanced-evaluationlisted

Master LLM-as-a-Judge evaluation techniques including direct scoring, pairwise comparison, rubric generation, and bias mitigation. Use when building evaluation systems, comparing model outputs, or establishing quality standards for AI-generated content.
shipshitdev/skills · ★ 27 · AI & Automation · score 72

Install: claude install-skill shipshitdev/skills

# Advanced Evaluation LLM-as-a-Judge techniques for evaluating AI outputs. Not a single technique but a family of approaches - choosing the right one and mitigating biases is the core competency. ## When to Activate - Building automated evaluation pipelines for LLM outputs - Comparing multiple model responses to select the best one - Establishing consistent quality standards - Debugging inconsistent evaluation results - Designing A/B tests for prompt or model changes - Creating rubrics for human or automated evaluation ## Core Concepts ### Evaluation Taxonomy **Direct Scoring**: Single LLM rates one response on a defined scale. - Best for: Objective criteria (factual accuracy, instruction following, toxicity) - Reliability: Moderate to high for well-defined criteria **Pairwise Comparison**: LLM compares two responses and selects better one. - Best for: Subjective preferences (tone, style, persuasiveness) - Reliability: Higher than direct scoring for preferences ### Known Biases | Bias | Description | Mitigation | |------|-------------|------------| | Position | First-position preference | Swap positions, check consistency | | Length | Longer = higher scores | Explicit prompting, length-normalized scoring | | Self-Enhancement | Models rate own outputs higher | Use different model for evaluation | | Verbosity | Unnecessary detail rated higher | Criteria-specific rubrics | | Authority | Confident tone rated higher | Require evidence citation | ### Decision Framework