advanced-evaluation

Featured

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

AI & Automation 43,990 stars 6492 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Advanced Evaluation This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems. **Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops. ## When to Use Activate this skill when: - Building automated evaluation pipelines for LLM outputs - Comparing multiple model responses to select the best one - Establishing consistent quality standards across evaluation teams - Debugging evaluation systems that show inconsistent results - Designing A/B tests for prompt or model changes - Creating rubrics for human or automated evaluation - Analyzing correlation between automated and human judgments ## Core Concepts ### The Evaluation Taxonomy Evaluation approaches fall into two primary categories with distinct reliability profiles: **Direct Scoring**: A single LLM rates one response on a defined scale. - Best for: Objective criteria (factual accuracy, instruction following, toxicity) - Reliability: Moderate to high for well-defined criteria - Failure mode: Score calibration drift, inconsistent scale interpretation **Pairwise Comparison**: An LLM compares two responses and selects the better one. - Best f...

Details

Author: sickn33
Repository: sickn33/agentic-awesome-skills
Created: 6 months ago
Last Updated: today
Language: Python
License: MIT

Bundled in these plugins

agentic-awesome-skills

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed