advanced-evaluation

Featured

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

AI & Automation 39,350 stars 6386 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Advanced Evaluation This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems. **Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops. ## When to Use Activate this skill when: - Building automated evaluation pipelines for LLM outputs - Comparing multiple model responses to select the best one - Establishing consistent quality standards across evaluation teams - Debugging evaluation systems that show inconsistent results - Designing A/B tests for prompt or model changes - Creating rubrics for human or automated evaluation - Analyzing correlation between automated and human judgments ## Core Concepts ### The Evaluation Taxonomy Evaluation approaches fall into two primary categories with distinct reliability profiles: **Direct Scoring**: A single LLM rates one response on a defined scale. - Best for: Objective criteria (factual accuracy, instruction following, toxicity) - Reliability: Moderate to high for well-defined criteria - Failure mode: Score calibration drift, inconsistent scale interpretation **Pairwise Comparison**: An LLM compares two responses and selects the better one. - Best f...

Details

Author
sickn33
Repository
sickn33/antigravity-awesome-skills
Created
4 months ago
Last Updated
today
Language
Python
License
MIT

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

0 Updated today
mytricker0
AI & Automation Listed

advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

3 Updated today
Kalyanikhandare29
AI & Automation Listed

advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

335 Updated today
aiskillstore
AI & Automation Listed

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

0 Updated today
CodeWithBehnam
AI & Automation Solid

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

36,222 Updated today
wshobson