llm-evaluation

Featured

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

AI & Automation 27,705 stars 2858 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# LLM Evaluation Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing. ## Do not use this skill when - The task is unrelated to llm evaluation - You need a different domain or tool outside this scope ## Instructions - Clarify goals, constraints, and required inputs. - Apply relevant best practices and validate outcomes. - Provide actionable steps and verification. - If detailed examples are required, open `resources/implementation-playbook.md`. ## Use this skill when - Measuring LLM application performance systematically - Comparing different models or prompts - Detecting performance regressions before deployment - Validating improvements from prompt changes - Building confidence in production systems - Establishing baselines and tracking progress over time - Debugging unexpected model behavior ## Core Evaluation Types ### 1. Automated Metrics Fast, repeatable, scalable evaluation using computed scores. **Text Generation:** - **BLEU**: N-gram overlap (translation) - **ROUGE**: Recall-oriented (summarization) - **METEOR**: Semantic similarity - **BERTScore**: Embedding-based similarity - **Perplexity**: Language model confidence **Classification:** - **Accuracy**: Percentage correct - **Precision/Recall/F1**: Class-specific performance - **Confusion Matrix**: Error patterns - **AUC-ROC**: Ranking quality **Retrieval (RAG):** - **MRR**: Mean Reciprocal Rank - **NDCG**: Normalized Discounted Cumul...

Details

Author: davila7
Repository: davila7/claude-code-templates
Created: 11 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

llm-evaluation

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

39,350 Updated today

sickn33

AI & Automation Listed

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

0 Updated today

CodeWithBehnam

AI & Automation Solid