llm-evaluation

Solid

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

AI & Automation 36,222 stars 3928 forks Updated today MIT

Install

View on GitHub

Quality Score: 93/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# LLM Evaluation Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing. ## When to Use This Skill - Measuring LLM application performance systematically - Comparing different models or prompts - Detecting performance regressions before deployment - Validating improvements from prompt changes - Building confidence in production systems - Establishing baselines and tracking progress over time - Debugging unexpected model behavior ## Core Evaluation Types ### 1. Automated Metrics Fast, repeatable, scalable evaluation using computed scores. **Text Generation:** - **BLEU**: N-gram overlap (translation) - **ROUGE**: Recall-oriented (summarization) - **METEOR**: Semantic similarity - **BERTScore**: Embedding-based similarity - **Perplexity**: Language model confidence **Classification:** - **Accuracy**: Percentage correct - **Precision/Recall/F1**: Class-specific performance - **Confusion Matrix**: Error patterns - **AUC-ROC**: Ranking quality **Retrieval (RAG):** - **MRR**: Mean Reciprocal Rank - **NDCG**: Normalized Discounted Cumulative Gain - **Precision@K**: Relevant in top K - **Recall@K**: Coverage in top K ### 2. Human Evaluation Manual assessment for quality aspects difficult to automate. **Dimensions:** - **Accuracy**: Factual correctness - **Coherence**: Logical flow - **Relevance**: Answers the question - **Fluency**: Natural language quality - **Safety**: No harmful content - **Helpful...

Details

Author: wshobson
Repository: wshobson/agents
Created: 10 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

llm-evaluation

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

39,350 Updated today

sickn33

AI & Automation Featured

llm-evaluation

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

27,705 Updated today

davila7