llm-evaluation

Solid

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

AI & Automation 36,222 stars 3928 forks Updated today MIT

Install

View on GitHub

Quality Score: 93/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# LLM Evaluation Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing. ## When to Use This Skill - Measuring LLM application performance systematically - Comparing different models or prompts - Detecting performance regressions before deployment - Validating improvements from prompt changes - Building confidence in production systems - Establishing baselines and tracking progress over time - Debugging unexpected model behavior ## Core Evaluation Types ### 1. Automated Metrics Fast, repeatable, scalable evaluation using computed scores. **Text Generation:** - **BLEU**: N-gram overlap (translation) - **ROUGE**: Recall-oriented (summarization) - **METEOR**: Semantic similarity - **BERTScore**: Embedding-based similarity - **Perplexity**: Language model confidence **Classification:** - **Accuracy**: Percentage correct - **Precision/Recall/F1**: Class-specific performance - **Confusion Matrix**: Error patterns - **AUC-ROC**: Ranking quality **Retrieval (RAG):** - **MRR**: Mean Reciprocal Rank - **NDCG**: Normalized Discounted Cumulative Gain - **Precision@K**: Relevant in top K - **Recall@K**: Coverage in top K ### 2. Human Evaluation Manual assessment for quality aspects difficult to automate. **Dimensions:** - **Accuracy**: Factual correctness - **Coherence**: Logical flow - **Relevance**: Answers the question - **Fluency**: Natural language quality - **Safety**: No harmful content - **Helpful...

Details

Author
wshobson
Repository
wshobson/agents
Created
10 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category