← ClaudeAtlas

llm-evaluationlisted

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
aiskillstore/marketplace · ★ 334 · AI & Automation · score 80
Install: claude install-skill aiskillstore/marketplace
# LLM Evaluation Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing. ## When to Use This Skill - Measuring LLM application performance systematically - Comparing different models or prompts - Detecting performance regressions before deployment - Validating improvements from prompt changes - Building confidence in production systems - Establishing baselines and tracking progress over time - Debugging unexpected model behavior ## Core Evaluation Types ### 1. Automated Metrics Fast, repeatable, scalable evaluation using computed scores. **Text Generation:** - **BLEU**: N-gram overlap (translation) - **ROUGE**: Recall-oriented (summarization) - **METEOR**: Semantic similarity - **BERTScore**: Embedding-based similarity - **Perplexity**: Language model confidence **Classification:** - **Accuracy**: Percentage correct - **Precision/Recall/F1**: Class-specific performance - **Confusion Matrix**: Error patterns - **AUC-ROC**: Ranking quality **Retrieval (RAG):** - **MRR**: Mean Reciprocal Rank - **NDCG**: Normalized Discounted Cumulative Gain - **Precision@K**: Relevant in top K - **Recall@K**: Coverage in top K ### 2. Human Evaluation Manual assessment for quality aspects difficult to automate. **Dimensions:** - **Accuracy**: Factual correctness - **Coherence**: Logical flow - **Relevance**: Answers the question - **Fluency**: Natural language quality - **Safety**: No harmful content - **Helpfulness**