← ClaudeAtlas

evaluating-llmslisted

Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.
ancoleman/ai-design-components · ★ 368 · AI & Automation · score 80
Install: claude install-skill ancoleman/ai-design-components
# LLM Evaluation Evaluate Large Language Model (LLM) systems using automated metrics, LLM-as-judge patterns, and standardized benchmarks to ensure production quality and safety. ## When to Use This Skill Apply this skill when: - Testing individual prompts for correctness and formatting - Validating RAG (Retrieval-Augmented Generation) pipeline quality - Measuring hallucinations, bias, or toxicity in LLM outputs - Comparing different models or prompt configurations (A/B testing) - Running benchmark tests (MMLU, HumanEval) to assess model capabilities - Setting up production monitoring for LLM applications - Integrating LLM quality checks into CI/CD pipelines Common triggers: - "How do I test if my RAG system is working correctly?" - "How can I measure hallucinations in LLM outputs?" - "What metrics should I use to evaluate generation quality?" - "How do I compare GPT-4 vs Claude for my use case?" - "How do I detect bias in LLM responses?" ## Evaluation Strategy Selection ### Decision Framework: Which Evaluation Approach? **By Task Type:** | Task Type | Primary Approach | Metrics | Tools | |-----------|------------------|---------|-------| | **Classification** (sentiment, intent) | Automated metrics | Accuracy, Precision, Recall, F1 | scikit-learn | | **Generation** (summaries, creative text) | LLM-as-judge + automated | BLEU, ROUGE, BERTScore, Quality rubric | GPT-4/Claude for judging | | **Question Answering** | Exact match + semantic similarity | EM, F1, Cosine simi