llm-evallisted

LLM evaluation: build evaluation datasets, choose metrics (RAGAS, G-Eval, LLM-as-judge), run automated evals, monitor production quality, and detect regressions
Claudient/Claudient · ★ 4 · AI & Automation · score 65

Install: claude install-skill Claudient/Claudient

# LLM Eval Skill ## When to activate - Building a test suite for an LLM-powered feature before shipping - Choosing evaluation metrics for RAG, chat, summarisation, or extraction tasks - Setting up LLM-as-judge to score model outputs automatically - Detecting prompt or model version regressions in CI - Monitoring production output quality over time - Comparing two prompt versions or model versions systematically ## When NOT to use - Unit testing application code (not LLM outputs) — use Jest/pytest - A/B testing with real users — use the experiment-designer skill - Security red-teaming — use the security-reviewer agent - Choosing between LLMs for a new project — benchmarking is different from eval ## Instructions ### Evaluation dataset design ``` Build an evaluation dataset for [LLM feature]. Feature: [describe — RAG Q&A / summarisation / extraction / classification / chat] Scale: [20 / 50 / 200 examples] Data sources: [real user queries / synthetic / domain expert created] Dataset design principles: Distribution: match your production input distribution - Sample from real user queries where possible (anonymised) - If synthetic: generate with Claude using diverse personas and intents - Cover: common cases (60%), edge cases (30%), adversarial cases (10%) For each example, record: | Field | Description | |---|---| | input | The prompt or question | | expected_output | Ground truth answer (or criteria) | | category | Type of query (factual / reasoning / format / refusal)