← ClaudeAtlas

eval-suite-planninglisted

Plan DeepEval, Promptfoo, and RAGAS evaluation suites for an LLM feature. Use when the user asks to plan LLM evaluations or invokes /eval-suite-planning.
Accelerated-Innovation/governed-ai-delivery · ★ 18 · AI & Automation · score 71
Install: claude install-skill Accelerated-Innovation/governed-ai-delivery
# Evaluation Suite Planning Plan the LLM evaluation test suite for a feature. Determine the feature name from the user's request; if it is not provided, ask before proceeding. ## Inputs to read Feature specs: - `features/<feature_name>/nfrs.md` - `features/<feature_name>/acceptance.feature` - `features/<feature_name>/eval_criteria.yaml` - `features/<feature_name>/architecture_preflight.md` (sections 10-14) Contracts and guides: - `docs/backend/architecture/EVALUATION_LLM_CONTRACT.md` - `docs/backend/guides/deepeval-usage.md` - `docs/backend/guides/promptfoo-usage.md` - `docs/backend/guides/ragas-evaluation.md` ## Instructions 1. Read all inputs listed above. 2. Determine which evaluation tools are required based on the architecture preflight: - DeepEval: always required for `mode: llm` - Promptfoo: required if feature is user-facing or processes untrusted input - RAGAS: required if feature uses retrieval (RAG pipeline) 3. For **DeepEval**, select metrics based on feature type: - All LLM features: `FaithfulnessMetric`, `AnswerRelevancyMetric` - Features with context: add `HallucinationMetric`, `ContextualRelevancyMetric` - Features needing custom criteria: add `GEval` with specific rubric - Set thresholds (recommend 0.8 minimum, 0.85+ for production features) 4. For **Promptfoo**, plan adversarial scenarios: - Jailbreak attempts (bypass system instructions) - Prompt injection (embed malicious instructions in user input) - Topic boundary te