eval-suite-planninglisted
Install: claude install-skill Accelerated-Innovation/governed-ai-delivery
# Evaluation Suite Planning
Plan the LLM evaluation test suite for a feature. Determine the feature name from the user's request; if it is not provided, ask before proceeding.
## Inputs to read
Feature specs:
- `features/<feature_name>/nfrs.md`
- `features/<feature_name>/acceptance.feature`
- `features/<feature_name>/eval_criteria.yaml`
- `features/<feature_name>/architecture_preflight.md` (sections 10-14)
Contracts and guides:
- `docs/backend/architecture/EVALUATION_LLM_CONTRACT.md`
- `docs/backend/guides/deepeval-usage.md`
- `docs/backend/guides/promptfoo-usage.md`
- `docs/backend/guides/ragas-evaluation.md`
## Instructions
1. Read all inputs listed above.
2. Determine which evaluation tools are required based on the architecture preflight:
- DeepEval: always required for `mode: llm`
- Promptfoo: required if feature is user-facing or processes untrusted input
- RAGAS: required if feature uses retrieval (RAG pipeline)
3. For **DeepEval**, select metrics based on feature type:
- All LLM features: `FaithfulnessMetric`, `AnswerRelevancyMetric`
- Features with context: add `HallucinationMetric`, `ContextualRelevancyMetric`
- Features needing custom criteria: add `GEval` with specific rubric
- Set thresholds (recommend 0.8 minimum, 0.85+ for production features)
4. For **Promptfoo**, plan adversarial scenarios:
- Jailbreak attempts (bypass system instructions)
- Prompt injection (embed malicious instructions in user input)
- Topic boundary te