rag-eval-guardrailslisted

Build a verified eval harness for a RAG/LLM feature plus PII/PHI-leakage guardrails, gated by checks that actually run. Scores a precomputed predictions file (so it runs with ZERO API access) on groundedness, citation validity, retrieval hit@k, answer F1/exact-match, refusal rate, and latency; compares to config thresholds and a baseline to catch regressions; and fails the build on PII/PHI leakage. Use when the user wants to evaluate or regression-test an AI/RAG feature, measure hallucination/groundedness, add an eval gate to CI, or scan prompts/answers/logs for leaked identifiers. Triggers: "RAG evaluation", "LLM eval", "eval harness", "hallucination", "groundedness", "PII/PHI leakage", "guardrails", "regression testing for AI features".
NeuralMedic-DE/claude-skills · ★ 0 · AI & Automation · score 73

Install: claude install-skill NeuralMedic-DE/claude-skills

# RAG eval & guardrails (verified, zero-API) Prove a RAG/LLM feature is good enough to ship — and isn't leaking identifiers — with checks that **run and exit non-zero on failure**, not assertions. The eval scores a precomputed predictions file, so the gate needs **no API access**. ## Core principle **Quality is measured, not claimed.** The loop is: eval against a frozen golden set → read the failures → fix the prompt / retrieval / data → re-eval, until thresholds are met **and** nothing regressed versus the baseline. A separate PII/PHI guard fails the build if structured identifiers leak. **Be honest about scope (this is the rule that keeps the skill correct):** the metrics here are **lexical proxies** — token overlap, containment, F1. They catch gross failures (hallucination, fabricated citations, retrieval misses) but do **not** measure truth. A correct paraphrase can score low; a wrong answer that reuses context words can score high. True faithfulness/correctness needs human review or an LLM-as-judge. The PII guard catches **structured** identifiers (emails, cards, SSNs…), not names or contextual PHI. Report **"thresholds met on lexical proxies; no regression; no structured-identifier leakage"** — never "the AI system is correct" or "compliant." → `references/01-eval-driven-rag.md`, `references/02-metrics.md` ## When to use vs. not - Use for: evaluating or regression-testing a RAG/LLM feature; measuring groundedness/hallucination, citation validity, retrieval hit@k