← ClaudeAtlas

resilience-auditlisted

Failure-mode audit (FMEA for software) — for each way the system can fail (network, storage, partial completion, crash, concurrency, bad input), check whether code DETECTS, HANDLES, RECOVERS, and COMMUNICATES it. Triggers on: "/resilience-audit", "resilience-audit", "FMEA audit". Use when touching network, storage, async, retry, or rollback paths. Flags data loss, silent-success-on-failure, missing rollback/retry/idempotency. Reports; does not fix unless asked.
HetCreep/CoalMine · ★ 2 · Code & Development · score 75
Install: claude install-skill HetCreep/CoalMine
# Resilience Audit **Language:** Generate EVERYTHING at runtime in the user's language — questions, answer options, menu labels, recommendations, report narrative. Detect from their messages; never default to English just because this file is English. English is allowed only for technical terms: commands, paths, code identifiers, severity labels (CRITICAL/HIGH/MEDIUM/LOW), and tier names (Light/Standard/Heavy). For every operation: **"what happens when this FAILS?"** Report; do NOT fix unless asked. ## Failure categories 1. **External I/O** — network down/slow, API 4xx/5xx/timeout, rate-limit. Retry w/ backoff? Timeout set? Clear error vs hang? 2. **Storage** — disk full, permission denied, partial write. Atomic write (temp+rename)? Cleanup on failure? Existing good copy untouched? 3. **Partial completion** — half-done op (extracted 50/100 files). Reported as FAILURE, never success. 4. **Crash / OOM** — killed mid-op. Idempotent restart? No orphaned half-state? 5. **Concurrency** — two instances, race, deadlock. Locking / idempotency / safe re-entry? 6. **Input / data** — malformed, null, truncated, huge. Validate at boundary? Fail-fast with clear error? 7. **Dependency down** — fallback/cache/graceful degrade? Clear error vs silent hang? 8. **Resource exhaustion** — bounded? Backpressure? Cleanup on error path? Per-stack timeout/atomicity/idempotency patterns to grep: read `references/checks.md` before scanning. ## For each failure point, check 4 things - **Detected?**