← ClaudeAtlas

reliabilitylisted

This skill should be used when designing, planning, implementing, or reviewing any non-trivial change, or when the user asks to "add retries", "add error handling", "add circuit breaker", "handle failures" — enforces graceful degradation, proper error handling, retry strategies, and fault-tolerant patterns so systems stay up when things go wrong
alo-exp/silver-bullet · ★ 5 · AI & Automation · score 73
Install: claude install-skill alo-exp/silver-bullet
# /reliability — Reliable Design Enforcement Every design, plan, and implementation MUST handle failure gracefully. Things WILL go wrong — networks fail, disks fill up, dependencies go down, inputs are invalid. The question is not "will it fail?" but "what happens when it does?" **Why this matters:** Unreliable systems erode user trust faster than any other quality issue. A system that crashes on bad input, hangs when a dependency is slow, or loses data on failure is not production-ready — no matter how many features it has. **When to invoke:** During PLANNING (after `/gsd:discuss-phase`, before `/gsd:plan-phase`) and during REVIEW (as part of code review criteria). This skill applies to both new code and modifications to existing code. --- ## The Rules ### Rule 1: Every External Call Can Fail Every network call, database query, file operation, and external service invocation MUST handle failure: | Failure mode | Required handling | |-------------|-------------------| | Timeout | Explicit timeout set. Don't wait forever. | | Connection refused | Retry with backoff, then degrade gracefully. | | 5xx response | Retry with backoff (idempotent ops only). | | 4xx response | Don't retry. Log and handle based on status code. | | Malformed response | Validate schema. Don't crash on unexpected shapes. | | Partial failure | Handle incomplete writes. Don't leave data half-updated. | **No external call without a timeout.** Default: 5s for API calls, 30s for long operations (with