← ClaudeAtlas

principle-resiliencylisted

Resiliency principles — fault tolerance, resilience, partial failure, blast radius, failure domains, bulkheads, resource isolation, graceful degradation, fail-fast vs fail-soft, health checks, liveness vs readiness probes, cascading failure, gray failure, fault isolation. Auto-load when designing for partial failure, isolating dependencies via bulkheads, planning graceful degradation, choosing fail-fast vs fail-soft, configuring health/readiness/liveness probes, evaluating cascading failure risk, designing fallback paths, or reviewing system-level fault tolerance.
lugassawan/swe-workbench · ★ 2 · Code & Development · score 68
Install: claude install-skill lugassawan/swe-workbench
# Resiliency Distributed systems fail partially, not totally. Resilience is the discipline of staying useful when components, networks, or dependencies degrade. ## Failure Domains A failure domain is the set of components that fail together. Name failure domains before designing for them — unnamed domains produce unnamed blast radii. - **Crash failure** — process exits; detectable immediately by the load balancer or orchestrator. - **Slow failure** — process responds but takes too long; the most dangerous mode. Threads and connections fill; the caller eventually crashes too. - **Gray/Byzantine failure** — process returns wrong data or errors intermittently; hardest to detect. - **Partial failure** — some instances or shards fail while others serve normally. Cascading failure: a degraded dependency holds resources long enough that the caller exhausts its own pools, propagating failure upstream. Root cause is almost always unbounded resource sharing across failure domains. ## Bulkheads Named after ship compartments: isolate resource pools so a breach in one dependency does not exhaust resources for all others. - Allocate a separate bounded connection pool, semaphore, or thread pool per downstream dependency. - Never share a single pool across unrelated dependencies — a slow third-party API must not starve database connections. - In multi-tenant systems, partition queues or workers per tenant to contain noisy-neighbor starvation. - Size each bulkhead to the dependency's