← ClaudeAtlas

chaos-engineeringlisted

Provides chaos engineering best practices for resilience testing, fault injection, and game day planning. Use when designing resilience experiments, configuring chaos tools, planning game days, or when user mentions 'chaos engineering', 'resilience', 'litmus', 'game day', 'fault injection', 'chaos monkey', 'blast radius', 'steady state', 'failure mode'.
Tibsfox/gsd-skill-creator · ★ 61 · AI & Automation · score 74
Install: claude install-skill Tibsfox/gsd-skill-creator
# Chaos Engineering Best practices for systematically injecting failures to discover weaknesses before they cause outages, using steady-state hypotheses, controlled experiments, and progressive blast radius expansion. ## Chaos Engineering Principles Chaos engineering is not random destruction. It is disciplined experimentation on distributed systems to build confidence in their resilience. ``` Define Steady State --> Form Hypothesis --> Design Experiment --> Control Blast Radius --> Run --> Analyze --> Fix --> Repeat ``` | Principle | Description | Why It Matters | |-----------|-------------|----------------| | Define steady state | Identify measurable normal behavior (latency, error rate, throughput) | Without a baseline, you cannot detect degradation | | Hypothesize around steady state | Predict the system will maintain steady state during fault | Forces explicit thinking about expected behavior | | Vary real-world events | Inject failures that actually happen (network, disk, process, dependency) | Simulated failures must map to real failure modes | | Run in production | Test where real complexity exists (with safeguards) | Staging rarely matches production topology | | Minimize blast radius | Start small, expand gradually, have kill switches | Chaos should reveal problems, not cause outages | | Automate experiments | Repeatable experiments run in CI/CD or on schedule | Manual experiments don't scale and introduce bias | | Build a hypothesis backlog | Track what you wa