chaos-engineering

Solid

Design and run chaos experiments in Kubernetes — pod failures, network partitions, resource pressure with LitmusChaos and manual chaos.

AI & Automation 14 stars 3 forks Updated 3 days ago MIT

Install

View on GitHub

Quality Score: 86/100

Stars 20%
39
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
80
License 10%
100
Description 5%
100

Skill Content

# Skill: Chaos Engineering > **Expertise:** LitmusChaos experiments, manual K8s chaos, network partition testing, graceful degradation validation. ## When to load When designing chaos experiments, validating failover behavior, verifying SLO headroom, or onboarding a service to chaos testing. ## Chaos Experiment Design Principles ``` 1. Define steady state first → What does "working" look like? (SLI baseline: error rate < 0.1%, p99 < 200ms) 2. Hypothesize → "If 1/3 of pods die, the service will continue serving with p99 < 500ms" 3. Blast radius control → Start with staging. Start with 1 pod. Increase gradually. 4. Abort conditions → Auto-stop if error rate > 1% or p99 > 1s for > 2 min 5. Document and act → Passed = evidence of resilience. Failed = fix + re-test. Never just accept failure. ``` ## Manual Chaos (no tooling needed) ```bash # ── Pod kill (test restart recovery) ────────────────────────── kubectl delete pod <pod-name> -n production # Watch: kubectl get pods -n production -l app=my-service -w # Expected: new pod starts, readiness probe passes, 0 user-visible errors # ── Kill all pods in deployment (test rolling restart recovery) ── kubectl rollout restart deployment/my-service -n production # Watch error rate during rollout # ── Simulate OOMKill ────────────────────────────────────────── kubectl exec -it <pod> -n production -- sh -c \ "dd if=/dev/zero of=/dev/shm/blob bs=1M count=600" # Expected: pod OOMKilled, restarted, alert fired, no...

Details

Author
sawrus
Repository
sawrus/agent-guides
Created
3 months ago
Last Updated
3 days ago
Language
Shell
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

chaos-engineer

Designs chaos experiments, creates failure injection frameworks, and facilitates game day exercises for distributed systems — producing runbooks, experiment manifests, rollback procedures, and post-mortem templates. Use when designing chaos experiments, implementing failure injection frameworks, or conducting game day exercises. Invoke for chaos experiments, resilience testing, blast radius control, game days, antifragile systems, fault injection, Chaos Monkey, Litmus Chaos.

9,509 Updated 1 weeks ago
Jeffallan
AI & Automation Solid

running-chaos-tests

Execute chaos engineering experiments to test system resilience. Use when performing specialized testing. Trigger with phrases like "run chaos tests", "test resilience", or "inject failures".

2,266 Updated today
jeremylongshore
AI & Automation Listed

chaos-experiment

Design and document chaos engineering experiments. Guide steady state baseline, hypothesis formation, failure injection plans, and results analysis. Use for resilience testing, game days, failure injection experiments, and building confidence in system stability.

33 Updated today
rjmurillo
AI & Automation Listed

chaos-engineering

Provides chaos engineering best practices for resilience testing, fault injection, and game day planning. Use when designing resilience experiments, configuring chaos tools, planning game days, or when user mentions 'chaos engineering', 'resilience', 'litmus', 'game day', 'fault injection', 'chaos monkey', 'blast radius', 'steady state', 'failure mode'.

62 Updated today
Tibsfox
AI & Automation Listed

chaos-engineer

Use when designing chaos experiments, implementing failure injection frameworks, or conducting game day exercises. Invoke for chaos experiments, resilience testing, blast radius control, game days, antifragile systems.

1 Updated today
zacklecon