chaos-engineering

Solid

Design and run chaos experiments in Kubernetes — pod failures, network partitions, resource pressure with LitmusChaos and manual chaos.

AI & Automation 14 stars 3 forks Updated 3 days ago MIT

Install

View on GitHub

Quality Score: 86/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Skill: Chaos Engineering > **Expertise:** LitmusChaos experiments, manual K8s chaos, network partition testing, graceful degradation validation. ## When to load When designing chaos experiments, validating failover behavior, verifying SLO headroom, or onboarding a service to chaos testing. ## Chaos Experiment Design Principles ``` 1. Define steady state first → What does "working" look like? (SLI baseline: error rate < 0.1%, p99 < 200ms) 2. Hypothesize → "If 1/3 of pods die, the service will continue serving with p99 < 500ms" 3. Blast radius control → Start with staging. Start with 1 pod. Increase gradually. 4. Abort conditions → Auto-stop if error rate > 1% or p99 > 1s for > 2 min 5. Document and act → Passed = evidence of resilience. Failed = fix + re-test. Never just accept failure. ``` ## Manual Chaos (no tooling needed) ```bash # ── Pod kill (test restart recovery) ────────────────────────── kubectl delete pod <pod-name> -n production # Watch: kubectl get pods -n production -l app=my-service -w # Expected: new pod starts, readiness probe passes, 0 user-visible errors # ── Kill all pods in deployment (test rolling restart recovery) ── kubectl rollout restart deployment/my-service -n production # Watch error rate during rollout # ── Simulate OOMKill ────────────────────────────────────────── kubectl exec -it <pod> -n production -- sh -c \ "dd if=/dev/zero of=/dev/shm/blob bs=1M count=600" # Expected: pod OOMKilled, restarted, alert fired, no...

Details

Author: sawrus
Repository: sawrus/agent-guides
Created: 3 months ago
Last Updated: 3 days ago
Language: Shell
License: MIT

Integrates with

Kubernetes · Infrastructure

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

chaos-engineer

Designs chaos experiments, creates failure injection frameworks, and facilitates game day exercises for distributed systems — producing runbooks, experiment manifests, rollback procedures, and post-mortem templates. Use when designing chaos experiments, implementing failure injection frameworks, or conducting game day exercises. Invoke for chaos experiments, resilience testing, blast radius control, game days, antifragile systems, fault injection, Chaos Monkey, Litmus Chaos.

9,509 Updated 1 weeks ago

Jeffallan

AI & Automation Solid

running-chaos-tests

Execute chaos engineering experiments to test system resilience. Use when performing specialized testing. Trigger with phrases like "run chaos tests", "test resilience", or "inject failures".

2,266 Updated today

jeremylongshore

AI & Automation Listed

chaos-experiment

Design and document chaos engineering experiments. Guide steady state baseline, hypothesis formation, failure injection plans, and results analysis. Use for resilience testing, game days, failure injection experiments, and building confidence in system stability.

33 Updated today

rjmurillo

AI & Automation Listed