← ClaudeAtlas

managing-incidentslisted

Guide incident response from detection to post-mortem using SRE principles, severity classification, on-call management, blameless culture, and communication protocols. Use when setting up incident processes, designing escalation policies, or conducting post-mortems.
ancoleman/ai-design-components · ★ 368 · Web & Frontend · score 80
Install: claude install-skill ancoleman/ai-design-components
# Incident Management Provide end-to-end incident management guidance covering detection, response, communication, and learning. Emphasizes SRE culture, blameless post-mortems, and structured processes for high-reliability operations. ## When to Use This Skill Apply this skill when: - Setting up incident response processes for a team - Designing on-call rotations and escalation policies - Creating runbooks for common failure scenarios - Conducting blameless post-mortems after incidents - Implementing incident communication protocols (internal and external) - Choosing incident management tooling and platforms - Improving MTTR and incident frequency metrics ## Core Principles ### Incident Management Philosophy **Declare Early and Often:** Do not wait for certainty. Declaring an incident enables coordination, can be downgraded if needed, and prevents delayed response. **Mitigation First, Root Cause Later:** Stop customer impact immediately (rollback, disable feature, failover). Debug and fix root cause after stability restored. **Blameless Culture:** Assume good intentions. Focus on how systems failed, not who failed. Create psychological safety for honest learning. **Clear Command Structure:** Assign Incident Commander (IC) to own coordination. IC delegates tasks but does not do hands-on debugging. **Communication is Critical:** Internal coordination via dedicated channels, external transparency via status pages. Update stakeholders every 15-30 minutes during critical