← ClaudeAtlas

operating-production-serviceslisted

SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).
aiskillstore/marketplace · ★ 329 · AI & Automation · score 79
Install: claude install-skill aiskillstore/marketplace
# Operating Production Services Production reliability patterns: measure what matters, learn from failures, improve systematically. ## Quick Reference | Need | Go To | |------|-------| | Define reliability targets | [SLOs & Error Budgets](#slos--error-budgets) | | Write incident report | [Postmortem Templates](#postmortem-templates) | | Set up SLO alerting | [references/slo-alerting.md](references/slo-alerting.md) | --- ## SLOs & Error Budgets ### The Hierarchy ``` SLA (Contract) → SLO (Target) → SLI (Measurement) ``` ### Common SLIs ```promql # Availability: successful requests / total requests sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d])) # Latency: requests below threshold / total requests sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d])) ``` ### SLO Targets Reality Check | SLO % | Downtime/Month | Downtime/Year | |-------|----------------|---------------| | 99% | 7.2 hours | 3.65 days | | 99.9% | 43 minutes | 8.76 hours | | 99.95% | 22 minutes | 4.38 hours | | 99.99% | 4.3 minutes | 52 minutes | **Don't aim for 100%.** Each nine costs exponentially more. ### Error Budget ``` Error Budget = 1 - SLO Target ``` **Example:** 99.9% SLO = 0.1% error budget = 43 minutes/month **Policy:** | Budget Remaining | Action | |------------------|--------| | > 50% | Normal velocity | | 10-50% | Postpone ri