operating-production-serviceslisted
Install: claude install-skill aiskillstore/marketplace
# Operating Production Services
Production reliability patterns: measure what matters, learn from failures, improve systematically.
## Quick Reference
| Need | Go To |
|------|-------|
| Define reliability targets | [SLOs & Error Budgets](#slos--error-budgets) |
| Write incident report | [Postmortem Templates](#postmortem-templates) |
| Set up SLO alerting | [references/slo-alerting.md](references/slo-alerting.md) |
---
## SLOs & Error Budgets
### The Hierarchy
```
SLA (Contract) → SLO (Target) → SLI (Measurement)
```
### Common SLIs
```promql
# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
```
### SLO Targets Reality Check
| SLO % | Downtime/Month | Downtime/Year |
|-------|----------------|---------------|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43 minutes | 8.76 hours |
| 99.95% | 22 minutes | 4.38 hours |
| 99.99% | 4.3 minutes | 52 minutes |
**Don't aim for 100%.** Each nine costs exponentially more.
### Error Budget
```
Error Budget = 1 - SLO Target
```
**Example:** 99.9% SLO = 0.1% error budget = 43 minutes/month
**Policy:**
| Budget Remaining | Action |
|------------------|--------|
| > 50% | Normal velocity |
| 10-50% | Postpone ri