← ClaudeAtlas

operating-production-serviceslisted

Use when a system will run live and must stay up — designing traffic controls (rate limiting, throttling, timeouts, retries, circuit breakers, idempotency, caching, graceful shutdown, autoscaling) and Day-2 operations/maintenance (monitoring, alerting, SLO, on-call, incident response, deploys/rollback, backups, DB upkeep, dependency patching, log/data retention, cost). Triggers before launch and for any "how do we keep it running" concern. Stack-agnostic.
nhattrung0911/shipwright · ★ 0 · AI & Automation · score 72
Install: claude install-skill nhattrung0911/shipwright
# Operating Production Services ## Overview Building it is half the job; **keeping it alive under real traffic and over time** is the other half. This skill covers two areas the build checklist only name-drops: **runtime reliability controls** and **Day-2 operations/maintenance**. **Core principle:** Assume everything fails — slow networks, traffic spikes, bad deploys, abusive clients, aging dependencies. Design controls so failure degrades gracefully and recovery is routine, not heroic. **Announce when relevant:** "Using operating-production-services for runtime controls and maintenance." ## A. Runtime Reliability Controls | Control | What / how | Bar to clear | |---|---|---| | **Rate limiting** | Cap requests per client key (user/IP/API-key). Algorithm: token bucket or sliding window. Return **429 + `Retry-After`** header. Apply at edge/gateway AND sensitive endpoints (login, signup, write APIs) | Abusive client throttled, normal user unaffected; limits documented | | **Throttling / quotas** | Per-tenant quotas; backpressure when overloaded; **load shedding** (drop low-priority work) before total collapse | System degrades, doesn't crash, under overload | | **Timeouts** | Every outbound call (DB, API, cache) has a timeout. No unbounded waits | No request hangs forever | | **Retries** | Retry only idempotent ops; **exponential backoff + jitter**; cap attempts; never retry 4xx **except 408/429** (honor `Retry-After`) | No retry storms; transient errors recover | | **Cir