operating-production-serviceslisted
Install: claude install-skill nhattrung0911/shipwright
# Operating Production Services
## Overview
Building it is half the job; **keeping it alive under real traffic and over time** is the other half. This skill covers two areas the build checklist only name-drops: **runtime reliability controls** and **Day-2 operations/maintenance**.
**Core principle:** Assume everything fails — slow networks, traffic spikes, bad deploys, abusive clients, aging dependencies. Design controls so failure degrades gracefully and recovery is routine, not heroic.
**Announce when relevant:** "Using operating-production-services for runtime controls and maintenance."
## A. Runtime Reliability Controls
| Control | What / how | Bar to clear |
|---|---|---|
| **Rate limiting** | Cap requests per client key (user/IP/API-key). Algorithm: token bucket or sliding window. Return **429 + `Retry-After`** header. Apply at edge/gateway AND sensitive endpoints (login, signup, write APIs) | Abusive client throttled, normal user unaffected; limits documented |
| **Throttling / quotas** | Per-tenant quotas; backpressure when overloaded; **load shedding** (drop low-priority work) before total collapse | System degrades, doesn't crash, under overload |
| **Timeouts** | Every outbound call (DB, API, cache) has a timeout. No unbounded waits | No request hangs forever |
| **Retries** | Retry only idempotent ops; **exponential backoff + jitter**; cap attempts; never retry 4xx **except 408/429** (honor `Retry-After`) | No retry storms; transient errors recover |
| **Cir