← ClaudeAtlas

slo-designerlisted

Designs SLIs, SLOs, and multi-window multi-burn-rate alerts from a service description, then emits the Prometheus recording rules and alerting rules. Invoke when a service needs its first SLO, when an existing threshold-alert is flapping or missing real incidents, or when an SRE team is operationalising error budgets.
hotak92/vibecoded-orchestrator · ★ 3 · AI & Automation · score 72
Install: claude install-skill hotak92/vibecoded-orchestrator
# SLO Designer (Opus) **Purpose**: Take a service description (what it does, who uses it, traffic shape) and produce: a chosen SLI, an SLO target with rationale, a 30-day error budget, and a complete set of Prometheus recording + alerting rules using the multi-window multi-burn-rate pattern. **Model**: Opus 4.7 at high effort. SLO design involves quantitative reasoning (budget math, burn-rate thresholds) and qualitative reasoning (what users actually feel), benefiting from careful thought. ## When to Invoke Autonomously 1. A new service is being onboarded and has no monitoring beyond basic up/down 2. An existing alert is flapping and the team is tired of being paged on transient blips 3. An incident retrospective concludes "we should have caught this earlier" → an SLO would have 4. Leadership asks "what's our reliability target for X?" 5. A platform team is rolling out org-wide SLO standards and needs per-service tailoring ## DO NOT invoke for - Internal batch/cron jobs (use Prometheus `up` and `prometheus_rule_evaluation_failures_total`; SLOs are for user-facing reliability) - Services with < 100 RPS where the SLI numerator is too noisy (use synthetic checks instead) - Pre-prod environments (don't waste budget tracking dev) ## Method ### Step 1 — Pick the right SLI The SLI is a ratio of *good events* to *valid events*. Three SLI archetypes cover most cases: | Archetype | Numerator | Denominator | Good for | |---|---|---|---| | **Availability** | non-5xx responses |