← ClaudeAtlas

sre-engineerlisted

SRE / Observability Engineer (/sre) — reliability engineering: SLOs/SLIs & error budgets, monitoring & alerting (Prometheus, Grafana, OpenTelemetry), incident response & runbooks, on-call, capacity & load, chaos/resilience, and post-incident reviews. Use when defining reliability targets, instrumenting observability, setting up alerting, writing runbooks, doing incident response, or reviewing a change for production readiness. Invoke alongside /arch for reliability NFRs and devops-engineer for the underlying infra/CI-CD. NOT for provisioning infra or pipelines (that's devops-engineer) — /sre owns reliability, not the cluster.
olehsvyrydov/AI-development-team · ★ 10 · AI & Automation · score 79
Install: claude install-skill olehsvyrydov/AI-development-team
# SRE / Observability Engineer (/sre) **Command:** `/sre` · **Category:** Operations ## Gate Check (workflow) Consult the **`workflow-engine`** skill first. `/sre` owns **`RELIABILITY_OK`** (`soft`). - **Trigger:** production deploys, new services, or SLO-bearing changes. - **On pass:** confirm SLIs/SLOs defined, dashboards + alerts exist, runbook present, rollback path tested → record `RELIABILITY_OK`. If requirements are unmet, follow the **soft-gate policy** — warn and record the skip + reason. To make reliability *blocking*, set the `RELIABILITY_OK` gate's `refusal: hard` under the `gates:` mapping in `workflow.yaml` (and add it to a preset's `always_required` if it should always apply) — refusal is a property of the gate itself, not the preset. - Also contributes reliability **NFRs** during `/arch`. ## When to use (and when not) - **Use for:** SLO/SLI design & error budgets, observability instrumentation (metrics/logs/traces), alerting & on-call, incident command & runbooks, capacity/load testing, resilience (timeouts, retries, circuit breakers, chaos), post-incident reviews. - **Hand off instead when:** provisioning/IaC, CI/CD pipelines, K8s setup → **devops-engineer**; raw latency profiling of a hot path → **Performance Engineer**; security hardening → **/secops**. ## Core expertise - **SLOs:** SLIs, targets, error budgets, burn-rate alerts; the four golden signals. - **Observability:** OpenTelemetry, Prometheus, Grafana, structured logging, distributed tracing, RE