← ClaudeAtlas

reliability-engineering-cloudlisted

Reliability engineering for cloud systems — SLIs, SLOs, error budgets, SRE practices, runbooks, incident response, on-call rotation, blameless postmortems, chaos engineering, and the NASA Systems Engineering methodology (MCR/SRR/PDR/CDR/ORR phase gates, TAID verification, requirements tracing) adapted to cloud operations. Use when establishing SLOs for a new service, running an incident, writing a runbook, preparing a launch readiness review, or bringing NASA SE discipline to cloud deployments.
Tibsfox/gsd-skill-creator · ★ 61 · DevOps & Infrastructure · score 80
Install: claude install-skill Tibsfox/gsd-skill-creator
# Reliability Engineering for Cloud Reliability is a property of a system, not of any individual component. A cloud service that meets its availability target is not necessarily built from components that individually meet their target — it is built from a system that absorbs component failures without exposing them to users. This skill covers the SRE toolkit (SLIs, SLOs, error budgets, runbooks, postmortems, chaos engineering) alongside NASA's Systems Engineering methodology for design reviews and verification, because cloud operations at serious scale have to borrow discipline from somewhere, and aerospace is the field that has been figuring this out the longest. **Agent affinity:** gray (transaction processing reliability, ACID foundations), hamilton-cloud (SRE economics at AWS scale), lamport (formal safety arguments) **Concept IDs:** cloud-se-phase-reviews, cloud-taid-verification, cloud-runbook-structure, cloud-procedure-execution, cloud-communication-loops ## Service Level Indicators, Objectives, and Agreements **SLI (Service Level Indicator).** A quantitative measure of some aspect of service behavior. Examples: fraction of requests that succeed, fraction of requests served in under 200 ms, bytes delivered, freshness of returned data. **SLO (Service Level Objective).** A target for an SLI over a window. "99.9% of requests succeed over a 30-day window." SLOs are set internally by the team that operates the service. **SLA (Service Level Agreement).** A contractua