← ClaudeAtlas

sre-patternslisted

Provides Site Reliability Engineering best practices for SLOs, SLIs, SLAs, error budgets, toil reduction, reliability reviews, and capacity planning. Use when defining service objectives, measuring reliability, reducing toil, planning capacity, or when user mentions 'SRE', 'SLO', 'SLI', 'SLA', 'error budget', 'toil', 'reliability', 'on-call', 'capacity planning'.
Tibsfox/gsd-skill-creator · ★ 61 · AI & Automation · score 74
Install: claude install-skill Tibsfox/gsd-skill-creator
# SRE Patterns Best practices for building and operating reliable systems using Site Reliability Engineering principles. ## SLO / SLI / SLA Definitions These three concepts form the foundation of SRE. They are distinct and frequently confused. | Concept | Definition | Owner | Example | |---------|-----------|-------|---------| | **SLI** (Service Level Indicator) | A quantitative measurement of a service attribute | Engineering | 99.2% of requests completed in < 300ms | | **SLO** (Service Level Objective) | A target value or range for an SLI | Engineering + Product | 99.5% of requests must complete in < 300ms | | **SLA** (Service Level Agreement) | A contract with consequences for missing an SLO | Business + Legal | 99.9% uptime or customer receives service credits | ### Relationship ``` SLI (what you measure) --> SLO (what you target, always stricter than SLA) --> SLA (what you promise externally, with penalties) ``` **Key rule:** SLO must be stricter than SLA. If your SLA promises 99.9% uptime, your internal SLO should target 99.95%. The gap is your safety margin. ## SLI Specification SLIs must be precise, measurable, and tied to user experience. Vague indicators lead to meaningless objectives. ### SLI Types by Service Category | Service Type | SLI Category | Good Event | Valid Event | |-------------|-------------|------------|-------------| | Request-driven | Availability | Response status < 500 | All HTTP requests | | Request-driven | Latency | Response time