slo-sli-design

Solid

Define SLIs, SLOs, and error budgets; implement burn rate alerts; integrate with Prometheus.

AI & Automation 14 stars 3 forks Updated 3 days ago MIT

Install

View on GitHub

Quality Score: 86/100

Stars 20%
39
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
80
License 10%
100
Description 5%
100

Skill Content

# Skill: SLO/SLI Design > **Expertise:** SLI selection, SLO target setting, error budget calculation, burn rate alerting, Sloth/pyrra integration. ## When to load When defining SLOs for a new service, setting up error budget tracking, or reviewing existing SLOs after an incident. ## SLI Selection Framework ``` Step 1: What does the user care about? → "The checkout completes successfully and quickly" Step 2: What CAN we measure? → HTTP 2xx responses, p99 latency Step 3: Define the SLI formula → Availability SLI: good_requests / total_requests where good = status < 500 AND latency < 500ms Step 4: Pick SLO target (start conservative, tighten later) → 99.5% (don't chase 99.99% without data — high budget wasted on caution) Step 5: Calculate error budget → 100% - 99.5% = 0.5% over 28 days = 0.5% × 28 × 24 × 60 = 201.6 minutes ``` ## Prometheus SLO Implementation (manual) ```yaml # Recording rules for SLO tracking groups: - name: slo.checkout-service interval: 30s rules: # Good requests (2xx, latency < 500ms) - record: slo:http_requests_good:rate5m expr: | sum(rate(http_requests_total{ service="checkout-service", status=~"2..", duration_bucket="0.5" }[5m])) # Total requests - record: slo:http_requests_total:rate5m expr: | sum(rate(http_requests_total{service="checkout-service"}[5m])) # SLI = good / total - record: slo:http_availa...

Details

Author
sawrus
Repository
sawrus/agent-guides
Created
3 months ago
Last Updated
3 days ago
Language
Shell
License
MIT

Similar Skills

Semantically similar based on skill content — not just same category