sre-engineer

Solid

Defines service level objectives, creates error budget policies, designs incident response procedures, develops capacity models, and produces monitoring configurations and automation scripts for production systems. Use when defining SLIs/SLOs, managing error budgets, building reliable systems at scale, incident management, chaos engineering, toil reduction, or capacity planning.

AI & Automation 9,846 stars 859 forks Updated 3 weeks ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%
100
Recency 20%
90
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# SRE Engineer ## Core Workflow 1. **Assess reliability** - Review architecture, SLOs, incidents, toil levels 2. **Define SLOs** - Identify meaningful SLIs and set appropriate targets 3. **Verify alignment** - Confirm SLO targets reflect user expectations before proceeding 4. **Implement monitoring** - Build golden signal dashboards and alerting 5. **Automate toil** - Identify repetitive tasks and build automation 6. **Test resilience** - Design and execute chaos experiments; verify recovery meets RTO/RPO targets before marking the experiment complete; validate recovery behavior end-to-end ## Reference Guide Load detailed guidance based on context: | Topic | Reference | Load When | |-------|-----------|-----------| | SLO/SLI | `references/slo-sli-management.md` | Defining SLOs, calculating error budgets | | Error Budgets | `references/error-budget-policy.md` | Managing budgets, burn rates, policies | | Monitoring | `references/monitoring-alerting.md` | Golden signals, alert design, dashboards | | Automation | `references/automation-toil.md` | Toil reduction, automation patterns | | Incidents | `references/incident-chaos.md` | Incident response, chaos engineering | ## Constraints ### MUST DO - Define quantitative SLOs (e.g., 99.9% availability) - Calculate error budgets from SLO targets - Monitor golden signals (latency, traffic, errors, saturation) - Write blameless postmortems for all incidents - Measure toil and track reduction progress - Automate repetitive operatio...

Details

Author
Jeffallan
Repository
Jeffallan/claude-skills
Created
7 months ago
Last Updated
3 weeks ago
Language
Python
License
MIT

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

sre-engineer

SRE / Observability Engineer (/sre) — reliability engineering: SLOs/SLIs & error budgets, monitoring & alerting (Prometheus, Grafana, OpenTelemetry), incident response & runbooks, on-call, capacity & load, chaos/resilience, and post-incident reviews. Use when defining reliability targets, instrumenting observability, setting up alerting, writing runbooks, doing incident response, or reviewing a change for production readiness. Invoke alongside /arch for reliability NFRs and devops-engineer for the underlying infra/CI-CD. NOT for provisioning infra or pipelines (that's devops-engineer) — /sre owns reliability, not the cluster.

10 Updated today
olehsvyrydov
AI & Automation Listed

sre-patterns

Provides Site Reliability Engineering best practices for SLOs, SLIs, SLAs, error budgets, toil reduction, reliability reviews, and capacity planning. Use when defining service objectives, measuring reliability, reducing toil, planning capacity, or when user mentions 'SRE', 'SLO', 'SLI', 'SLA', 'error budget', 'toil', 'reliability', 'on-call', 'capacity planning'.

65 Updated today
Tibsfox
AI & Automation Listed

operating-production-services

SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).

353 Updated today
aiskillstore