← ClaudeAtlas

observability-srelisted

Observability and SRE expert. Use when setting up monitoring, logging, tracing, defining SLOs, or managing incidents. Covers Prometheus, Grafana, OpenTelemetry, and incident response best practices.
majiayu000/claude-arsenal · ★ 72 · AI & Automation · score 84
Install: claude install-skill majiayu000/claude-arsenal
# Observability & Site Reliability Engineering ## Core Principles - **Three Pillars** — Metrics, Logs, and Traces provide holistic visibility - **Observability-First** — Build systems that explain their own behavior - **SLO-Driven** — Define reliability targets that matter to users - **Proactive Detection** — Find issues before customers do - **Blameless Culture** — Learn from failures without blame - **Automate Toil** — Reduce repetitive operational work - **Continuous Improvement** — Each incident makes systems more resilient - **Full-Stack Visibility** — Monitor from infrastructure to business metrics --- ## Hard Rules (Must Follow) > These rules are mandatory. Violating them means the skill is not working correctly. ### Symptom-Based Alerts Only **Alert on user-facing symptoms, not internal infrastructure metrics.** ```yaml # ❌ FORBIDDEN: Alerting on internal metrics - alert: CPUHigh expr: cpu_usage > 70% # Users don't care about CPU, they care about latency - alert: MemoryHigh expr: memory_usage > 80% # Internal metric, may not affect users # ✅ REQUIRED: Alert on user experience - alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 annotations: summary: "Users experiencing slow response times" - alert: ErrorRateHigh expr: slo:api_errors:rate5m > 0.001 annotations: summary: "Users encountering errors" ``` ### Low Cardinality Labels **Loki/Prometheus labels must have low cardinality (<10 unique labels).** ```yaml # ❌ FORBIDDEN: Hig