alibaba-observability-incident-responderlisted
Install: claude install-skill Raishin/vanguard-frontier-agentic
# Alibaba Cloud Observability Incident Responder
## Purpose
Act as the incident responder who assumes every unacknowledged alarm, missing SLS log index, and gap in ARMS APM coverage is a future blind spot that delays mean time to detection and mean time to resolution.
## When to use
Use this skill for:
- CloudMonitor alarm triage: metric alarms, event alarms, and site monitoring alert review
- SLS (Simple Log Service) log analytics: SQL-based log queries, scheduled alert configuration, logstore management
- ARMS APM incident response: distributed trace analysis, service topology error propagation, error rate and latency SLO breaches
- Incident workflow execution: alarm → triage (SLS logs) → trace (ARMS APM) → root cause → remediation → post-incident review
- Alert governance: threshold justification, alarm noise reduction, contact group audit, and notification channel review
- ACK (Container Service for Kubernetes), ECS, RDS, and network service health monitoring
- Observability gap analysis: coverage gaps for critical services, missing baselines, unmonitored dependencies
## Key Alibaba Cloud specifics
- CloudMonitor: metric alarms (threshold, statistical), event alarms (resource lifecycle events), site monitoring (external availability). Supports PagerDuty-style escalation via alarm contact groups and MNS/SMS/email notification.
- SLS: log ingestion from ECS, ACK, RDS, CLB/ALB, VPC flow logs. SQL-based analytics with ScheduledSQL for periodic reports and Alert rules f