snowflake-incident-runbook

Featured

Execute Snowflake incident response with triage, rollback, and postmortem using real SQL diagnostics. Use when responding to Snowflake outages, investigating query failures, or running post-incident reviews for pipeline failures. Trigger with phrases like "snowflake incident", "snowflake outage", "snowflake down", "snowflake on-call", "snowflake emergency".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Snowflake Incident Runbook ## Overview Rapid incident response procedures for Snowflake infrastructure, pipeline failures, and query issues. ## Severity Levels | Level | Definition | Response Time | Examples | |-------|------------|---------------|----------| | P1 | Complete outage | < 15 min | All queries failing, auth broken | | P2 | Degraded service | < 1 hour | High latency, task failures | | P3 | Minor impact | < 4 hours | Snowpipe delays, non-critical errors | | P4 | No user impact | Next business day | Monitoring gaps, cost anomalies | ## Quick Triage (First 5 Minutes) ### Step 1: Is Snowflake Itself Down? ```bash # Check Snowflake status page curl -s https://status.snowflake.com/api/v2/summary.json | python3 -c " import sys, json data = json.load(sys.stdin) print(f\"Status: {data['status']['description']}\") for c in data['components']: if c['status'] != 'operational': print(f\" DEGRADED: {c['name']} - {c['status']}\") " ``` ### Step 2: Can We Connect? ```sql -- Quick connectivity test SELECT CURRENT_TIMESTAMP(), CURRENT_ACCOUNT(), CURRENT_REGION(); -- If this fails, the issue is connectivity/auth, not query logic ``` ### Step 3: What's Failing? ```sql -- Recent failures (last 30 minutes) SELECT error_code, error_message, COUNT(*) AS occurrences, MIN(start_time) AS first_seen, MAX(start_time) AS last_seen FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE execution_status = 'FAIL' AND start_time >= DATEADD(minutes, -30, CURRENT_TIMEST...

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
7 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

databricks-incident-runbook

Execute Databricks incident response procedures with triage, mitigation, and postmortem. Use when responding to Databricks-related outages, investigating job failures, or running post-incident reviews for pipeline failures. Trigger with phrases like "databricks incident", "databricks outage", "databricks down", "databricks on-call", "databricks emergency", "job failed".

2,266 Updated today
jeremylongshore
AI & Automation Featured

fireflies-incident-runbook

Execute Fireflies.ai incident response with triage, remediation, and postmortem. Use when responding to Fireflies.ai API outages, auth failures, or webhook delivery problems. Trigger with phrases like "fireflies incident", "fireflies outage", "fireflies down", "fireflies on-call", "fireflies emergency", "fireflies broken".

2,266 Updated today
jeremylongshore
AI & Automation Featured

salesforce-incident-runbook

Execute Salesforce incident response procedures with triage, mitigation, and postmortem. Use when responding to Salesforce-related outages, investigating API errors, or running post-incident reviews for Salesforce integration failures. Trigger with phrases like "salesforce incident", "salesforce outage", "salesforce down", "salesforce on-call", "salesforce emergency", "salesforce broken".

2,266 Updated today
jeremylongshore
AI & Automation Featured

snowflake-advanced-troubleshooting

Apply advanced Snowflake debugging with query profiling, spill analysis, lock contention, and performance deep-dives using ACCOUNT_USAGE views. Use when standard troubleshooting fails, investigating slow queries, or diagnosing warehouse performance issues. Trigger with phrases like "snowflake hard bug", "snowflake slow query debug", "snowflake query profile", "snowflake spilling", "snowflake deep debug".

2,266 Updated today
jeremylongshore
AI & Automation Featured

clickhouse-incident-runbook

ClickHouse incident response — triage, diagnose, and remediate server issues using system tables, kill stuck queries, and execute recovery procedures. Use when ClickHouse is slow, unresponsive, or producing errors in production. Trigger: "clickhouse incident", "clickhouse outage", "clickhouse down", "clickhouse emergency", "clickhouse on-call", "clickhouse broken".

2,266 Updated today
jeremylongshore