databricks-incident-runbook

Featured

Execute Databricks incident response procedures with triage, mitigation, and postmortem. Use when responding to Databricks-related outages, investigating job failures, or running post-incident reviews for pipeline failures. Trigger with phrases like "databricks incident", "databricks outage", "databricks down", "databricks on-call", "databricks emergency", "job failed".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Databricks Incident Runbook ## Overview Rapid incident response for Databricks: triage script, decision tree, immediate actions by error type, communication templates, evidence collection, and postmortem template. Designed for on-call engineers to follow during live incidents. ## Severity Levels | Level | Definition | Response Time | Examples | |-------|------------|---------------|----------| | P1 | Production pipeline down | < 15 min | Critical ETL failed, data not updating | | P2 | Degraded performance | < 1 hour | Slow queries, partial failures, stale data | | P3 | Non-critical issues | < 4 hours | Dev cluster issues, non-critical job delays | | P4 | No user impact | Next business day | Monitoring gaps, cleanup needed | ## Instructions ### Step 1: Quick Triage (Run First) ```bash #!/bin/bash set -euo pipefail echo "=== DATABRICKS TRIAGE $(date -u +%H:%M:%S\ UTC) ===" # 1. Is Databricks itself down? echo "--- Platform Status ---" curl -s https://status.databricks.com/api/v2/status.json | \ jq -r '.status.description // "UNKNOWN"' # 2. Can we reach the workspace? echo "--- Workspace ---" if databricks current-user me --output json 2>/dev/null | jq -r .userName; then echo "API: CONNECTED" else echo "API: UNREACHABLE — check VPN/firewall/token" fi # 3. Recent failures echo "--- Failed Runs (last 1h) ---" databricks runs list --limit 20 --output json 2>/dev/null | \ jq -r '.runs[]? | select(.state.result_state == "FAILED") | "\(.run_id): \(.run_name /...

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
7 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

replit-incident-runbook

Execute Replit incident response: triage deployment failures, database issues, and platform outages. Use when responding to Replit-related outages, investigating deployment crashes, or running post-incident reviews for Replit app failures. Trigger with phrases like "replit incident", "replit outage", "replit down", "replit emergency", "replit broken", "replit crash".

2,266 Updated today
jeremylongshore
AI & Automation Featured

fireflies-incident-runbook

Execute Fireflies.ai incident response with triage, remediation, and postmortem. Use when responding to Fireflies.ai API outages, auth failures, or webhook delivery problems. Trigger with phrases like "fireflies incident", "fireflies outage", "fireflies down", "fireflies on-call", "fireflies emergency", "fireflies broken".

2,266 Updated today
jeremylongshore
AI & Automation Featured

klaviyo-incident-runbook

Execute Klaviyo incident response procedures with triage, mitigation, and postmortem. Use when responding to Klaviyo-related outages, investigating API errors, or running post-incident reviews for Klaviyo integration failures. Trigger with phrases like "klaviyo incident", "klaviyo outage", "klaviyo down", "klaviyo on-call", "klaviyo emergency", "klaviyo broken".

2,266 Updated today
jeremylongshore
AI & Automation Featured

hubspot-incident-runbook

Execute HubSpot incident response with triage, mitigation, and postmortem. Use when responding to HubSpot API outages, investigating CRM errors, or running post-incident reviews for HubSpot integration failures. Trigger with phrases like "hubspot incident", "hubspot outage", "hubspot down", "hubspot on-call", "hubspot emergency", "hubspot broken".

2,266 Updated today
jeremylongshore
AI & Automation Featured

snowflake-incident-runbook

Execute Snowflake incident response with triage, rollback, and postmortem using real SQL diagnostics. Use when responding to Snowflake outages, investigating query failures, or running post-incident reviews for pipeline failures. Trigger with phrases like "snowflake incident", "snowflake outage", "snowflake down", "snowflake on-call", "snowflake emergency".

2,266 Updated today
jeremylongshore