clickhouse-incident-runbook

Featured

ClickHouse incident response — triage, diagnose, and remediate server issues using system tables, kill stuck queries, and execute recovery procedures. Use when ClickHouse is slow, unresponsive, or producing errors in production. Trigger: "clickhouse incident", "clickhouse outage", "clickhouse down", "clickhouse emergency", "clickhouse on-call", "clickhouse broken".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# ClickHouse Incident Runbook ## Overview Step-by-step procedures for triaging and resolving ClickHouse incidents using built-in system tables and SQL commands. ## Severity Levels | Level | Definition | Response | Examples | |-------|------------|----------|----------| | P1 | ClickHouse unreachable / all queries failing | < 15 min | Server down, OOM, disk full | | P2 | Degraded performance / partial failures | < 1 hour | Slow queries, merge backlog | | P3 | Minor impact / non-critical errors | < 4 hours | Single table issue, warnings | | P4 | No user impact | Next business day | Monitoring gaps, optimization | ## Quick Triage (Run First) ```bash # 1. Is ClickHouse alive? curl -sf 'http://localhost:8123/ping' && echo "UP" || echo "DOWN" # 2. Can it answer a query? curl -sf 'http://localhost:8123/?query=SELECT+1' && echo "OK" || echo "QUERY FAILED" # 3. Check ClickHouse Cloud status curl -sf 'https://status.clickhouse.cloud' | head -5 ``` ```sql -- 4. Server health snapshot (run if server responds) SELECT version() AS version, formatReadableTimeDelta(uptime()) AS uptime, (SELECT count() FROM system.processes) AS running_queries, (SELECT value FROM system.metrics WHERE metric = 'MemoryTracking') AS memory_bytes, (SELECT count() FROM system.merges) AS active_merges; -- 5. Recent errors SELECT event_time, exception_code, exception, substring(query, 1, 200) AS q FROM system.query_log WHERE type = 'ExceptionWhileProcessi...

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
7 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

clickup-incident-runbook

Execute ClickUp API incident response: triage, diagnosis, mitigation, and postmortem for API failures and integration outages. Trigger: "clickup incident", "clickup outage", "clickup down", "clickup on-call", "clickup emergency", "clickup API broken".

2,266 Updated today
jeremylongshore
AI & Automation Featured

databricks-incident-runbook

Execute Databricks incident response procedures with triage, mitigation, and postmortem. Use when responding to Databricks-related outages, investigating job failures, or running post-incident reviews for pipeline failures. Trigger with phrases like "databricks incident", "databricks outage", "databricks down", "databricks on-call", "databricks emergency", "job failed".

2,266 Updated today
jeremylongshore
AI & Automation Featured

clickhouse-debug-bundle

Collect ClickHouse diagnostic data — system tables, query logs, merge status, and server metrics for support tickets and troubleshooting. Use when investigating persistent issues, preparing debug artifacts, or collecting evidence for ClickHouse support. Trigger: "clickhouse debug", "clickhouse diagnostics", "clickhouse support bundle", "collect clickhouse logs", "clickhouse system tables".

2,266 Updated today
jeremylongshore
AI & Automation Featured

hubspot-incident-runbook

Execute HubSpot incident response with triage, mitigation, and postmortem. Use when responding to HubSpot API outages, investigating CRM errors, or running post-incident reviews for HubSpot integration failures. Trigger with phrases like "hubspot incident", "hubspot outage", "hubspot down", "hubspot on-call", "hubspot emergency", "hubspot broken".

2,266 Updated today
jeremylongshore
AI & Automation Featured

fireflies-incident-runbook

Execute Fireflies.ai incident response with triage, remediation, and postmortem. Use when responding to Fireflies.ai API outages, auth failures, or webhook delivery problems. Trigger with phrases like "fireflies incident", "fireflies outage", "fireflies down", "fireflies on-call", "fireflies emergency", "fireflies broken".

2,266 Updated today
jeremylongshore