databricks-observability

Featured

Set up comprehensive observability for Databricks with metrics, traces, and alerts. Use when implementing monitoring for Databricks jobs, setting up dashboards, or configuring alerting for pipeline health. Trigger with phrases like "databricks monitoring", "databricks metrics", "databricks observability", "monitor databricks", "databricks alerts", "databricks logging".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Databricks Observability ## Overview Monitor Databricks jobs, clusters, SQL warehouses, and costs using system tables in the `system` catalog. System tables provide queryable observability data: `system.lakeflow` (job runs), `system.billing` (costs), `system.query` (SQL history), `system.access` (audit logs), and `system.compute` (cluster metrics). Data updates throughout the day, not real-time. ## Prerequisites - Databricks Premium or Enterprise with Unity Catalog enabled - Access to `system.billing`, `system.lakeflow`, `system.query`, and `system.access` schemas - SQL warehouse for running monitoring queries ## Instructions ### Step 1: Job Health Monitoring ```sql -- Job success/failure over last 24 hours SELECT COUNT(CASE WHEN result_state = 'SUCCESS' THEN 1 END) AS succeeded, COUNT(CASE WHEN result_state = 'FAILED' THEN 1 END) AS failed, COUNT(CASE WHEN result_state = 'TIMED_OUT' THEN 1 END) AS timed_out, ROUND(100.0 * COUNT(CASE WHEN result_state = 'SUCCESS' THEN 1 END) / COUNT(*), 1) AS success_rate_pct, ROUND(AVG(TIMESTAMPDIFF(MINUTE, start_time, end_time)), 1) AS avg_duration_min FROM system.lakeflow.job_run_timeline WHERE start_time > current_timestamp() - INTERVAL 24 HOURS; -- Failed jobs with error details SELECT job_id, run_name, result_state, start_time, end_time, TIMESTAMPDIFF(MINUTE, start_time, end_time) AS duration_min, error_message FROM system.lakeflow.job_run_timeline WHERE result_state = 'FAILED' AND start_time >...

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
7 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

snowflake-observability

Set up Snowflake observability using ACCOUNT_USAGE views, alerts, and external monitoring. Use when implementing Snowflake monitoring dashboards, setting up query performance tracking, or configuring alerting for warehouse and pipeline health. Trigger with phrases like "snowflake monitoring", "snowflake metrics", "snowflake observability", "snowflake dashboard", "snowflake alerts".

2,266 Updated today
jeremylongshore
AI & Automation Featured

databricks-cost-tuning

Optimize Databricks costs with cluster policies, spot instances, and monitoring. Use when reducing cloud spend, implementing cost controls, or analyzing Databricks usage costs. Trigger with phrases like "databricks cost", "reduce databricks spend", "databricks billing", "databricks cost optimization", "cluster cost".

2,266 Updated today
jeremylongshore
AI & Automation Featured

databricks-webhooks-events

Configure Databricks job notifications, webhooks, and event handling. Use when setting up Slack/Teams notifications, configuring alerts, or integrating Databricks events with external systems. Trigger with phrases like "databricks webhook", "databricks notifications", "databricks alerts", "job failure notification", "databricks slack".

2,266 Updated today
jeremylongshore
AI & Automation Featured

clickhouse-observability

Monitor ClickHouse with Prometheus metrics, Grafana dashboards, system table queries, and alerting for query performance, merge health, and resource usage. Use when setting up ClickHouse monitoring, building Grafana dashboards, or configuring alerts for production ClickHouse deployments. Trigger: "clickhouse monitoring", "clickhouse metrics", "clickhouse Grafana", "clickhouse observability", "monitor clickhouse", "clickhouse Prometheus".

2,266 Updated today
jeremylongshore
Data & Documents Listed

databricks-core

Databricks CLI operations: auth, profiles, data exploration, and bundles. Contains up-to-date guidelines for Databricks-related CLI tasks.

0 Updated 2 days ago
pgoell