coreweave-observability

Featured

Set up GPU monitoring and observability for CoreWeave workloads. Use when implementing GPU metrics dashboards, configuring alerts, or tracking inference latency and throughput. Trigger with phrases like "coreweave monitoring", "coreweave observability", "coreweave gpu metrics", "coreweave grafana".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# CoreWeave Observability ## Overview CoreWeave runs GPU-intensive workloads on Kubernetes where hardware failures, memory exhaustion, and underutilization directly impact cost and reliability. Observability must cover DCGM GPU metrics, Kubernetes pod health, inference latency, and job completion rates. Proactive monitoring prevents wasted spend on idle GPUs and catches OOM conditions before they cascade. ## Key Metrics | Metric | Type | Target | Alert Threshold | |--------|------|--------|-----------------| | GPU utilization | Gauge | > 60% | < 20% for 30m | | GPU memory usage | Gauge | < 85% | > 95% for 5m | | Inference latency p99 | Histogram | < 200ms | > 500ms | | Job completion rate | Counter | > 99% | < 95% per hour | | Pod restart count | Counter | 0 | > 3 in 15m | | Node GPU temperature | Gauge | < 80C | > 85C for 10m | ## Instrumentation ```typescript async function trackInference(model: string, fn: () => Promise<any>) { const start = Date.now(); try { const result = await fn(); metrics.record('coreweave.inference.latency', Date.now() - start, { model, status: 'ok' }); metrics.increment('coreweave.inference.completed', { model }); return result; } catch (err) { metrics.increment('coreweave.inference.errors', { model, error: err.code }); throw err; } } ``` ## Health Check Dashboard ```typescript async function coreweaveHealth(): Promise<Record<string, string>> { const gpu = await queryPrometheus('avg(DCGM_FI_DEV_GPU_UTIL)'); ...

Details

Author: jeremylongshore
Repository: jeremylongshore/claude-code-plugins-plus-skills
Created: 7 months ago
Last Updated: today
Language: Python
License: MIT

coreweave-common-errors

Diagnose and fix CoreWeave GPU scheduling, pod, and networking errors. Use when pods are stuck Pending, GPUs are not allocated, or experiencing CUDA and NCCL errors. Trigger with phrases like "coreweave error", "coreweave pod pending", "coreweave gpu not found", "coreweave debug", "fix coreweave".

2,266 Updated today

jeremylongshore