coreweave-observability

Featured

Set up GPU monitoring and observability for CoreWeave workloads. Use when implementing GPU metrics dashboards, configuring alerts, or tracking inference latency and throughput. Trigger with phrases like "coreweave monitoring", "coreweave observability", "coreweave gpu metrics", "coreweave grafana".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# CoreWeave Observability ## Overview CoreWeave runs GPU-intensive workloads on Kubernetes where hardware failures, memory exhaustion, and underutilization directly impact cost and reliability. Observability must cover DCGM GPU metrics, Kubernetes pod health, inference latency, and job completion rates. Proactive monitoring prevents wasted spend on idle GPUs and catches OOM conditions before they cascade. ## Key Metrics | Metric | Type | Target | Alert Threshold | |--------|------|--------|-----------------| | GPU utilization | Gauge | > 60% | < 20% for 30m | | GPU memory usage | Gauge | < 85% | > 95% for 5m | | Inference latency p99 | Histogram | < 200ms | > 500ms | | Job completion rate | Counter | > 99% | < 95% per hour | | Pod restart count | Counter | 0 | > 3 in 15m | | Node GPU temperature | Gauge | < 80C | > 85C for 10m | ## Instrumentation ```typescript async function trackInference(model: string, fn: () => Promise<any>) { const start = Date.now(); try { const result = await fn(); metrics.record('coreweave.inference.latency', Date.now() - start, { model, status: 'ok' }); metrics.increment('coreweave.inference.completed', { model }); return result; } catch (err) { metrics.increment('coreweave.inference.errors', { model, error: err.code }); throw err; } } ``` ## Health Check Dashboard ```typescript async function coreweaveHealth(): Promise<Record<string, string>> { const gpu = await queryPrometheus('avg(DCGM_FI_DEV_GPU_UTIL)'); ...

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
7 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

coreweave-deploy-integration

Deploy inference services on CoreWeave with Helm charts and Kustomize. Use when deploying multi-model inference, managing GPU deployments at scale, or templating CoreWeave manifests. Trigger with phrases like "deploy coreweave", "coreweave helm", "coreweave kustomize", "coreweave deployment patterns".

2,266 Updated today
jeremylongshore
AI & Automation Featured

coreweave-webhooks-events

Monitor CoreWeave cluster events and GPU workload status. Use when tracking pod lifecycle events, monitoring GPU utilization, or alerting on inference service health changes. Trigger with phrases like "coreweave events", "coreweave monitoring", "coreweave pod alerts", "coreweave gpu monitoring".

2,266 Updated today
jeremylongshore
AI & Automation Solid

coreweave-prod-checklist

Production readiness checklist for CoreWeave GPU workloads. Use when launching inference services, preparing GPU training for production, or validating deployment configurations. Trigger with phrases like "coreweave production", "coreweave go-live", "coreweave checklist", "coreweave launch".

2,266 Updated today
jeremylongshore
AI & Automation Solid

coreweave-performance-tuning

Optimize CoreWeave GPU inference latency and throughput. Use when reducing inference latency, maximizing GPU utilization, or tuning batch sizes and concurrency. Trigger with phrases like "coreweave performance", "coreweave latency", "coreweave throughput", "optimize coreweave inference".

2,266 Updated today
jeremylongshore
AI & Automation Solid

coreweave-common-errors

Diagnose and fix CoreWeave GPU scheduling, pod, and networking errors. Use when pods are stuck Pending, GPUs are not allocated, or experiencing CUDA and NCCL errors. Trigger with phrases like "coreweave error", "coreweave pod pending", "coreweave gpu not found", "coreweave debug", "fix coreweave".

2,266 Updated today
jeremylongshore