coreweave-incident-runbook

Solid

Incident response runbook for CoreWeave GPU workload failures. Use when inference services are down, GPUs are unavailable, or responding to production incidents on CoreWeave. Trigger with phrases like "coreweave incident", "coreweave outage", "coreweave runbook", "coreweave service down".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 97/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# CoreWeave Incident Runbook ## Triage Steps ```bash # 1. Check pod status kubectl get pods -l app=inference -o wide # 2. Check recent events kubectl get events --sort-by=.lastTimestamp | tail -20 # 3. Check node status kubectl get nodes -l gpu.nvidia.com/class -o wide # 4. Check GPU health kubectl exec -it $(kubectl get pod -l app=inference -o name | head -1) -- nvidia-smi ``` ## Common Incidents ### Inference Service Down 1. Check pod status and events 2. If OOMKilled: reduce batch size or upgrade GPU 3. If ImagePullBackOff: check registry credentials 4. If Pending: check GPU quota and availability ### GPU Node Failure 1. Pods will be rescheduled automatically 2. If no capacity: scale down non-critical workloads 3. Contact CoreWeave support for extended outages ### Model Loading Failure 1. Check HuggingFace token secret exists 2. Verify model name spelling 3. Check PVC has sufficient storage 4. Review container logs for download errors ## Rollback ```bash kubectl rollout undo deployment/inference ``` ## Resources - [CoreWeave Support](https://www.coreweave.com/support) - [CoreWeave Status](https://status.coreweave.com) ## Next Steps For data handling, see `coreweave-data-handling`.

Details

Author: jeremylongshore
Repository: jeremylongshore/claude-code-plugins-plus-skills
Created: 7 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

Anthropic · AI Hugging Face · AI Kubernetes · Infrastructure

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

coreweave-prod-checklist

Production readiness checklist for CoreWeave GPU workloads. Use when launching inference services, preparing GPU training for production, or validating deployment configurations. Trigger with phrases like "coreweave production", "coreweave go-live", "coreweave checklist", "coreweave launch".

2,266 Updated today

jeremylongshore

AI & Automation Solid

coreweave-common-errors

Diagnose and fix CoreWeave GPU scheduling, pod, and networking errors. Use when pods are stuck Pending, GPUs are not allocated, or experiencing CUDA and NCCL errors. Trigger with phrases like "coreweave error", "coreweave pod pending", "coreweave gpu not found", "coreweave debug", "fix coreweave".

2,266 Updated today

jeremylongshore

AI & Automation Featured

coreweave-observability

Set up GPU monitoring and observability for CoreWeave workloads. Use when implementing GPU metrics dashboards, configuring alerts, or tracking inference latency and throughput. Trigger with phrases like "coreweave monitoring", "coreweave observability", "coreweave gpu metrics", "coreweave grafana".

2,266 Updated today

jeremylongshore

AI & Automation Featured

coreweave-deploy-integration

Deploy inference services on CoreWeave with Helm charts and Kustomize. Use when deploying multi-model inference, managing GPU deployments at scale, or templating CoreWeave manifests. Trigger with phrases like "deploy coreweave", "coreweave helm", "coreweave kustomize", "coreweave deployment patterns".

2,266 Updated today

jeremylongshore

DevOps & Infrastructure Featured

coreweave-ci-integration

Integrate CoreWeave deployments into CI/CD pipelines with GitHub Actions. Use when automating container builds, deploying inference services from CI, or validating GPU manifests in pull requests. Trigger with phrases like "coreweave CI", "coreweave github actions", "coreweave pipeline", "automate coreweave deploy".

2,266 Updated today

jeremylongshore