coreweave-incident-runbook

Solid

Incident response runbook for CoreWeave GPU workload failures. Use when inference services are down, GPUs are unavailable, or responding to production incidents on CoreWeave. Trigger with phrases like "coreweave incident", "coreweave outage", "coreweave runbook", "coreweave service down".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 97/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
61
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# CoreWeave Incident Runbook ## Triage Steps ```bash # 1. Check pod status kubectl get pods -l app=inference -o wide # 2. Check recent events kubectl get events --sort-by=.lastTimestamp | tail -20 # 3. Check node status kubectl get nodes -l gpu.nvidia.com/class -o wide # 4. Check GPU health kubectl exec -it $(kubectl get pod -l app=inference -o name | head -1) -- nvidia-smi ``` ## Common Incidents ### Inference Service Down 1. Check pod status and events 2. If OOMKilled: reduce batch size or upgrade GPU 3. If ImagePullBackOff: check registry credentials 4. If Pending: check GPU quota and availability ### GPU Node Failure 1. Pods will be rescheduled automatically 2. If no capacity: scale down non-critical workloads 3. Contact CoreWeave support for extended outages ### Model Loading Failure 1. Check HuggingFace token secret exists 2. Verify model name spelling 3. Check PVC has sufficient storage 4. Review container logs for download errors ## Rollback ```bash kubectl rollout undo deployment/inference ``` ## Resources - [CoreWeave Support](https://www.coreweave.com/support) - [CoreWeave Status](https://status.coreweave.com) ## Next Steps For data handling, see `coreweave-data-handling`.

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
7 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

coreweave-prod-checklist

Production readiness checklist for CoreWeave GPU workloads. Use when launching inference services, preparing GPU training for production, or validating deployment configurations. Trigger with phrases like "coreweave production", "coreweave go-live", "coreweave checklist", "coreweave launch".

2,266 Updated today
jeremylongshore
AI & Automation Solid

coreweave-common-errors

Diagnose and fix CoreWeave GPU scheduling, pod, and networking errors. Use when pods are stuck Pending, GPUs are not allocated, or experiencing CUDA and NCCL errors. Trigger with phrases like "coreweave error", "coreweave pod pending", "coreweave gpu not found", "coreweave debug", "fix coreweave".

2,266 Updated today
jeremylongshore
AI & Automation Featured

coreweave-observability

Set up GPU monitoring and observability for CoreWeave workloads. Use when implementing GPU metrics dashboards, configuring alerts, or tracking inference latency and throughput. Trigger with phrases like "coreweave monitoring", "coreweave observability", "coreweave gpu metrics", "coreweave grafana".

2,266 Updated today
jeremylongshore
AI & Automation Featured

coreweave-deploy-integration

Deploy inference services on CoreWeave with Helm charts and Kustomize. Use when deploying multi-model inference, managing GPU deployments at scale, or templating CoreWeave manifests. Trigger with phrases like "deploy coreweave", "coreweave helm", "coreweave kustomize", "coreweave deployment patterns".

2,266 Updated today
jeremylongshore
DevOps & Infrastructure Featured

coreweave-ci-integration

Integrate CoreWeave deployments into CI/CD pipelines with GitHub Actions. Use when automating container builds, deploying inference services from CI, or validating GPU manifests in pull requests. Trigger with phrases like "coreweave CI", "coreweave github actions", "coreweave pipeline", "automate coreweave deploy".

2,266 Updated today
jeremylongshore