coreweave-common-errors

Solid

Diagnose and fix CoreWeave GPU scheduling, pod, and networking errors. Use when pods are stuck Pending, GPUs are not allocated, or experiencing CUDA and NCCL errors. Trigger with phrases like "coreweave error", "coreweave pod pending", "coreweave gpu not found", "coreweave debug", "fix coreweave".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 97/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
80
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# CoreWeave Common Errors ## Error Reference ### 1. Pod Stuck Pending -- No GPU Available ```bash kubectl describe pod <pod-name> | grep -A5 Events # "0/N nodes are available: insufficient nvidia.com/gpu" ``` **Fix**: Check GPU availability: `kubectl get nodes -l gpu.nvidia.com/class=A100_PCIE_80GB`. Try a different GPU type or region. ### 2. CUDA Out of Memory ``` torch.cuda.OutOfMemoryError: CUDA out of memory ``` **Fix**: Reduce batch size, enable gradient checkpointing, or use a larger GPU (A100-80GB instead of 40GB). ### 3. Image Pull BackOff **Fix**: Create an imagePullSecret: ```bash kubectl create secret docker-registry regcred \ --docker-server=ghcr.io \ --docker-username=$GH_USER \ --docker-password=$GH_TOKEN ``` ### 4. NCCL Timeout (Multi-GPU) ``` NCCL error: unhandled system error ``` **Fix**: Ensure all GPUs are on the same node (NVLink). For multi-node, use InfiniBand-connected nodes. ### 5. PVC Not Mounting **Fix**: Check storage class availability: `kubectl get sc`. Use CoreWeave storage classes like `shared-hdd-ord1` or `shared-ssd-ord1`. ### 6. Node Affinity Mismatch **Fix**: List valid GPU class labels: ```bash kubectl get nodes -o json | jq -r '.items[].metadata.labels["gpu.nvidia.com/class"]' | sort -u ``` ### 7. Service Not Reachable **Fix**: Check Service and Endpoints: ```bash kubectl get svc,endpoints <service-name> ``` ## Resources - [CoreWeave Documentation](https://docs.coreweave.com) - [GPU Instance Types](https://docs.coreweave.co...

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
7 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

coreweave-incident-runbook

Incident response runbook for CoreWeave GPU workload failures. Use when inference services are down, GPUs are unavailable, or responding to production incidents on CoreWeave. Trigger with phrases like "coreweave incident", "coreweave outage", "coreweave runbook", "coreweave service down".

2,266 Updated today
jeremylongshore
AI & Automation Featured

coreweave-observability

Set up GPU monitoring and observability for CoreWeave workloads. Use when implementing GPU metrics dashboards, configuring alerts, or tracking inference latency and throughput. Trigger with phrases like "coreweave monitoring", "coreweave observability", "coreweave gpu metrics", "coreweave grafana".

2,266 Updated today
jeremylongshore
AI & Automation Featured

coreweave-install-auth

Configure CoreWeave Kubernetes Service (CKS) access with kubeconfig and API tokens. Use when setting up kubectl access to CoreWeave, configuring CKS clusters, or authenticating with CoreWeave cloud services. Trigger with phrases like "install coreweave", "setup coreweave", "coreweave kubeconfig", "coreweave auth", "connect to coreweave".

2,266 Updated today
jeremylongshore
AI & Automation Solid

coreweave-prod-checklist

Production readiness checklist for CoreWeave GPU workloads. Use when launching inference services, preparing GPU training for production, or validating deployment configurations. Trigger with phrases like "coreweave production", "coreweave go-live", "coreweave checklist", "coreweave launch".

2,266 Updated today
jeremylongshore
AI & Automation Solid

coreweave-rate-limits

Handle CoreWeave API and GPU quota limits. Use when hitting quota limits, managing GPU resource allocation, or implementing request queuing for inference endpoints. Trigger with phrases like "coreweave quota", "coreweave limits", "coreweave gpu allocation", "coreweave throttle".

2,266 Updated today
jeremylongshore