coreweave-prod-checklist

Solid

Production readiness checklist for CoreWeave GPU workloads. Use when launching inference services, preparing GPU training for production, or validating deployment configurations. Trigger with phrases like "coreweave production", "coreweave go-live", "coreweave checklist", "coreweave launch".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 97/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
55
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# CoreWeave Production Checklist ## Inference Services - [ ] GPU type and count validated for model size - [ ] Autoscaling configured (KServe or HPA) - [ ] Health and readiness probes set - [ ] Resource requests AND limits specified - [ ] Node affinity targeting correct GPU class - [ ] `minReplicas >= 1` for production (no cold starts) ## Storage - [ ] Model weights in PVC (not downloaded at startup) - [ ] Checkpoints saved to persistent storage - [ ] Storage class appropriate (SSD for inference, HDD for archival) ## Security - [ ] Secrets for model tokens and registry access - [ ] Network policies applied - [ ] Container images from trusted registries ## Monitoring - [ ] GPU utilization metrics collected - [ ] Inference latency and throughput tracked - [ ] Alert on pod restarts and OOM events - [ ] Log aggregation configured ## Rollback ```bash kubectl rollout undo deployment/my-inference kubectl rollout status deployment/my-inference ``` ## Resources - [CoreWeave CKS](https://docs.coreweave.com/docs/products/cks) ## Next Steps For upgrades, see `coreweave-upgrade-migration`.

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
7 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

coreweave-incident-runbook

Incident response runbook for CoreWeave GPU workload failures. Use when inference services are down, GPUs are unavailable, or responding to production incidents on CoreWeave. Trigger with phrases like "coreweave incident", "coreweave outage", "coreweave runbook", "coreweave service down".

2,266 Updated today
jeremylongshore
AI & Automation Featured

coreweave-deploy-integration

Deploy inference services on CoreWeave with Helm charts and Kustomize. Use when deploying multi-model inference, managing GPU deployments at scale, or templating CoreWeave manifests. Trigger with phrases like "deploy coreweave", "coreweave helm", "coreweave kustomize", "coreweave deployment patterns".

2,266 Updated today
jeremylongshore
AI & Automation Featured

coreweave-observability

Set up GPU monitoring and observability for CoreWeave workloads. Use when implementing GPU metrics dashboards, configuring alerts, or tracking inference latency and throughput. Trigger with phrases like "coreweave monitoring", "coreweave observability", "coreweave gpu metrics", "coreweave grafana".

2,266 Updated today
jeremylongshore
AI & Automation Featured

coreweave-local-dev-loop

Set up local development workflow for CoreWeave GPU deployments. Use when building containers locally, testing YAML manifests, or iterating on model serving configurations before deploying. Trigger with phrases like "coreweave dev setup", "coreweave local testing", "develop for coreweave", "coreweave container build".

2,266 Updated today
jeremylongshore
AI & Automation Solid

coreweave-common-errors

Diagnose and fix CoreWeave GPU scheduling, pod, and networking errors. Use when pods are stuck Pending, GPUs are not allocated, or experiencing CUDA and NCCL errors. Trigger with phrases like "coreweave error", "coreweave pod pending", "coreweave gpu not found", "coreweave debug", "fix coreweave".

2,266 Updated today
jeremylongshore