coreweave-core-workflow-b

Featured

Run distributed GPU training jobs on CoreWeave with multi-node PyTorch. Use when training models across multiple GPUs, setting up distributed training, or running fine-tuning jobs on CoreWeave H100 clusters. Trigger with phrases like "coreweave training", "coreweave multi-gpu", "distributed training coreweave", "fine-tune on coreweave".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# CoreWeave Core Workflow: GPU Training ## Overview Run distributed GPU training on CoreWeave: single-node multi-GPU and multi-node training with PyTorch DDP, Slurm-on-Kubernetes, and shared storage. ## Prerequisites - CKS cluster with multi-GPU node pools (8xA100 or 8xH100) - Shared storage (CoreWeave PVC or NFS) - Training container with PyTorch and NCCL ## Instructions ### Step 1: Single-Node Multi-GPU Training ```yaml # training-job.yaml apiVersion: batch/v1 kind: Job metadata: name: llm-finetune spec: template: spec: restartPolicy: Never containers: - name: trainer image: ghcr.io/myorg/trainer:latest command: ["torchrun"] args: - "--nproc_per_node=8" - "train.py" - "--model_name=meta-llama/Llama-3.1-8B" - "--batch_size=4" - "--epochs=3" resources: limits: nvidia.com/gpu: "8" memory: 512Gi cpu: "64" volumeMounts: - name: data mountPath: /data - name: checkpoints mountPath: /checkpoints volumes: - name: data persistentVolumeClaim: claimName: training-data - name: checkpoints persistentVolumeClaim: claimName: model-checkpoints affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: ...

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
7 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

coreweave-hello-world

Deploy a GPU workload on CoreWeave with kubectl. Use when running your first GPU job, testing inference, or verifying CoreWeave cluster access. Trigger with phrases like "coreweave hello world", "coreweave first deploy", "coreweave gpu test", "run on coreweave".

2,266 Updated today
jeremylongshore
AI & Automation Featured

coreweave-deploy-integration

Deploy inference services on CoreWeave with Helm charts and Kustomize. Use when deploying multi-model inference, managing GPU deployments at scale, or templating CoreWeave manifests. Trigger with phrases like "deploy coreweave", "coreweave helm", "coreweave kustomize", "coreweave deployment patterns".

2,266 Updated today
jeremylongshore
AI & Automation Solid

coreweave-performance-tuning

Optimize CoreWeave GPU inference latency and throughput. Use when reducing inference latency, maximizing GPU utilization, or tuning batch sizes and concurrency. Trigger with phrases like "coreweave performance", "coreweave latency", "coreweave throughput", "optimize coreweave inference".

2,266 Updated today
jeremylongshore
AI & Automation Solid

vastai-core-workflow-b

Execute Vast.ai secondary workflow: multi-instance orchestration, spot recovery, and cost optimization. Use when running distributed training, handling spot preemption, or optimizing GPU spend across multiple instances. Trigger with phrases like "vastai distributed training", "vastai spot recovery", "vastai multi-gpu", "vastai cost optimization".

2,266 Updated today
jeremylongshore
AI & Automation Solid

coreweave-cost-tuning

Optimize CoreWeave GPU cloud costs with right-sizing and scheduling. Use when reducing GPU spend, selecting cost-effective instances, or implementing scale-to-zero for dev workloads. Trigger with phrases like "coreweave cost", "coreweave pricing", "reduce coreweave spend", "coreweave budget".

2,266 Updated today
jeremylongshore