coreweave-core-workflow-b

Featured

Run distributed GPU training jobs on CoreWeave with multi-node PyTorch. Use when training models across multiple GPUs, setting up distributed training, or running fine-tuning jobs on CoreWeave H100 clusters. Trigger with phrases like "coreweave training", "coreweave multi-gpu", "distributed training coreweave", "fine-tune on coreweave".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# CoreWeave Core Workflow: GPU Training ## Overview Run distributed GPU training on CoreWeave: single-node multi-GPU and multi-node training with PyTorch DDP, Slurm-on-Kubernetes, and shared storage. ## Prerequisites - CKS cluster with multi-GPU node pools (8xA100 or 8xH100) - Shared storage (CoreWeave PVC or NFS) - Training container with PyTorch and NCCL ## Instructions ### Step 1: Single-Node Multi-GPU Training ```yaml # training-job.yaml apiVersion: batch/v1 kind: Job metadata: name: llm-finetune spec: template: spec: restartPolicy: Never containers: - name: trainer image: ghcr.io/myorg/trainer:latest command: ["torchrun"] args: - "--nproc_per_node=8" - "train.py" - "--model_name=meta-llama/Llama-3.1-8B" - "--batch_size=4" - "--epochs=3" resources: limits: nvidia.com/gpu: "8" memory: 512Gi cpu: "64" volumeMounts: - name: data mountPath: /data - name: checkpoints mountPath: /checkpoints volumes: - name: data persistentVolumeClaim: claimName: training-data - name: checkpoints persistentVolumeClaim: claimName: model-checkpoints affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: ...

Details

Author: jeremylongshore
Repository: jeremylongshore/claude-code-plugins-plus-skills
Created: 7 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

Anthropic · AI Kubernetes · Infrastructure

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

coreweave-hello-world

Deploy a GPU workload on CoreWeave with kubectl. Use when running your first GPU job, testing inference, or verifying CoreWeave cluster access. Trigger with phrases like "coreweave hello world", "coreweave first deploy", "coreweave gpu test", "run on coreweave".

2,266 Updated today

jeremylongshore

AI & Automation Featured

coreweave-deploy-integration

Deploy inference services on CoreWeave with Helm charts and Kustomize. Use when deploying multi-model inference, managing GPU deployments at scale, or templating CoreWeave manifests. Trigger with phrases like "deploy coreweave", "coreweave helm", "coreweave kustomize", "coreweave deployment patterns".

2,266 Updated today

jeremylongshore

AI & Automation Solid

coreweave-performance-tuning

Optimize CoreWeave GPU inference latency and throughput. Use when reducing inference latency, maximizing GPU utilization, or tuning batch sizes and concurrency. Trigger with phrases like "coreweave performance", "coreweave latency", "coreweave throughput", "optimize coreweave inference".

2,266 Updated today

jeremylongshore

AI & Automation Solid

vastai-core-workflow-b

Execute Vast.ai secondary workflow: multi-instance orchestration, spot recovery, and cost optimization. Use when running distributed training, handling spot preemption, or optimizing GPU spend across multiple instances. Trigger with phrases like "vastai distributed training", "vastai spot recovery", "vastai multi-gpu", "vastai cost optimization".

2,266 Updated today

jeremylongshore

AI & Automation Solid

coreweave-cost-tuning

Optimize CoreWeave GPU cloud costs with right-sizing and scheduling. Use when reducing GPU spend, selecting cost-effective instances, or implementing scale-to-zero for dev workloads. Trigger with phrases like "coreweave cost", "coreweave pricing", "reduce coreweave spend", "coreweave budget".

2,266 Updated today

jeremylongshore