vastai-core-workflow-b

Solid

Execute Vast.ai secondary workflow: multi-instance orchestration, spot recovery, and cost optimization. Use when running distributed training, handling spot preemption, or optimizing GPU spend across multiple instances. Trigger with phrases like "vastai distributed training", "vastai spot recovery", "vastai multi-gpu", "vastai cost optimization".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Vast.ai Core Workflow B: Multi-Instance & Cost Optimization ## Overview Secondary workflow for Vast.ai: orchestrate multiple GPU instances for distributed training, implement automatic spot interruption recovery with checkpoint-based resume, and analyze spending to reduce per-job cost. ## Prerequisites - Completed `vastai-core-workflow-a` - Understanding of distributed training (PyTorch DDP, DeepSpeed) - Checkpoint-based training pipeline ## Instructions ### Step 1: Multi-Instance Provisioning ```python import subprocess, json, time from concurrent.futures import ThreadPoolExecutor def provision_cluster(num_nodes, gpu_name="A100", min_vram=80, image=""): """Provision multiple GPU instances for distributed training.""" # Search for matching offers query = (f"num_gpus=1 gpu_name={gpu_name} gpu_ram>={min_vram} " f"reliability>0.98 inet_down>500 rentable=true") result = subprocess.run( ["vastai", "search", "offers", query, "--order", "dph_total", "--raw", "--limit", str(num_nodes * 3)], capture_output=True, text=True, check=True, ) offers = json.loads(result.stdout) if len(offers) < num_nodes: raise RuntimeError(f"Only {len(offers)} offers, need {num_nodes}") # Provision nodes in parallel instances = [] for i, offer in enumerate(offers[:num_nodes]): inst_id = provision_single(offer["id"], image, rank=i) instances.append({"id": inst_id, "rank": i, "offer": offer}) # ...

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
7 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

vastai-core-workflow-a

Execute Vast.ai primary workflow: GPU instance provisioning and job execution. Use when renting GPUs for training, searching offers by price and specs, or managing the full instance lifecycle from search to teardown. Trigger with phrases like "vastai rent gpu", "vastai training job", "vastai provision instance", "run job on vastai".

2,266 Updated today
jeremylongshore
AI & Automation Solid

vastai-performance-tuning

Optimize Vast.ai GPU instance selection, startup time, and training throughput. Use when optimizing instance selection, reducing startup latency, or maximizing GPU utilization on rented hardware. Trigger with phrases like "vastai performance", "optimize vastai", "vastai slow", "vastai gpu utilization", "vastai throughput".

2,266 Updated today
jeremylongshore
AI & Automation Solid

vastai-cost-tuning

Optimize Vast.ai GPU cloud costs through smart instance selection and lifecycle management. Use when analyzing GPU spending, reducing training costs, or implementing budget controls for Vast.ai workloads. Trigger with phrases like "vastai cost", "vastai billing", "reduce vastai costs", "vastai pricing", "vastai budget".

2,266 Updated today
jeremylongshore
AI & Automation Featured

coreweave-core-workflow-b

Run distributed GPU training jobs on CoreWeave with multi-node PyTorch. Use when training models across multiple GPUs, setting up distributed training, or running fine-tuning jobs on CoreWeave H100 clusters. Trigger with phrases like "coreweave training", "coreweave multi-gpu", "distributed training coreweave", "fine-tune on coreweave".

2,266 Updated today
jeremylongshore
AI & Automation Solid

vastai-prod-checklist

Execute Vast.ai production deployment checklist for GPU workloads. Use when deploying training pipelines to production, preparing for large-scale GPU jobs, or auditing production readiness. Trigger with phrases like "vastai production", "deploy vastai", "vastai go-live", "vastai launch checklist".

2,266 Updated today
jeremylongshore