vastai-prod-checklist

Solid

Execute Vast.ai production deployment checklist for GPU workloads. Use when deploying training pipelines to production, preparing for large-scale GPU jobs, or auditing production readiness. Trigger with phrases like "vastai production", "deploy vastai", "vastai go-live", "vastai launch checklist".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Vast.ai Production Checklist ## Overview Complete checklist for running production GPU workloads on Vast.ai, covering account setup, instance selection, data safety, monitoring, and cost controls. ## Prerequisites - Vast.ai account with sufficient credits - Docker images tested and published to registry - Checkpoint-based training pipeline ## Instructions ### Account & Authentication - [ ] API key stored in secrets manager (not in code or env files) - [ ] Dedicated SSH key pair for Vast.ai (not shared with other services) - [ ] Account balance sufficient for planned workload duration + 50% buffer - [ ] Billing alerts configured at cloud.vast.ai ### Instance Selection - [ ] GPU type validated for workload (VRAM, compute capability) - [ ] Reliability filter set to `>= 0.98` for production jobs - [ ] Internet speed filter set to `inet_down >= 200` for data transfer - [ ] Disk allocation includes room for checkpoints + data + 20% overhead - [ ] CUDA version on host matches Docker image requirements ### Data Safety - [ ] Training data encrypted before upload to instances - [ ] Checkpoint saving every N steps (not just per epoch) - [ ] Checkpoints uploaded to persistent storage (S3/GCS) periodically - [ ] Instance cleanup script removes data before destruction - [ ] No sensitive data (API keys, PII) embedded in Docker images ### Spot Instance Protection - [ ] Spot preemption handler implemented (save checkpoint on SIGTERM) - [ ] Auto-recovery: detect destroyed instance, pr...

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
7 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

vastai-deploy-integration

Deploy ML training jobs and inference services on Vast.ai GPU cloud. Use when deploying GPU workloads, configuring Docker images, or setting up automated deployment scripts. Trigger with phrases like "deploy vastai", "vastai deployment", "vastai docker", "vastai production deploy".

2,266 Updated today
jeremylongshore
AI & Automation Solid

vastai-ci-integration

Configure Vast.ai CI/CD integration with GitHub Actions and automated GPU testing. Use when setting up automated testing on GPU instances, or integrating Vast.ai provisioning into CI/CD pipelines. Trigger with phrases like "vastai CI", "vastai github actions", "vastai automated testing", "vastai pipeline".

2,266 Updated today
jeremylongshore
AI & Automation Solid

vastai-security-basics

Apply Vast.ai security best practices for API keys and instance access. Use when securing API keys, hardening SSH access to GPU instances, or auditing Vast.ai security configuration. Trigger with phrases like "vastai security", "vastai secrets", "secure vastai", "vastai API key security", "vastai ssh security".

2,266 Updated today
jeremylongshore
AI & Automation Solid

vastai-core-workflow-a

Execute Vast.ai primary workflow: GPU instance provisioning and job execution. Use when renting GPUs for training, searching offers by price and specs, or managing the full instance lifecycle from search to teardown. Trigger with phrases like "vastai rent gpu", "vastai training job", "vastai provision instance", "run job on vastai".

2,266 Updated today
jeremylongshore
AI & Automation Solid

vastai-data-handling

Manage training data and model artifacts securely on Vast.ai GPU instances. Use when transferring data to instances, managing checkpoints, or implementing secure data lifecycle on rented hardware. Trigger with phrases like "vastai data", "vastai upload data", "vastai checkpoints", "vastai data security", "vastai artifacts".

2,266 Updated today
jeremylongshore