vastai-incident-runbook

Featured

Execute Vast.ai incident response for GPU instance failures and outages. Use when responding to instance failures, investigating training crashes, or handling spot preemption emergencies. Trigger with phrases like "vastai incident", "vastai outage", "vastai down", "vastai emergency", "vastai instance failed".

AI & Automation 2,359 stars 334 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Vast.ai Incident Runbook ## Overview Rapid incident response procedures for Vast.ai GPU instance failures. Covers triage, mitigation, recovery, and postmortem for common incident types: spot preemption, instance crashes, GPU failures, and billing issues. ## Prerequisites - Vast.ai CLI access - SSH access to instances (if still running) - Checkpoint storage accessible (S3/GCS) ## Instructions ### Triage: Assess Impact (< 2 minutes) ```bash #!/bin/bash set -euo pipefail echo "=== INCIDENT TRIAGE ===" echo "Time: $(date -u)" # 1. Check all instances echo -e "\n--- Instance Status ---" vastai show instances --raw | python3 -c " import sys, json for inst in json.load(sys.stdin): status = inst.get('actual_status', '?') flag = 'ALERT' if status in ('error', 'exited', 'offline') else 'OK' print(f' [{flag}] ID:{inst[\"id\"]} Status:{status} ' f'GPU:{inst.get(\"gpu_name\",\"?\")} \${inst.get(\"dph_total\",0):.3f}/hr') " # 2. Check if affected instance has recent logs echo -e "\n--- Recent Logs (last 20 lines) ---" vastai logs ${INSTANCE_ID:-0} --tail 20 2>/dev/null || echo "No logs available" # 3. Check account balance echo -e "\n--- Account ---" vastai show user --raw | python3 -c "import sys,json; u=json.load(sys.stdin); print(f'Balance: \${u.get(\"balance\",0):.2f}')" ``` ### Incident Type 1: Spot Preemption **Symptoms**: Instance status changes from `running` to `exited` or `offline` without user action. ```bash # 1. Verify preemption (not user e...

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
8 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category