training-check
SolidPeriodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.
AI & Automation 11,977 stars
1099 forks Updated yesterday MIT
Install
Quality Score: 96/100
Stars 20%
Recency 20%
Frontmatter 20%
Documentation 15%
Issue Health 10%
License 10%
Description 5%
Skill Content
# Training Check
Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time.
## Context: $ARGUMENTS
## Constants
- WANDB_ENTITY and WANDB_PROJECT: read from CLAUDE.md or passed as argument (format: `entity/project/run_id`)
- CHECK_INTERVAL: starts at 10 minutes, then gradually increases if consistently healthy: 10 min → 20 min → 30 min → 60 min (cap)
- REVIEWER_MODEL = `gpt-5.4` — used via Codex MCP for ambiguous cases only
## When to Use
- After training is confirmed running (session alive, loss decreasing for first few steps)
- Set up via CronCreate to fire periodically during training
- **This skill checks training QUALITY, not process HEALTH.** Process health (session alive, GPU utilization) is [watchdog.py](../../tools/watchdog.py)'s job.
## Workflow
### Step 1: Read WandB Metrics
```python
import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()
```
If WandB is unreachable (API error, network issue), fall back to reading the log file directly via SSH:
```bash
ssh server "tail -100 /path/to/training.log"
```
Check these signals:
- **Loss trend**: Is training loss decreasing over the last N steps?
- **Eval metrics**: Are evaluation metrics improving (or at least not degrading)?
- **NaN / Inf**: Any NaN or Inf values in loss or gradients?
- **Spikes**: Sudden large jumps in loss (>10x normal variance)?
- **Learning rate**: Is the sc...
Details
- Author
- wanshuiyin
- Repository
- wanshuiyin/Auto-claude-code-research-in-sleep
- Created
- 3 months ago
- Last Updated
- yesterday
- Language
- Python
- License
- MIT
Integrates with
Similar Skills
Semantically similar based on skill content — not just same category
AI & Automation Listed
weights-and-biases
W&B: log ML experiments, sweeps, model registry, dashboards.
0 Updated yesterday
aashutosh396 AI & Automation Solid
weights-and-biases
Track ML experiments with automatic logging, visualize training in real-time, optimize hyperparameters with sweeps, and manage model registry with W&B - collaborative MLOps platform
191,515 Updated today
NousResearch AI & Automation Solid
weights-and-biases
Track ML experiments with automatic logging, visualize training in real-time, optimize hyperparameters with sweeps, and manage model registry with W&B - collaborative MLOps platform
9,609 Updated 1 months ago
Orchestra-Research