ml-training-debuggerlisted

Diagnose machine learning training failures including loss divergence, mode collapse, gradient issues, architecture problems, and optimization failures. This skill spawns a specialist ML debugging ...
aiskillstore/marketplace · ★ 329 · AI & Automation · score 79

Install: claude install-skill aiskillstore/marketplace

# ML Training Debugger **Version**: 1.0.0 **Type**: Agent-based skill with SDK implementation **Domain**: Machine learning training diagnostics ## Description Diagnose machine learning training failures including loss divergence, mode collapse, gradient issues, architecture problems, and optimization failures. This skill spawns a specialist ML debugging agent that systematically analyzes training artifacts to identify root causes and propose evidence-based fixes. Use this skill when encountering training failures, when loss curves exhibit pathological behavior, when models produce degenerate outputs, when experiencing GPU memory issues, or when hyperparameter tuning produces inconsistent results. ## Triggers This skill activates when users request: - "Debug my training run" - "Why is my loss diverging?" - "Model outputs are all the same token" - "Training failed at epoch X" - "Help diagnose mode collapse" - "Why are gradients exploding/vanishing?" - "Model not learning anything" ## Skill Architecture ### Skill Layer (Lightweight) The skill handles: 1. **Detection**: Identify ML training debugging requests 2. **Context Gathering**: Collect training logs, loss curves, model code 3. **Agent Spawning**: Invoke ML debugging specialist with context 4. **Result Processing**: Format diagnosis and fixes for user ### Agent Layer (Specialist) The ML debugging agent handles: 1. **Systematic Analysis**: Apply debugging methodology to artifacts 2. **Root Cause Identification**: Di