when-debugging-ml-training-use-ml-training-debuggerlisted
Install: claude install-skill aiskillstore/marketplace
# ML Training Debugger - Diagnose and Fix Training Issues
## Overview
Systematic debugging workflow for ML training issues including loss divergence, overfitting, slow convergence, gradient problems, and performance optimization.
## When to Use
- Training loss becomes NaN or infinite
- Severe overfitting (train >> val performance)
- Training not converging
- Gradient vanishing/exploding
- Poor validation accuracy
- Training too slow
## Phase 1: Diagnose Issue (8 min)
### Objective
Identify the specific training problem
### Agent: ML-Developer
**Step 1.1: Analyze Training Curves**
```python
import json
import numpy as np
# Load training history
with open('training_history.json', 'r') as f:
history = json.load(f)
# Diagnose issues
diagnosis = {
'loss_divergence': check_loss_divergence(history['loss']),
'overfitting': check_overfitting(history['loss'], history['val_loss']),
'slow_convergence': check_convergence_rate(history['loss']),
'gradient_issues': check_gradient_health(history),
'nan_values': any(np.isnan(history['loss']))
}
def check_loss_divergence(losses):
# Loss increasing over time
if len(losses) > 10:
recent_trend = np.mean(losses[-5:]) > np.mean(losses[-10:-5])
return recent_trend
def check_overfitting(train_loss, val_loss):
# Val loss diverging from train loss
if len(train_loss) > 10:
gap = np.mean(val_loss[-5:]) - np.mean(train_loss[-5:])
return gap > 0.5 # Significant gap
def che