ml-feature-engineeringlisted
Install: claude install-skill Methasit-Pun/data_engineer_claude_skills
# ML Feature Engineering Patterns
## The Data Engineering / ML Boundary
Data engineers own the data. ML engineers own the models. Feature engineering sits at the boundary — and when it's designed poorly, both sides pay for it. The most common failures:
1. **Training/serving skew** — features computed differently at training time vs. prediction time → model performs worse in production than in validation
2. **Label leakage** — features computed using data from the future, making the training set artificially easy → model fails in production
3. **Feature duplication** — each ML team recomputes the same features independently → inconsistent definitions, wasted compute
The patterns in this skill address all three.
---
## Point-in-Time Correct Joins
This is the most important concept in ML feature engineering. A model trained to predict churn should only "see" data that would have been available at the time of the prediction — not data from the future.
### The wrong way
```sql
-- BAD: This leaks future data into training
-- If we're predicting churn as of 2024-01-15, we shouldn't know about
-- events that happened on 2024-01-20
SELECT
u.user_id,
u.subscription_tier,
COUNT(e.event_id) AS events_last_30d, -- counts events AFTER the label date!
u.churned AS label
FROM users u
JOIN events e ON u.user_id = e.user_id
AND e.event_date >= DATEADD(day, -30, CURRENT_DATE) -- wrong: uses today's date
WHERE u.label_date = '2024-01-15'
GROUP BY 1, 2, 4;
```
###