python-data-patternslisted
Install: claude install-skill Methasit-Pun/data_engineer_claude_skills
# Python Data Patterns
## The Root Cause of Most Python Data Performance Problems
Row-by-row iteration (`for index, row in df.iterrows()`) is almost always the culprit. DataFrames are columnar data structures — they're designed for batch column operations, not row-by-row Python loops. A 1M-row DataFrame that takes 10 minutes with `iterrows` typically runs in under a second with a vectorized equivalent.
---
## Pandas
### Vectorized operations — always prefer over loops
```python
# Bad: iterrows is 100-1000x slower
for i, row in df.iterrows():
df.at[i, "margin"] = row["revenue"] - row["cost"]
# Good: vectorized
df["margin"] = df["revenue"] - df["cost"]
# Good: apply only when vectorized isn't possible
df["label"] = df["score"].apply(lambda x: "high" if x > 0.8 else "low")
# Better: use np.where for simple conditionals
import numpy as np
df["label"] = np.where(df["score"] > 0.8, "high", "low")
# Best for complex conditionals: np.select
conditions = [df["score"] > 0.8, df["score"] > 0.5]
choices = ["high", "medium"]
df["label"] = np.select(conditions, choices, default="low")
```
### Memory optimization — reduce types early
```python
def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
for col in df.select_dtypes("object"):
if df[col].nunique() / len(df) < 0.5: # low cardinality → category
df[col] = df[col].astype("category")
for col in df.select_dtypes("int64"):
df[col] = pd.to_numeric(df[col], downcast="integer")
fo