python-data-patternslisted

Pandas, Polars, and PySpark idioms for production data engineering — chunked reads, memory-safe transforms, vectorized operations, type optimization, and performance patterns. Use this skill whenever the user is writing a Python data transformation script and running into memory issues, slow performance, or correctness bugs with large datasets. Also trigger when the user asks how to handle large CSV/Parquet files, process data in batches, use Polars instead of Pandas, optimize a PySpark job, or reduce DataFrame memory usage. If you see someone iterating row-by-row over a DataFrame, this skill should trigger immediately.
Methasit-Pun/data_engineer_claude_skills · ★ 1 · Data & Documents · score 62

Install: claude install-skill Methasit-Pun/data_engineer_claude_skills

# Python Data Patterns ## The Root Cause of Most Python Data Performance Problems Row-by-row iteration (`for index, row in df.iterrows()`) is almost always the culprit. DataFrames are columnar data structures — they're designed for batch column operations, not row-by-row Python loops. A 1M-row DataFrame that takes 10 minutes with `iterrows` typically runs in under a second with a vectorized equivalent. --- ## Pandas ### Vectorized operations — always prefer over loops ```python # Bad: iterrows is 100-1000x slower for i, row in df.iterrows(): df.at[i, "margin"] = row["revenue"] - row["cost"] # Good: vectorized df["margin"] = df["revenue"] - df["cost"] # Good: apply only when vectorized isn't possible df["label"] = df["score"].apply(lambda x: "high" if x > 0.8 else "low") # Better: use np.where for simple conditionals import numpy as np df["label"] = np.where(df["score"] > 0.8, "high", "low") # Best for complex conditionals: np.select conditions = [df["score"] > 0.8, df["score"] > 0.5] choices = ["high", "medium"] df["label"] = np.select(conditions, choices, default="low") ``` ### Memory optimization — reduce types early ```python def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame: for col in df.select_dtypes("object"): if df[col].nunique() / len(df) < 0.5: # low cardinality → category df[col] = df[col].astype("category") for col in df.select_dtypes("int64"): df[col] = pd.to_numeric(df[col], downcast="integer") fo