python-pipeline

Solid

Python data processing pipelines with modular architecture. Use when building content processing workflows, implementing dispatcher patterns, integrating Google Sheets/Drive APIs, or creating batch processing systems. Covers patterns from rosen-scraper, image-analyzer, and social-scraper projects.

Data & Documents 233 stars 44 forks Updated today MIT

Install

View on GitHub

Quality Score: 89/100

Stars 20%
79
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Python data pipeline development Patterns for building production-quality data processing pipelines with Python. **Targeted at Python 3.11+** for `asyncio.TaskGroup` and exception groups; Python 3.12+ for the lighter `type X = ...` syntax. Pin a 3.13+ runtime if you want the JIT or experimental free-threading; the patterns here don't depend on either. ## Choosing a DataFrame engine: pandas vs polars vs DuckDB For a long time pandas was the default for any tabular work in Python. As of 2026 the default has shifted: **polars** is the right pick for multi-GB pipelines on a single machine, **DuckDB** is the right pick when SQL or larger-than-RAM scans are involved, and **pandas** stays useful for small data and the ML/notebook ecosystem (scikit-learn, statsmodels, plotnine all speak it natively). | Tool | When | Why | |---|---|---| | pandas | < ~1 GB data, ML interop, single-threaded familiarity | Mature, ubiquitous, eager DataFrame model. Slowest in benchmarks but most ecosystem support. | | polars | 1 GB - tens of GB on one box, performance-critical pipelines | Multithreaded by default, lazy query engine, Arrow-native. ~5x speedup over pandas on filter / aggregate at 100M rows. | | DuckDB | SQL workflows, larger-than-RAM, parquet/CSV scanning, joins across many files | Vectorized + pipelined execution, cost-based optimizer, streaming scans. Works great as a thin wrapper over a directory of parquet files. | All three speak Apache Arrow, so zero-copy interop between them ...

Details

Author
jamditis
Repository
jamditis/claude-skills-journalism
Created
5 months ago
Last Updated
today
Language
HTML
License
MIT

Similar Skills

Semantically similar based on skill content — not just same category

Data & Documents Listed

python-data-patterns

Pandas, Polars, and PySpark idioms for production data engineering — chunked reads, memory-safe transforms, vectorized operations, type optimization, and performance patterns. Use this skill whenever the user is writing a Python data transformation script and running into memory issues, slow performance, or correctness bugs with large datasets. Also trigger when the user asks how to handle large CSV/Parquet files, process data in batches, use Polars instead of Pandas, optimize a PySpark job, or reduce DataFrame memory usage. If you see someone iterating row-by-row over a DataFrame, this skill should trigger immediately.

0 Updated 4 days ago
Methasit-Pun
Data & Documents Listed

bigdata-processing

Core big data processing toolkit for data teams. Includes Polars, Dask, Vaex for large-scale data processing, ETL pipelines, and distributed computing. Use when working with datasets larger than memory, building data pipelines, or optimizing data processing performance.

1 Updated today
MARUCIE
Data & Documents Listed

pipeline-architect

Designs and implements data pipelines: ETL/ELT, streaming, batch processing, schema migrations, and data warehouse architecture. Covers Kafka, Airflow, dbt, Spark, ClickHouse, BigQuery, Snowflake, Redis Streams, and more. Use this skill when the user asks about data pipelines, ETL jobs, data transformation, streaming setup, data warehouse design, CDC, schema migrations, data quality checks, or anything involving moving data from source to target. Also triggers on "build a pipeline," "migrate data from X to Y," "set up streaming," "design my data warehouse," or "data quality is bad, help me fix it."

1 Updated 2 days ago
mturac
Data & Documents Listed

transforming-data

Transform raw data into analytical assets using ETL/ELT patterns, SQL (dbt), Python (pandas/polars/PySpark), and orchestration (Airflow). Use when building data pipelines, implementing incremental models, migrating from pandas to polars, or orchestrating multi-step transformations with testing and quality checks.

368 Updated 5 months ago
ancoleman
Data & Documents Listed

pipeline-design

Design ETL/ELT pipelines end-to-end — source connectors, extraction strategies, transform logic, load patterns, idempotency, scheduling, and error handling. Use this skill whenever the user is starting a new ingestion job, planning how data moves from a source (REST API, database, file, webhook, message queue) into a data warehouse or data lake. Also trigger when the user asks about pipeline architecture, incremental vs. full loads, backfill strategies, CDC, retry logic, or orchestration choices (Airflow, Prefect, dbt). This skill should feel like pairing with a senior data engineer on day one of a new pipeline project.

0 Updated 4 days ago
Methasit-Pun