spark-engineer

Solid

Use when writing Spark jobs, debugging performance issues, or configuring cluster settings for Apache Spark applications, distributed data processing pipelines, or big data workloads. Invoke to write DataFrame transformations, optimize Spark SQL queries, implement RDD pipelines, tune shuffle operations, configure executor memory, process .parquet files, handle data partitioning, or build structured streaming analytics.

Data & Documents 9,537 stars 808 forks Updated 1 weeks ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%
100
Recency 20%
90
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Spark Engineer Senior Apache Spark engineer specializing in high-performance distributed data processing, optimizing large-scale ETL pipelines, and building production-grade Spark applications. ## Core Workflow 1. **Analyze requirements** - Understand data volume, transformations, latency requirements, cluster resources 2. **Design pipeline** - Choose DataFrame vs RDD, plan partitioning strategy, identify broadcast opportunities 3. **Implement** - Write Spark code with optimized transformations, appropriate caching, proper error handling 4. **Optimize** - Analyze Spark UI, tune shuffle partitions, eliminate skew, optimize joins and aggregations 5. **Validate** - Check Spark UI for shuffle spill before proceeding; verify partition count with `df.rdd.getNumPartitions()`; if spill or skew detected, return to step 4; test with production-scale data, monitor resource usage, verify performance targets ## Reference Guide Load detailed guidance based on context: | Topic | Reference | Load When | |-------|-----------|-----------| | Spark SQL & DataFrames | `references/spark-sql-dataframes.md` | DataFrame API, Spark SQL, schemas, joins, aggregations | | RDD Operations | `references/rdd-operations.md` | Transformations, actions, pair RDDs, custom partitioners | | Partitioning & Caching | `references/partitioning-caching.md` | Data partitioning, persistence levels, broadcast variables | | Performance Tuning | `references/performance-tuning.md` | Configuration, memory tuning, shuf...

Details

Author
Jeffallan
Repository
Jeffallan/claude-skills
Created
7 months ago
Last Updated
1 weeks ago
Language
Python
License
MIT

Similar Skills

Semantically similar based on skill content — not just same category

Data & Documents Featured

data-engineer

Build scalable data pipelines, modern data warehouses, and real-time streaming architectures. Implements Apache Spark, dbt, Airflow, and cloud-native data platforms.

39,350 Updated today
sickn33
Data & Documents Featured

data-engineer

Build scalable data pipelines, modern data warehouses, and real-time streaming architectures. Implements Apache Spark, dbt, Airflow, and cloud-native data platforms.

27,705 Updated today
davila7
Data & Documents Solid

senior-data-engineer

World-class data engineering skill for building scalable data pipelines, ETL/ELT systems, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, and modern data stack. Includes data modeling, pipeline orchestration, data quality, and DataOps. Use when designing data architectures, building data pipelines, optimizing data workflows, or implementing data governance.

27,705 Updated today
davila7
Data & Documents Listed

senior-data-engineer

World-class data engineering skill for building scalable data pipelines, ETL/ELT systems, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, and modern data stack. Includes data modeling, pipeline orchestration, data quality, and DataOps. Use when designing data architectures, building data pipelines, optimizing data workflows, or implementing data governance.

335 Updated today
aiskillstore
Data & Documents Listed

data-engineer

Build scalable data pipelines, modern data warehouses, and real-time streaming architectures. Implements Apache Spark, dbt, Airflow, and cloud-native data platforms. Use PROACTIVELY for data pipeline design, analytics infrastructure, or modern data stack implementation.

335 Updated today
aiskillstore