spark-engineer

Solid

Use when writing Spark jobs, debugging performance issues, or configuring cluster settings for Apache Spark applications, distributed data processing pipelines, or big data workloads. Invoke to write DataFrame transformations, optimize Spark SQL queries, implement RDD pipelines, tune shuffle operations, configure executor memory, process .parquet files, handle data partitioning, or build structured streaming analytics.

Data & Documents 9,537 stars 808 forks Updated 1 weeks ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Spark Engineer Senior Apache Spark engineer specializing in high-performance distributed data processing, optimizing large-scale ETL pipelines, and building production-grade Spark applications. ## Core Workflow 1. **Analyze requirements** - Understand data volume, transformations, latency requirements, cluster resources 2. **Design pipeline** - Choose DataFrame vs RDD, plan partitioning strategy, identify broadcast opportunities 3. **Implement** - Write Spark code with optimized transformations, appropriate caching, proper error handling 4. **Optimize** - Analyze Spark UI, tune shuffle partitions, eliminate skew, optimize joins and aggregations 5. **Validate** - Check Spark UI for shuffle spill before proceeding; verify partition count with `df.rdd.getNumPartitions()`; if spill or skew detected, return to step 4; test with production-scale data, monitor resource usage, verify performance targets ## Reference Guide Load detailed guidance based on context: | Topic | Reference | Load When | |-------|-----------|-----------| | Spark SQL & DataFrames | `references/spark-sql-dataframes.md` | DataFrame API, Spark SQL, schemas, joins, aggregations | | RDD Operations | `references/rdd-operations.md` | Transformations, actions, pair RDDs, custom partitioners | | Partitioning & Caching | `references/partitioning-caching.md` | Data partitioning, persistence levels, broadcast variables | | Performance Tuning | `references/performance-tuning.md` | Configuration, memory tuning, shuf...

Details

Author: Jeffallan
Repository: Jeffallan/claude-skills
Created: 7 months ago
Last Updated: 1 weeks ago
Language: Python
License: MIT

Similar Skills

Semantically similar based on skill content — not just same category

Data & Documents Featured

data-engineer

Build scalable data pipelines, modern data warehouses, and real-time streaming architectures. Implements Apache Spark, dbt, Airflow, and cloud-native data platforms.

39,350 Updated today

sickn33

Data & Documents Featured

data-engineer

Build scalable data pipelines, modern data warehouses, and real-time streaming architectures. Implements Apache Spark, dbt, Airflow, and cloud-native data platforms.

27,705 Updated today

davila7

Data & Documents Solid

senior-data-engineer

World-class data engineering skill for building scalable data pipelines, ETL/ELT systems, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, and modern data stack. Includes data modeling, pipeline orchestration, data quality, and DataOps. Use when designing data architectures, building data pipelines, optimizing data workflows, or implementing data governance.

27,705 Updated today

davila7

Data & Documents Listed

senior-data-engineer

335 Updated today

aiskillstore

Data & Documents Listed

data-engineer

Build scalable data pipelines, modern data warehouses, and real-time streaming architectures. Implements Apache Spark, dbt, Airflow, and cloud-native data platforms. Use PROACTIVELY for data pipeline design, analytics infrastructure, or modern data stack implementation.

335 Updated today

aiskillstore