streaming-patternslisted

Kafka, Flink, Kinesis, and Spark Structured Streaming design — consumer groups, partitioning, exactly-once semantics, lag monitoring, windowing, and late-arriving data. Use this skill whenever the user needs real-time or near-real-time data processing, is redesigning a batch pipeline into streaming, asks about event-driven architectures, or mentions Kafka topics, consumer lag, checkpointing, watermarks, or stream-table joins. Also trigger when the user says batch is "too slow", stakeholders want "live" dashboards, or the pipeline needs to react to events as they happen rather than on a schedule. If latency requirements are under a few minutes, this skill should be active.
Methasit-Pun/data_engineer_claude_skills · ★ 1 · Data & Documents · score 62

Install: claude install-skill Methasit-Pun/data_engineer_claude_skills

# Streaming Patterns for Data Pipelines ## When to stream vs. batch Streaming adds real complexity — consumer groups, offset management, exactly-once semantics, late data handling. Before committing, verify the business actually needs it: | Need | Right choice | |---|---| | Latency < 1 minute | Streaming | | Latency 1–15 minutes | Micro-batch (Spark Structured Streaming, Flink in batch mode) | | Latency > 15 minutes | Batch is simpler and cheaper | | React to individual events (fraud, alerts) | Streaming | | Aggregate for dashboards | Micro-batch is often fine | If you can answer "what are stakeholders actually doing with this data at 2am?", you usually find the true latency requirement is much looser than the initial ask. --- ## Kafka Fundamentals ### Partitioning strategy Partitions are the unit of parallelism. The right partition key determines throughput AND ordering guarantees. ```python # Producers — choose partition key carefully producer.send( topic="user_events", key=user_id.encode(), # all events for a user go to the same partition → ordered per user value=event_json.encode() ) ``` - Partition by entity key (user_id, order_id) when per-entity ordering matters - Partition by random/null when you just want throughput and don't need ordering - Avoid low-cardinality keys (e.g., `event_type`) — they create hot partitions ### Consumer groups Each consumer group maintains its own offsets — multiple applications can read the same topic independentl