streaming-patternslisted
Install: claude install-skill Methasit-Pun/data_engineer_claude_skills
# Streaming Patterns for Data Pipelines
## When to stream vs. batch
Streaming adds real complexity — consumer groups, offset management, exactly-once semantics, late data handling. Before committing, verify the business actually needs it:
| Need | Right choice |
|---|---|
| Latency < 1 minute | Streaming |
| Latency 1–15 minutes | Micro-batch (Spark Structured Streaming, Flink in batch mode) |
| Latency > 15 minutes | Batch is simpler and cheaper |
| React to individual events (fraud, alerts) | Streaming |
| Aggregate for dashboards | Micro-batch is often fine |
If you can answer "what are stakeholders actually doing with this data at 2am?", you usually find the true latency requirement is much looser than the initial ask.
---
## Kafka Fundamentals
### Partitioning strategy
Partitions are the unit of parallelism. The right partition key determines throughput AND ordering guarantees.
```python
# Producers — choose partition key carefully
producer.send(
topic="user_events",
key=user_id.encode(), # all events for a user go to the same partition → ordered per user
value=event_json.encode()
)
```
- Partition by entity key (user_id, order_id) when per-entity ordering matters
- Partition by random/null when you just want throughput and don't need ordering
- Avoid low-cardinality keys (e.g., `event_type`) — they create hot partitions
### Consumer groups
Each consumer group maintains its own offsets — multiple applications can read the same topic independentl