Chapter 11: Stream Processing

The Big Idea

Batch processing has one problem: latency. A job that runs once a day means insights that are 24 hours stale. Stream processing is like a continuous batch job — instead of processing a bounded dataset at scheduled intervals, you process an unbounded, continuously arriving stream of data in near-real-time.

Stream = data that is produced over time, continuously

This chapter covers how to transmit, process, and derive insights from streams.

📨 Transmitting Event Streams

Events and Streams

An event is a small, self-contained, immutable record of something that happened at a specific time. Examples:

User clicked a button
Sensor reported a temperature
A service processed a payment

A stream is a sequence of related events. The producer encodes the event; consumers decode and process it.

Direct Messaging vs Message Brokers

Direct messaging (UDP multicast, HTTP webhooks, ZeroMQ):

Low latency
❌ If consumer is down, messages are lost
❌ Producer must know about all consumers

Message brokers (Kafka, RabbitMQ, Kinesis):

Producers send to the broker; consumers receive from the broker
✅ Durability: messages stored until consumed (or expired)
✅ Decoupling: producers don't know about consumers
✅ Backpressure: if consumer is slow, messages queue up

Message Brokers vs Databases

Aspect	Message Broker	Database
Retention	Messages deleted after consumption	Data stored indefinitely
Querying	Subscribe to new messages	Ad-hoc queries over all data
Secondary indexes	No	Yes
Ordering	Within partition	Depends on schema

Log-based message brokers (Kafka): A hybrid — messages are stored on disk durably, consumers track their position (offset). Messages are not deleted after consumption — multiple consumers can read independently. Old messages expire based on retention policy.

🚀 Apache Kafka Deep Dive

Kafka is the dominant log-based message broker. Key design:

Producer → Topic → [Partition 0: msg0, msg1, msg2, ...]
                 → [Partition 1: msg0, msg1, msg2, ...]
                 → [Partition 2: msg0, msg1, msg2, ...]
                        ↓
Consumer Group: [Consumer A reads P0, Consumer B reads P1, Consumer C reads P2]

A topic is split into partitions for parallelism
Each partition is an append-only, ordered log
Each consumer tracks its offset (which message it processed last)
Messages are retained for a configurable period (days or weeks), regardless of consumption
Multiple independent consumer groups can read the same topic

Why this is powerful:

Adding consumers doesn't affect producers
Replay: reset your offset to re-read historical messages (e.g., after fixing a bug in your consumer)
Fan-out: multiple services subscribe to the same events independently
Exactly-once and at-least-once delivery semantics (with careful consumer design)

⚙️ Stream Processing Patterns

1. Complex Event Processing (CEP)

Search for event patterns across a stream, similar to a regex matching on a string.

Example: Detect credit card fraud — find a pattern like "purchase in New York, then purchase in London, within 5 minutes."

Frameworks: Esper, Siddhi. CEP systems reverse the normal model — queries are stored, data flows through them.

2. Stream Analytics

Aggregate metrics over time windows. Unlike CEP, you don't care about individual event sequences — you want statistics:

Request count per minute
95th percentile latency per hour
Unique user count per day

Frameworks: Apache Flink, Apache Spark Streaming, Apache Storm, Kafka Streams.

3. Maintaining Materialized Views

Keep a derived view of data up-to-date in near-real-time. Every change event to the source is applied to the materialized view.

Example: Keep a search index (Elasticsearch) synchronized with a database (PostgreSQL) via a CDC stream.

⏱️ Time in Stream Processing

Time is surprisingly tricky in stream processing:

Event Time vs Processing Time

Event time: When the event actually occurred (timestamp in the event)
Processing time: When the event arrived at the stream processor

These diverge because of:

Network delays
Mobile apps that buffer events when offline (then flush when connected)
Events arriving out of order

Example: A mobile app sends events. If the user is on a plane for 8 hours, all events arrive at processing time T, but event times span T-8 hours to T.

Windowing

Group events into time windows for aggregation:

Window Type	Description	Example
Tumbling	Fixed-size, non-overlapping	Events from 10:00–10:01, 10:01–10:02
Hopping	Fixed-size, overlapping	5-min window, every 1 min
Sliding	Window per event (last N seconds)	"events in the last 5 minutes"
Session	Variable-size, inactivity-gap-bounded	User session (gap > 30 min = new session)

Handling Late-Arriving Events

If you process based on event time and events arrive late, you need to decide:

Watermarks: Declare "I believe all events up to time T have arrived" → close the window and emit results. Risk: some late events exceed the watermark → ignored.
Allow lateness: Keep windows open for a grace period. When a late event arrives within grace period, update and re-emit the window result.
Retract and re-emit: Emit a preliminary result, then emit a correction when late events arrive.

Apache Flink provides all three mechanisms.

🔗 Joins in Stream Processing

Stream-Stream Join (Windowed Join)

Both sides are streams. Join events that occur within the same time window.

Example: Match search queries with click events. A user searches, then clicks a result within 1 hour — what search query led to which click?

Both streams must be buffered. Events expire from the buffer after the window.

Stream-Table Join (Enrichment)

One side is a stream, the other is a slowly-changing database table.

Example: Enrich activity events with the user's current profile data.

The database table must be loaded into memory (or efficiently queried). When the table updates, the stream processor must pick up the changes — this is often done via CDC (Change Data Capture) on the database.

Table-Table Join (Materialized View Maintenance)

Both sides are streams of database changes. Maintain a derived table that joins them.

Example: Maintain a user's timeline = join "who they follow" changes with "posts" changes.

⚡ Fault Tolerance in Stream Processing

Batch jobs are easy to retry — just re-run from the beginning. Streams are infinite — you can't "re-run from the beginning" indefinitely.

Microbatching and Checkpointing

Apache Spark Streaming: Process data in small batches (every 0.5–5 seconds). Each microbatch is a mini-batch job — exactly-once semantics via batch recomputation on failure.

Apache Flink: Continuously checkpoint state to durable storage (HDFS/S3) at regular intervals. On failure, restart from the last checkpoint + replay events since then.

Idempotent Writes

If processing fails and restarts, you might process the same event twice → duplicate writes. Make writes idempotent — processing the same event twice has the same effect as processing it once.

Techniques:

Include a unique event ID in writes; use upsert/ignore-if-exists semantics
Use a key derived from the event content (content-addressed storage)

Exactly-Once Semantics

The holy grail. Achieved by combining:

Durable, replayable message log (Kafka offset tracking)
Idempotent writes (or two-phase commit to output systems)
Checkpointed processing state (Flink's distributed snapshots)

Apache Flink + Kafka provides end-to-end exactly-once, with caveats.

🔄 Stream Processing vs Batch Processing

Aspect	Batch	Stream
Input	Bounded (finite dataset)	Unbounded (infinite stream)
Latency	Minutes to hours	Milliseconds to seconds
Fault tolerance	Re-run from scratch	Checkpoint + replay
State	No persistent state (functional)	Stateful operators
Output	Written on completion	Continuously updated

The Lambda Architecture (formerly popular) ran batch + stream in parallel and merged results. This is now largely replaced by systems like Flink that handle both with one system.

Summary

Stream processing enables:

Real-time dashboards and monitoring
Fraud detection and alerting
Near-real-time search index updates (CDC → Kafka → Elasticsearch)
Event-driven microservices (choreography via shared event log)
Continuous ETL pipelines

The key concepts are: event logs as the source of truth, time windowing, stateful operators, and fault tolerance via checkpointing and replay. These make stream processing as reliable and deterministic as batch processing — just much faster.

The Big Idea​

📨 Transmitting Event Streams​

Events and Streams​

Direct Messaging vs Message Brokers​

Message Brokers vs Databases​

🚀 Apache Kafka Deep Dive​

⚙️ Stream Processing Patterns​

1. Complex Event Processing (CEP)​

2. Stream Analytics​

3. Maintaining Materialized Views​

⏱️ Time in Stream Processing​

Event Time vs Processing Time​

Windowing​

Handling Late-Arriving Events​

🔗 Joins in Stream Processing​

Stream-Stream Join (Windowed Join)​

Stream-Table Join (Enrichment)​

Table-Table Join (Materialized View Maintenance)​

⚡ Fault Tolerance in Stream Processing​

Microbatching and Checkpointing​

Idempotent Writes​

Exactly-Once Semantics​

🔄 Stream Processing vs Batch Processing​

Summary​