Chapter 10: Batch Processing

The Big Idea

So far the book has focused on systems that handle requests as they arrive (OLTP) or read/write in real-time. But some of the most important data processing happens offline, in bulk — batch processing. This chapter explores batch processing systems, culminating in the MapReduce model and the dataflow systems that succeeded it.

Three types of systems by response time:

Type	Response Time	Examples
Services (online)	Milliseconds	Web services, databases
Batch processing (offline)	Minutes to hours	MapReduce jobs, ETL pipelines
Stream processing (near-real-time)	Seconds	Flink, Kafka Streams

🗂️ Batch Processing with Unix Tools

The Unix philosophy provides surprisingly powerful batch processing primitives. Consider counting the 5 most visited URLs from a web server log:

cat /var/log/nginx/access.log |
  awk '{print $7}' |          # extract URL field
  sort |                       # sort alphabetically
  uniq -c |                    # count occurrences
  sort -r -n |                 # sort by count (descending)
  head -n 5                    # top 5

This processes gigabytes of data in seconds on a single machine. The Unix philosophy makes this work:

Each program does one thing well
Programs communicate via text streams (stdin/stdout)
Programs are composable — the output of one is the input of the next
Immutable inputs — each tool reads from its input without modifying it

The | operator connects programs via in-memory buffers, not files. The OS schedules them to run concurrently — this is a form of stream processing!

The Uniform Interface

The reason Unix pipes work: every program reads from stdin and writes to stdout, using the same format (newline-delimited text). This uniform interface is what enables composition.

The equivalent in databases: SQL is the uniform interface that lets you combine tables, views, and queries.

🗺️ MapReduce and Distributed Filesystems

When data doesn't fit on one machine, you need distributed batch processing. MapReduce (Google, 2004) is the foundational model.

HDFS (Hadoop Distributed File System)

Built on commodity hardware. Large files are split into blocks (typically 128 MB), replicated across multiple machines for fault tolerance.

Unlike network-attached storage (NAS), HDFS co-locates computation with storage — bring the computation to the data, not the data to the computation. This avoids network bottlenecks.

MapReduce Job Execution

A MapReduce job has two phases:

Map phase: Run a mapper function on each input record. Output key-value pairs.

Input:  "2016-05-01 GET /api/users 200"
Output: ("/api/users", 1)

Reduce phase: Group all values by key, then run a reducer function on each group.

Input:  [("/api/users", [1, 1, 1, 1, ...]), ("/api/posts", [1, 1, ...])]
Output: [("/api/users", 4821), ("/api/posts", 312)]

The shuffle: Between map and reduce, the framework:

Sorts all mapper outputs by key
Sends all values for the same key to the same reducer
Merges sorted output files from multiple mappers

This is the most expensive part (network transfer + disk I/O).

Distributed Grep (Word Count Example)

# Mapper
def map(line):
    for word in line.split():
        emit(word, 1)

# Reducer
def reduce(word, counts):
    emit(word, sum(counts))

Every MapReduce tutorial starts with word count. But real-world jobs chain many MapReduce steps (job chains).

🔗 MapReduce Workflows

Real pipelines chain multiple MapReduce jobs. The output of job 1 is the input of job 2.

Tools like Pig (scripting language for Pig Latin → MapReduce) and Hive (SQL → MapReduce) made this more ergonomic, but the underlying model was the same.

Joins in MapReduce

Joins are tricky in MapReduce. Two approaches:

Sort-Merge Join:

Mapper outputs (key, {tag: "user", data: ...}) or (key, {tag: "event", data: ...})
Reducer receives all data for one key from both datasets
Reducer joins matching records

Broadcast Hash Join:

One dataset is small enough to fit in memory
Load the small dataset into a hash map on each mapper
For each record in the large dataset, look up in the hash map
No reduce phase needed → much faster

Partitioned Hash Join (Bucketed Map Join):

Both datasets partitioned by the same key
Each mapper only needs to load its partition's small-dataset shard

Handling Skew

Hot keys (e.g., a celebrity user ID that appears in millions of records) cause one reducer to get all the work — the straggler problem.

Solutions:

Sample hot keys first; replicate small dataset for those keys
Split the hot key reducer into multiple reducers with random suffixes

🔄 Beyond MapReduce: Dataflow Engines

MapReduce has significant overhead:

Every step writes to HDFS (fault tolerance via materialization)
Many jobs re-read data written by the previous job
The sort/shuffle in every step is expensive even when unnecessary

Dataflow engines (Spark, Tez, Flink) improve on this:

Model the entire job as one DAG (Directed Acyclic Graph) of operators
Data flows between operators in memory (no HDFS between steps)
Only materialize to disk when the operator graph requires it (e.g., sorting for a join)
Fault tolerance via re-computation from a checkpoint, not by materializing everything

MapReduce workflow:
  read HDFS → Map → write HDFS → read HDFS → Reduce → write HDFS

Spark DAG:
  read HDFS → Map → Reduce → write HDFS
               ↑
           in-memory pipeline, no intermediate writes

Result: Spark jobs are typically 10-100x faster than equivalent MapReduce jobs for iterative algorithms (machine learning) and multi-step pipelines.

📊 High-Level Batch APIs

Modern batch frameworks support SQL-like operations natively:

Apache Spark SQL / DataFrames — SQL on distributed datasets
Apache Flink — unified batch and stream processing
Apache Beam — portable API that runs on Spark, Flink, or Google Dataflow

These frameworks can optimize queries across the entire dataflow graph (predicate pushdown, join reordering, column pruning) — similar to a query optimizer in a relational database.

🏗️ The Output of Batch Jobs

What does a batch job produce?

Reports/aggregations → write to a database, data warehouse
Search indexes → Lucene index files (bulk-loaded into Elasticsearch)
Machine learning models → trained model files
Key-value stores → pre-computed results for fast serving (e.g., recommendations)

Building databases in batch jobs: Instead of writing records one by one, build an SSTable/sorted file offline and bulk-load it into the serving layer. This is orders of magnitude faster than individual inserts and doesn't impact production.

Summary

Batch processing draws a clean line between:

Input data (immutable, read-only)
         ↓
Transformation (pure functions, no side effects)
         ↓
Output data (new datasets)

This functional approach — treating data as immutable and transformations as pure — is what makes batch jobs easy to reason about, retry, and debug. The same data can be reprocessed as requirements change (you keep the raw data forever, regenerate derived views).

The Unix philosophy, MapReduce, and modern dataflow engines all share this same insight: separate the logic from the execution, and make reprocessing cheap.

The Big Idea​

🗂️ Batch Processing with Unix Tools​

The Uniform Interface​

🗺️ MapReduce and Distributed Filesystems​

HDFS (Hadoop Distributed File System)​

MapReduce Job Execution​

Distributed Grep (Word Count Example)​

🔗 MapReduce Workflows​

Joins in MapReduce​

Handling Skew​

🔄 Beyond MapReduce: Dataflow Engines​

📊 High-Level Batch APIs​

🏗️ The Output of Batch Jobs​

Summary​