Chapter 6: Partitioning

The Big Idea

For very large datasets or very high query throughput, a single machine is not enough. Partitioning (also called sharding) breaks the data into partitions — each partition is assigned to a different node. This enables horizontal scaling of both storage and query throughput.

The challenge is doing this in a way that distributes load evenly and supports efficient queries.

🗂️ Partitioning and Replication

Partitioning and replication are often combined. Each partition is replicated across several nodes for fault tolerance. A node may hold multiple partitions.

Node 1: Partition 1 (leader), Partition 3 (follower)
Node 2: Partition 2 (leader), Partition 1 (follower)
Node 3: Partition 3 (leader), Partition 2 (follower)

Partitioning strategy determines which node holds which data. Replication strategy determines how many copies and which nodes hold them.

🔢 Partitioning of Key-Value Data

Goal: spread data and query load evenly (skew means uneven distribution; a heavily loaded partition is a hot spot).

Partitioning by Key Range

Assign a continuous range of keys to each partition (like encyclopedia volumes: A-D, E-H, ...).

Partition 1: keys [a — cart]
Partition 2: keys [cart — menu]
Partition 3: keys [menu — zzz]

✅ Efficient range queries (e.g., all timestamps in January) ❌ Risk of hot spots — e.g., all writes today go to the "today's date" partition

Used by: HBase, Bigtable, MongoDB (range sharding), RethinkDB.

Fix for timestamp hot spot: Prefix the key with something other than the timestamp, e.g., {sensor_id}_{timestamp} instead of {timestamp}_{sensor_id}. Now you can still range-query one sensor's data, and writes are spread across partitions.

Partitioning by Hash of Key

Use a hash function to determine which partition a key goes to.

hash("user_123") % 10 = 7  →  Partition 7

✅ Distributes load evenly (good hash function = uniform distribution) ❌ Range queries no longer work (adjacent keys are scattered across partitions)

Used by: Cassandra, MongoDB (hash sharding), Riak, Voldemort.

Consistent hashing: A technique where adding/removing nodes only requires moving a fraction of the keys (not rehashing everything).

Cassandra's Compound Primary Key

Cassandra uses a hybrid: the first column is hashed (to select partition), the remaining columns form an SSTable-style sort key within the partition. This allows efficient range queries within one user's data:

PARTITION KEY: (user_id)          ← hashed to select node
CLUSTERING KEY: (update_timestamp) ← sorted within partition

Query: "Give me the last 10 posts for user 42" — goes to one partition and does a range scan. Efficient!

🔥 Skewed Workloads and Hot Spots

Even with hash partitioning, extreme skew can occur. Example: a celebrity Twitter account — all of their followers reading and writing generates traffic that hashes to one key.

Mitigation: Add a random prefix (e.g., 2-digit suffix) to the celebrity's key, spreading writes across 100 partitions. Reads must then query all 100 partitions and combine results — a trade-off the application must manage manually.

No database handles this automatically — it requires application-level logic.

🔍 Partitioning and Secondary Indexes

Secondary indexes don't map neatly to partitions. Two main approaches:

Document-Based Partitioning (Local Indexes)

Each partition maintains its own secondary index — only covering documents in that partition.

Partition 0 (cars with id 0-499):
  color:red → [id:12, id:58, id:100]
  make:honda → [id:22, id:199]

Partition 1 (cars with id 500-999):
  color:red → [id:512, id:788]
  make:honda → [id:600]

✅ Simple writes — only update the local index ❌ Scatter/gather reads — a query for "red cars" must go to ALL partitions and merge results; prone to tail latency amplification

Used by: MongoDB, Cassandra, Riak, Elasticsearch, SolrCloud.

Term-Based Partitioning (Global Indexes)

One global index, also partitioned — but by the index term, not by document ID.

Index for 'color':
  Partition 0: color:black → [car:id:523, car:id:904, ...]
  Partition 1: color:red   → [car:id:12, car:id:512, ...]
  Partition 2: color:white → [car:id:77, ...]

✅ Reads are efficient — query goes to just the index partition(s) for the term ❌ Writes are slower — a single document write may update multiple index partitions

Used by: Amazon DynamoDB Global Secondary Indexes, Riak Search.

Global index updates are often asynchronous — meaning reads may see stale index data for a short window.

🔄 Rebalancing Partitions

Partitions need to be moved between nodes when:

Query throughput increases → add more CPUs
Dataset grows → add more disks
A node fails → its partitions must be served by other nodes

Requirements for rebalancing:

Load should be evenly distributed after rebalancing
Database should continue serving reads/writes during rebalancing
Minimize data moved across the network

Bad Approach: `hash mod N`

Using hash(key) % N (where N = number of nodes) means that when N changes, almost all keys need to be moved. Too disruptive.

Fixed Number of Partitions

Create many more partitions than nodes (e.g., 1000 partitions for 10 nodes → 100 partitions/node). When adding a node, steal a few partitions from each existing node. Only partition-to-node assignment changes, not the key-to-partition mapping.

Used by: Riak, Elasticsearch, Couchbase, Voldemort.

Downside: You must choose the total partition count upfront. Too low = large partitions (expensive rebalancing). Too high = overhead.

Dynamic Partitioning

Partitions split when they grow beyond a threshold (e.g., 10 GB), and merge when they shrink below a threshold. Total partition count adapts to data volume.

Used by: HBase, MongoDB, RethinkDB.

Partitioning Proportionally to Nodes

Fixed number of partitions per node (Cassandra, Ketama). Adding a node = splitting some existing partitions. Partition size scales with dataset; partition count scales with node count.

🧭 Request Routing

How does a client know which node to connect to for a given key?

Three approaches:

Client connects to any node, which forwards if needed (gossip protocol approach — Cassandra, Riak)
Routing tier (partition-aware load balancer) — all clients connect to a router that tracks partition-to-node mapping (ZooKeeper-based)
Client tracks partition assignment — client knows directly which node to connect to

ZooKeeper is commonly used to track cluster state. Nodes register themselves; the routing tier subscribes to ZooKeeper and gets notified of changes.

Summary

Aspect	By Key Range	By Hash	Hash + Compound Key
Range queries	✅ Efficient	❌ Inefficient	⚠️ Within-partition only
Load distribution	⚠️ Risk of hot spots	✅ Even	✅ Even per partition
Used by	HBase, MongoDB	Cassandra, Dynamo	Cassandra

Partitioning enables scaling beyond a single machine, but it adds complexity around secondary indexes, query routing, and rebalancing. The right strategy depends on your query patterns.

The Big Idea​

🗂️ Partitioning and Replication​

🔢 Partitioning of Key-Value Data​

Partitioning by Key Range​

Partitioning by Hash of Key​

Cassandra's Compound Primary Key​

🔥 Skewed Workloads and Hot Spots​

🔍 Partitioning and Secondary Indexes​

Document-Based Partitioning (Local Indexes)​

Term-Based Partitioning (Global Indexes)​

🔄 Rebalancing Partitions​

Bad Approach: hash mod N​

Fixed Number of Partitions​

Dynamic Partitioning​

Partitioning Proportionally to Nodes​

🧭 Request Routing​

Summary​