Chapter 6: Partitioning
The Big Ideaโ
For very large datasets or very high query throughput, a single machine is not enough. Partitioning (also called sharding) breaks the data into partitions โ each partition is assigned to a different node. This enables horizontal scaling of both storage and query throughput.
The challenge is doing this in a way that distributes load evenly and supports efficient queries.
๐๏ธ Partitioning and Replicationโ
Partitioning and replication are often combined. Each partition is replicated across several nodes for fault tolerance. A node may hold multiple partitions.
Node 1: Partition 1 (leader), Partition 3 (follower)
Node 2: Partition 2 (leader), Partition 1 (follower)
Node 3: Partition 3 (leader), Partition 2 (follower)
Partitioning strategy determines which node holds which data. Replication strategy determines how many copies and which nodes hold them.
๐ข Partitioning of Key-Value Dataโ
Goal: spread data and query load evenly (skew means uneven distribution; a heavily loaded partition is a hot spot).
Partitioning by Key Rangeโ
Assign a continuous range of keys to each partition (like encyclopedia volumes: A-D, E-H, ...).
Partition 1: keys [a โ cart]
Partition 2: keys [cart โ menu]
Partition 3: keys [menu โ zzz]
โ Efficient range queries (e.g., all timestamps in January) โ Risk of hot spots โ e.g., all writes today go to the "today's date" partition
Used by: HBase, Bigtable, MongoDB (range sharding), RethinkDB.
Fix for timestamp hot spot: Prefix the key with something other than the timestamp, e.g., {sensor_id}_{timestamp} instead of {timestamp}_{sensor_id}. Now you can still range-query one sensor's data, and writes are spread across partitions.
Partitioning by Hash of Keyโ
Use a hash function to determine which partition a key goes to.
hash("user_123") % 10 = 7 โ Partition 7
โ Distributes load evenly (good hash function = uniform distribution) โ Range queries no longer work (adjacent keys are scattered across partitions)
Used by: Cassandra, MongoDB (hash sharding), Riak, Voldemort.
Consistent hashing: A technique where adding/removing nodes only requires moving a fraction of the keys (not rehashing everything).
Cassandra's Compound Primary Keyโ
Cassandra uses a hybrid: the first column is hashed (to select partition), the remaining columns form an SSTable-style sort key within the partition. This allows efficient range queries within one user's data:
PARTITION KEY: (user_id) โ hashed to select node
CLUSTERING KEY: (update_timestamp) โ sorted within partition
Query: "Give me the last 10 posts for user 42" โ goes to one partition and does a range scan. Efficient!
๐ฅ Skewed Workloads and Hot Spotsโ
Even with hash partitioning, extreme skew can occur. Example: a celebrity Twitter account โ all of their followers reading and writing generates traffic that hashes to one key.
Mitigation: Add a random prefix (e.g., 2-digit suffix) to the celebrity's key, spreading writes across 100 partitions. Reads must then query all 100 partitions and combine results โ a trade-off the application must manage manually.
No database handles this automatically โ it requires application-level logic.
๐ Partitioning and Secondary Indexesโ
Secondary indexes don't map neatly to partitions. Two main approaches:
Document-Based Partitioning (Local Indexes)โ
Each partition maintains its own secondary index โ only covering documents in that partition.
Partition 0 (cars with id 0-499):
color:red โ [id:12, id:58, id:100]
make:honda โ [id:22, id:199]
Partition 1 (cars with id 500-999):
color:red โ [id:512, id:788]
make:honda โ [id:600]
โ Simple writes โ only update the local index โ Scatter/gather reads โ a query for "red cars" must go to ALL partitions and merge results; prone to tail latency amplification
Used by: MongoDB, Cassandra, Riak, Elasticsearch, SolrCloud.
Term-Based Partitioning (Global Indexes)โ
One global index, also partitioned โ but by the index term, not by document ID.
Index for 'color':
Partition 0: color:black โ [car:id:523, car:id:904, ...]
Partition 1: color:red โ [car:id:12, car:id:512, ...]
Partition 2: color:white โ [car:id:77, ...]
โ Reads are efficient โ query goes to just the index partition(s) for the term โ Writes are slower โ a single document write may update multiple index partitions
Used by: Amazon DynamoDB Global Secondary Indexes, Riak Search.
Global index updates are often asynchronous โ meaning reads may see stale index data for a short window.
๐ Rebalancing Partitionsโ
Partitions need to be moved between nodes when:
- Query throughput increases โ add more CPUs
- Dataset grows โ add more disks
- A node fails โ its partitions must be served by other nodes
Requirements for rebalancing:
- Load should be evenly distributed after rebalancing
- Database should continue serving reads/writes during rebalancing
- Minimize data moved across the network
Bad Approach: hash mod Nโ
Using hash(key) % N (where N = number of nodes) means that when N changes, almost all keys need to be moved. Too disruptive.
Fixed Number of Partitionsโ
Create many more partitions than nodes (e.g., 1000 partitions for 10 nodes โ 100 partitions/node). When adding a node, steal a few partitions from each existing node. Only partition-to-node assignment changes, not the key-to-partition mapping.
Used by: Riak, Elasticsearch, Couchbase, Voldemort.
Downside: You must choose the total partition count upfront. Too low = large partitions (expensive rebalancing). Too high = overhead.
Dynamic Partitioningโ
Partitions split when they grow beyond a threshold (e.g., 10 GB), and merge when they shrink below a threshold. Total partition count adapts to data volume.
Used by: HBase, MongoDB, RethinkDB.
Partitioning Proportionally to Nodesโ
Fixed number of partitions per node (Cassandra, Ketama). Adding a node = splitting some existing partitions. Partition size scales with dataset; partition count scales with node count.
๐งญ Request Routingโ
How does a client know which node to connect to for a given key?
Three approaches:
- Client connects to any node, which forwards if needed (gossip protocol approach โ Cassandra, Riak)
- Routing tier (partition-aware load balancer) โ all clients connect to a router that tracks partition-to-node mapping (ZooKeeper-based)
- Client tracks partition assignment โ client knows directly which node to connect to
ZooKeeper is commonly used to track cluster state. Nodes register themselves; the routing tier subscribes to ZooKeeper and gets notified of changes.
Summaryโ
| Aspect | By Key Range | By Hash | Hash + Compound Key |
|---|---|---|---|
| Range queries | โ Efficient | โ Inefficient | โ ๏ธ Within-partition only |
| Load distribution | โ ๏ธ Risk of hot spots | โ Even | โ Even per partition |
| Used by | HBase, MongoDB | Cassandra, Dynamo | Cassandra |
Partitioning enables scaling beyond a single machine, but it adds complexity around secondary indexes, query routing, and rebalancing. The right strategy depends on your query patterns.