Skip to main content

Chapter 6: Partitioning

The Big Ideaโ€‹

For very large datasets or very high query throughput, a single machine is not enough. Partitioning (also called sharding) breaks the data into partitions โ€” each partition is assigned to a different node. This enables horizontal scaling of both storage and query throughput.

The challenge is doing this in a way that distributes load evenly and supports efficient queries.


๐Ÿ—‚๏ธ Partitioning and Replicationโ€‹

Partitioning and replication are often combined. Each partition is replicated across several nodes for fault tolerance. A node may hold multiple partitions.

Node 1: Partition 1 (leader), Partition 3 (follower)
Node 2: Partition 2 (leader), Partition 1 (follower)
Node 3: Partition 3 (leader), Partition 2 (follower)

Partitioning strategy determines which node holds which data. Replication strategy determines how many copies and which nodes hold them.


๐Ÿ”ข Partitioning of Key-Value Dataโ€‹

Goal: spread data and query load evenly (skew means uneven distribution; a heavily loaded partition is a hot spot).

Partitioning by Key Rangeโ€‹

Assign a continuous range of keys to each partition (like encyclopedia volumes: A-D, E-H, ...).

Partition 1: keys [a โ€” cart]
Partition 2: keys [cart โ€” menu]
Partition 3: keys [menu โ€” zzz]

โœ… Efficient range queries (e.g., all timestamps in January) โŒ Risk of hot spots โ€” e.g., all writes today go to the "today's date" partition

Used by: HBase, Bigtable, MongoDB (range sharding), RethinkDB.

Fix for timestamp hot spot: Prefix the key with something other than the timestamp, e.g., {sensor_id}_{timestamp} instead of {timestamp}_{sensor_id}. Now you can still range-query one sensor's data, and writes are spread across partitions.

Partitioning by Hash of Keyโ€‹

Use a hash function to determine which partition a key goes to.

hash("user_123") % 10 = 7 โ†’ Partition 7

โœ… Distributes load evenly (good hash function = uniform distribution) โŒ Range queries no longer work (adjacent keys are scattered across partitions)

Used by: Cassandra, MongoDB (hash sharding), Riak, Voldemort.

Consistent hashing: A technique where adding/removing nodes only requires moving a fraction of the keys (not rehashing everything).

Cassandra's Compound Primary Keyโ€‹

Cassandra uses a hybrid: the first column is hashed (to select partition), the remaining columns form an SSTable-style sort key within the partition. This allows efficient range queries within one user's data:

PARTITION KEY: (user_id) โ† hashed to select node
CLUSTERING KEY: (update_timestamp) โ† sorted within partition

Query: "Give me the last 10 posts for user 42" โ€” goes to one partition and does a range scan. Efficient!


๐Ÿ”ฅ Skewed Workloads and Hot Spotsโ€‹

Even with hash partitioning, extreme skew can occur. Example: a celebrity Twitter account โ€” all of their followers reading and writing generates traffic that hashes to one key.

Mitigation: Add a random prefix (e.g., 2-digit suffix) to the celebrity's key, spreading writes across 100 partitions. Reads must then query all 100 partitions and combine results โ€” a trade-off the application must manage manually.

No database handles this automatically โ€” it requires application-level logic.


๐Ÿ” Partitioning and Secondary Indexesโ€‹

Secondary indexes don't map neatly to partitions. Two main approaches:

Document-Based Partitioning (Local Indexes)โ€‹

Each partition maintains its own secondary index โ€” only covering documents in that partition.

Partition 0 (cars with id 0-499):
color:red โ†’ [id:12, id:58, id:100]
make:honda โ†’ [id:22, id:199]

Partition 1 (cars with id 500-999):
color:red โ†’ [id:512, id:788]
make:honda โ†’ [id:600]

โœ… Simple writes โ€” only update the local index โŒ Scatter/gather reads โ€” a query for "red cars" must go to ALL partitions and merge results; prone to tail latency amplification

Used by: MongoDB, Cassandra, Riak, Elasticsearch, SolrCloud.

Term-Based Partitioning (Global Indexes)โ€‹

One global index, also partitioned โ€” but by the index term, not by document ID.

Index for 'color':
Partition 0: color:black โ†’ [car:id:523, car:id:904, ...]
Partition 1: color:red โ†’ [car:id:12, car:id:512, ...]
Partition 2: color:white โ†’ [car:id:77, ...]

โœ… Reads are efficient โ€” query goes to just the index partition(s) for the term โŒ Writes are slower โ€” a single document write may update multiple index partitions

Used by: Amazon DynamoDB Global Secondary Indexes, Riak Search.

Global index updates are often asynchronous โ€” meaning reads may see stale index data for a short window.


๐Ÿ”„ Rebalancing Partitionsโ€‹

Partitions need to be moved between nodes when:

  • Query throughput increases โ†’ add more CPUs
  • Dataset grows โ†’ add more disks
  • A node fails โ†’ its partitions must be served by other nodes

Requirements for rebalancing:

  • Load should be evenly distributed after rebalancing
  • Database should continue serving reads/writes during rebalancing
  • Minimize data moved across the network

Bad Approach: hash mod Nโ€‹

Using hash(key) % N (where N = number of nodes) means that when N changes, almost all keys need to be moved. Too disruptive.

Fixed Number of Partitionsโ€‹

Create many more partitions than nodes (e.g., 1000 partitions for 10 nodes โ†’ 100 partitions/node). When adding a node, steal a few partitions from each existing node. Only partition-to-node assignment changes, not the key-to-partition mapping.

Used by: Riak, Elasticsearch, Couchbase, Voldemort.

Downside: You must choose the total partition count upfront. Too low = large partitions (expensive rebalancing). Too high = overhead.

Dynamic Partitioningโ€‹

Partitions split when they grow beyond a threshold (e.g., 10 GB), and merge when they shrink below a threshold. Total partition count adapts to data volume.

Used by: HBase, MongoDB, RethinkDB.

Partitioning Proportionally to Nodesโ€‹

Fixed number of partitions per node (Cassandra, Ketama). Adding a node = splitting some existing partitions. Partition size scales with dataset; partition count scales with node count.


๐Ÿงญ Request Routingโ€‹

How does a client know which node to connect to for a given key?

Three approaches:

  1. Client connects to any node, which forwards if needed (gossip protocol approach โ€” Cassandra, Riak)
  2. Routing tier (partition-aware load balancer) โ€” all clients connect to a router that tracks partition-to-node mapping (ZooKeeper-based)
  3. Client tracks partition assignment โ€” client knows directly which node to connect to

ZooKeeper is commonly used to track cluster state. Nodes register themselves; the routing tier subscribes to ZooKeeper and gets notified of changes.


Summaryโ€‹

AspectBy Key RangeBy HashHash + Compound Key
Range queriesโœ… EfficientโŒ Inefficientโš ๏ธ Within-partition only
Load distributionโš ๏ธ Risk of hot spotsโœ… Evenโœ… Even per partition
Used byHBase, MongoDBCassandra, DynamoCassandra

Partitioning enables scaling beyond a single machine, but it adds complexity around secondary indexes, query routing, and rebalancing. The right strategy depends on your query patterns.