Introduction
In Chapter 5, we discussed replication - keeping copies of the same data on multiple machines for redundancy and performance. But what if your dataset is so large that it doesn’t fit on a single machine? Or what if a single machine cannot handle all the read and write requests? This is where partitioning (also called sharding) comes in. Partitioning is the technique of breaking up a large database into smaller pieces, called partitions, and distributing them across multiple machines.What is partitioning?
Partitioning is the process of dividing a dataset into subsets, where each partition is a small database of its own. Although it may access other partitions as needed, each partition can be treated largely independently.Key terminology:
- Partition: A subset of the data (also called shard in MongoDB, region in HBase, vnode in Cassandra, vBucket in Couchbase)
- Partitioning: The process of dividing data (also called sharding)
- Partition key: The value used to determine which partition a piece of data belongs to (also called shard key)
Why partition data?
Partitioning is necessary when data grows beyond what a single machine can handle. The main reasons are:-
Scalability - Handle more data
- Single machine has limited disk capacity (maybe 1-10TB)
- With partitioning, dataset can grow to petabytes by adding more machines
- Example: Facebook has petabytes of user data, impossible to store on one machine
-
Performance - Handle more requests
- Single machine has limited CPU and memory
- Query throughput limited by single machine (e.g., 10,000 queries/second)
- With partitioning, queries distributed across many machines
- If you have 10 partitions, theoretically 10x throughput
- Example: Twitter handles millions of tweets/second by partitioning across thousands of machines
-
Parallel query processing
- Large queries can be parallelized across multiple partitions
- Each partition processes its subset of data independently
- Results combined at the end
- Example: Analytics query “count users by country” - each partition counts its users, results aggregated
Partitioning vs. replication
Partitioning and replication are often used together:- Partitioning: Divide data into subsets
- Replication: Keep multiple copies of each partition for fault tolerance
Partitioning of key-value data
The fundamental question in partitioning: How do you decide which records to store on which nodes? The goal is to spread data evenly. If partition is unfair, you could end with most data and requests going to one partition (hot spot), making partitioning useless. Let’s explore different partitioning strategies.Partitioning by key range
One approach is to assign a continuous range of keys to each partition. Like volumes of an encyclopedia - A-B in volume 1, C-D in volume 2, etc. How it works:-
Range queries are efficient
- If you want all users from Alice to Charlie, you know they’re all in Partition 1
- No need to query all partitions
- Example: “Get all temperature readings between timestamp 2024-01-01 and 2024-01-31”
-
Keys are stored in sorted order within partition
- Easier to scan through records in order
- Good for applications that need sorted data
- BigTable/HBase: Keys stored in sorted order within each partition
- MongoDB (before 2.4): Could partition by range (now recommends hash-based)
- RethinkDB: Supports range-based sharding
Partitioning by hash of key
To avoid hot spots and distribute data more evenly, many systems use a hash function to determine the partition. How it works:- A good hash function takes skewed data and makes it uniformly distributed
- Same input always produces same output (deterministic)
- No need for cryptographically strong hash (MD5, SHA-256 overkill)
- Common choices: MurmurHash, CityHash, FNV
-
Even distribution of data
- Hash function uniformly distributes keys across partitions
- Reduces risk of hot spots
- Example: User IDs hashed - users evenly distributed regardless of naming patterns
-
Automatic load balancing
- No manual adjustment of partition boundaries needed
- Works well even if data access patterns change
-
Range queries are inefficient
- Adjacent keys in key space end up in different partitions
- Example: “Get all users from Alice to Charlie” requires querying ALL partitions
-
Loss of ordering
- Hash destroys the natural ordering of keys
- Can’t efficiently iterate through keys in sorted order
- Cassandra: Uses consistent hashing (hash of key determines partition)
- MongoDB: Default sharding strategy uses hash of shard key
- Riak: Uses consistent hashing
- DynamoDB: Hash partitioning
Consistent hashing
The problem with simple hash partitioning: When you add or remove partitions (nodes), most keys need to move to different partitions.- Hash output space forms a ring (e.g., 0 to 2^32-1, wraps around)
- Each partition assigned a position on the ring
- Each key hashed to a position on the ring
- Key belongs to the next partition clockwise on the ring
- Only keys between new partition and previous partition need to move
- Other keys stay in same partition
The term “consistent hashing” is often used loosely. Some databases (like Cassandra) use a variation called “virtual nodes” (vnodes) to improve load distribution.
Skewed workloads and hot spots
Even with hash partitioning, you can still get hot spots in extreme cases. Example - Celebrity user problem:-
Application-level sharding:
-
Caching layer:
- Put cache (Redis, Memcached) in front of hot partitions
- Cache absorbs read load
- Database partition sees less traffic
-
Read replicas for hot partition:
- Create more replicas of the hot partition
- Distribute reads across replicas
Real-world example: Twitter’s Justin Bieber problem
- When Justin Bieber tweets, millions of followers’ timelines need updates
- Twitter had to build special infrastructure to handle celebrity accounts
- Normal partitioning insufficient for such extreme skew
Partitioning and secondary indexes
So far we’ve discussed partitioning by primary key. But what if you want to query by something other than the primary key? Example - Car sales database:car_id, how do you efficiently find all red Teslas?
This is the problem of secondary indexes. Secondary indexes are the bread and butter of relational databases, but they complicate partitioning.
There are two main approaches:
Document-based partitioning (local indexes)
Each partition maintains its own secondary indexes, covering only the documents in that partition. How it works:- Fast writes: Only need to update one partition
- Simple: Each partition is independent
- Slow reads: Need to query all partitions (scatter/gather)
- Tail latency problem: Read as slow as the slowest partition
- If one partition is slow or unavailable, entire query affected
- MongoDB: Local secondary indexes
- Elasticsearch: Each shard has its own indexes
- Cassandra: Local secondary indexes (added in version 2.1)
- Riak: Search (based on Solr) uses local indexes
Term-based partitioning (global indexes)
Instead of each partition having its own local index, we construct a global index that covers all partitions. The global index itself is partitioned, but differently from the primary key. How it works:- Global index partitioned by the indexed field (e.g., color)
- color: a-m → Index Partition 0
- color: n-z → Index Partition 1
- All red cars (regardless of car_id) → same index partition
- Fast reads: Only need to query one index partition
- More efficient for read-heavy workloads
- Better performance for queries on secondary attributes
- Slow writes: Must update multiple partitions (data + indexes)
- Complex distributed transactions needed
- Often updated asynchronously (eventual consistency)
- DynamoDB: Global Secondary Indexes (GSIs)
- Riak Search: Can use global indexes
- Oracle: Global indexes in partitioned tables
- Writes faster (don’t wait for index updates)
- But: Reads may not immediately see new data
- Eventually consistent (index catches up)
Comparison: local vs. global indexes
| Aspect | Document-Based (Local) | Term-Based (Global) |
|---|---|---|
| Write speed | Fast (single partition) | Slow (multiple partitions) |
| Read speed | Slow (query all partitions) | Fast (query specific partition) |
| Consistency | Immediate | Often eventual |
| Complexity | Simple | Complex |
| Best for | Write-heavy workloads | Read-heavy workloads |
| Examples | MongoDB, Elasticsearch | DynamoDB GSI |
Rebalancing partitions
Over time, things change in a database cluster:- Data size increases: More data doesn’t fit in existing partitions
- Query load increases: Need more machines to handle traffic
- Machines fail: Need to redistribute load to surviving machines
- New machines added: Want to take advantage of additional resources
Requirements for rebalancing
No matter which strategy we use, rebalancing should meet these requirements:-
After rebalancing, load should be shared fairly between nodes
- Data and query load distributed evenly
- No hot spots created
-
Database should continue accepting reads and writes during rebalancing
- No downtime
- Minimal performance impact
-
No more data than necessary should be moved between nodes
- Moving data is expensive (network bandwidth, disk I/O)
- Minimize disruption
Strategies for rebalancing
Don’t hash mod N (bad approach)
A naive approach is to usehash(key) % N where N is number of nodes. This is terrible for rebalancing!
Fixed number of partitions
Create many more partitions than nodes, then assign multiple partitions to each node. How it works:- Create fixed number of partitions (e.g., 1000 partitions)
- Each node owns several partitions
- When new node added: Steal a few partitions from existing nodes
- Partitions themselves don’t change, just reassigned to different nodes
- Only entire partitions moved (clear boundaries)
- Can move partition while continuing to serve reads/writes (replica takes over)
- Number of partitions doesn’t change
- Riak: 64 partitions per node (if cluster has 10 nodes → 640 partitions)
- Elasticsearch: Shards created upfront, can’t change without reindexing
- Couchbase: 1024 vBuckets per bucket
Dynamic partitioning
Create partitions dynamically based on data size. When partition grows too large, split it. When partition shrinks too small, merge it. Advantages:- Number of partitions adapts to data volume
- Works well with both key-range and hash partitioning
- Empty database starts with small number of partitions, grows organically
- Empty database starts with single partition → all writes to one node (hot spot)
- Pre-splitting: Some databases allow configuring initial set of partitions
- HBase: Automatic partition splitting
- MongoDB: Automatic chunk splitting (64MB chunks)
- RethinkDB: Dynamic sharding
Partitioning proportionally to nodes
Make the number of partitions proportional to the number of nodes - fixed number of partitions per node.- Automatically adapts to cluster size
- Each new node takes fair share of load
- Cassandra: Uses vnodes (virtual nodes), default 256 per physical node
- Riak: Also uses vnodes
Comparison of rebalancing strategies
| Strategy | Partition Count | When to Split | Best For | Examples |
|---|---|---|---|---|
| Fixed partitions | Fixed upfront | Never | Known dataset size | Riak, Elasticsearch |
| Dynamic partitions | Changes with data size | Partition too large/small | Unknown growth | HBase, MongoDB |
| Proportional to nodes | Changes with cluster size | Add/remove node | Variable cluster size | Cassandra, Riak vnodes |
Automatic vs. manual rebalancing
Automatic rebalancing:- System automatically decides when and how to move partitions
- Convenient, less operational burden
- Risk: Can go wrong (move too much data, overwhelm network, cause cascading failures)
- Human administrator decides partition assignment
- System executes the move
- More control, prevents surprises
- Requires more operational effort
Request routing
We’ve discussed how data is partitioned across nodes. Now: How does a client know which node to connect to? When a client wants to make a request, which node should it connect to? This is an instance of a more general problem called service discovery.Approaches to request routing
There are three main approaches:Approach 1: Client contacts any node
Client sends request to any node (via load balancer). If that node owns the partition, it handles the request. Otherwise, it forwards to the correct node. Advantages:- Client doesn’t need to know cluster topology
- Simple client logic
- Any node can handle any request (after forwarding)
- Extra network hop if first node doesn’t have data
- All nodes need routing table
- Cassandra: Uses this approach (gossip protocol shares routing info)
- Riak: Similar approach
Approach 2: Routing tier
Client contacts a routing tier (partition-aware load balancer), which determines the correct node and forwards the request. Advantages:- Client completely unaware of partitioning
- Routing logic centralized (easier to update)
- Nodes don’t need to forward requests
- Routing tier is additional component (single point of failure unless replicated)
- Extra network hop
- MongoDB: Uses mongos routers (routing tier)
Approach 3: Client-side routing
Client is aware of partitioning scheme and directly contacts the correct node. Advantages:- Most efficient (no extra hops)
- No additional routing infrastructure needed
- Client needs to be partition-aware
- Client needs to track cluster changes
- More complex client logic
How does routing learn about partition changes?
When partitions are rebalanced, routing decisions need to change. How do components learn about these changes?Approach: Coordination service (e.g., ZooKeeper)
Many distributed systems use a separate coordination service like ZooKeeper to track cluster metadata. How it works:- Each node registers itself in ZooKeeper with partition assignments
- Routing tier (or client) subscribes to ZooKeeper for updates
- When partitions reassigned, ZooKeeper notifies subscribers
- Routing tier updates its routing table
- HBase: Uses ZooKeeper for metadata
- Kafka: Uses ZooKeeper (moving away from it in newer versions)
- SolrCloud: Uses ZooKeeper
Approach: Gossip protocol
Nodes communicate directly with each other to share cluster state (no external coordination service). How gossip works:- Every node periodically picks random node to share state with
- Information spreads through cluster exponentially fast
- Eventually all nodes have consistent view of cluster
- Cassandra: Uses gossip protocol
- Riak: Also uses gossip
Summary
Partitioning is essential for scaling beyond what a single machine can handle. Key takeaways: Partitioning strategies:- Key range: Efficient range queries, but risk of hot spots
- Hash: Even distribution, but inefficient range queries
- Consistent hashing: Minimal rebalancing when nodes added/removed
- Local (document-based): Fast writes, slow reads (scatter/gather)
- Global (term-based): Fast reads, slow writes (distributed updates)
- Fixed partitions: Simple, but need to choose number upfront
- Dynamic partitions: Adapts to data size, but complex
- Proportional to nodes: Adapts to cluster size
- Client to any node (with forwarding)
- Routing tier (partition-aware load balancer)
- Client-side routing (partition-aware client)
- Use ZooKeeper or gossip for coordination