Replication Fundamentals
CockroachDB divides your data into ranges—contiguous chunks of the key-value space, typically 512 MiB by default. Each range is replicated across multiple nodes to ensure survivability.From a SQL perspective, a table starts as a single range. As the table grows beyond the maximum range size, it automatically splits into multiple ranges. Each range is then independently replicated and distributed.
Default Replication
By default, CockroachDB uses 3x replication:- Each range is copied to 3 different nodes
- Changes must be acknowledged by a majority (2 out of 3)
- The cluster can tolerate 1 node failure
- This provides the minimum configuration for high availability
| Replication Factor | Failures Tolerated | Minimum Nodes |
|---|---|---|
| 3 | 1 | 3 |
| 5 | 2 | 5 |
| 7 | 3 | 7 |
Raft Consensus Protocol
CockroachDB uses the Raft consensus algorithm to ensure all replicas agree on the current state of data.Raft Groups and Roles
Each range forms a Raft group with replicas that have specific roles:Leader
One replica is elected as the Raft leader. The leader:
- Proposes all writes to the group
- Coordinates replication to followers
- Maintains the Raft log
- Is always co-located with the leaseholder (via Leader leases)
Followers
The remaining voting replicas act as followers. They:
- Receive and acknowledge proposed writes
- Maintain copies of the Raft log
- Can become leader if the current leader fails
- Participate in leader elections
How Consensus Works
When you write data:- The client sends the write to any node (the gateway)
- The gateway routes it to the range’s leaseholder
- The leaseholder (also the Raft leader) proposes the write
- Followers acknowledge the proposed write
- Once a majority acknowledges, the write is committed
- The committed write is applied to the storage engine
A majority (quorum) is required for writes to commit. With 3x replication, this means 2 out of 3 replicas must acknowledge. This ensures the write survives even if one node fails immediately after.
Leases and Read Performance
Leaseholder Replicas
Each range has one leaseholder—a special replica that:- Serves all reads for the range without going through Raft consensus
- Coordinates all writes as the Raft leader
- Can serve reads with minimal latency
- Maintains the timestamp cache to ensure serializability
Leaseholders can serve strongly-consistent reads without Raft because they must have participated in the consensus that committed the data in the first place. This eliminates unnecessary network round trips.
Leader Leases
CockroachDB uses Leader leases to ensure the Raft leader and leaseholder are always the same replica:- Leases are based on store-level liveness detection
- A replica must be fortified by a quorum of stores to become leader
- The fortified leader automatically gets the lease
- This eliminates the single point of failure from the previous node liveness system
- Network partitions heal in less than 20 seconds
- Liveness failures resolved in less than 1 second
- No dependency on a single node liveness range
- Equivalent performance to the previous epoch-based lease system
Follower Reads
Non-leaseholder replicas can serve follower reads (also called stale reads):- Use the
AS OF SYSTEM TIMEclause to read historical data - Reads are served at a timestamp below the range’s closed timestamp
- No need to contact the leaseholder, reducing latency
- Perfect for geo-distributed reads in multi-region deployments
Rebalancing and Range Distribution
CockroachDB automatically rebalances data across your cluster to optimize performance and survivability.When Rebalancing Occurs
New Nodes Join
New Nodes Join
When you add nodes to your cluster:
- New nodes advertise their available capacity
- The cluster identifies ranges to rebalance
- Snapshots of ranges are sent to new nodes
- New replicas join their Raft groups
- Old replicas are removed once new ones are up-to-date
Nodes Go Offline
Nodes Go Offline
When nodes become unavailable:
- After a timeout, the cluster considers the node dead
- Ranges with replicas on the dead node are under-replicated
- New replicas are created on available nodes
- Once replication is restored, the cluster is healthy again
Load-Based Rebalancing
Load-Based Rebalancing
The cluster continuously monitors:
- Replica count per node
- CPU usage per node (configurable)
- Query patterns and hotspots
Snapshots and Replica Creation
To create a new replica:- The leaseholder creates a snapshot of the range’s data
- The snapshot is sent to the target node (possibly via delegated snapshots)
- The receiving node loads the snapshot into its storage engine
- The new replica replays the Raft log from the snapshot timestamp to present
- The replica joins the Raft group and begins participating
Delegated snapshots allow a follower close to the recipient to send the snapshot instead of the leader. This reduces cross-region bandwidth costs and speeds up rebalancing in multi-region deployments.
Replication Zones
You control replication behavior through replication zones—configurations that specify:- Number of replicas (replication factor)
- Number of voting vs. non-voting replicas
- Placement constraints (e.g., “must be in EU”)
- Leaseholder preferences (e.g., “prefer region=us-east”)
- Range size limits
- Garbage collection TTL
Zone Configuration Levels
Replication zones can be configured at multiple levels:| Level | Scope | Example Use Case |
|---|---|---|
| Cluster | All data by default | Set 5x replication globally |
| Database | All tables in a database | Keep eu_db data in Europe |
| Table | Specific table | High replication for critical payments table |
| Index | Specific index | Co-locate index with related data |
| Partition | Specific rows | Pin user data to their home region |
Example: Pin Data to a Region
- All replicas are in
us-east - The leaseholder is preferably in
us-east - Reads and writes are low-latency for
us-eastusers
Leaseholder Rebalancing (Follow-the-Workload)
CockroachDB can automatically move leaseholders to be closer to the primary source of traffic:How It Works
- Each leaseholder tracks which localities are requesting the range
- Every 10 minutes (by default), it evaluates whether to transfer the lease
- It considers:
- Request counts from each locality
- Latency between localities
- Current lease distribution
- If transferring improves latency, the lease moves
Intra-Locality vs. Inter-Locality
Same Locality
Same Locality
When all replicas are in the same locality:
- Leases are balanced evenly across nodes
- Small deviations are tolerated to prevent thrashing
- Based purely on lease count per node
Different Localities
Different Localities
When replicas span localities:
- CockroachDB calculates which locality is requesting most
- It correlates request localities with replica localities
- Leases move to replicas in localities with the most requests
- This optimizes for the primary traffic source
Handling Failures
Automatic Failure Detection
CockroachDB detects failures through:- Store liveness: Continuous heartbeating between stores
- Raft heartbeats: Leaders heartbeat followers periodically
- Circuit breakers: Per-replica timeouts (60 seconds by default)
Recovery Process
When a node fails:Cluster Operations Resume
Requests continue with the remaining replicas. The cluster is temporarily at reduced replication (e.g., 2 of 3 replicas).
The entire recovery process—from failure detection to serving requests again—typically completes in seconds, not minutes.
See Also
- Distributed Transactions - How transactions work across replicas
- Multi-Region Deployments - Optimizing replication for geo-distributed data
- Resilience and Recovery - Backup strategies and disaster recovery