Why Raft?
CockroachDB chose Raft over alternatives like Paxos for several reasons:Understandability
Raft is designed to be easier to understand than Paxos, with clear separation of leader election, log replication, and safety.
Reference Implementation
Raft includes a reference implementation and covers important practical details often omitted from theoretical descriptions.
Strong Consistency
Guarantees that replicas maintain identical logs and provides ACID semantics for operations.
Production-Proven
Successfully used in distributed systems like etcd, Consul, and CockroachDB itself.
CockroachDB’s Raft implementation was developed in collaboration with CoreOS (now part of Red Hat) and includes optimizations for managing millions of consensus groups.
Raft Basics
Core Components
Fromdocs/design.md:
Raft elects a relatively long-lived leader which must be involved to propose commands. It heartbeats followers periodically and keeps their logs replicated. In the absence of heartbeats, followers become candidates after randomized election timeouts and proceed to hold new leader elections.
Raft Roles
Leader- Handles all client requests (reads and writes)
- Replicates log entries to followers
- Sends periodic heartbeats
- Only one leader per term
- Responds to leader requests
- Forwards client requests to leader
- Becomes candidate if heartbeats stop
- Temporary state during election
- Requests votes from other replicas
- Either becomes leader or reverts to follower
Range Replicas and Raft Groups
Each range in CockroachDB is backed by its own Raft consensus group consisting of 3 or more replicas (configurable via zone config).
Range Replica Structure
The color coding shows associated range replicas. More than one replica for a range will never be placed on the same store or even the same node.
Command Execution Flow
Raft Log
Each replica maintains: Replicated Raft log (stored in RocksDB):- Sequence of proposed commands
- Each entry has: term number, index, command
- Guarantees all replicas have identical logs
- Current term
- Voted for which candidate
- Commit index
- Last applied index
pkg/storage/ (key storage structures):
CockroachDB Raft Optimizations
Coalesced Heartbeats
Problem
Problem
With standard Raft, each range sends independent heartbeats. With millions of ranges, this creates enormous overhead.
Solution
Solution
CockroachDB coalesces heartbeats so the number of heartbeat messages is proportional to the number of nodes, not the number of ranges.
Batch Processing
Raft operations are batched:- Multiple proposals combined into single RPC
- Append entries sent in batches
- Reduces RPC overhead dramatically
Optimized Election Timeouts
From the design document:Cockroach weights random timeouts such that the replicas with shorter round trip times to peers are more likely to hold elections first.This ensures leaders are elected from well-connected replicas, improving performance.
Future Optimizations
Range Leases
While Raft ensures consistency, CockroachDB adds range leases for efficiency.A range lease is a time-bound exclusive right held by one replica, allowing it to serve reads and coordinate writes without going through Raft for every operation.
Why Leases?
Without leases:- Every read requires Raft consensus (expensive)
- No single source of truth for current state
- Higher latency for common operations
- Leaseholder serves reads locally (fast)
- Single replica coordinates range operations
- Other replicas redirect to leaseholder
Lease Acquisition
Fromdocs/design.md:
A replica establishes itself as owning the lease on a range by committing a special lease acquisition log entry through Raft. The log entry contains the replica node’s epoch from the node liveness table.
Lease Properties
Time-based validity:- Based on node liveness table
- Contains node epoch + expiration time
- Nodes heartbeat liveness table continuously
- Lease acquisition includes copy of current lease
- Only granted if current lease invalid or expired
- Relies on node epoch incrementing on failure
- Leases can be transferred between replicas
- Useful for load balancing
- Common during rebalancing
Lease vs. Raft Leadership
Why separate?- Raft leadership: consensus algorithm requirement
- Range lease: CockroachDB-specific optimization
- Different lifetime and transfer semantics
- Leaseholder doesn’t forward proposals to leader
- Reduces RPC round-trips
- Better performance
In practice, that means the mismatch is rare and self-corrects quickly. Each lease renewal or transfer also attempts to colocate them.
Read and Write Paths
Write Path
Consistent Read Path
Leaseholder Read (Fast Path)
Leaseholder Read (Fast Path)
Leaseholder can serve reads locally:
- Verify lease still valid
- Check no overlapping pending writes
- Read from local RocksDB
- Return result (no Raft required)
Follower Read (Slow Path)
Follower Read (Slow Path)
Non-leaseholder replica:
- Detect it doesn’t hold lease
- Return NotLeaseholderError with hint
- Client retries at actual leaseholder
Inconsistent Read Path
Fault Tolerance
Failure Scenarios
Follower Failure (F < N/2):- Leader continues operating
- Quorum still achievable
- Failed follower catches up when recovered
- Followers detect missing heartbeats
- Election triggered after timeout
- New leader elected from followers
- Operations resume with new leader
- Range becomes unavailable
- Cannot achieve quorum
- Requires manual intervention or recovery
Self-Repair
Fromdocs/design.md:
If a store has not been heard from (gossiped their descriptors) in some time, the default setting being 5 minutes, the cluster will consider this store to be dead.
Splitting and Merging
Range Splits
When a range exceeds size threshold (~64MB):- Range [A-C) becomes [A-B) and [B-C)
- Each new range has same replica set initially
- Ranges can then rebalance independently
Range Merges
When a range falls below minimum threshold:Rebalancing
CockroachDB automatically rebalances replicas based on:Replica Count
Number of replicas per store
Data Size
Total bytes of data per store
Free Space
Available disk space per store
Load
CPU, network, and I/O load per store
- Add new replica to target node
- Bring new replica up-to-date via Raft
- Update range metadata
- Remove old replica
Implementation Details
Key source files: Raft Implementation:pkg/raft/: Core Raft algorithm (from etcd/raft)pkg/kv/kvserver/replica_raft.go: Integration with CockroachDB
pkg/kv/kvserver/replica_lease.gopkg/kv/kvserver/node_liveness.go
pkg/kv/kvserver/replica.gopkg/kv/kvserver/replica_proposal.go
Further Reading
Transaction Layer
How transactions work with Raft
Storage Layer
RocksDB and Raft log storage