Raft Consensus Protocol
Raft is a consensus algorithm designed to be understandable while providing the same guarantees as Paxos. The name “Raft” comes from thinking about logs and what can be built using them - and how to escape the island of Paxos complexity.YugabyteDB’s Raft implementation is based on Apache Kudu’s implementation, with enhancements including leader leases, group commits, and pre-voting during learner mode.
Why Raft?
- Understandable: Designed for clarity and ease of understanding
- Proven: Strong theoretical foundation and wide adoption
- Efficient: Optimized for common cases with leader-based replication
- Safe: Guarantees consistency even during failures and network partitions
Replication Architecture
Tablet Peers and Raft Groups
Each tablet is replicated across multiple nodes as tablet peers. These peers form a Raft group:Tablet Leader
Handles all reads and writes, replicates to followers
Tablet Followers
Replicate data, participate in consensus, hot standbys
Raft Group
All peers of a tablet forming a consensus group
Replication Factor
The replication factor (RF) is the number of copies of data maintained by the cluster.Common Configurations
| RF | Fault Tolerance | Can Survive | Use Case |
|---|---|---|---|
| 1 | 0 | 0 failures | Development only |
| 3 | 1 | 1 node failure | Production (most common) |
| 5 | 2 | 2 node failures | Mission-critical |
| 7 | 3 | 3 node failures | Maximum availability |
f failures, you need RF of 2f + 1
Setting Replication Factor
Raft Roles
Nodes in a Raft group can take on three roles:Leader
- Elected by majority vote from followers
- Handles all writes by appending to Raft log
- Serves consistent reads (by default)
- Sends heartbeats to maintain leadership
- Coordinates replication to followers
Follower
- Replicates log entries from leader
- Participates in elections by voting for candidates
- Can serve reads with follower reads enabled
- Becomes candidate if no heartbeat received
- Applies committed entries to local storage
Candidate
- Transitional role during leader election
- Requests votes from other peers
- Becomes leader if receives majority votes
- Returns to follower if election fails
Initially, all nodes start as followers. After an election timeout without hearing from a leader, a follower transitions to candidate and starts an election.
Write Replication Process
Here’s how a write operation is replicated:Leader Election
When a leader fails or a new Raft group is formed, peers elect a new leader.Election Process
Detect Leader Failure
Follower doesn’t receive heartbeat within election timeout (typically 2-3 seconds)
Election Terms
Each election increments the term number:- Terms are monotonically increasing
- Requests from leaders with stale terms are rejected
- Helps detect outdated leaders after network partitions
If no candidate wins a majority (split vote), a new election starts for the next term after another timeout.
Raft Enhancements in YugabyteDB
YugabyteDB includes several optimizations beyond basic Raft:Leader Leases
Leader leases eliminate the need for read-time heartbeats, reducing read latency. How it works:- Leader computes lease interval (default: 2 seconds)
- Leader sends lease interval to followers via Raft replication
- Followers start countdown after RPC delay
- Leader can serve reads without heartbeat during lease
- Old leader steps down when lease expires
- New leader waits out old lease before serving
- Significantly lower read latency (no network round-trip)
- No heartbeat required for each read
- Still maintains safety during failures
- Small unavailability window during failover (~lease duration)
- Window bounded by:
max_clock_drift * lease_interval + rpc_delay
Group Commits
Batching multiple client operations into a single Raft log entry:- Higher write throughput (fewer Raft rounds)
- Better network utilization (larger batches)
- Reduced per-operation overhead
- Operations ordered by (term, index, op_id)
- Incoming operations grouped while one batch replicates
- Next batch formed from pending operations
Leader Balancing
YugabyteDB automatically balances tablet leaders across nodes:- Distributes read/write load evenly
- Throttled to avoid impacting foreground operations
- Considers zone/region placement for geo-distribution
Affinitized Leaders (Preferred Regions)
Pin tablet leaders to specific regions for lower write latency:- All writes go to preferred region (low latency)
- Synchronous replication to other regions (durability)
- Follower reads in non-preferred regions (read scaling)
Multi-Zone Deployment
Typical 3-zone deployment with RF=3:Zone Outage Handling
RPO (Recovery Point Objective): Zero data loss due to synchronous replicationRTO (Recovery Time Objective): ~3 seconds for automatic failover
Follower Reads
By default, only leaders serve reads for linearizability. Follower reads provide:- Lower latency: Read from geographically closer followers
- Higher throughput: Distribute reads across all replicas
- Timeline consistency: Monotonic, causally-consistent reads
- Bounded staleness: Configurable maximum staleness
Enabling Follower Reads
Read Replicas
Read replicas are async replicas that don’t participate in Raft consensus:- Serve reads in distant regions with lower latency
- Analytics workloads on separate infrastructure
- Compliance (data locality requirements)
- Disaster recovery (asynchronous replication)
- Do NOT participate in primary cluster voting
- Asynchronously replicate from primary via Raft observer
- Can have their own RF for local availability
- Always require follower reads to be enabled
Monitoring Replication
Key metrics to monitor:Replication Performance
Latency Impact
Latency Impact
- Single-zone (same DC): 1-3ms replication latency
- Multi-zone (same region): 5-10ms replication latency
- Multi-region: 50-300ms replication latency (depends on distance)
- Majority required: Write latency determined by 2nd fastest replica (RF=3)
Throughput Scaling
Throughput Scaling
- Writes: Limited by leader capacity, scale by adding tablets
- Reads: Scale linearly with followers (if follower reads enabled)
- Network: Replication consumes bandwidth (RF-1 copies per write)
- Group commits: Batch operations for higher throughput
Failure Recovery
Failure Recovery
- Leader failure: 2-3 second failover (election timeout)
- Follower failure: No impact on availability or latency
- Re-replication: Automatic when node permanently fails
- Tablet splitting: Load automatically balanced to new tablets
Best Practices
Choose Appropriate RF
Use RF=3 for production, RF=5 for mission-critical, never RF=2 (no benefit over RF=3)
Distribute Across Zones
Place replicas in different zones/regions for true fault tolerance
Use Follower Reads
Enable for read-heavy workloads and analytics to reduce leader load
Monitor Replication Lag
Alert on high lag indicating slow followers or network issues
Set Preferred Regions
Pin leaders near application servers for lower write latency
Plan for Failures
Design for graceful degradation when zones fail
Replication vs xCluster
| Feature | Raft Replication | xCluster Replication |
|---|---|---|
| Consistency | Synchronous (strong) | Asynchronous (eventual) |
| Scope | Within universe | Between universes |
| Use case | HA, fault tolerance | DR, multi-DC active-active |
| Latency impact | Yes (majority quorum) | No (async) |
| Conflict resolution | No conflicts (single source) | Application-defined |
Next Steps
Consistency Model
Understand how replication ensures consistency
Distributed Transactions
Learn how transactions work across replicas
Architecture
Review the overall system architecture
Data Model
See how data is stored in each replica

