Skip to main content
YugabyteDB uses the Raft consensus protocol to automatically replicate data across multiple nodes, ensuring high availability, fault tolerance, and data consistency without requiring operator intervention.

Raft Consensus Protocol

Raft is a consensus algorithm designed to be understandable while providing the same guarantees as Paxos. The name “Raft” comes from thinking about logs and what can be built using them - and how to escape the island of Paxos complexity.
YugabyteDB’s Raft implementation is based on Apache Kudu’s implementation, with enhancements including leader leases, group commits, and pre-voting during learner mode.

Why Raft?

  • Understandable: Designed for clarity and ease of understanding
  • Proven: Strong theoretical foundation and wide adoption
  • Efficient: Optimized for common cases with leader-based replication
  • Safe: Guarantees consistency even during failures and network partitions

Replication Architecture

Tablet Peers and Raft Groups

Each tablet is replicated across multiple nodes as tablet peers. These peers form a Raft group:
Tablet 1 (RF=3)
├─ Peer A (Leader)   - Node 1, Zone A
├─ Peer B (Follower) - Node 2, Zone B  
└─ Peer C (Follower) - Node 3, Zone C

Tablet Leader

Handles all reads and writes, replicates to followers

Tablet Followers

Replicate data, participate in consensus, hot standbys

Raft Group

All peers of a tablet forming a consensus group

Replication Factor

The replication factor (RF) is the number of copies of data maintained by the cluster.

Common Configurations

RFFault ToleranceCan SurviveUse Case
100 failuresDevelopment only
311 node failureProduction (most common)
522 node failuresMission-critical
733 node failuresMaximum availability
Formula: To tolerate f failures, you need RF of 2f + 1
RF=3 is recommended for production. It provides fault tolerance for 1 failure while maintaining a quorum with the remaining 2 nodes.

Setting Replication Factor

# At cluster creation
yb-ctl create --rf 3

# Modify existing universe (via YBA or yb-admin)
yb-admin modify_placement_info \  
    cloud.region.zone1,cloud.region.zone2,cloud.region.zone3 3

Raft Roles

Nodes in a Raft group can take on three roles:

Leader

  • Elected by majority vote from followers
  • Handles all writes by appending to Raft log
  • Serves consistent reads (by default)
  • Sends heartbeats to maintain leadership
  • Coordinates replication to followers

Follower

  • Replicates log entries from leader
  • Participates in elections by voting for candidates
  • Can serve reads with follower reads enabled
  • Becomes candidate if no heartbeat received
  • Applies committed entries to local storage

Candidate

  • Transitional role during leader election
  • Requests votes from other peers
  • Becomes leader if receives majority votes
  • Returns to follower if election fails
Initially, all nodes start as followers. After an election timeout without hearing from a leader, a follower transitions to candidate and starts an election.

Write Replication Process

Here’s how a write operation is replicated:
1

Client Sends Write

Client connects to any YB-TServer and sends write request
2

Leader Acquires Locks

Tablet leader acquires in-memory locks on affected keys
3

Prepare Write Batch

Leader prepares batch of DocDB changes (can include multiple operations)
4

Assign Hybrid Timestamp

Leader assigns hybrid timestamp from its HLC
5

Append to Raft Log

Leader appends entry to its Raft log and sends to followers
6

Followers Append

Followers append entry to their Raft logs and send acknowledgment
7

Achieve Majority

Once majority (including leader) has appended, entry is committed
8

Apply to DocDB

Leader applies write batch to its local RocksDB
9

Return Success

Leader returns success to client (write is now durable)
10

Followers Apply

Followers asynchronously apply committed entry to their RocksDB
A write is considered committed and durable once it’s replicated to a majority of peers, even if not all followers have applied it yet.

Leader Election

When a leader fails or a new Raft group is formed, peers elect a new leader.

Election Process

1

Detect Leader Failure

Follower doesn’t receive heartbeat within election timeout (typically 2-3 seconds)
2

Become Candidate

Follower increments its term number and transitions to candidate role
3

Request Votes

Candidate votes for itself and sends RequestVote RPCs to all peers
4

Peers Vote

Each peer votes for at most one candidate per term (first-come-first-served)
5

Win Election

Candidate receiving majority votes becomes the new leader
6

Send Heartbeats

New leader immediately sends heartbeats to all followers
7

Wait Out Lease

New leader waits for old leader’s lease to expire before serving requests

Election Terms

Each election increments the term number:
Term 1: Leader A elected, serves requests
Term 2: Leader A fails, Leader B elected
Term 3: Network partition, election fails, retry
Term 4: Leader B re-elected
  • Terms are monotonically increasing
  • Requests from leaders with stale terms are rejected
  • Helps detect outdated leaders after network partitions
If no candidate wins a majority (split vote), a new election starts for the next term after another timeout.

Raft Enhancements in YugabyteDB

YugabyteDB includes several optimizations beyond basic Raft:

Leader Leases

Leader leases eliminate the need for read-time heartbeats, reducing read latency. How it works:
  1. Leader computes lease interval (default: 2 seconds)
  2. Leader sends lease interval to followers via Raft replication
  3. Followers start countdown after RPC delay
  4. Leader can serve reads without heartbeat during lease
  5. Old leader steps down when lease expires
  6. New leader waits out old lease before serving
Benefits:
  • Significantly lower read latency (no network round-trip)
  • No heartbeat required for each read
  • Still maintains safety during failures
Tradeoff:
  • Small unavailability window during failover (~lease duration)
  • Window bounded by: max_clock_drift * lease_interval + rpc_delay
Leader leases are critical for performance in geo-distributed deployments where heartbeat latency would be high (100ms+).

Group Commits

Batching multiple client operations into a single Raft log entry:
Without group commit:
Op1 -> Raft Entry 1 -> Replicate
Op2 -> Raft Entry 2 -> Replicate  
Op3 -> Raft Entry 3 -> Replicate

With group commit:
Op1 + Op2 + Op3 -> Single Raft Entry -> Replicate once
Benefits:
  • Higher write throughput (fewer Raft rounds)
  • Better network utilization (larger batches)
  • Reduced per-operation overhead
Implementation:
  • Operations ordered by (term, index, op_id)
  • Incoming operations grouped while one batch replicates
  • Next batch formed from pending operations

Leader Balancing

YugabyteDB automatically balances tablet leaders across nodes:
Before balancing:
Node 1: 100 leaders, 50 followers
Node 2: 20 leaders, 80 followers  
Node 3: 30 leaders, 120 followers

After balancing:
Node 1: 50 leaders, 100 followers
Node 2: 50 leaders, 100 followers
Node 3: 50 leaders, 100 followers
  • Distributes read/write load evenly
  • Throttled to avoid impacting foreground operations
  • Considers zone/region placement for geo-distribution

Affinitized Leaders (Preferred Regions)

Pin tablet leaders to specific regions for lower write latency:
# Preferred region: us-west (priority 1)
cloud.region.us-west.zone.a: priority=1
cloud.region.us-west.zone.b: priority=1
cloud.region.us-east.zone.a: priority=2
cloud.region.eu-west.zone.a: priority=3
  • All writes go to preferred region (low latency)
  • Synchronous replication to other regions (durability)
  • Follower reads in non-preferred regions (read scaling)

Multi-Zone Deployment

Typical 3-zone deployment with RF=3:
Zone A          Zone B          Zone C
┌────────┐      ┌────────┐      ┌────────┐
│ Node 1 │      │ Node 2 │      │ Node 3 │
├────────┤      ├────────┤      ├────────┤
│ T1-L   │      │ T1-F   │      │ T1-F   │
│ T2-F   │      │ T2-L   │      │ T2-F   │  
│ T3-F   │      │ T3-F   │      │ T3-L   │
│ T4-L   │      │ T4-F   │      │ T4-F   │
└────────┘      └────────┘      └────────┘

T1-L = Tablet 1 Leader
T1-F = Tablet 1 Follower

Zone Outage Handling

1

Zone Fails

Entire zone becomes unreachable (e.g., Zone C fails)
2

Leaders Failover

Tablets with leaders in Zone C elect new leaders in Zones A/B
3

Quorum Maintained

Zones A and B form majority (2 out of 3), cluster remains operational
4

Balanced Redistribution

New leaders distributed evenly across surviving zones
5

RPO = 0

No data loss (all committed writes were on majority)
6

RTO ≈ 3s

Recovery time ~3 seconds (leader election timeout)
RPO (Recovery Point Objective): Zero data loss due to synchronous replicationRTO (Recovery Time Objective): ~3 seconds for automatic failover

Follower Reads

By default, only leaders serve reads for linearizability. Follower reads provide:
  • Lower latency: Read from geographically closer followers
  • Higher throughput: Distribute reads across all replicas
  • Timeline consistency: Monotonic, causally-consistent reads
  • Bounded staleness: Configurable maximum staleness

Enabling Follower Reads

-- YSQL
SET yb_read_from_followers = true;
SET yb_follower_read_staleness_ms = 30000; -- 30 seconds max staleness

SELECT * FROM products WHERE category = 'electronics';
-- May read from local follower if within staleness bound
-- YCQL
SELECT * FROM products WHERE category = 'electronics' 
USING FOLLOWER READ;
Follower reads provide timeline consistency, not linearizability. Don’t use for operations requiring absolute latest data (e.g., read-modify-write).

Read Replicas

Read replicas are async replicas that don’t participate in Raft consensus:
Primary Cluster (RF=3)    Read Replica Cluster (RF=1)
┌──────────────────┐      ┌──────────────────┐
│ Zone A, B, C     │──────>│ Remote Region    │
│ (Sync Raft)      │ Async │ (Raft Observer)  │
└──────────────────┘      └──────────────────┘
Use cases:
  • Serve reads in distant regions with lower latency
  • Analytics workloads on separate infrastructure
  • Compliance (data locality requirements)
  • Disaster recovery (asynchronous replication)
Characteristics:
  • Do NOT participate in primary cluster voting
  • Asynchronously replicate from primary via Raft observer
  • Can have their own RF for local availability
  • Always require follower reads to be enabled

Monitoring Replication

Key metrics to monitor:
-- Check Raft lag per tablet
SELECT tablet_id, 
       majority_replicated_op_id,
       committed_op_id,
       majority_replicated_op_id - committed_op_id AS lag
FROM yb_local_tablets;

-- Check follower lag in milliseconds  
SELECT tablet_id,
       follower_lag_ms,
       is_leader
FROM yb_replication_status
WHERE follower_lag_ms > 1000; -- Followers lagging >1s

-- Monitor under-replicated tablets
SELECT COUNT(*) 
FROM yb_local_tablets
WHERE num_replicas < replication_factor;

Replication Performance

  • Single-zone (same DC): 1-3ms replication latency
  • Multi-zone (same region): 5-10ms replication latency
  • Multi-region: 50-300ms replication latency (depends on distance)
  • Majority required: Write latency determined by 2nd fastest replica (RF=3)
  • Writes: Limited by leader capacity, scale by adding tablets
  • Reads: Scale linearly with followers (if follower reads enabled)
  • Network: Replication consumes bandwidth (RF-1 copies per write)
  • Group commits: Batch operations for higher throughput
  • Leader failure: 2-3 second failover (election timeout)
  • Follower failure: No impact on availability or latency
  • Re-replication: Automatic when node permanently fails
  • Tablet splitting: Load automatically balanced to new tablets

Best Practices

Choose Appropriate RF

Use RF=3 for production, RF=5 for mission-critical, never RF=2 (no benefit over RF=3)

Distribute Across Zones

Place replicas in different zones/regions for true fault tolerance

Use Follower Reads

Enable for read-heavy workloads and analytics to reduce leader load

Monitor Replication Lag

Alert on high lag indicating slow followers or network issues

Set Preferred Regions

Pin leaders near application servers for lower write latency

Plan for Failures

Design for graceful degradation when zones fail

Replication vs xCluster

FeatureRaft ReplicationxCluster Replication
ConsistencySynchronous (strong)Asynchronous (eventual)
ScopeWithin universeBetween universes
Use caseHA, fault toleranceDR, multi-DC active-active
Latency impactYes (majority quorum)No (async)
Conflict resolutionNo conflicts (single source)Application-defined

Next Steps

Consistency Model

Understand how replication ensures consistency

Distributed Transactions

Learn how transactions work across replicas

Architecture

Review the overall system architecture

Data Model

See how data is stored in each replica

Build docs developers (and LLMs) love