Chapter 5: Replication
Introduction
Replication means keeping a copy of the same data on multiple machines (nodes) connected via a network. This is one of the most fundamental concepts in distributed systems and forms the backbone of how modern applications achieve scale and reliability.Why replicate data?
Replication serves several critical purposes:-
High availability: Keep the system running even if some parts fail
- If one machine fails, the system can continue operating using other replicas
- Critical for services that need 99.9% or higher uptime
- Example: If Amazon’s database goes down in one region, replicas in other regions keep the site running
-
Reduced latency: Keep data geographically close to users
- Users in Tokyo can read from a replica in Asia rather than waiting for data from a server in the US
- This can reduce latency from 200ms to 20ms or less
- Example: Netflix stores copies of popular movies in servers close to major population centers
-
Increased read throughput: Scale out the number of machines that can serve read queries
- One machine might handle 1,000 queries/second, but 10 replicas can handle 10,000 queries/second
- Read-heavy applications (like social media feeds) can distribute load across many replicas
- Example: Twitter uses thousands of replicas to serve billions of timeline reads per day
-
Disaster recovery: Protect against catastrophic failures
- If an entire datacenter is destroyed (fire, flood, power outage), data survives in other locations
- Regulatory compliance may require data to be stored in multiple geographic locations
The fundamental challenge
While replication sounds simple in theory, it introduces significant complexity in practice. The core challenge is: How do we keep all replicas in sync when data changes? This seemingly simple question leads to complex trade-offs between consistency, availability, and performance that we’ll explore throughout this chapter.Leaders and followers (single-leader replication)
How it works
The most common approach to replication is leader-based replication (also known as master-slave or active-passive replication). This pattern is used by most relational databases (PostgreSQL, MySQL, Oracle, SQL Server) and many NoSQL databases (MongoDB, RethinkDB, Espresso). Key principles:-
One replica is designated as the leader (master/primary)
- The leader is the authoritative source of truth
- Only the leader accepts write operations
-
All writes go to the leader first
- When a client wants to write to the database, it must send the request to the leader
- The leader writes the new data to its local storage
-
The leader sends data changes to all followers (replicas/slaves/secondaries)
- Changes are sent as part of a replication log or change stream
- This happens either synchronously or asynchronously (see below)
-
Followers apply the changes in the same order as they were processed on the leader
- This ensures that followers eventually have the same data as the leader
- Order is critical - applying changes in wrong order could lead to incorrect state
-
Reads can be served from the leader or any follower
- This allows read scaling - add more followers to handle more read queries
- Reading from followers may return slightly stale data (eventual consistency)
Real-world example: Imagine a social media platform like Instagram:
- When you post a photo, it writes to the leader database
- The leader replicates this data to followers in different geographic regions
- When users around the world view your photo, they read from the nearest follower
- This gives fast reads globally while ensuring all writes go through a single point of coordination
Example: PostgreSQL streaming replication
Synchronous vs. asynchronous replication
One of the most important questions in replication is: Should the leader wait for followers to confirm they’ve received the write before telling the client the write was successful?Synchronous replication
In synchronous replication, the leader waits for confirmation from follower(s) before reporting success to the client. How it works:- Client sends write to leader
- Leader writes to its local storage
- Leader sends change to followers
- Leader waits for acknowledgment from follower(s)
- Only after receiving acknowledgment, leader confirms success to client
- Durability guarantee: Follower guaranteed to have up-to-date copy consistent with leader
- If leader fails immediately after acknowledging write, the data is safe on follower
- Strong consistency for reads from synchronous followers
- Latency: Write is slower (network round trip to follower)
- Availability: Write is blocked if synchronous follower doesn’t respond
- If follower crashes or network fails, writes cannot be processed
- System becomes unavailable for writes until follower recovers
- Not practical to make all followers synchronous (one slow node blocks all writes)
Real-world example: Google Cloud SQL offers synchronous replication to one replica in a different zone within the same region for high durability while maintaining reasonable latency.
Asynchronous replication
In asynchronous replication, the leader sends changes but doesn’t wait for confirmation from followers. How it works:- Client sends write to leader
- Leader writes to its local storage
- Leader immediately confirms success to client
- Leader sends change to followers in the background (fire and forget)
- Followers apply changes when they receive them
- Performance: Write latency is minimal (only limited by leader’s local write)
- Availability: Leader can continue processing writes even if all followers are down
- Can have many followers without impacting write performance
- Common choice for read-heavy workloads where eventual consistency is acceptable
- Durability risk: Not all followers guaranteed to have latest data if leader fails
- If leader fails before followers receive recent writes, those writes are lost
- Could lose seconds or minutes of data
- Replication lag: Followers may be seconds, minutes, or even hours behind
- Reads from followers return stale data
Semi-synchronous replication (best of both worlds)
In practice, making all followers synchronous is impractical (one slow node would block all writes), while fully asynchronous replication risks data loss. The solution is semi-synchronous replication:- One follower is synchronous (often called the “synchronous standby”)
- All other followers are asynchronous
- If the synchronous follower becomes unavailable, one of the asynchronous followers is promoted to synchronous
- Guarantees at least two up-to-date copies (leader + one synchronous follower)
- Minimal data loss risk (at most one node’s worth of recent writes)
- Better performance than fully synchronous (not waiting for all followers)
- Better durability than fully asynchronous (one follower always up-to-date)
Setting up new followers
A common operational task is adding a new follower to the system - perhaps to replace a failed node, add read capacity, or create a new replica in a different geographic region. How do we add a new follower without downtime? Simply copying data files is insufficient because data is constantly being written. The process:-
Take a consistent snapshot of the leader’s database at some point in time
- Most databases have built-in snapshot features
- The snapshot must be consistent (represent the database at a single point in time, not halfway through a write)
- Example: PostgreSQL
pg_basebackup, MySQLmysqldump, MongoDBmongodump
-
Copy the snapshot to the new follower node
- Transfer the snapshot file(s) over the network
- This may take hours or days for large databases
- During this time, the leader continues accepting writes
-
Follower connects to leader and requests all changes since the snapshot
- The snapshot is associated with an exact position in the leader’s replication log
- Position identified by log sequence number (LSN), binlog coordinates, or replica set optime
- Follower says: “Give me all changes since position X”
-
Follower processes the backlog of changes
- Applies all writes that happened since the snapshot
- “Catches up” to the leader’s current state
- This is called catch-up recovery
-
Follower is now ready to serve reads
- Once caught up, continues to receive and apply ongoing changes
- Can now handle read queries
Real-world example: Adding a read replica to a production database:
- The process must be entirely automatic (can’t afford hours of human intervention)
- Snapshot + catch-up can take a long time for multi-terabyte databases
- Some systems support “online” or “hot” backup (snapshot while database is running)
- Network bandwidth between leader and follower impacts catch-up time
Handling node outages
In any system with multiple machines, individual nodes can and will fail. The question is not if they’ll fail, but when. Our replication setup should handle these failures gracefully, ideally with minimal or no downtime.Follower failure: Catch-up recovery
Follower failures are relatively straightforward to handle because they don’t affect the ability to process writes. What happens:- Each follower keeps a log of data changes received from the leader on local disk
- When a follower crashes and restarts (or network is restored after interruption), it knows the last transaction it processed
- It connects to the leader and requests all changes since that point
- It applies the backlog of changes (catch-up recovery)
- Once caught up, it continues to receive ongoing changes
- It requests: “Give me all changes since transaction #1000”
- Leader sends transactions #1001 through #1500 (current position)
- Follower applies these 500 transactions
- Now caught up, continues normal operation
Leader failure: Failover
Leader failure is much more complex and problematic because the leader is the only node accepting writes. If the leader fails, one of the followers needs to be promoted to be the new leader. This process is called failover. Failover can happen in two ways:- Manually: Administrator decides which follower becomes new leader and reconfigures the system
- Automatically: System detects failure and performs failover automatically
-
Detecting that the leader has failed
- No foolproof way to detect failure (cannot distinguish between slow response and crash)
- Typically use timeout-based approach: nodes bounce heartbeat messages, if no response for X seconds (e.g., 30s), leader is considered dead
- Must be careful: too short timeout causes unnecessary failovers, too long means longer downtime
-
Choosing a new leader
- Could be done through election (remaining replicas elect a leader)
- Or use a previously elected “standby” leader
- Best candidate is usually the replica with the most up-to-date data changes from old leader
- Getting consensus on new leader is a problem in itself (see Chapter 9 on consensus)
-
Reconfiguring the system to use the new leader
- Clients need to send writes to the new leader
- If old leader comes back, it must recognize that it’s no longer the leader (become a follower)
- System needs to ensure old leader doesn’t believe it’s still the leader (split brain problem)
-
Choosing the right timeout
- Too short: Unnecessary failovers from temporary network glitches or load spikes
- Failover causes disruption (clients need to reconnect, system reconfiguration)
- Can make problem worse if system is already overloaded
- Too long: Longer recovery time, more downtime
- Users see errors for duration of timeout
- May violate SLA (Service Level Agreement)
- Financial systems: Very short (seconds)
- Less critical systems: Longer (minutes)
- Too short: Unnecessary failovers from temporary network glitches or load spikes
-
Scenarios leading to false positives
- Sudden load spike causes leader to respond slowly
- Network congestion delays heartbeat messages
- Garbage collection pause in Java/JVM-based systems (can be 10+ seconds)
- These aren’t failures, but can trigger failover
- Allows human judgment (is this really a failure or just a slow response?)
- Prevents cascade failures from automatic actions
- Can verify data consistency before promoting follower
- Less risk of split brain scenarios
Conclusion: Failover is fraught with challenges. While it’s essential for high availability, it must be implemented carefully with proper monitoring, testing, and safeguards.
Replication logs implementation
Now that we understand the basic principle of leader-based replication, let’s examine how replication is actually implemented. The key question is: What exactly does the leader send to the followers? There are several different approaches, each with its own trade-offs.Statement-based replication
The leader logs every write request (INSERT, UPDATE, DELETE statement) and sends that statement to its followers. Each follower parses and executes the SQL statement as if it had been received from a client. How it works:-
Non-deterministic functions produce different values on each replica
Solution: Leader can replace non-deterministic functions with fixed values before sending to followers
-
Statements with auto-incrementing columns
Solution: Strict ordering requirement - followers must execute in exact same order as leader
-
Side effects (triggers, stored procedures, user-defined functions)
Write-ahead log (WAL) shipping
Most databases write to an append-only log before applying changes to the storage engine (for crash recovery). This log is called a Write-Ahead Log (WAL) or commit log. The leader can send this exact same log to followers. How it works: The WAL contains low-level descriptions of which bytes were changed in which disk blocks. For example:- “In disk block 123, at offset 456, write these bytes: 0x4A6F686E…”
- Very detailed, byte-level modifications
- Exact replica: followers’ storage is byte-for-byte identical to leader
- No ambiguity about non-deterministic functions
- Tightly coupled to storage engine: WAL describes data at a very low level (disk blocks, offsets)
- If storage engine format changes between versions, can’t replicate between different software versions
- Makes zero-downtime upgrades difficult (can’t have leader on v1 and follower on v2)
- If you want to change database system (migrate from PostgreSQL 10 to 11), often need downtime
Logical (row-based) log replication
An alternative is to use a different log format for replication, decoupled from the storage engine internals. This is called a logical log (as opposed to physical log like WAL). Logical log for relational database is usually a sequence of records describing writes to database tables at the row level:- For an inserted row: log contains new values of all columns
- For a deleted row: log contains information to uniquely identify the row (typically primary key)
- For an updated row: log contains information to identify the row + new values of all (or changed) columns
- Decoupled from storage engine: Leader and followers can run different storage engines or even different database systems
- Backward compatible: Easier to make replication compatible across different software versions
- External systems can parse it: Data warehouses, caches, search indexes can consume the log
- Example: Debezium, Maxwell’s Daemon capture MySQL binlog and stream to Kafka
- Example: LinkedIn’s Databus captures Oracle’s logical log
Real-world example: Change Data Capture (CDC)
Trigger-based replication
The previous approaches are implemented by the database system. Sometimes you need more flexibility - for example, only replicate a subset of data, or replicate to a different kind of database. For these cases, you can use trigger-based replication. How it works: Application code registers custom triggers that log changes to a separate table. An external process reads this changelog and replicates it.- Very flexible: Can replicate subset of data, transform data, replicate to different databases
- Application-level control: Can add custom logic (validation, filtering, transformation)
- Greater overhead: Triggers run for every write, slowing down writes
- More error-prone: Application code is more likely to have bugs than database built-in replication
- Complexity: Need to maintain trigger code and replication process
Example use case: Replicating from OLTP database to data warehouse
Problems with replication lag
With asynchronous replication, followers may lag behind the leader. This creates consistency issues. In an ideal world, replication would be instantaneous - every write made to the leader would immediately be visible on all followers. In reality, there’s a delay, called replication lag. The lag might be a fraction of a second (typical in normal operation), but can grow to several seconds or even minutes when:- The system is operating near capacity
- There’s a network problem
- A follower is recovering from a failure
Read-after-write consistency (read-your-writes consistency)
The problem: User makes a write, then immediately reads that data back. With asynchronous replication, the read might go to a follower that hasn’t yet received the write. From the user’s perspective, it looks like their data was lost! Solutions:-
Always read user’s own writes from the leader
- Simple and effective for user-generated content
- Problem: If most things are editable by users, most reads go to leader (negates benefit of read replicas)
-
Track timestamp of last update; for some time after update, read from leader
- Good balance: reads from leader only when necessary
- Requires tracking last write time (can store in client session or cookie)
-
Client remembers timestamp of most recent write
- Most sophisticated approach
- System ensures replica serving read has updates ≥ timestamp of user’s last write
- Can use logical timestamps (like log sequence numbers) rather than wall-clock time
-
If replicas are distributed across datacenters, additional complexity:
- User might be routed to different datacenters for different requests
- Any replica that serves user reads must be in same datacenter as leader, or must reliably route to leader
- Example: User in Europe writes (goes to Europe leader), then mobile app reads (might route to US datacenter)
- Approach 1 doesn’t work (one device doesn’t know about writes on other device)
- Metadata for last write timestamp must be centralized
- If replicas are in different datacenters, harder to guarantee routing to same datacenter from different devices
Monotonic reads
The problem: With asynchronous followers, it’s possible for a user to see things move backward in time. This can happen if a user makes several reads from different replicas with different replication lag. Solution: Monotonic reads guarantee Monotonic reads guarantee that if a user reads data at time t1, then reads again at t2 (where t2 > t1), they will not see an older version of the data. In other words, time doesn’t move backward for that user. Implementation: Each user always reads from the same replica- Simple to implement (just hash user ID)
- Each user’s view of data is consistent (never goes backward)
- Load is distributed across replicas
- If a replica fails, users assigned to it must be rerouted (temporarily see older data)
- Not as strong as “read after write” consistency, but weaker guarantee is often sufficient
Weaker than read-after-write: User might still see stale data (their replica might be lagging), but at least data doesn’t go backward on subsequent reads.
Consistent prefix reads
The problem: This anomaly concerns causality violations - seeing effect before cause. Particularly problematic in partitioned (sharded) databases where different partitions operate independently. Imagine a conversation between two people:- Different partitions operate independently with no ordering guarantee across partitions
- Replication lag varies between partitions
- Some partitions might be slower (more load, slower disk, network issues)
- Each message includes vector of “happened before” relationships
- System ensures messages aren’t shown until dependencies are met
- Some systems (like Riak) provide causal consistency by tracking causal relationships
- More complex but provides stronger guarantees
Summary of consistency guarantees
Summary of consistency guarantees
| Guarantee | What it prevents | Strength |
|---|---|---|
| Eventual Consistency | Nothing in short term - data eventually becomes consistent | Weakest |
| Monotonic Reads | Time moving backward for a user | Weak |
| Consistent Prefix Reads | Causality violations (seeing effect before cause) | Medium |
| Read After Write | User not seeing their own writes | Medium |
| Strong Consistency | All anomalies - all clients see same data at same time | Strongest (but impacts performance) |
Multi-leader replication
In single-leader replication, all writes must go through the one leader. This works well, but has limitations:- Leader becomes a bottleneck for writes
- If you want to have datacenters in multiple geographic regions, all writes must route to one datacenter
- If the leader fails, system is unavailable for writes until failover completes