Introduction
In Chapters 5-8, we’ve discussed replication, partitioning, transactions, and the problems of distributed systems. Now we’ll explore one of the most important abstractions for building fault-tolerant distributed systems: consensus. Consensus means getting several nodes to agree on something. This sounds simple, but is surprisingly difficult to solve in a distributed system where nodes and networks can fail. Consensus is at the heart of many distributed systems problems:- Leader election: Nodes must agree on which node is the leader
- Atomic commit: All nodes must agree to commit or abort a transaction
- State machine replication: All nodes must agree on the order of operations
- Total order broadcast: All nodes must deliver messages in the same order
Consistency guarantees
When we replicate data, different replicas may process updates at different times. What guarantees can we provide about when updates become visible?The spectrum of consistency models
Eventual Consistency (weakest):- If no new updates, all replicas eventually converge to same value
- No guarantee about when this happens
- Replicas may diverge in the meantime
- System appears as if there’s only one copy of the data
- All operations appear to happen atomically at a single point in time
- Once a read returns a value, all subsequent reads return that value or a newer one
Linearizability
Linearizability (also called atomic consistency or strong consistency) is the strongest consistency guarantee. It makes a distributed system appear as if there’s only a single copy of the data.What is linearizability?
Key property: Once a write completes, all subsequent reads (by any client) must return that value or a newer value. The system acts as if operations happen in a single, total order.Linearizability vs. serializability
These terms are often confused, but they’re different:| Aspect | Linearizability | Serializability |
|---|---|---|
| What | Recency guarantee on reads/writes | Isolation guarantee for transactions |
| Scope | Single object (or register) | Multiple objects |
| Purpose | Make distributed system look like single copy | Prevent race conditions in transactions |
| When | Real-time ordering | Can reorder as long as result equivalent to serial execution |
Can have both: Strict serializability = Serializability + Linearizability
Implementing linearizability
Approaches:-
Single-leader replication (potentially linearizable)
- Reads from leader or synchronously updated follower
- Writes to leader
-
Consensus algorithms (linearizable)
- Raft, Paxos, ZAB
- Ensure all operations happen in agreed-upon order
-
Multi-leader replication (not linearizable)
- Concurrent writes to different leaders
- No total ordering
-
Leaderless replication with quorums (usually not linearizable)
- Even with strict quorums (r + w > n), edge cases exist
- Network delays can cause issues
The cost of linearizability
Linearizability has performance costs: When to use linearizability:- Leader election (must have one leader)
- Constraints and uniqueness (username, file locks)
- Cross-channel timing dependencies
- Most applications (eventual consistency is fine)
- Geo-distributed systems (too slow)
- High availability is critical
Ordering guarantees
Many distributed systems problems boil down to ordering: making sure all nodes agree on the order in which things happened.Causality
Causality imposes an ordering on events: cause comes before effect. Causal consistency: Operations that are causally related must be seen in the same order by all nodes. Concurrent operations can be seen in any order. Why causality matters:- Question comes before answer
- Row must be created before updated
- User must be registered before posting
Happens-before relationship
Operation A happens-before operation B if:- A and B are in the same thread, and A comes before B
- A is sending a message, B is receiving that message
- Transitivity: If A happens-before C, and C happens-before B, then A happens-before B
Capturing causality with version vectors
Version vectors track causality between operations:Sequence numbers and total ordering
Simpler than version vectors: Assign incrementing sequence numbers to operations. Solution: Lamport timestamps - Include node ID to break tiesTotal order broadcast
Total order broadcast (also called atomic broadcast): Protocol for exchanging messages between nodes where all nodes deliver messages in the same order. Properties:- Reliable delivery: If message delivered to one node, delivered to all
- Totally ordered delivery: All nodes deliver messages in same order
- State machine replication: All nodes process commands in same order
- Database replication: All replicas apply updates in same order
- Serializable transactions: Assign transaction IDs in total order
Total order broadcast ≈ Consensus:
- Can implement consensus using total order broadcast
- Can implement total order broadcast using consensus
- They’re equivalent problems!
Distributed transactions and consensus
Now we get to the core: consensus algorithms that allow nodes to agree on something despite failures.Two-phase commit (2PC)
We covered 2PC in Chapter 7, but let’s revisit it as a consensus algorithm. Goal: Get all nodes to agree to commit or abort a transactionConsensus algorithms
True consensus algorithms can tolerate node failures without blocking:- Paxos (1989): Theoretically proven, complex
- Raft (2013): Easier to understand, becoming popular
- ZAB (ZooKeeper Atomic Broadcast): Used by Apache ZooKeeper
- Viewstamped Replication (1988): Similar to Raft
- Uniform agreement: All nodes decide the same value
- Integrity: No node decides twice
- Validity: If node decides v, then v was proposed by some node
- Termination: Every non-faulty node eventually decides
Raft consensus algorithm
Raft is easier to understand than Paxos. Let’s explore how it works. Key idea: Elect a leader, leader coordinates all decisionsLeader election
Key points:- Each node votes for at most one candidate per term
- Candidate needs majority of votes to become leader
- Leader sends periodic heartbeats to maintain authority
Log replication
Once a leader is elected, it coordinates all client requests. Key properties:- If two logs contain entry with same index and term, they’re identical up to that point
- If entry is committed, all future leaders will have that entry
- Committed entries are never lost
Handling failures
Follower crash: Leader keeps retrying AppendEntries until follower recovers Leader crash: New leader elected, may need to repair logs Log matching principle: New leader’s log is the “truth” - overwrites conflicting entries on followersConsensus system invariants
Consensus algorithms maintain strong invariants:Raft Invariants
Raft Invariants
Election Safety:
- At most one leader per term
- Only one candidate can get majority votes in a term
- Leader never overwrites or deletes entries
- Leader never overwrites its own log entries
- If two logs have same entry at same index, all preceding entries are identical
- Log consistency across all nodes
- If entry committed in term, it will be present in all future leaders
- Committed entries are never lost
- If node applies log entry at index, no other node applies different entry at same index
- All nodes execute same commands in same order
Consensus performance limitations
Consensus isn’t free - it has costs: Typical Numbers:- Commit latency: ~10-100ms
- Throughput: ~10k-100k ops/sec
- Fault tolerance: Can lose f nodes out of 2f+1
- Leader election
- Metadata storage (small amounts of critical data)
- Configuration management
- Lock services
- High-throughput data storage (use replication instead)
- Geo-distributed systems (too slow)
- Anywhere eventual consistency is acceptable
Membership and coordination services
In practice, most applications don’t implement consensus algorithms directly. Instead, they use coordination services like ZooKeeper.Apache ZooKeeper
ZooKeeper is a distributed coordination service that implements consensus (ZAB algorithm, similar to Raft). What ZooKeeper provides:- Linearizable key-value store
- Watch notifications for changes
- Atomic operations (test-and-set)
- Ephemeral nodes (session-based)
- Sequential nodes (automatic numbering)
- Leader Election
- Distributed Locks
- Configuration Management
- Service Discovery
- Coordination Primitives
Leader election with ZooKeeper
Distributed locks with ZooKeeper
Alternatives to ZooKeeper
| Feature | ZooKeeper | etcd | Consul |
|---|---|---|---|
| Algorithm | ZAB (Paxos-like) | Raft | Raft |
| Language | Java | Go | Go |
| API | Custom | gRPC, HTTP | HTTP, DNS |
| Watch | Yes | Yes | Yes |
| Service Discovery | Manual | Via API | Built-in |
| Health Checks | No | Via API | Built-in |
| Used By | Hadoop, Kafka | Kubernetes | HashiCorp stack |
Summary
This chapter covered the theory and practice of consistency and consensus in distributed systems.Key concepts
Consistency Models: Linearizability:- Strongest single-object consistency
- System behaves as if only one copy
- Expensive: limits performance and availability
- Use for: leader election, locks, critical metadata
- Many problems reduce to ordering
- Causality: natural partial order (A causes B)
- Total order: artificial total order (sequence numbers)
- Lamport timestamps: Total order that respects causality
- Version vectors: Detect concurrent operations
| Property | 2PC | Paxos | Raft | ZAB |
|---|---|---|---|---|
| Fault Tolerance | ❌ Coordinator SPOF | ✓ | ✓ | ✓ |
| Blocking | ❌ Yes | ✓ No | ✓ No | ✓ No |
| Complexity | Simple | Very complex | Understandable | Medium |
| Used In | Databases | Rare (theory) | etcd, Consul | ZooKeeper |
- Agreement: All nodes decide same value
- Validity: Decided value was proposed
- Termination: All non-faulty nodes eventually decide
- Integrity: Nodes decide at most once
Practical takeaways
When to use what consistency model
When to use what consistency model
Use Eventual Consistency when:
- High availability is critical
- Geographic distribution required
- Application can tolerate temporary inconsistency
- E.g., social media feeds, shopping carts
- Need to preserve cause-effect relationships
- Some inconsistency acceptable
- Better performance than linearizability needed
- E.g., comment threads, collaborative editing
- Strong consistency absolutely required
- Leader election or distributed locks
- Unique constraints (usernames, IDs)
- Can accept performance cost
- E.g., bank account balance, inventory count
Anti-patterns to avoid
Anti-patterns to avoid
What NOT to do:
- Don’t use consensus for high-throughput data
- Don’t use linearizable storage for everything
- Don’t implement Paxos yourself (use library)
- Don’t use ZooKeeper as a database
- Don’t ignore network partitions in design
- Don’t assume clocks are synchronized
The FLP impossibility result
Fischer, Lynch, Paterson (1985): In an asynchronous system where even one process can crash, there’s no deterministic algorithm that always reaches consensus. But we have consensus algorithms? They work because:- Real systems are not purely asynchronous (timeouts work most of the time)
- Algorithms use randomization (e.g., random election timeouts in Raft)
- Trade termination guarantee for liveness (may not terminate, but won’t give wrong answer)
Looking forward
Consensus is fundamental to distributed systems, but it’s not the only way to build reliable systems. In real-world applications:- Most data doesn’t need consensus: Eventual consistency is fine
- Consensus for coordination only: Leader election, configuration
- Avoid consensus when possible: It’s slow and complex
- When you need it, use a library: ZooKeeper, etcd, Consul