Introduction
In previous chapters, we’ve discussed replication (Chapter 5), partitioning (Chapter 6), and transactions (Chapter 7). These techniques help build reliable systems from unreliable components. However, we’ve glossed over many problems that occur in distributed systems. Working with distributed systems is fundamentally different from writing software on a single computer. A program on a single computer either works or it doesn’t - there’s usually no middle ground. But in a distributed system, partial failures are the norm. This chapter explores the harsh realities of distributed systems - all the things that can go wrong, and what we can (and cannot) do about them.Why build distributed systems?
Despite the complexity, we build distributed systems because:- Scalability: Single machine has limited resources
- Fault tolerance: Hardware failures are inevitable
- Low latency: Serve users from geographically distributed locations
- Availability: Keep system running even when parts fail
Faults and partial failures
Single computer vs. distributed system
Single computer (idealistic model):- Either works correctly, or doesn’t work at all
- When hardware fault occurs, system usually crashes completely
- Software bugs are deterministic (same input → same output)
- Some parts work, some parts fail
- Network may lose, delay, or duplicate messages
- Different nodes may see events in different orders
- Nondeterministic behavior
Partial failures are nondeterministic
In distributed systems, you cannot easily distinguish between:- Node crashed
- Network connection broken
- Node slow (still processing)
- Request lost in network
- Response lost in network
Unreliable networks
Most distributed systems use asynchronous networks (like the internet). In an asynchronous network, there are no guarantees about:- When a message will arrive (if at all)
- How long it will take
- Whether the recipient is alive
Network faults in practice
Network problems are more common than you might think: Example incidents:- Shark bites undersea cables (really happened!)
- Network switches fail or misconfigured
- Data center network partitions (split into two groups that can’t communicate)
- Network congestion (packets delayed or dropped)
- Entire data center connectivity lost (power failure, backhoe cuts fiber)
Detecting faults
How can you tell if a remote node is down? Most common approach: TimeoutsTimeouts and unbounded delays
In an asynchronous network, there’s no upper bound on how long a message can take. Why packets get delayed:- Application queue waiting to send
- OS network queue
- Network switch queue
- Recipient’s OS queue
- Recipient application queue/busy
- TCP congestion control
- Network congestion
Most latency comes from queuing, not the physical network.
Network congestion and queueing
TCP congestion control: Sender adjusts rate based on packet loss Switch queues: When network congested, switches queue packetsUnreliable clocks
Distributed systems need to measure time for:- Request timeouts
- Performance metrics
- Ordering events
- Expiring cached data
Types of clocks
Time-of-day clocks (wall-clock time):- Returns current date and time
- Usually synchronized with NTP (Network Time Protocol)
- Can jump backward or forward (when synchronized)
- Only goes forward
- Measures duration/intervals
- Cannot compare between different machines
Clock synchronization
Clocks on different machines can drift apart. NTP (Network Time Protocol) synchronizes clocks. Problems with NTP:-
Network delay is variable
- Can’t accurately measure round-trip time
- Synchronization accuracy limited (~1-10ms on LAN, worse on internet)
-
Clock can jump backward
- Clock skew (clocks running at different speeds)
Relying on synchronized clocks
Process pauses
Even more insidious: a process can be paused at any time! Causes of process pauses:- Garbage collection (GC): 10ms to 1 minute+
- Virtual machine suspended (cloud infrastructure)
- Operating system context switch
- Paging (memory swapped to disk)
- SIGSTOP signal (debugging)
Knowledge, truth, and lies
In a distributed system, a node cannot necessarily trust its own judgment. It must rely on messages from other nodes.The truth is defined by the majority
Quorum: Decisions made by majority voteSplit brain problem
Problem: Network partition causes multiple leaders Without majority vote: Both think they’re leader → data corruption! With majority vote: Only one partition can have majorityByzantine faults
So far we’ve assumed nodes are honest but unreliable (crash-stop failures). But what if nodes are malicious or have bugs that cause arbitrary behavior? Byzantine fault: Node behaves in arbitrary, possibly malicious ways Examples of Byzantine faults:- Node sends corrupted data due to hardware error (bit flips)
- Node has software bug that causes wrong behavior
- Node is compromised by attacker
- Node intentionally lies to gain advantage
- PBFT (Practical Byzantine Fault Tolerance)
- Blockchain consensus (Bitcoin, Ethereum)
- Require 2f + 1 honest nodes to tolerate f Byzantine nodes
When to use Byzantine Fault Tolerance
When to use Byzantine Fault Tolerance
When to use BFT:
- Aerospace systems (cosmic rays cause bit flips)
- Blockchain/cryptocurrency (don’t trust participants)
- Systems with untrusted participants
- Most corporate data centers (trust your own hardware)
- Too expensive for most applications (3x overhead minimum)
- Simpler crash-fault-tolerance usually sufficient
System models
To reason about distributed systems, we use system models - assumptions about what can go wrong.Timing assumptions
Synchronous Model:- Bounded network delay
- Bounded process execution time
- Bounded clock drift
- Used in: Real-time systems, aircraft control
- Usually bounded delays
- Occasionally unbounded
- System works most of the time
- Used in: Most distributed databases (most realistic model)
- No timing assumptions
- No bound on delays
- No bound on clock drift
- Used in: Theoretical analysis, impossibility results
Node failure models
Crash-Stop:- Node works correctly until it crashes
- After crash: stays down
- Node may crash
- Can restart
- May lose in-memory state
- Persistent storage survives
- Node can behave arbitrarily
- May lie, corrupt data
- Hardest to handle
Algorithm correctness
Distributed algorithms must satisfy properties: Safety: Nothing bad happens- Example: No two nodes both think they’re leader
- Example: All committed transactions are durable
- Example: System eventually processes requests
- Example: Leader election eventually succeeds
Summary
Distributed systems face many challenges that don’t exist in single-machine systems:Network problems
-
Unreliable networks:
- Messages can be lost, delayed, duplicated
- No way to distinguish between different failures
- Timeouts are only mechanism to detect failures
-
Solutions:
- Retry with exponential backoff
- Idempotent operations (safe to retry)
- Careful timeout selection
Clock problems
-
Unreliable clocks:
- Clocks can jump backward or forward
- Clocks on different machines drift apart
- Process pauses can happen anytime
-
Solutions:
- Use monotonic clocks for durations
- Don’t rely on synchronized clocks for ordering
- Use logical clocks (Lamport timestamps, vector clocks)
- Fencing tokens to prevent stale operations
Key principles
- Assume nothing: Networks, clocks, and nodes are unreliable
- Detect failures: Use timeouts and heartbeats
- Use quorums: Decisions by majority vote
- Idempotency: Make operations safe to retry
- Fencing: Prevent stale operations
- End-to-end argument: Application must handle failures
Next steps
The challenges discussed in this chapter are fundamental to distributed systems. In Chapter 9, we’ll see how consensus algorithms (Paxos, Raft) solve some of these problems by:- Electing leaders safely
- Replicating data with strong consistency
- Maintaining correctness despite failures
- Consistency vs. Availability
- Latency vs. Correctness
- Simplicity vs. Fault-tolerance