Why fault tolerance matters
Fault tolerance is imperative for several reasons:- Business continuity: Applications built on Ably serve critical functions where downtime is unacceptable
- User experience: Even brief interruptions in service can lead to poor user experience and customer dissatisfaction
- Data integrity: Maintaining the correctness of data and message delivery is crucial for applications that rely on accurate realtime information
- Scale: At Ably’s scale, where the system handles billions of messages across millions of connections, component failures are expected occurrences that must be managed effectively
The gossip layer: Building a resilient foundation
The gossip layer uses the gossip protocol to establish a “netmap” - a data structure that describes a cluster by listing all nodes and providing relevant details of each of those nodes. The gossip protocol determines liveness of each participant and ensures that all participants share an eventually-consistent view of each other participant.How the gossip layer works
The gossip layer provides a service discovery service to all other layers of the platform:- Continuously monitors the health of each participant through periodic heartbeats
- Ensures that all participants eventually share a consistent view of each other
- Automatically adapts to changes in the cluster’s composition, making it self-healing
- Multiple gossip nodes operate in each region, ensuring that the failure of any individual gossip node doesn’t compromise the system
Service discovery
Using the netmap, the gossip layer ensures that any changes in the underlying population of nodes are:- Detected in realtime
- Broadcast to all nodes
- Assimilated by all other nodes consistently
- Handled in a way that is tolerant to failures of the gossip nodes themselves
The frontend layer: Stateless scalability
Frontend nodes are stateless, which simplifies the approach to fault tolerance. These nodes terminate client connections and handle REST API requests. The stateless nature of these components means that each request can be processed independently, without depending on previous requests or maintaining state between operations.Fault tolerance through redundancy
Fault tolerance in the frontend layer is achieved primarily through redundancy:Sufficient capacity
Ably ensures that there is always a sufficient population of nodes to handle the current load, plus extra capacity to absorb the impact of node failures.
Health checks
Continuous health checks monitor the status of frontend nodes. When a node is detected as unhealthy, it is automatically removed from the load balancer’s pool.
Automatic replacement
Failed frontend instances are promptly replaced through automated processes. The system maintains a defined capacity level and automatically initiates the creation of new instances when failures reduce the available capacity.
The core layer: Stateful resilience
The core, or channel processing layer, is where Ably processes messages for channels. Unlike the frontend layer, the core layer is stateful, which introduces additional complexity for fault tolerance.Challenges of stateful fault tolerance
The stateful nature of this layer means that simply replacing a failed component is not sufficient:- State must be preserved or recovered to maintain service continuity
- Roles performed by the failed node must be transferred to other nodes
- During role transfers, messages must not be lost, duplicated, or delivered out of order
Fault tolerance mechanisms
Ably implements several mechanisms to ensure fault tolerance in the stateful core layer:Consistent hashing
Channels are assigned to core nodes using consistent hashing, which minimizes the redistribution required when nodes are added or removed. When a node fails, only the channels hosted on that node need to be relocated, rather than requiring a global reorganization.State redundancy
Critical state information is stored redundantly across multiple nodes. This redundancy ensures that even if a node fails, the state required to resume processing is available elsewhere in the system.Transactional processing
Message processing is designed to be transactional, ensuring that a message is either fully processed or not processed at all. This transactional approach prevents partial updates that could lead to inconsistent state.Message persistence
The system includes a channel persistence layer where messages are persisted in multiple locations before being acknowledged:- Every message is stored in RAM on two or more physically isolated datacenters within the receiving region
- Every message is additionally stored in RAM in at least one other region, bringing the total to at least three copies
- For persisted messages, storage across three regions is required before the message is deemed successfully stored
- If durability requirements cannot be met, the message is rejected and the client is notified to retry
Graceful degradation
The system is designed for graceful degradation, continuing to function, potentially with reduced capacity or increased latency, even when multiple failures occur simultaneously.Handling core node failures
When a core node fails:Failure detection
The failure is detected by the gossip layer, and consensus is formed that the node is no longer available.
Netmap propagation
The updated netmap is propagated to all remaining nodes, ensuring that all components have a consistent view of the cluster.
Channel relocation
Consistent hashing is used to determine the new locations for the channels previously assigned to the failed node.
Channel reactivation
The channels are reactivated on their new nodes, using persisted state to ensure continuity.
Regional independence and global coordination
Ably operates across multiple geographic regions, with each region capable of functioning independently while still coordinating with other regions. This design provides several fault tolerance benefits.Regional isolation
- Problems in one region are contained without affecting service in other regions
- If a region experiences failures or becomes unavailable due to infrastructure issues, other regions continue to operate normally
- This containment prevents local problems from cascading into global outages
Continuous global service
The multi-region architecture ensures continuous global service:- Even if an entire region becomes unavailable, the global service continues to function
- Clients can be redirected to healthy regions when their default region experiences issues
- This maintains service availability during regional failures
Dynamic traffic redirection
Client traffic can be dynamically redirected between regions based on health and proximity:- When a region is determined to be unhealthy or unreachable, traffic is automatically routed to the next closest healthy region
- This traffic redirection is transparent to the end user, minimizing disruption during regional failures
Cross-region data replication
Critical data is replicated across regions, ensuring it remains available even during regional outages:- This replication is particularly important for stateful services, where continuity of state is essential for maintaining correct operation
- By distributing data geographically, Ably creates a resilient system that can withstand even large-scale regional failures
Gradations of failure and health
In traditional fault tolerance theory, components are often modeled as either functioning correctly or failing completely. In reality, failures in distributed systems are typically non-binary and can manifest in various ways.Non-binary failures
Components may fail partially or intermittently:- A node might respond to some requests but not others
- A node might respond with increased latency or error rates
- These partial failures can be more difficult to detect than complete failures
Multi-dimensional health monitoring
The system employs multiple health metrics to determine the status of components:- Response time
- Error rates
- Resource utilization
- Application-specific indicators
Health scoring
Health assessment is a continuous process:- Rather than making binary healthy/unhealthy determinations, the system assigns health scores
- Health scores reflect the gradual nature of performance degradation
- This nuanced approach allows for more informed decisions about traffic routing and component replacement
Graduated remediation
Remediation strategies are tailored to the specific type and severity of failures:- Minor degradations might trigger increased monitoring or load reduction
- More severe issues prompt component replacement or traffic redirection
- This graduated response ensures that remediation is proportional to the impact of the failure
Resource implications of fault tolerance
Implementing robust fault tolerance has significant resource implications that must be considered in system design and capacity planning.Redundancy overhead
Fault tolerance often requires running more instances than would be needed just to handle the current load:- This redundancy ensures that there is sufficient capacity to absorb the impact of component failures without service degradation
- The level of redundancy required depends on the service level objectives and the expected failure rate of components
Resource margin
Instances cannot be run at 100% capacity during normal operation:- A resource margin must be maintained to handle failure scenarios
- This margin ensures that when failures occur, the remaining components can accommodate the redistributed load
- The necessary margin depends on factors such as the frequency and scope of expected failures, the elasticity of the system, and the acceptable performance impact during failure events
Reactive scaling
When failures occur, the system may need to scale up quickly to maintain service levels:- This requires the underlying infrastructure to support rapid scaling
- Automated scaling mechanisms must respond to changing demand
- Reactive scaling must be fast enough to prevent service degradation during the scaling period
Recursive fault tolerance
The fault tolerance mechanisms themselves must be fault-tolerant:- The system that detects and responds to failures must itself be resilient to failures
- This often requires its own redundancy and fallback mechanisms
- This recursive fault tolerance adds complexity and resource requirements to the system
Resource isolation
A particularly challenging aspect is that fault tolerance mechanisms require resources to operate:- If a disruption occurs because resources are exhausted, the very mechanisms designed to handle the failure may be unable to function properly
- This creates a need for careful resource isolation and prioritization
- Fault tolerance mechanisms must have guaranteed access to the resources they need, even during resource contention
Next steps
Edge network
Learn about Ably’s edge network architecture and resilience
Performance
Understand how Ably maintains performance at scale
Scalability
Explore Ably’s horizontal scalability approach
Connection recovery
Learn about connection state recovery mechanisms
