Fault tolerance - Ably Realtime

Fault tolerance refers to a system’s ability to continue functioning correctly despite the failure of one or more of its components. In distributed systems, component failures must be assumed as normal events that happen continuously somewhere in the system. The goal is not to eliminate failures but to design the system to detect, contain, and recover from them without significant service disruption.

Why fault tolerance matters

Fault tolerance is imperative for several reasons:

Business continuity: Applications built on Ably serve critical functions where downtime is unacceptable
User experience: Even brief interruptions in service can lead to poor user experience and customer dissatisfaction
Data integrity: Maintaining the correctness of data and message delivery is crucial for applications that rely on accurate realtime information
Scale: At Ably’s scale, where the system handles billions of messages across millions of connections, component failures are expected occurrences that must be managed effectively

The aim of fault tolerance is dependability, which encompasses both availability and reliability. Availability ensures that the service is accessible when needed, while reliability ensures that the service behaves correctly.

The gossip layer: Building a resilient foundation

The gossip layer uses the gossip protocol to establish a “netmap” - a data structure that describes a cluster by listing all nodes and providing relevant details of each of those nodes. The gossip protocol determines liveness of each participant and ensures that all participants share an eventually-consistent view of each other participant.

How the gossip layer works

The gossip layer provides a service discovery service to all other layers of the platform:

Continuously monitors the health of each participant through periodic heartbeats
Ensures that all participants eventually share a consistent view of each other
Automatically adapts to changes in the cluster’s composition, making it self-healing
Multiple gossip nodes operate in each region, ensuring that the failure of any individual gossip node doesn’t compromise the system

The gossip layer forms the foundation upon which other fault tolerance mechanisms are built. If nodes cannot agree on the state of the cluster, higher-level fault tolerance mechanisms would be ineffective.

Service discovery

Using the netmap, the gossip layer ensures that any changes in the underlying population of nodes are:

Detected in realtime
Broadcast to all nodes
Assimilated by all other nodes consistently
Handled in a way that is tolerant to failures of the gossip nodes themselves

The frontend layer: Stateless scalability

Frontend nodes are stateless, which simplifies the approach to fault tolerance. These nodes terminate client connections and handle REST API requests. The stateless nature of these components means that each request can be processed independently, without depending on previous requests or maintaining state between operations.

Fault tolerance through redundancy

Fault tolerance in the frontend layer is achieved primarily through redundancy:

Sufficient capacity

Ably ensures that there is always a sufficient population of nodes to handle the current load, plus extra capacity to absorb the impact of node failures.

Health checks

Continuous health checks monitor the status of frontend nodes. When a node is detected as unhealthy, it is automatically removed from the load balancer’s pool.

Automatic replacement

Failed frontend instances are promptly replaced through automated processes. The system maintains a defined capacity level and automatically initiates the creation of new instances when failures reduce the available capacity.

Client-side resilience

Client SDKs implement connection recovery mechanisms. When a client detects a connection failure, it automatically attempts to reconnect, often to a different frontend node.

The stateless nature of the frontend layer makes it straightforward to achieve fault tolerance through redundancy and rapid replacement.

The core layer: Stateful resilience

The core, or channel processing layer, is where Ably processes messages for channels. Unlike the frontend layer, the core layer is stateful, which introduces additional complexity for fault tolerance.

Challenges of stateful fault tolerance

The stateful nature of this layer means that simply replacing a failed component is not sufficient:

State must be preserved or recovered to maintain service continuity
Roles performed by the failed node must be transferred to other nodes
During role transfers, messages must not be lost, duplicated, or delivered out of order

Fault tolerance mechanisms

Ably implements several mechanisms to ensure fault tolerance in the stateful core layer:

Consistent hashing

Channels are assigned to core nodes using consistent hashing, which minimizes the redistribution required when nodes are added or removed. When a node fails, only the channels hosted on that node need to be relocated, rather than requiring a global reorganization.

State redundancy

Critical state information is stored redundantly across multiple nodes. This redundancy ensures that even if a node fails, the state required to resume processing is available elsewhere in the system.

Transactional processing

Message processing is designed to be transactional, ensuring that a message is either fully processed or not processed at all. This transactional approach prevents partial updates that could lead to inconsistent state.

Message persistence

The system includes a channel persistence layer where messages are persisted in multiple locations before being acknowledged:

Every message is stored in RAM on two or more physically isolated datacenters within the receiving region
Every message is additionally stored in RAM in at least one other region, bringing the total to at least three copies
For persisted messages, storage across three regions is required before the message is deemed successfully stored
If durability requirements cannot be met, the message is rejected and the client is notified to retry

Once a message is acknowledged, it is stored in multiple physical locations, providing statistical guarantees of 99.999999% (8 nines) for message availability and survivability.

Graceful degradation

The system is designed for graceful degradation, continuing to function, potentially with reduced capacity or increased latency, even when multiple failures occur simultaneously.

Handling core node failures

When a core node fails:

Failure detection

The failure is detected by the gossip layer, and consensus is formed that the node is no longer available.

Netmap propagation

The updated netmap is propagated to all remaining nodes, ensuring that all components have a consistent view of the cluster.

Channel relocation

Consistent hashing is used to determine the new locations for the channels previously assigned to the failed node.

Channel reactivation

The channels are reactivated on their new nodes, using persisted state to ensure continuity.

Resume processing

Processing continues with minimal disruption to service.

Regional independence and global coordination

Ably operates across multiple geographic regions, with each region capable of functioning independently while still coordinating with other regions. This design provides several fault tolerance benefits.

Regional isolation

Problems in one region are contained without affecting service in other regions
If a region experiences failures or becomes unavailable due to infrastructure issues, other regions continue to operate normally
This containment prevents local problems from cascading into global outages

Continuous global service

The multi-region architecture ensures continuous global service:

Even if an entire region becomes unavailable, the global service continues to function
Clients can be redirected to healthy regions when their default region experiences issues
This maintains service availability during regional failures

Dynamic traffic redirection

Client traffic can be dynamically redirected between regions based on health and proximity:

When a region is determined to be unhealthy or unreachable, traffic is automatically routed to the next closest healthy region
This traffic redirection is transparent to the end user, minimizing disruption during regional failures

Cross-region data replication

Critical data is replicated across regions, ensuring it remains available even during regional outages:

This replication is particularly important for stateful services, where continuity of state is essential for maintaining correct operation
By distributing data geographically, Ably creates a resilient system that can withstand even large-scale regional failures

The gossip protocol extends across regions, enabling global coordination while maintaining regional independence. This coordination ensures that all regions share a consistent view of the global system state, enabling efficient routing and failover decisions.

Gradations of failure and health

In traditional fault tolerance theory, components are often modeled as either functioning correctly or failing completely. In reality, failures in distributed systems are typically non-binary and can manifest in various ways.

Non-binary failures

Components may fail partially or intermittently:

A node might respond to some requests but not others
A node might respond with increased latency or error rates
These partial failures can be more difficult to detect than complete failures

Multi-dimensional health monitoring

The system employs multiple health metrics to determine the status of components:

Response time
Error rates
Resource utilization
Application-specific indicators

By considering multiple dimensions of health, the system can detect various degradations that might otherwise go unnoticed.

Health scoring

Health assessment is a continuous process:

Rather than making binary healthy/unhealthy determinations, the system assigns health scores
Health scores reflect the gradual nature of performance degradation
This nuanced approach allows for more informed decisions about traffic routing and component replacement

Graduated remediation

Remediation strategies are tailored to the specific type and severity of failures:

Minor degradations might trigger increased monitoring or load reduction
More severe issues prompt component replacement or traffic redirection
This graduated response ensures that remediation is proportional to the impact of the failure

The challenge of dealing with non-binary health is further complicated in a globally distributed system. Ably addresses this challenge through consensus formation mechanisms and cross-validation between independent monitoring systems.

Resource implications of fault tolerance

Implementing robust fault tolerance has significant resource implications that must be considered in system design and capacity planning.

Redundancy overhead

Fault tolerance often requires running more instances than would be needed just to handle the current load:

This redundancy ensures that there is sufficient capacity to absorb the impact of component failures without service degradation
The level of redundancy required depends on the service level objectives and the expected failure rate of components

Resource margin

Instances cannot be run at 100% capacity during normal operation:

A resource margin must be maintained to handle failure scenarios
This margin ensures that when failures occur, the remaining components can accommodate the redistributed load
The necessary margin depends on factors such as the frequency and scope of expected failures, the elasticity of the system, and the acceptable performance impact during failure events

Reactive scaling

When failures occur, the system may need to scale up quickly to maintain service levels:

This requires the underlying infrastructure to support rapid scaling
Automated scaling mechanisms must respond to changing demand
Reactive scaling must be fast enough to prevent service degradation during the scaling period

Recursive fault tolerance

The fault tolerance mechanisms themselves must be fault-tolerant:

The system that detects and responds to failures must itself be resilient to failures
This often requires its own redundancy and fallback mechanisms
This recursive fault tolerance adds complexity and resource requirements to the system

Resource isolation

A particularly challenging aspect is that fault tolerance mechanisms require resources to operate:

If a disruption occurs because resources are exhausted, the very mechanisms designed to handle the failure may be unable to function properly
This creates a need for careful resource isolation and prioritization
Fault tolerance mechanisms must have guaranteed access to the resources they need, even during resource contention

Next steps

Edge network

Learn about Ably’s edge network architecture and resilience

Performance

Understand how Ably maintains performance at scale

Scalability

Explore Ably’s horizontal scalability approach

Connection recovery

Learn about connection state recovery mechanisms

Overview

Architecture

Products & SDKs

Pricing & Billing

​Why fault tolerance matters

​The gossip layer: Building a resilient foundation

​How the gossip layer works

​Service discovery

​The frontend layer: Stateless scalability

​Fault tolerance through redundancy

​The core layer: Stateful resilience

​Challenges of stateful fault tolerance

​Fault tolerance mechanisms

​Consistent hashing

​State redundancy

​Transactional processing

​Message persistence

​Graceful degradation

​Handling core node failures

​Regional independence and global coordination

​Regional isolation

​Continuous global service

​Dynamic traffic redirection

​Cross-region data replication

​Gradations of failure and health

​Non-binary failures

​Multi-dimensional health monitoring

​Health scoring

​Graduated remediation

​Resource implications of fault tolerance

​Redundancy overhead

​Resource margin

​Reactive scaling

​Recursive fault tolerance

​Resource isolation

​Next steps