Overview
NATS clustering allows multiple NATS servers to form a highly available messaging system. Servers in a cluster form a full mesh topology where each server connects to every other server, enabling transparent failover and load distribution.How Clustering Works
When you configure multiple NATS servers to form a cluster, they establish route connections between each other. These route connections:- Propagate subscription interest across the cluster
- Forward messages to remote subscribers
- Share cluster topology information
- Enable automatic failover when servers go offline
Full Mesh Topology
In a NATS cluster, every server maintains a direct connection to every other server. This full mesh topology provides:- No single point of failure: Clients can connect to any server
- Minimal message hops: At most one hop to reach any subscriber
- Fast failure detection: Direct connections detect failures quickly
- Simple topology: No complex routing decisions needed
Route Connections
Route connections are special server-to-server connections that differ from client connections.Route Types
Fromserver/route.go:36-44, NATS distinguishes between:
- Explicit routes: Configured directly in the server configuration
- Implicit routes: Learned dynamically from other cluster members
Route Protocol
Route connections use a specialized protocol fromserver/route.go:126-130:
CONNECT: Establishes the route connection with authenticationINFO: Exchanges server metadata and capabilities- Account-aware protocols:
A+/A-for account subscription management - Remote subscription protocols:
RS+/RS-for interest propagation - Leaf node protocols:
LS+/LS-for leaf node subscriptions
Connection Pooling
Modern NATS servers support route connection pooling:- Multiple route connections per server pair
- Reduces lock contention for high-throughput accounts
- Dedicated routes for specific accounts (account pinning)
- Backward compatible with older servers
Cluster Configuration
Basic Configuration
Minimal cluster configuration:Configuration Options
Key cluster configuration parameters:- name: Cluster name for identification
- listen: Address for accepting route connections
- routes: Seed URLs for other cluster members
- authorization: Authentication for route connections
- tls: TLS configuration for encrypted routes
- pool_size: Number of route connections per server (default varies by version)
- compression: Enable compression for route connections
TLS for Routes
Route Authentication
Protect route connections with authentication:Auto-Discovery
NATS servers automatically discover other cluster members through gossip:- Server A connects to Server B (explicit route)
- Server B sends its INFO containing all known cluster members
- Server A learns about Server C and D from Server B
- Server A establishes implicit routes to C and D
Gossip Protocol
Fromserver/route.go:104-108, servers use gossip modes:
- gossipDefault: Normal gossip behavior
- gossipDisabled: Prevents gossip propagation
- gossipOverride: Forces gossip even when it would normally be suppressed
Split-Brain Prevention
NATS prevents split-brain scenarios through cluster naming:Cluster Name Validation
Fromserver/route.go:119, servers include their cluster name in the CONNECT protocol. Servers reject route connections from servers with:
- Different cluster names
- Missing cluster names (when expected)
Loop Detection
Servers detect and prevent routing loops:- Each server has a unique server ID
- INFO messages include server identity
- Duplicate connections to the same server are rejected
- Prevents cycles in route topology
Interest Propagation
Cluster efficiency depends on propagating subscription interest correctly.Subject Interest
When a client subscribes:- Local server registers the subscription
- Server propagates interest to all route connections
- Other servers remember interest for that subject
- Messages published remotely are forwarded only if interest exists
Queue Group Interest
Queue subscriptions are handled specially:- Interest is propagated but marked as queue subscription
- Only one message per queue group is delivered cluster-wide
- Server picks a random queue member (local or remote)
Account-Scoped Interest
With route pooling and account isolation:- Interest is scoped to specific accounts
- Routes can be dedicated to a single account
- Protocol messages include account context
- Improves performance for multi-tenant deployments
Performance Considerations
Recommended Cluster Size
For optimal performance:- 3 servers: Minimal HA deployment
- 5-7 servers: Recommended for production
- >9 servers: Consider gateway architecture instead
Route Compression
Fromserver/route.go:139-140, routes support compression:
- Default ping interval is 30 seconds for routes (vs 2 minutes for clients)
- Compression uses RTT measurements to select compression level
- S2 compression provides good compression with low CPU overhead
- Particularly beneficial for high-bandwidth, high-latency routes
Connection Delays
Fromserver/route.go:145-147:
- Default route connect delay:
DEFAULT_ROUTE_CONNECT - Max retry delay:
DEFAULT_ROUTE_CONNECT_MAX - Configurable for testing and specific network conditions
Best Practices
Cluster Topology
- Use odd numbers: 3, 5, or 7 servers for better quorum semantics with JetStream
- Co-locate in same region: Low latency between cluster members is critical
- Dedicated network: Use separate network for cluster traffic when possible
- Name your clusters: Always set a cluster name to prevent accidental merging
Configuration Management
- Consistent configuration: Keep cluster settings identical across servers
- Use seed servers: Configure at least 2-3 seed routes on each server
- Enable TLS: Encrypt route connections in production
- Authentication: Require route authentication to prevent unauthorized servers
Monitoring
- Route health: Monitor
/routezendpoint for route connection status - Interest graph: Check that subscriptions propagate correctly
- Message flow: Verify messages reach remote subscribers
- Connection count: Ensure all servers have n-1 route connections
Scaling Patterns
- Vertical first: Scale individual servers before adding cluster members
- Connection limits: Set appropriate per-account limits
- Account isolation: Use accounts to segment traffic
- Gateway for geo-distribution: Use gateways instead of large clusters
Cluster Operations
Adding a Server
- Configure new server with cluster settings
- Add at least one route to an existing member
- Start the server
- Verify it discovers all other members (check
/routez)
Removing a Server
- Stop accepting new client connections (lame duck mode)
- Wait for clients to drain and reconnect
- Shut down the server
- Other cluster members automatically detect the failure
- Remove from configuration and DNS
Rolling Upgrades
- Upgrade one server at a time
- Wait for it to rejoin cluster and sync
- Verify client reconnection and message flow
- Proceed to next server
Troubleshooting
Routes Not Forming
Check:- Cluster name matches on all servers
- Route port (6222) is accessible
- Authentication credentials are correct
- TLS configuration matches (if enabled)
- Firewall allows bidirectional traffic
Messages Not Forwarding
Check:- Subscriptions appear in
/subszon all servers - Interest propagated in
/routezsubscription count - Account isolation - messages don’t cross accounts
- Network connectivity between servers
Performance Issues
Check:- Route connection count (should be n-1)
- Enable route compression for high traffic
- Consider connection pooling for hot accounts
- Monitor slow consumer events
- Check for route connection saturation
Next Steps
Gateways
Connect multiple clusters across regions
Accounts
Implement multi-tenancy in your cluster