Skip to main content

Overview

NATS clustering allows multiple NATS servers to form a highly available messaging system. Servers in a cluster form a full mesh topology where each server connects to every other server, enabling transparent failover and load distribution.

How Clustering Works

When you configure multiple NATS servers to form a cluster, they establish route connections between each other. These route connections:
  • Propagate subscription interest across the cluster
  • Forward messages to remote subscribers
  • Share cluster topology information
  • Enable automatic failover when servers go offline

Full Mesh Topology

In a NATS cluster, every server maintains a direct connection to every other server. This full mesh topology provides:
  • No single point of failure: Clients can connect to any server
  • Minimal message hops: At most one hop to reach any subscriber
  • Fast failure detection: Direct connections detect failures quickly
  • Simple topology: No complex routing decisions needed

Route Connections

Route connections are special server-to-server connections that differ from client connections.

Route Types

From server/route.go:36-44, NATS distinguishes between:
  • Explicit routes: Configured directly in the server configuration
  • Implicit routes: Learned dynamically from other cluster members

Route Protocol

Route connections use a specialized protocol from server/route.go:126-130:
  • CONNECT: Establishes the route connection with authentication
  • INFO: Exchanges server metadata and capabilities
  • Account-aware protocols: A+/A- for account subscription management
  • Remote subscription protocols: RS+/RS- for interest propagation
  • Leaf node protocols: LS+/LS- for leaf node subscriptions

Connection Pooling

Modern NATS servers support route connection pooling:
  • Multiple route connections per server pair
  • Reduces lock contention for high-throughput accounts
  • Dedicated routes for specific accounts (account pinning)
  • Backward compatible with older servers

Cluster Configuration

Basic Configuration

Minimal cluster configuration:
cluster {
  name: "my-cluster"
  listen: 0.0.0.0:6222
  
  routes: [
    nats-route://server1:6222
    nats-route://server2:6222
  ]
}

Configuration Options

Key cluster configuration parameters:
  • name: Cluster name for identification
  • listen: Address for accepting route connections
  • routes: Seed URLs for other cluster members
  • authorization: Authentication for route connections
  • tls: TLS configuration for encrypted routes
  • pool_size: Number of route connections per server (default varies by version)
  • compression: Enable compression for route connections

TLS for Routes

cluster {
  name: "secure-cluster"
  listen: 0.0.0.0:6222
  
  tls {
    cert_file: "./certs/server-cert.pem"
    key_file: "./certs/server-key.pem"
    ca_file: "./certs/ca.pem"
    verify: true
  }
}
From server/route.go:133-134, when verify is false, certificate chains and hostnames are not verified. Never use this in production.

Route Authentication

Protect route connections with authentication:
cluster {
  name: "auth-cluster"
  listen: 0.0.0.0:6222
  
  authorization {
    user: "route_user"
    password: "$2a$11$..." # bcrypt hash
  }
  
  routes: [
    nats-route://route_user:password@server1:6222
  ]
}

Auto-Discovery

NATS servers automatically discover other cluster members through gossip:
  1. Server A connects to Server B (explicit route)
  2. Server B sends its INFO containing all known cluster members
  3. Server A learns about Server C and D from Server B
  4. Server A establishes implicit routes to C and D

Gossip Protocol

From server/route.go:104-108, servers use gossip modes:
  • gossipDefault: Normal gossip behavior
  • gossipDisabled: Prevents gossip propagation
  • gossipOverride: Forces gossip even when it would normally be suppressed
This mechanism ensures all servers discover each other even when only configured with a subset of seed servers.

Split-Brain Prevention

NATS prevents split-brain scenarios through cluster naming:

Cluster Name Validation

From server/route.go:119, servers include their cluster name in the CONNECT protocol. Servers reject route connections from servers with:
  • Different cluster names
  • Missing cluster names (when expected)
This prevents accidentally merging separate clusters.

Loop Detection

Servers detect and prevent routing loops:
  • Each server has a unique server ID
  • INFO messages include server identity
  • Duplicate connections to the same server are rejected
  • Prevents cycles in route topology

Interest Propagation

Cluster efficiency depends on propagating subscription interest correctly.

Subject Interest

When a client subscribes:
  1. Local server registers the subscription
  2. Server propagates interest to all route connections
  3. Other servers remember interest for that subject
  4. Messages published remotely are forwarded only if interest exists

Queue Group Interest

Queue subscriptions are handled specially:
  • Interest is propagated but marked as queue subscription
  • Only one message per queue group is delivered cluster-wide
  • Server picks a random queue member (local or remote)

Account-Scoped Interest

With route pooling and account isolation:
  • Interest is scoped to specific accounts
  • Routes can be dedicated to a single account
  • Protocol messages include account context
  • Improves performance for multi-tenant deployments

Performance Considerations

For optimal performance:
  • 3 servers: Minimal HA deployment
  • 5-7 servers: Recommended for production
  • >9 servers: Consider gateway architecture instead
The full mesh topology means O(n²) connections. Large clusters should use gateways.

Route Compression

From server/route.go:139-140, routes support compression:
  • Default ping interval is 30 seconds for routes (vs 2 minutes for clients)
  • Compression uses RTT measurements to select compression level
  • S2 compression provides good compression with low CPU overhead
  • Particularly beneficial for high-bandwidth, high-latency routes

Connection Delays

From server/route.go:145-147:
  • Default route connect delay: DEFAULT_ROUTE_CONNECT
  • Max retry delay: DEFAULT_ROUTE_CONNECT_MAX
  • Configurable for testing and specific network conditions

Best Practices

Cluster Topology

  1. Use odd numbers: 3, 5, or 7 servers for better quorum semantics with JetStream
  2. Co-locate in same region: Low latency between cluster members is critical
  3. Dedicated network: Use separate network for cluster traffic when possible
  4. Name your clusters: Always set a cluster name to prevent accidental merging

Configuration Management

  1. Consistent configuration: Keep cluster settings identical across servers
  2. Use seed servers: Configure at least 2-3 seed routes on each server
  3. Enable TLS: Encrypt route connections in production
  4. Authentication: Require route authentication to prevent unauthorized servers

Monitoring

  1. Route health: Monitor /routez endpoint for route connection status
  2. Interest graph: Check that subscriptions propagate correctly
  3. Message flow: Verify messages reach remote subscribers
  4. Connection count: Ensure all servers have n-1 route connections

Scaling Patterns

  1. Vertical first: Scale individual servers before adding cluster members
  2. Connection limits: Set appropriate per-account limits
  3. Account isolation: Use accounts to segment traffic
  4. Gateway for geo-distribution: Use gateways instead of large clusters

Cluster Operations

Adding a Server

  1. Configure new server with cluster settings
  2. Add at least one route to an existing member
  3. Start the server
  4. Verify it discovers all other members (check /routez)

Removing a Server

  1. Stop accepting new client connections (lame duck mode)
  2. Wait for clients to drain and reconnect
  3. Shut down the server
  4. Other cluster members automatically detect the failure
  5. Remove from configuration and DNS

Rolling Upgrades

  1. Upgrade one server at a time
  2. Wait for it to rejoin cluster and sync
  3. Verify client reconnection and message flow
  4. Proceed to next server
NATS protocol versioning ensures backward compatibility during upgrades.

Troubleshooting

Routes Not Forming

Check:
  • Cluster name matches on all servers
  • Route port (6222) is accessible
  • Authentication credentials are correct
  • TLS configuration matches (if enabled)
  • Firewall allows bidirectional traffic

Messages Not Forwarding

Check:
  • Subscriptions appear in /subsz on all servers
  • Interest propagated in /routez subscription count
  • Account isolation - messages don’t cross accounts
  • Network connectivity between servers

Performance Issues

Check:
  • Route connection count (should be n-1)
  • Enable route compression for high traffic
  • Consider connection pooling for hot accounts
  • Monitor slow consumer events
  • Check for route connection saturation

Next Steps

Gateways

Connect multiple clusters across regions

Accounts

Implement multi-tenancy in your cluster

Build docs developers (and LLMs) love