Skip to main content
This guide covers deploying Temporal Server in a multi-node cluster configuration for high availability, fault tolerance, and horizontal scalability.

Cluster Architecture

A production Temporal cluster typically consists of:
1

Multiple Service Instances

Each Temporal service (Frontend, History, Matching, Worker) runs multiple instances across different nodes
2

Membership Protocol

Services use the Ringpop gossip protocol for cluster membership and peer discovery
3

Consistent Hashing

History shards are distributed across History service instances using consistent hashing
4

Shared Database

All instances connect to the same persistence and visibility databases

Service Roles and Scaling

Frontend Service

Stateless - Horizontal Scaling

The Frontend service is completely stateless and can be scaled horizontally.Scaling Guidelines:
  • Start with 2-3 instances for HA
  • Add instances based on request rate
  • Each instance handles ~10k requests/sec
  • Use load balancer for traffic distribution
services:
  frontend:
    rpc:
      grpcPort: 7233
      membershipPort: 6933
      bindOnIP: "0.0.0.0"
      httpPort: 7243

History Service

Stateful - Shard-Based Scaling

The History service manages workflow execution state using consistent hashing of history shards.Scaling Guidelines:
  • Minimum 3 instances for HA
  • Shards automatically redistribute on instance changes
  • Each instance handles multiple shards
  • More instances = better shard distribution
persistence:
  numHistoryShards: 512  # More shards = better distribution

services:
  history:
    rpc:
      grpcPort: 7234
      membershipPort: 6934
      bindOnIP: "0.0.0.0"
Shard Distribution Example:
512 shards, 3 History instances:
- Instance 1: ~170 shards
- Instance 2: ~171 shards  
- Instance 3: ~171 shards

512 shards, 5 History instances:
- Each instance: ~102 shards

Matching Service

Semi-Stateful - Queue-Based

The Matching service routes tasks to workers and maintains task queue state.Scaling Guidelines:
  • Start with 2-3 instances
  • Scale based on task queue throughput
  • Task queues are partitioned across instances
services:
  matching:
    rpc:
      grpcPort: 7235
      membershipPort: 6935
      bindOnIP: "0.0.0.0"

Worker Service

Stateless - Background Operations

The Worker service executes internal system workflows like archival.Scaling Guidelines:
  • Usually 1-2 instances sufficient
  • Scale if system workflow backlog occurs
services:
  worker:
    rpc:
      grpcPort: 7239
      membershipPort: 6939
      bindOnIP: "0.0.0.0"

Cluster Membership

Temporal services use Ringpop for cluster membership and service discovery.

Membership Configuration

global:
  membership:
    maxJoinDuration: 30s
    broadcastAddress: ""  # Auto-detected if empty

How Membership Works

1

Service Startup

Each service instance binds to its membership port and announces itself
2

Peer Discovery

Instances discover each other through the shared database (cluster_membership table)
3

Gossip Protocol

Ringpop gossip maintains consistent view of cluster membership
4

Health Checking

Services continuously monitor peer health and detect failures
5

Shard Rebalancing

When topology changes, History shards automatically redistribute

Broadcast Address

The broadcast address is how services advertise themselves to peers:
global:
  membership:
    broadcastAddress: ""  # Auto-detect from hostname
Best for:
  • Simple deployments
  • Kubernetes with pod DNS
  • Docker with proper networking

Multi-Node Deployment Examples

Docker Compose Cluster

services:
  # Frontend instances
  frontend-1:
    image: temporalio/server:latest
    container_name: temporal-frontend-1
    environment:
      - SERVICES=frontend
      - BIND_ON_IP=0.0.0.0
      - TEMPORAL_BROADCAST_ADDRESS=frontend-1
      - NUM_HISTORY_SHARDS=128
      - DB=postgres12
      - POSTGRES_SEEDS=postgres
      - POSTGRES_USER=temporal
      - POSTGRES_PWD=temporal
    networks:
      - temporal-network
    depends_on:
      - postgres
  
  frontend-2:
    image: temporalio/server:latest
    container_name: temporal-frontend-2
    environment:
      - SERVICES=frontend
      - BIND_ON_IP=0.0.0.0
      - TEMPORAL_BROADCAST_ADDRESS=frontend-2
      - NUM_HISTORY_SHARDS=128
      - DB=postgres12
      - POSTGRES_SEEDS=postgres
      - POSTGRES_USER=temporal
      - POSTGRES_PWD=temporal
    networks:
      - temporal-network
    depends_on:
      - postgres
  
  # History instances
  history-1:
    image: temporalio/server:latest
    container_name: temporal-history-1
    environment:
      - SERVICES=history
      - BIND_ON_IP=0.0.0.0
      - TEMPORAL_BROADCAST_ADDRESS=history-1
      - NUM_HISTORY_SHARDS=128
      - DB=postgres12
      - POSTGRES_SEEDS=postgres
      - POSTGRES_USER=temporal
      - POSTGRES_PWD=temporal
    networks:
      - temporal-network
    depends_on:
      - postgres
  
  history-2:
    image: temporalio/server:latest
    container_name: temporal-history-2
    environment:
      - SERVICES=history
      - BIND_ON_IP=0.0.0.0
      - TEMPORAL_BROADCAST_ADDRESS=history-2
      - NUM_HISTORY_SHARDS=128
      - DB=postgres12
      - POSTGRES_SEEDS=postgres
      - POSTGRES_USER=temporal
      - POSTGRES_PWD=temporal
    networks:
      - temporal-network
    depends_on:
      - postgres
  
  history-3:
    image: temporalio/server:latest
    container_name: temporal-history-3
    environment:
      - SERVICES=history
      - BIND_ON_IP=0.0.0.0
      - TEMPORAL_BROADCAST_ADDRESS=history-3
      - NUM_HISTORY_SHARDS=128
      - DB=postgres12
      - POSTGRES_SEEDS=postgres
      - POSTGRES_USER=temporal
      - POSTGRES_PWD=temporal
    networks:
      - temporal-network
    depends_on:
      - postgres
  
  # Matching instances
  matching-1:
    image: temporalio/server:latest
    container_name: temporal-matching-1
    environment:
      - SERVICES=matching
      - BIND_ON_IP=0.0.0.0
      - TEMPORAL_BROADCAST_ADDRESS=matching-1
      - NUM_HISTORY_SHARDS=128
      - DB=postgres12
      - POSTGRES_SEEDS=postgres
      - POSTGRES_USER=temporal
      - POSTGRES_PWD=temporal
    networks:
      - temporal-network
    depends_on:
      - postgres
  
  matching-2:
    image: temporalio/server:latest
    container_name: temporal-matching-2
    environment:
      - SERVICES=matching
      - BIND_ON_IP=0.0.0.0
      - TEMPORAL_BROADCAST_ADDRESS=matching-2
      - NUM_HISTORY_SHARDS=128
      - DB=postgres12
      - POSTGRES_SEEDS=postgres
      - POSTGRES_USER=temporal
      - POSTGRES_PWD=temporal
    networks:
      - temporal-network
    depends_on:
      - postgres
  
  # Worker instance
  worker:
    image: temporalio/server:latest
    container_name: temporal-worker
    environment:
      - SERVICES=worker
      - BIND_ON_IP=0.0.0.0
      - TEMPORAL_BROADCAST_ADDRESS=worker
      - NUM_HISTORY_SHARDS=128
      - DB=postgres12
      - POSTGRES_SEEDS=postgres
      - POSTGRES_USER=temporal
      - POSTGRES_PWD=temporal
    networks:
      - temporal-network
    depends_on:
      - postgres
  
  # Load balancer for Frontend
  nginx:
    image: nginx:alpine
    container_name: temporal-nginx
    ports:
      - "7233:7233"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    networks:
      - temporal-network
    depends_on:
      - frontend-1
      - frontend-2
  
  postgres:
    image: postgres:13
    container_name: temporal-postgres
    environment:
      POSTGRES_USER: temporal
      POSTGRES_PASSWORD: temporal
    networks:
      - temporal-network
    volumes:
      - postgres-data:/var/lib/postgresql/data

networks:
  temporal-network:
    driver: bridge

volumes:
  postgres-data:

Kubernetes Deployment

For Kubernetes, use the official Helm charts:
# Add Temporal Helm repository
helm repo add temporalio https://go.temporal.io/helm-charts
helm repo update

# Create namespace
kubectl create namespace temporal

# Install with custom values
helm install temporal temporalio/temporal \
  --namespace temporal \
  --values values.yaml
values.yaml
server:
  replicaCount: 1
  
  config:
    persistence:
      default:
        sql:
          driver: "postgres"
          host: postgres.temporal.svc.cluster.local
          port: 5432
          database: temporal
          user: temporal
          password: temporal
          maxConns: 20
          maxIdleConns: 20
      visibility:
        sql:
          driver: "postgres"
          host: postgres.temporal.svc.cluster.local
          port: 5432
          database: temporal_visibility
          user: temporal
          password: temporal
          maxConns: 10
          maxIdleConns: 10
    
    numHistoryShards: 512

frontend:
  replicaCount: 3
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 2000m
      memory: 2Gi
  service:
    type: LoadBalancer
    port: 7233

history:
  replicaCount: 5
  resources:
    requests:
      cpu: 1000m
      memory: 2Gi
    limits:
      cpu: 4000m
      memory: 4Gi

matching:
  replicaCount: 3
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 2000m
      memory: 2Gi

worker:
  replicaCount: 1
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 2000m
      memory: 2Gi

prometheus:
  enabled: true

grafana:
  enabled: true

Multi-Cluster Replication

For geo-distributed deployments, Temporal supports multi-cluster replication.

Cluster Configuration

config/cluster-a.yaml
persistence:
  numHistoryShards: 512

clusterMetadata:
  enableGlobalNamespace: true
  failoverVersionIncrement: 100
  masterClusterName: "cluster-a"
  currentClusterName: "cluster-a"
  clusterInformation:
    cluster-a:
      enabled: true
      initialFailoverVersion: 1
      rpcName: "frontend"
      rpcAddress: "cluster-a.example.com:7233"
      httpAddress: "cluster-a.example.com:7243"
    cluster-b:
      enabled: true
      initialFailoverVersion: 2
      rpcName: "frontend"
      rpcAddress: "cluster-b.example.com:7233"
      httpAddress: "cluster-b.example.com:7243"

dcRedirectionPolicy:
  policy: "all-apis-forwarding"

Registering Remote Clusters

After both clusters are running:
# From Cluster A, register Cluster B
tctl --address cluster-a.example.com:7233 \
  admin cluster upsert-remote-cluster \
  --frontend_address "cluster-b.example.com:7233"

# From Cluster B, register Cluster A
tctl --address cluster-b.example.com:7233 \
  admin cluster upsert-remote-cluster \
  --frontend_address "cluster-a.example.com:7233"

Global Namespaces

Create a namespace that replicates across clusters:
tctl --address cluster-a.example.com:7233 \
  namespace register \
  --global \
  --namespace my-global-namespace \
  --clusters cluster-a cluster-b \
  --active_cluster cluster-a

Failover

Manual failover to switch active cluster:
tctl --address cluster-a.example.com:7233 \
  namespace update \
  --namespace my-global-namespace \
  --active_cluster cluster-b

Load Balancing

Frontend Load Balancing

stream {
    upstream temporal_grpc {
        least_conn;
        server frontend-1.temporal.svc.cluster.local:7233 max_fails=3 fail_timeout=30s;
        server frontend-2.temporal.svc.cluster.local:7233 max_fails=3 fail_timeout=30s;
        server frontend-3.temporal.svc.cluster.local:7233 max_fails=3 fail_timeout=30s;
    }
    
    server {
        listen 7233;
        proxy_pass temporal_grpc;
        proxy_connect_timeout 5s;
        proxy_timeout 300s;
    }
}

High Availability Checklist

1

Multiple Instances

  • At least 2 Frontend instances
  • At least 3 History instances
  • At least 2 Matching instances
  • At least 1 Worker instance
2

Database HA

  • Database replication configured
  • Automatic failover enabled
  • Regular backups scheduled
  • Connection pooling tuned
3

Network Configuration

  • Load balancer for Frontend service
  • Health checks configured
  • Network policies in place
  • DNS properly configured
4

Monitoring

  • Prometheus metrics collection
  • Grafana dashboards
  • Alerting rules configured
  • Log aggregation setup
5

Resource Management

  • CPU/Memory limits set
  • Pod/container auto-scaling
  • Disk space monitoring
  • Rate limiting configured

Troubleshooting Cluster Issues

Symptoms: History service logs show shard ownership conflictsCauses:
  • Split-brain scenarios
  • Network partitions
  • Clock skew between nodes
Solutions:
  • Ensure NTP is synchronized across all nodes
  • Check network connectivity between services
  • Verify database connectivity
  • Review maxJoinDuration setting
Symptoms: Some History instances have many more shards than othersSolutions:
  • Restart History instances one at a time
  • Wait for membership stabilization (30-60s)
  • Check broadcast addresses are correct
  • Verify all instances can reach the database
Symptoms: Services cannot find each otherSolutions:
  • Check membership ports are accessible
  • Verify broadcast addresses resolve correctly
  • Check firewall rules
  • Ensure database connectivity for membership table
Symptoms: Connection failures, timeouts, uneven loadSolutions:
  • Configure health checks properly
  • Use least-connections algorithm
  • Set appropriate timeouts (5s connect, 300s idle)
  • Monitor backend health status

Performance Optimization

Shard Count Guidelines

# < 100 workflows/second
numHistoryShards: 128
history:
  replicaCount: 3
# Result: ~43 shards per instance
Cannot change shard count after deployment! Choose carefully based on expected scale.

Resource Recommendations

ComponentSmallMediumLarge
Frontend2 CPU
4 GB RAM
4 CPU
8 GB RAM
8 CPU
16 GB RAM
History4 CPU
8 GB RAM
8 CPU
16 GB RAM
16 CPU
32 GB RAM
Matching2 CPU
4 GB RAM
4 CPU
8 GB RAM
8 CPU
16 GB RAM
Worker2 CPU
4 GB RAM
2 CPU
4 GB RAM
4 CPU
8 GB RAM
These are starting points. Monitor actual usage and adjust based on workload patterns.

Build docs developers (and LLMs) love