Horizontal Scaling

UTMStack supports horizontal scaling for large enterprise deployments handling more than 500 data sources. This page explains the architecture and configuration for multi-node deployments.

When to Scale Horizontally: Going above 500 data sources/devices requires adding secondary nodes for horizontal scaling (from README.md:94).

Scaling Architecture

Single-Node vs Multi-Node

Single-Node Deployment

Capacity: Up to 500 data sources (1 TB/month) Resource Requirements:

CPU: 32 cores
RAM: 64 GB
Disk: 1 TB SSD

Components on Single Server:

Backend API
Frontend UI
Correlation Engine
Agent Manager
PostgreSQL
Elasticsearch (single node)
Redis

Advantages:

Simple deployment and management
Lower infrastructure costs
No network latency between components
Easier troubleshooting

Limitations:

Single point of failure
Limited vertical scaling
Cannot exceed 500 data sources efficiently

Multi-Node Deployment

Capacity: 500+ to 10,000+ data sources Architecture:

Primary node: Management and core services
Secondary nodes: Distributed processing
Database cluster: High availability and read scaling
Search cluster: Distributed indexing and search

Advantages:

Horizontal scalability
High availability
Load distribution
Fault tolerance
Better performance under heavy load

Deployment Topology

Small Multi-Node (500-1500 sources)

Node Configuration:

Node	Role	Components	Resources
Primary	Management	Backend API, Frontend, PostgreSQL Master, Redis Master	32 cores, 64 GB RAM, 500 GB SSD
Secondary 1	Processing	Agent Manager, Correlation, Elasticsearch Data	32 cores, 64 GB RAM, 2 TB SSD
Secondary 2	Processing	Agent Manager, Correlation, Elasticsearch Data	32 cores, 64 GB RAM, 2 TB SSD

Total Capacity: ~1,500 data sources, ~3 TB/month

Medium Multi-Node (1500-5000 sources)

Node Configuration:

Node	Role	Components	Resources
Primary	Management	Backend API, Frontend	32 cores, 64 GB RAM, 500 GB SSD
Database	Database	PostgreSQL Primary + Replica	32 cores, 128 GB RAM, 1 TB SSD
Search 1	Search Master	Elasticsearch Master + Data	32 cores, 128 GB RAM, 4 TB SSD
Search 2	Search Data	Elasticsearch Data	32 cores, 128 GB RAM, 4 TB SSD
Search 3	Search Data	Elasticsearch Data	32 cores, 128 GB RAM, 4 TB SSD
Worker 1	Processing	Agent Manager, Correlation	32 cores, 64 GB RAM, 500 GB SSD
Worker 2	Processing	Agent Manager, Correlation	32 cores, 64 GB RAM, 500 GB SSD
Worker 3	Processing	Agent Manager, Correlation	32 cores, 64 GB RAM, 500 GB SSD

Total Capacity: ~5,000 data sources, ~10 TB/month

Large Multi-Node (5000-10000 sources)

Node Configuration:

Node Count	Type	Purpose	Resources per Node
2	Management	API, Frontend, Load Balancing	32 cores, 64 GB RAM, 500 GB SSD
3	Database	PostgreSQL Cluster	64 cores, 256 GB RAM, 2 TB SSD
5	Search	Elasticsearch Cluster	64 cores, 256 GB RAM, 8 TB SSD
8	Processing	Agent Managers, Correlation	32 cores, 64 GB RAM, 1 TB SSD
2	Cache	Redis Cluster	16 cores, 32 GB RAM, 200 GB SSD

Total Capacity: ~10,000 data sources, ~20 TB/month

Component Scaling Strategies

Backend API Scaling

Stateless Design: All API instances share:

PostgreSQL for persistent data
Redis for session storage
Elasticsearch for log queries

Load Balancing Configuration (NGINX):

upstream utmstack_backend {
    least_conn;
    server primary:8080 weight=2;
    server worker1:8080;
    server worker2:8080;
    keepalive 32;
}

server {
    listen 443 ssl http2;
    server_name utmstack.company.com;
    
    ssl_certificate /etc/ssl/utmstack.crt;
    ssl_certificate_key /etc/ssl/utmstack.key;
    
    location /api/ {
        proxy_pass http://utmstack_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
    
    location /websocket {
        proxy_pass http://utmstack_backend;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Health Checks:

upstream utmstack_backend {
    server primary:8080 max_fails=3 fail_timeout=30s;
    server worker1:8080 max_fails=3 fail_timeout=30s;
    server worker2:8080 max_fails=3 fail_timeout=30s;
    
    check interval=3000 rise=2 fall=3 timeout=1000 type=http;
    check_http_send "GET /actuator/health HTTP/1.0\r\n\r\n";
    check_http_expect_alive http_2xx http_3xx;
}

Agent Manager Scaling

gRPC Load Balancing:

# HAProxy configuration for gRPC
global
    maxconn 50000
    
defaults
    mode tcp
    timeout connect 10s
    timeout client 30s
    timeout server 30s
    
frontend grpc_frontend
    bind *:50051
    mode tcp
    default_backend grpc_backend
    
backend grpc_backend
    mode tcp
    balance roundrobin
    
    # Health check
    option tcp-check
    tcp-check connect
    
    server primary 10.0.1.10:50051 check
    server worker1 10.0.1.11:50051 check
    server worker2 10.0.1.12:50051 check

Agent Configuration:

# Agents connect to load balancer
server: grpc-lb.utmstack.local:50051
agent_key: ${AGENT_KEY}

# Connection pooling
connection_pool:
  max_connections: 10
  min_connections: 2
  keepalive_time: 60s

PostgreSQL Scaling

Master-Replica Setup: Master Configuration (postgresql.conf):

# Replication
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
wal_keep_size = 1GB

# Performance
shared_buffers = 16GB
effective_cache_size = 48GB
maintenance_work_mem = 2GB
work_mem = 64MB
max_connections = 200

# Write performance
wal_buffers = 16MB
checkpoint_completion_target = 0.9
max_wal_size = 4GB

Replica Configuration (add to postgresql.conf):

hot_standby = on
max_standby_streaming_delay = 30s
wal_receiver_status_interval = 10s
hot_standby_feedback = on

Application Configuration:

spring:
  datasource:
    # Write operations go to master
    url: jdbc:postgresql://pg-master:5432/utmstack
    
  # Read-only queries can use replica
  datasource-readonly:
    url: jdbc:postgresql://pg-replica:5432/utmstack
    hikari:
      read-only: true

Connection Routing:

@Configuration
public class DatabaseConfig {
    @Bean
    @Primary
    public DataSource dataSource() {
        return DataSourceBuilder.create()
            .url("jdbc:postgresql://pg-master:5432/utmstack")
            .build();
    }
    
    @Bean(name = "readOnlyDataSource")
    public DataSource readOnlyDataSource() {
        return DataSourceBuilder.create()
            .url("jdbc:postgresql://pg-replica:5432/utmstack")
            .build();
    }
}

@Service
public class ReportService {
    @Autowired
    @Qualifier("readOnlyDataSource")
    private DataSource readOnlyDataSource;
    
    public List<Report> generateReport() {
        // Use read replica for heavy queries
        JdbcTemplate jdbc = new JdbcTemplate(readOnlyDataSource);
        return jdbc.query("SELECT ...", new ReportMapper());
    }
}

Elasticsearch Cluster

3-Node Cluster Configuration: Master + Data Node (es-master):

cluster.name: utmstack
node.name: es-master
node.master: true
node.data: true
node.ingest: true

network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - es-master:9300
  - es-data1:9300
  - es-data2:9300

cluster.initial_master_nodes:
  - es-master

# Memory
bootstrap.memory_lock: true

# Paths
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch

Data Nodes (es-data1, es-data2):

cluster.name: utmstack
node.name: es-data1  # or es-data2
node.master: false
node.data: true
node.ingest: true

network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
  - es-master:9300
  - es-data1:9300
  - es-data2:9300

Shard Allocation:

{
  "settings": {
    "number_of_shards": 6,
    "number_of_replicas": 1,
    
    "routing.allocation.awareness.attributes": "zone",
    "routing.allocation.total_shards_per_node": 3
  }
}

Client Configuration:

@Configuration
public class ElasticsearchConfig {
    @Bean
    public RestHighLevelClient client() {
        return new RestHighLevelClient(
            RestClient.builder(
                new HttpHost("es-master", 9200, "http"),
                new HttpHost("es-data1", 9200, "http"),
                new HttpHost("es-data2", 9200, "http")
            )
            .setRequestConfigCallback(builder -> 
                builder.setConnectTimeout(5000)
                    .setSocketTimeout(60000)
            )
            .setHttpClientConfigCallback(httpClientBuilder ->
                httpClientBuilder.setMaxConnTotal(100)
                    .setMaxConnPerRoute(30)
            )
        );
    }
}

Redis Cluster

Master-Replica Configuration: Master (redis-master):

bind 0.0.0.0
port 6379
requirepass ${REDIS_PASSWORD}

# Replication
repl-diskless-sync yes
repl-diskless-sync-delay 5

# Memory
maxmemory 16gb
maxmemory-policy allkeys-lru

# Persistence
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec

Replica (redis-replica):

bind 0.0.0.0
port 6379
replicaof redis-master 6379
masterauth ${REDIS_PASSWORD}
requirepass ${REDIS_PASSWORD}

# Read-only
replica-read-only yes

Sentinel for High Availability:

port 26379
sentinel monitor utmstack-redis redis-master 6379 2
sentinel auth-pass utmstack-redis ${REDIS_PASSWORD}
sentinel down-after-milliseconds utmstack-redis 5000
sentinel parallel-syncs utmstack-redis 1
sentinel failover-timeout utmstack-redis 10000

Application Configuration:

spring:
  redis:
    sentinel:
      master: utmstack-redis
      nodes:
        - sentinel1:26379
        - sentinel2:26379
        - sentinel3:26379
    password: ${REDIS_PASSWORD}

Monitoring Multi-Node Deployment

Key Metrics

Per-Node Metrics:

CPU usage
Memory usage
Disk I/O
Network throughput
Process count

Application Metrics:

Request rate per node
Response time per node
Error rate per node
Queue depth
Cache hit rate

Database Metrics:

Replication lag
Connection pool usage
Query performance
Lock contention

Search Cluster Metrics:

Cluster health (green/yellow/red)
Shard allocation
Indexing rate
Search rate
JVM heap usage

Monitoring Stack

Prometheus Configuration:

scrape_configs:
  - job_name: 'utmstack-backend'
    static_configs:
      - targets:
        - primary:8080
        - worker1:8080
        - worker2:8080
    metrics_path: '/actuator/prometheus'
    
  - job_name: 'elasticsearch'
    static_configs:
      - targets:
        - es-master:9200
        - es-data1:9200
        - es-data2:9200
    metrics_path: '/_prometheus/metrics'
    
  - job_name: 'postgresql'
    static_configs:
      - targets:
        - pg-exporter:9187

Migration Path

From Single-Node to Multi-Node

Phase 1: Add Secondary Processing Node

Provision secondary node
Install UTMStack worker components
Configure to connect to primary database/search
Add to load balancer pool
Monitor performance

Phase 2: Scale Database

Set up PostgreSQL replication
Configure application for read replicas
Migrate heavy queries to replicas

Phase 3: Scale Search Cluster

Add Elasticsearch data nodes
Rebalance shards
Update index templates for more shards
Update application connection pool

Phase 4: Add More Workers

Add processing nodes as needed
Update load balancer configuration
Monitor and tune

Next Steps

High Availability

Configure for maximum uptime

Performance Tuning

Optimize multi-node performance

Data Storage

Understand storage scaling

System Architecture

Review overall architecture

Overview

Components

Scalability

Horizontal Scaling

Scaling Architecture

Single-Node vs Multi-Node

Single-Node Deployment

Multi-Node Deployment

Deployment Topology

Small Multi-Node (500-1500 sources)

Medium Multi-Node (1500-5000 sources)

Large Multi-Node (5000-10000 sources)

Component Scaling Strategies

Backend API Scaling

Agent Manager Scaling

PostgreSQL Scaling

Elasticsearch Cluster

Redis Cluster

Monitoring Multi-Node Deployment

Key Metrics

Monitoring Stack

Migration Path

From Single-Node to Multi-Node

Next Steps

High Availability

Performance Tuning

Data Storage

System Architecture

Build docs developers (and LLMs) love

Overview

Components

Scalability

​Scaling Architecture

​Single-Node vs Multi-Node

​Single-Node Deployment

​Multi-Node Deployment

​Deployment Topology

​Small Multi-Node (500-1500 sources)

​Medium Multi-Node (1500-5000 sources)

​Large Multi-Node (5000-10000 sources)

​Component Scaling Strategies

​Backend API Scaling

​Agent Manager Scaling

​PostgreSQL Scaling

​Elasticsearch Cluster

​Redis Cluster

​Monitoring Multi-Node Deployment

​Key Metrics

​Monitoring Stack

​Migration Path

​From Single-Node to Multi-Node

​Next Steps

High Availability

Performance Tuning

Data Storage

System Architecture

Build docs developers (and LLMs) love

Scaling Architecture

Single-Node vs Multi-Node

Single-Node Deployment

Multi-Node Deployment

Deployment Topology

Small Multi-Node (500-1500 sources)

Medium Multi-Node (1500-5000 sources)

Large Multi-Node (5000-10000 sources)

Component Scaling Strategies

Backend API Scaling

Agent Manager Scaling

PostgreSQL Scaling

Elasticsearch Cluster

Redis Cluster

Monitoring Multi-Node Deployment

Key Metrics

Monitoring Stack

Migration Path

From Single-Node to Multi-Node

Next Steps