Snuba is designed to scale horizontally across multiple dimensions. This guide covers scaling strategies for API servers, consumers, and ClickHouse storage.
Scaling Overview
Snuba consists of three independently scalable components:
API Layer
Stateless query servers that can be scaled horizontally:
Scale based on query rate and latency
No coordination required between instances
Typically CPU-bound
Consumer Layer
Stateful Kafka consumers that ingest data:
Scale based on Kafka lag and throughput
Limited by Kafka partition count
Typically I/O and network bound
Storage Layer
ClickHouse clusters for data storage:
Scale based on data volume and query complexity
Requires careful planning for sharding
Both vertical and horizontal scaling options
Scaling the API Layer
Horizontal Scaling
API servers are stateless and can be scaled freely:
apiVersion : apps/v1
kind : Deployment
metadata :
name : snuba-api
spec :
replicas : 10 # Scale to 10 replicas
selector :
matchLabels :
app : snuba
component : api
template :
spec :
containers :
- name : api
image : snuba:latest
resources :
requests :
memory : "2Gi"
cpu : "1"
limits :
memory : "4Gi"
cpu : "2"
Auto-scaling Configuration
Use Horizontal Pod Autoscaler (HPA):
apiVersion : autoscaling/v2
kind : HorizontalPodAutoscaler
metadata :
name : snuba-api-hpa
spec :
scaleTargetRef :
apiVersion : apps/v1
kind : Deployment
name : snuba-api
minReplicas : 3
maxReplicas : 20
metrics :
- type : Resource
resource :
name : cpu
target :
type : Utilization
averageUtilization : 70
- type : Resource
resource :
name : memory
target :
type : Utilization
averageUtilization : 80
behavior :
scaleUp :
stabilizationWindowSeconds : 60
policies :
- type : Percent
value : 50
periodSeconds : 60
- type : Pods
value : 2
periodSeconds : 60
selectPolicy : Max
scaleDown :
stabilizationWindowSeconds : 300
policies :
- type : Percent
value : 10
periodSeconds : 60
Set minReplicas to at least 2 for high availability. Use conservative scale-down policies to avoid thrashing.
Worker Configuration
Configure workers per API pod:
# settings.py
API_WORKERS = 4 # Number of worker processes
API_THREADS = 8 # Number of threads per worker
# Total concurrent requests = API_WORKERS * API_THREADS = 32
# Or via environment variables
docker run -e API_WORKERS= 4 -e API_THREADS= 8 snuba:latest api
Worker sizing guidelines:
CPU-bound workloads : workers = num_cores
I/O-bound workloads : workers = num_cores * 2-4
Threads per worker : 4-8 for query serving
Memory : ~500MB base + 100MB per worker
Load Balancing
Distribute traffic across API pods:
apiVersion : v1
kind : Service
metadata :
name : snuba-api
spec :
type : LoadBalancer
selector :
app : snuba
component : api
ports :
- port : 80
targetPort : 1218
sessionAffinity : None # Round-robin distribution
Connection Pool Tuning
Optimize ClickHouse connections:
# settings.py
CLICKHOUSE_MAX_POOL_SIZE = 25 # Max connections per cluster
CLUSTERS = [
{
"host" : "clickhouse" ,
"port" : 9000 ,
"max_connections" : 10 , # Connections per API worker
"block_connections" : False , # Don't block on full pool
}
]
Connection pool sizing : Total connections = API_WORKERS * max_connections * num_pods. Ensure ClickHouse can handle the total connection count (default max: 4096).
Scaling Consumers
Understanding Consumer Scaling
Consumer scaling is limited by Kafka partition count:
Max consumer replicas = Number of Kafka partitions
Example: A topic with 16 partitions can have max 16 consumer instances in the same consumer group.
Horizontal Scaling
Scale consumers based on lag:
apiVersion : apps/v1
kind : Deployment
metadata :
name : snuba-consumer-errors
spec :
replicas : 8 # Match partition count for best performance
selector :
matchLabels :
app : snuba
component : consumer
storage : errors
template :
spec :
containers :
- name : consumer
image : snuba:latest
command :
- snuba
- consumer
- --storage=errors
- --consumer-group=snuba-consumers
- --max-batch-size=50000
- --max-batch-time-ms=2000
- --processes=2 # Parallel processing within consumer
resources :
requests :
memory : "2Gi"
cpu : "1"
limits :
memory : "8Gi"
cpu : "4"
Per-Storage Scaling
Each storage type can be scaled independently:
# Scale errors consumer
kubectl scale deployment snuba-consumer-errors --replicas=16
# Scale transactions consumer
kubectl scale deployment snuba-consumer-transactions --replicas=8
# Scale metrics consumer
kubectl scale deployment snuba-consumer-metrics --replicas=4
Batch Size Tuning
Optimize batch sizes for throughput:
# Default batch configuration
DEFAULT_MAX_BATCH_SIZE = 50000 # Max messages per batch
DEFAULT_MAX_BATCH_TIME_MS = 2000 # Max time to wait for batch
# Tuning guidelines:
# - High throughput, low latency: smaller batches (10k-25k)
# - High throughput, latency tolerant: larger batches (50k-100k)
# - Low throughput: shorter time window (500-1000ms)
# Consumer with optimized batching
snuba consumer \
--storage=errors \
--max-batch-size=100000 \
--max-batch-time-ms=5000 \
--max-insert-batch-size=50000 \
--max-insert-batch-time-ms=2000
Batch Size Guidelines by Storage
Errors Storage:
Batch size: 50,000 messages
Batch time: 2,000ms
Rationale: Balance between latency and throughput
Transactions Storage:
Batch size: 75,000 messages
Batch time: 3,000ms
Rationale: Larger events, need bigger batches
Metrics Storage:
Batch size: 100,000 messages
Batch time: 5,000ms
Rationale: Small events, high volume, can tolerate latency
Replays Storage:
Batch size: 25,000 messages
Batch time: 1,000ms
Rationale: Large payloads, need quick flushing
Parallel Processing
Increase processing parallelism within consumers:
snuba consumer \
--storage=errors \
--processes=4 \
--input-block-size=10000 \
--output-block-size=10000
Process count guidelines:
Start with 2 processes per consumer
Increase if CPU usage < 80%
Max: num_cores - 1 to leave headroom
Monitor memory usage (each process needs ~1-2GB)
Kafka Consumer Configuration
Optimize Kafka consumer settings:
snuba consumer \
--storage=errors \
--queued-max-messages-kbytes=100000 \
--queued-min-messages=50000 \
--max-poll-interval-ms=300000 \
--group-instance-id=consumer-1 # Static membership
Static membership (group-instance-id) reduces rebalancing overhead. Use unique IDs per pod (e.g., based on pod name).
Consumer Auto-scaling
Scale based on Kafka lag using KEDA:
apiVersion : keda.sh/v1alpha1
kind : ScaledObject
metadata :
name : snuba-consumer-errors-scaler
spec :
scaleTargetRef :
name : snuba-consumer-errors
minReplicaCount : 2
maxReplicaCount : 16
triggers :
- type : kafka
metadata :
bootstrapServers : kafka:9092
consumerGroup : snuba-consumers
topic : ingest-events
lagThreshold : "1000000" # Scale up if lag > 1M
offsetResetPolicy : latest
Scaling ClickHouse
Vertical Scaling
Increase resources for existing nodes:
apiVersion : apps/v1
kind : StatefulSet
metadata :
name : clickhouse
spec :
template :
spec :
containers :
- name : clickhouse
image : altinity/clickhouse-server:25.3.6.10034.altinitystable
resources :
requests :
memory : "16Gi" # Increased from 8Gi
cpu : "4" # Increased from 2
limits :
memory : "32Gi" # Increased from 16Gi
cpu : "8" # Increased from 4
Vertical scaling benefits:
Simpler than horizontal scaling
Better for single-node setups
Good for query-heavy workloads
No resharding required
Limitations:
Single point of failure
Limited by node capacity
Can’t exceed max node size
Horizontal Scaling with Sharding
Distribute data across multiple nodes. See ClickHouse Topology for detailed cluster setup.
apiVersion : apps/v1
kind : StatefulSet
metadata :
name : clickhouse-shard
spec :
replicas : 3 # 3 shards
serviceName : clickhouse-shard
template :
spec :
containers :
- name : clickhouse
image : altinity/clickhouse-server:25.3.6.10034.altinitystable
volumeMounts :
- name : config
mountPath : /etc/clickhouse-server/config.d
Storage Slicing
Snuba supports storage slicing for horizontal scaling:
# settings.py
SLICED_STORAGE_SETS = {
"events" : 4 , # 4 slices for events storage
}
# Each slice gets its own:
# - ClickHouse cluster
# - Kafka topics
# - Consumer group
Slicing configuration:
SLICED_CLUSTERS = [
{
"host" : "clickhouse-slice-0" ,
"storage_sets" : {( "events" , 0 )}, # Slice 0
},
{
"host" : "clickhouse-slice-1" ,
"storage_sets" : {( "events" , 1 )}, # Slice 1
},
{
"host" : "clickhouse-slice-2" ,
"storage_sets" : {( "events" , 2 )}, # Slice 2
},
{
"host" : "clickhouse-slice-3" ,
"storage_sets" : {( "events" , 3 )}, # Slice 3
},
]
SLICED_KAFKA_TOPIC_MAP = {
( "ingest-events" , 0 ): "ingest-events-slice-0" ,
( "ingest-events" , 1 ): "ingest-events-slice-1" ,
( "ingest-events" , 2 ): "ingest-events-slice-2" ,
( "ingest-events" , 3 ): "ingest-events-slice-3" ,
}
LOGICAL_PARTITION_MAPPING = {
"events" : {
0 : 0 , # Partition 0 -> Slice 0
1 : 1 , # Partition 1 -> Slice 1
2 : 2 , # Partition 2 -> Slice 2
3 : 3 , # Partition 3 -> Slice 3
}
}
Storage Volume Expansion
Expand persistent volumes:
# Edit PVC to increase size
kubectl edit pvc data-clickhouse-0
# Change spec.resources.requests.storage:
spec:
resources:
requests:
storage: 500Gi # Increased from 100Gi
# Kubernetes will automatically expand the volume
Volume expansion requires a storage class that supports expansion (allowVolumeExpansion: true). The pod may need to be restarted for the expansion to take effect.
Query Optimization
Optimize expensive queries:
Add indexes for frequently filtered columns
Use materialized views for common aggregations
Partition pruning - structure queries to use partition keys
Limit result sets - use LIMIT clauses
**Avoid SELECT *** - specify needed columns only
Table Optimization
Regularly optimize tables:
# Run optimize job as Kubernetes CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: snuba-optimize
spec:
schedule: "0 2 * * *" # Daily at 2 AM UTC
jobTemplate:
spec:
template:
spec:
containers:
- name: optimize
image: snuba:latest
command : [ "snuba" , "optimize"]
env:
- name: SNUBA_SETTINGS
value: "production"
Optimize configuration:
OPTIMIZE_JOB_CUTOFF_TIME = 23 # Stop at 11 PM UTC
OPTIMIZE_QUERY_TIMEOUT = 14400 # 4 hour timeout
OPTIMIZE_BASE_SLEEP_TIME = 300 # 5 min between checks
OPTIMIZE_MAX_SLEEP_TIME = 7200 # Max 2 hours wait
Retention Management
Manage data retention:
ENFORCE_RETENTION = True
DEFAULT_RETENTION_DAYS = 90
LOWER_RETENTION_DAYS = 30
# Run cleanup job
apiVersion: batch/v1
kind: CronJob
metadata:
name: snuba-cleanup
spec:
schedule: "0 3 * * *" # Daily at 3 AM UTC
jobTemplate:
spec:
template:
spec:
containers:
- name: cleanup
image: snuba:latest
command : [ "snuba" , "cleanup"]
Capacity Planning
Estimating Storage Needs
Events storage:
Daily events: 1 billion
Avg event size: 2 KB
Daily storage: 1B * 2KB = 2 TB
With compression (4x): 500 GB/day
90-day retention: 45 TB
Metrics storage:
Daily metrics: 10 billion
Avg metric size: 200 bytes
Daily storage: 10B * 200B = 2 TB
With compression (10x): 200 GB/day
90-day retention: 18 TB
Add 50% overhead for indexes, merges, and temporary data. Plan for 3x current usage to allow for growth.
Computing Resource Requirements
API servers:
1 core per 100 QPS
2-4 GB RAM per core
Min 2 replicas for HA
Consumers:
1 core per 50k messages/sec
2-4 GB RAM per core
1 consumer per Kafka partition (max)
ClickHouse:
1 core per 50 QPS
4-8 GB RAM per core
RAM should be 2-4x total dataset size for hot data
Scaling Timeline
Current State (0-3 months)
3 API replicas
4 consumer replicas per storage
1 ClickHouse node (16 cores, 64GB RAM)
100 GB storage per day
6 Month Projection
6 API replicas (+100%)
8 consumer replicas per storage (+100%)
3 ClickHouse nodes (scale out)
200 GB storage per day
12 Month Projection
12 API replicas (+100%)
16 consumer replicas per storage (+100%)
6 ClickHouse nodes (scale out)
500 GB storage per day
Monitoring Scaling Metrics
Key Metrics to Watch
# API scaling indicators
api.cpu.usage > 70%
api.memory.usage > 80%
api.request_queue > 100
query.latency.p95 > 5s
# Consumer scaling indicators
consumer.lag > 1M
consumer.cpu.usage > 80%
consumer.processing_time.p95 > 10s
consumer.error_rate > 1%
# ClickHouse scaling indicators
clickhouse.cpu.usage > 80%
clickhouse.memory.usage > 85%
clickhouse.disk.usage > 75%
clickhouse.query.queue > 50
Best Practices
Scale proactively : Don’t wait for alerts, scale before hitting limits
Test scaling : Practice scaling operations in staging
Monitor continuously : Watch metrics trends, not just current values
Automate scaling : Use HPA and KEDA for automatic scaling
Document capacity : Keep capacity planning docs up-to-date
Plan for peaks : Size for 2-3x normal load to handle spikes
Scale gradually : Increase by 25-50% at a time
Balance costs : Over-provisioning is expensive, under-provisioning is worse
Troubleshooting
High API Latency
# Check if API pods are CPU-bound
kubectl top pods -l component=api
# Scale up if CPU > 80%
kubectl scale deployment snuba-api --replicas=10
# Check ClickHouse query performance
SELECT query, elapsed FROM system.processes WHERE elapsed > 5 ;
Consumer Lag Not Decreasing
# Check consumer CPU usage
kubectl top pods -l component=consumer
# Scale up consumers (max = partition count)
kubectl scale deployment snuba-consumer-errors --replicas=16
# Check ClickHouse insert performance
SELECT * FROM system.metrics WHERE metric LIKE '%Insert%' ;
ClickHouse Out of Memory
# Check memory usage
SELECT * FROM system.metrics WHERE metric = 'MemoryTracking' ;
# Kill long-running queries
KILL QUERY WHERE elapsed > 300 ;
# Scale vertically or add nodes
kubectl edit statefulset clickhouse