Scaling CockroachDB Clusters

CockroachDB is designed to scale horizontally by adding more nodes to your cluster. The database automatically rebalances data across nodes to maintain optimal performance and fault tolerance.

Horizontal Scaling

CockroachDB scales horizontally, meaning you add more nodes rather than upgrading existing hardware. This approach provides:

Linear Scalability

Performance scales nearly linearly with the number of nodes

Automatic Rebalancing

Data automatically redistributes across all nodes

No Downtime

Add or remove nodes without interrupting service

Increased Capacity

Distribute storage and compute across more resources

Adding Nodes to a Cluster

Manual Node Addition

Prepare the New Node

Install CockroachDB on the new machine and ensure it can communicate with existing nodes on port 26257.

Copy Security Certificates

If running in secure mode, copy the CA certificate and create node certificates:

# On the new node
mkdir certs my-safe-directory

# Copy CA certificate from existing node
scp user@existing-node:/path/to/certs/ca.crt certs/
scp user@existing-node:/path/to/certs/ca.key my-safe-directory/

# Create node certificate
cockroach cert create-node \
  <new-node-address> \
  localhost \
  127.0.0.1 \
  --certs-dir=certs \
  --ca-key=my-safe-directory/ca.key

Start the New Node

cockroach start \
  --certs-dir=certs \
  --advertise-addr=<new-node-address> \
  --join=<existing-node1>,<existing-node2>,<existing-node3> \
  --cache=.25 \
  --max-sql-memory=.25 \
  --background

The new node automatically joins the cluster and begins receiving data.

Verify Node Addition

Check that the node has joined successfully:

SELECT node_id, address, is_live FROM crdb_internal.gossip_nodes;

Kubernetes Node Scaling

For Kubernetes deployments, scale the StatefulSet:

# Increase replicas from 3 to 5
kubectl scale statefulset cockroachdb --replicas=5

# Verify scaling
kubectl get pods -l app=cockroachdb

When scaling down, ensure you’re not removing so many nodes that you lose quorum or data replicas. CockroachDB requires a majority of nodes to be available.

Removing Nodes from a Cluster

Graceful Node Decommissioning

Before removing a node, decommission it to safely transfer its data:

Initiate Decommissioning

cockroach node decommission <node-id> \
  --certs-dir=certs \
  --host=<any-node-address>

This process moves all data off the target node to other nodes in the cluster.

Monitor Decommissioning Progress

cockroach node status \
  --certs-dir=certs \
  --host=<any-node-address> \
  --decommission

Wait until the node shows as “decommissioned” before proceeding.

Stop the Node

Once decommissioning completes:

# Find the process ID
ps aux | grep cockroach

# Gracefully stop
kill -TERM <process-id>

Verify Removal

SELECT node_id, address, is_live, membership 
FROM crdb_internal.gossip_nodes;

Never force-remove nodes without decommissioning unless they’ve permanently failed. Improper removal can lead to data loss or unavailability.

Automatic Rebalancing

CockroachDB automatically rebalances data across nodes based on:

Rebalancing Triggers

Node Addition: New nodes receive data from existing nodes
Node Removal: Data moves from decommissioned nodes
Uneven Distribution: Data shifts to balance storage utilization
Locality Changes: Data moves closer to where it’s accessed

Monitoring Rebalancing

Monitor Replica Distribution

SELECT 
  store_id,
  node_id,
  replica_count,
  capacity,
  available,
  used
FROM crdb_internal.kv_store_status
ORDER BY node_id;

Controlling Rebalancing Rate

Adjust cluster settings to control rebalancing speed:

-- Slower rebalancing (less impact on performance)
SET CLUSTER SETTING kv.snapshot_rebalance.max_rate = '32MiB';

-- Faster rebalancing (completes sooner, higher impact)
SET CLUSTER SETTING kv.snapshot_rebalance.max_rate = '128MiB';

Scaling Considerations

When to Scale Up

High CPU Utilization

Sustained CPU usage above 70% across nodes

Memory Pressure

Frequent cache evictions or OOM warnings

Storage Capacity

Disk usage exceeding 80% capacity

Query Latency

Increasing query response times

Scaling Best Practices

Scale in Odd Numbers

Always maintain an odd number of nodes (3, 5, 7, etc.) to ensure proper quorum for Raft consensus.

With an even number of nodes, you don’t gain additional fault tolerance. A 4-node cluster can still only lose 1 node, same as a 3-node cluster.

Scale Gradually

Add one or two nodes at a time and wait for rebalancing to complete before adding more.

Monitor During Scaling

Watch metrics for CPU, memory, disk I/O, and network throughput during scaling operations.

Plan for Replication Factor

Ensure you have at least replication_factor + 1 nodes for proper fault tolerance.

Replication Factor

The replication factor determines how many copies of each data range exist:

View Current Replication Factor

SHOW ZONE CONFIGURATIONS;

Change Default Replication Factor

ALTER RANGE default CONFIGURE ZONE USING num_replicas = 5;

Replication Factor Guidelines:

3 replicas (default): Tolerates 1 node failure
5 replicas: Tolerates 2 node failures
7 replicas: Tolerates 3 node failures

Higher replication increases fault tolerance but requires more storage and network bandwidth.

Locality-Aware Scaling

When deploying across multiple regions or availability zones:

Start Node with Locality

cockroach start \
  --certs-dir=certs \
  --advertise-addr=<node-address> \
  --join=<join-addresses> \
  --locality=region=us-east,zone=us-east-1a \
  --background

Configure Zone Constraints

Constrain Data to Specific Localities

ALTER DATABASE mydb CONFIGURE ZONE USING 
  num_replicas = 3,
  constraints = '{+region=us-east: 2, +region=us-west: 1}';

This ensures 2 replicas stay in us-east and 1 in us-west.

Performance Monitoring During Scaling

Key Metrics to Monitor

Replica Count per Store: Should be balanced across stores
Snapshot Rate: Shows active rebalancing activity
Disk I/O: Increases during rebalancing
Network Throughput: Higher during data movement
Query Latency: Should remain stable during scaling
Range Count: Total ranges should distribute evenly

Troubleshooting Scaling Issues

Rebalancing Stuck

Check for Range Issues

SELECT 
  range_id,
  start_key,
  end_key,
  replicas,
  replica_localities
FROM crdb_internal.ranges
WHERE unavailable = true OR under_replicated = true;

Uneven Distribution

Identify Unevenly Distributed Ranges

SELECT 
  store_id,
  node_id,
  replica_count,
  capacity,
  used * 100.0 / capacity as pct_used
FROM crdb_internal.kv_store_status
ORDER BY pct_used DESC;

Next Steps

Deployment

Learn deployment strategies

Migration

Migrate data to CockroachDB

Backup & Restore

Implement backup strategies

Getting Started

Architecture

SQL Reference

Administration

Operations

Performance

Scaling CockroachDB Clusters

Horizontal Scaling

Linear Scalability

Automatic Rebalancing

No Downtime

Increased Capacity

Adding Nodes to a Cluster

Manual Node Addition

Kubernetes Node Scaling

Removing Nodes from a Cluster

Graceful Node Decommissioning

Automatic Rebalancing

Rebalancing Triggers

Monitoring Rebalancing

Controlling Rebalancing Rate

Scaling Considerations

When to Scale Up

High CPU Utilization

Memory Pressure

Storage Capacity

Query Latency

Scaling Best Practices

Replication Factor

Locality-Aware Scaling

Configure Zone Constraints

Performance Monitoring During Scaling

Troubleshooting Scaling Issues

Rebalancing Stuck

Uneven Distribution

Next Steps

Deployment

Migration

Backup & Restore

Build docs developers (and LLMs) love

Getting Started

Architecture

SQL Reference

Administration

Operations

Performance

​Horizontal Scaling

Linear Scalability

Automatic Rebalancing

No Downtime

Increased Capacity

​Adding Nodes to a Cluster

​Manual Node Addition

​Kubernetes Node Scaling

​Removing Nodes from a Cluster

​Graceful Node Decommissioning

​Automatic Rebalancing

​Rebalancing Triggers

​Monitoring Rebalancing

​Controlling Rebalancing Rate

​Scaling Considerations

​When to Scale Up

High CPU Utilization

Memory Pressure

Storage Capacity

Query Latency

​Scaling Best Practices

​Replication Factor

​Locality-Aware Scaling

​Configure Zone Constraints

​Performance Monitoring During Scaling

​Troubleshooting Scaling Issues

​Rebalancing Stuck

​Uneven Distribution

​Next Steps

Deployment

Migration

Backup & Restore

Build docs developers (and LLMs) love

Horizontal Scaling

Adding Nodes to a Cluster

Manual Node Addition

Kubernetes Node Scaling

Removing Nodes from a Cluster

Graceful Node Decommissioning

Automatic Rebalancing

Rebalancing Triggers

Monitoring Rebalancing

Controlling Rebalancing Rate

Scaling Considerations

When to Scale Up

Scaling Best Practices

Replication Factor

Locality-Aware Scaling

Configure Zone Constraints

Performance Monitoring During Scaling

Troubleshooting Scaling Issues

Rebalancing Stuck

Uneven Distribution

Next Steps