Skip to main content
CockroachDB is designed to scale horizontally by adding more nodes to your cluster. The database automatically rebalances data across nodes to maintain optimal performance and fault tolerance.

Horizontal Scaling

CockroachDB scales horizontally, meaning you add more nodes rather than upgrading existing hardware. This approach provides:

Linear Scalability

Performance scales nearly linearly with the number of nodes

Automatic Rebalancing

Data automatically redistributes across all nodes

No Downtime

Add or remove nodes without interrupting service

Increased Capacity

Distribute storage and compute across more resources

Adding Nodes to a Cluster

Manual Node Addition

1

Prepare the New Node

Install CockroachDB on the new machine and ensure it can communicate with existing nodes on port 26257.
2

Copy Security Certificates

If running in secure mode, copy the CA certificate and create node certificates:
# On the new node
mkdir certs my-safe-directory

# Copy CA certificate from existing node
scp user@existing-node:/path/to/certs/ca.crt certs/
scp user@existing-node:/path/to/certs/ca.key my-safe-directory/

# Create node certificate
cockroach cert create-node \
  <new-node-address> \
  localhost \
  127.0.0.1 \
  --certs-dir=certs \
  --ca-key=my-safe-directory/ca.key
3

Start the New Node

cockroach start \
  --certs-dir=certs \
  --advertise-addr=<new-node-address> \
  --join=<existing-node1>,<existing-node2>,<existing-node3> \
  --cache=.25 \
  --max-sql-memory=.25 \
  --background
The new node automatically joins the cluster and begins receiving data.
4

Verify Node Addition

Check that the node has joined successfully:
SELECT node_id, address, is_live FROM crdb_internal.gossip_nodes;

Kubernetes Node Scaling

For Kubernetes deployments, scale the StatefulSet:
# Increase replicas from 3 to 5
kubectl scale statefulset cockroachdb --replicas=5

# Verify scaling
kubectl get pods -l app=cockroachdb
When scaling down, ensure you’re not removing so many nodes that you lose quorum or data replicas. CockroachDB requires a majority of nodes to be available.

Removing Nodes from a Cluster

Graceful Node Decommissioning

Before removing a node, decommission it to safely transfer its data:
1

Initiate Decommissioning

cockroach node decommission <node-id> \
  --certs-dir=certs \
  --host=<any-node-address>
This process moves all data off the target node to other nodes in the cluster.
2

Monitor Decommissioning Progress

cockroach node status \
  --certs-dir=certs \
  --host=<any-node-address> \
  --decommission
Wait until the node shows as “decommissioned” before proceeding.
3

Stop the Node

Once decommissioning completes:
# Find the process ID
ps aux | grep cockroach

# Gracefully stop
kill -TERM <process-id>
4

Verify Removal

SELECT node_id, address, is_live, membership 
FROM crdb_internal.gossip_nodes;
Never force-remove nodes without decommissioning unless they’ve permanently failed. Improper removal can lead to data loss or unavailability.

Automatic Rebalancing

CockroachDB automatically rebalances data across nodes based on:

Rebalancing Triggers

  • Node Addition: New nodes receive data from existing nodes
  • Node Removal: Data moves from decommissioned nodes
  • Uneven Distribution: Data shifts to balance storage utilization
  • Locality Changes: Data moves closer to where it’s accessed

Monitoring Rebalancing

Monitor Replica Distribution
SELECT 
  store_id,
  node_id,
  replica_count,
  capacity,
  available,
  used
FROM crdb_internal.kv_store_status
ORDER BY node_id;

Controlling Rebalancing Rate

Adjust cluster settings to control rebalancing speed:
-- Slower rebalancing (less impact on performance)
SET CLUSTER SETTING kv.snapshot_rebalance.max_rate = '32MiB';

-- Faster rebalancing (completes sooner, higher impact)
SET CLUSTER SETTING kv.snapshot_rebalance.max_rate = '128MiB';

Scaling Considerations

When to Scale Up

High CPU Utilization

Sustained CPU usage above 70% across nodes

Memory Pressure

Frequent cache evictions or OOM warnings

Storage Capacity

Disk usage exceeding 80% capacity

Query Latency

Increasing query response times

Scaling Best Practices

1

Scale in Odd Numbers

Always maintain an odd number of nodes (3, 5, 7, etc.) to ensure proper quorum for Raft consensus.
With an even number of nodes, you don’t gain additional fault tolerance. A 4-node cluster can still only lose 1 node, same as a 3-node cluster.
2

Scale Gradually

Add one or two nodes at a time and wait for rebalancing to complete before adding more.
3

Monitor During Scaling

Watch metrics for CPU, memory, disk I/O, and network throughput during scaling operations.
4

Plan for Replication Factor

Ensure you have at least replication_factor + 1 nodes for proper fault tolerance.

Replication Factor

The replication factor determines how many copies of each data range exist:
View Current Replication Factor
SHOW ZONE CONFIGURATIONS;
Change Default Replication Factor
ALTER RANGE default CONFIGURE ZONE USING num_replicas = 5;
Replication Factor Guidelines:
  • 3 replicas (default): Tolerates 1 node failure
  • 5 replicas: Tolerates 2 node failures
  • 7 replicas: Tolerates 3 node failures
Higher replication increases fault tolerance but requires more storage and network bandwidth.

Locality-Aware Scaling

When deploying across multiple regions or availability zones:
Start Node with Locality
cockroach start \
  --certs-dir=certs \
  --advertise-addr=<node-address> \
  --join=<join-addresses> \
  --locality=region=us-east,zone=us-east-1a \
  --background

Configure Zone Constraints

Constrain Data to Specific Localities
ALTER DATABASE mydb CONFIGURE ZONE USING 
  num_replicas = 3,
  constraints = '{+region=us-east: 2, +region=us-west: 1}';
This ensures 2 replicas stay in us-east and 1 in us-west.

Performance Monitoring During Scaling

  • Replica Count per Store: Should be balanced across stores
  • Snapshot Rate: Shows active rebalancing activity
  • Disk I/O: Increases during rebalancing
  • Network Throughput: Higher during data movement
  • Query Latency: Should remain stable during scaling
  • Range Count: Total ranges should distribute evenly

Troubleshooting Scaling Issues

Rebalancing Stuck

Check for Range Issues
SELECT 
  range_id,
  start_key,
  end_key,
  replicas,
  replica_localities
FROM crdb_internal.ranges
WHERE unavailable = true OR under_replicated = true;

Uneven Distribution

Identify Unevenly Distributed Ranges
SELECT 
  store_id,
  node_id,
  replica_count,
  capacity,
  used * 100.0 / capacity as pct_used
FROM crdb_internal.kv_store_status
ORDER BY pct_used DESC;

Next Steps

Deployment

Learn deployment strategies

Migration

Migrate data to CockroachDB

Backup & Restore

Implement backup strategies

Build docs developers (and LLMs) love