CockroachDB is designed to scale horizontally by adding more nodes to your cluster. The database automatically rebalances data across nodes to maintain optimal performance and fault tolerance.
Horizontal Scaling
CockroachDB scales horizontally, meaning you add more nodes rather than upgrading existing hardware. This approach provides:
Linear Scalability Performance scales nearly linearly with the number of nodes
Automatic Rebalancing Data automatically redistributes across all nodes
No Downtime Add or remove nodes without interrupting service
Increased Capacity Distribute storage and compute across more resources
Adding Nodes to a Cluster
Manual Node Addition
Prepare the New Node
Install CockroachDB on the new machine and ensure it can communicate with existing nodes on port 26257.
Copy Security Certificates
If running in secure mode, copy the CA certificate and create node certificates: # On the new node
mkdir certs my-safe-directory
# Copy CA certificate from existing node
scp user@existing-node:/path/to/certs/ca.crt certs/
scp user@existing-node:/path/to/certs/ca.key my-safe-directory/
# Create node certificate
cockroach cert create-node \
< new-node-addres s > \
localhost \
127.0.0.1 \
--certs-dir=certs \
--ca-key=my-safe-directory/ca.key
Start the New Node
cockroach start \
--certs-dir=certs \
--advertise-addr= < new-node-address > \
--join= < existing-node 1> , < existing-node 2> , < existing-node 3> \
--cache=.25 \
--max-sql-memory=.25 \
--background
The new node automatically joins the cluster and begins receiving data.
Verify Node Addition
Check that the node has joined successfully: SELECT node_id, address , is_live FROM crdb_internal . gossip_nodes ;
Kubernetes Node Scaling
For Kubernetes deployments, scale the StatefulSet:
# Increase replicas from 3 to 5
kubectl scale statefulset cockroachdb --replicas=5
# Verify scaling
kubectl get pods -l app=cockroachdb
When scaling down, ensure you’re not removing so many nodes that you lose quorum or data replicas. CockroachDB requires a majority of nodes to be available.
Removing Nodes from a Cluster
Graceful Node Decommissioning
Before removing a node, decommission it to safely transfer its data:
Initiate Decommissioning
cockroach node decommission < node-i d > \
--certs-dir=certs \
--host= < any-node-address >
This process moves all data off the target node to other nodes in the cluster.
Monitor Decommissioning Progress
cockroach node status \
--certs-dir=certs \
--host= < any-node-address > \
--decommission
Wait until the node shows as “decommissioned” before proceeding.
Stop the Node
Once decommissioning completes: # Find the process ID
ps aux | grep cockroach
# Gracefully stop
kill -TERM < process-i d >
Verify Removal
SELECT node_id, address , is_live, membership
FROM crdb_internal . gossip_nodes ;
Never force-remove nodes without decommissioning unless they’ve permanently failed. Improper removal can lead to data loss or unavailability.
Automatic Rebalancing
CockroachDB automatically rebalances data across nodes based on:
Rebalancing Triggers
Node Addition : New nodes receive data from existing nodes
Node Removal : Data moves from decommissioned nodes
Uneven Distribution : Data shifts to balance storage utilization
Locality Changes : Data moves closer to where it’s accessed
Monitoring Rebalancing
Monitor Replica Distribution
SELECT
store_id,
node_id,
replica_count,
capacity,
available,
used
FROM crdb_internal . kv_store_status
ORDER BY node_id;
Controlling Rebalancing Rate
Adjust cluster settings to control rebalancing speed:
-- Slower rebalancing (less impact on performance)
SET CLUSTER SETTING kv . snapshot_rebalance .max_rate = '32MiB' ;
-- Faster rebalancing (completes sooner, higher impact)
SET CLUSTER SETTING kv . snapshot_rebalance .max_rate = '128MiB' ;
Scaling Considerations
When to Scale Up
High CPU Utilization Sustained CPU usage above 70% across nodes
Memory Pressure Frequent cache evictions or OOM warnings
Storage Capacity Disk usage exceeding 80% capacity
Query Latency Increasing query response times
Scaling Best Practices
Scale in Odd Numbers
Always maintain an odd number of nodes (3, 5, 7, etc.) to ensure proper quorum for Raft consensus. With an even number of nodes, you don’t gain additional fault tolerance. A 4-node cluster can still only lose 1 node, same as a 3-node cluster.
Scale Gradually
Add one or two nodes at a time and wait for rebalancing to complete before adding more.
Monitor During Scaling
Watch metrics for CPU, memory, disk I/O, and network throughput during scaling operations.
Plan for Replication Factor
Ensure you have at least replication_factor + 1 nodes for proper fault tolerance.
Replication Factor
The replication factor determines how many copies of each data range exist:
View Current Replication Factor
SHOW ZONE CONFIGURATIONS;
Change Default Replication Factor
ALTER RANGE default CONFIGURE ZONE USING num_replicas = 5 ;
Replication Factor Guidelines:
3 replicas (default): Tolerates 1 node failure
5 replicas : Tolerates 2 node failures
7 replicas : Tolerates 3 node failures
Higher replication increases fault tolerance but requires more storage and network bandwidth.
Locality-Aware Scaling
When deploying across multiple regions or availability zones:
cockroach start \
--certs-dir=certs \
--advertise-addr= < node-address > \
--join= < join-addresses > \
--locality=region=us-east,zone=us-east-1a \
--background
Constrain Data to Specific Localities
ALTER DATABASE mydb CONFIGURE ZONE USING
num_replicas = 3 ,
constraints = '{+region=us-east: 2, +region=us-west: 1}' ;
This ensures 2 replicas stay in us-east and 1 in us-west.
Replica Count per Store : Should be balanced across stores
Snapshot Rate : Shows active rebalancing activity
Disk I/O : Increases during rebalancing
Network Throughput : Higher during data movement
Query Latency : Should remain stable during scaling
Range Count : Total ranges should distribute evenly
Troubleshooting Scaling Issues
Rebalancing Stuck
SELECT
range_id,
start_key,
end_key,
replicas,
replica_localities
FROM crdb_internal . ranges
WHERE unavailable = true OR under_replicated = true;
Uneven Distribution
Identify Unevenly Distributed Ranges
SELECT
store_id,
node_id,
replica_count,
capacity,
used * 100 . 0 / capacity as pct_used
FROM crdb_internal . kv_store_status
ORDER BY pct_used DESC ;
Next Steps
Deployment Learn deployment strategies
Migration Migrate data to CockroachDB
Backup & Restore Implement backup strategies