Skip to main content
This checklist provides critical recommendations for deploying CockroachDB in production. Review each section carefully before going live.

Topology planning

1

Choose deployment pattern

Review topology patterns to select the best configuration for your latency and resiliency requirements.
2

Node distribution

  • Deploy at least 3 nodes for fault tolerance
  • Use at least 3 nodes per region for multi-region deployments
  • Distribute nodes across availability zones
  • Use identical hardware for all nodes
3

Replication factor

  • Default: 3x replication (suitable for cloud storage)
  • Local disks: Increase to 5x replication
  • Configure via zone configs for specific databases or tables

Hardware requirements

CPU sizing

Minimum

4 vCPUs per nodeAbsolute minimum for production stability. Below this, foreground workloads compete with background tasks.

Recommended

8-16 vCPUs per nodeOptimal range for most workloads. Maximum tested: 32 vCPUs per node.
Avoid burstable or shared-core instances. They limit CPU resources and cause unpredictable performance.

Memory provisioning

Recommendation: 4 GiB RAM per vCPU
  • Minimum acceptable: 2 GiB per vCPU (testing only)
  • Benefits decrease beyond 4 GiB per vCPU as CPU count increases
  • Disable memory swap on Linux systems
Under-provisioning RAM causes reduced caching, disk spilling, and potential OOM crashes.

Storage specifications

MetricRecommendation
Capacity per vCPU100-150 GiB
Maximum per node10 TiB
IOPS per vCPU500
Throughput per vCPU30 MB/s
Filesystemext4 or XFS
1

Use SSDs

Always use solid-state drives. HDDs don’t provide sufficient IOPS for production workloads.
2

Separate volumes

  • Store data on dedicated volume (not OS disk)
  • Keep logs on separate volume from data
3

Monitor capacity

  • Maintain 10-15% free space at all times
  • Set up alerts at 80% usage
  • CockroachDB creates automatic ballast files for emergencies
4

Avoid distributed filesystems

Do not use Ceph or similar distributed filesystems. CockroachDB handles replication internally.

Cloud-specific recommendations

Instance types:
  • i3.xlarge, i3.2xlarge, i3.4xlarge (local SSD)
  • m6i.xlarge, m6i.2xlarge (network storage)
Storage:
  • gp3 volumes (cost-effective, 3000 IOPS default)
  • io2 volumes (higher IOPS, provision separately)
  • Provision IOPS and throughput to meet 500 IOPS and 30 MB/s per vCPU

Security configuration

Never run insecure clusters in production. Insecure clusters have no authentication, encryption, or authorization.

TLS certificates

1

Generate certificates

Use cockroach cert or openssl to create:
  • CA certificate and key
  • Node certificates (common name: node)
  • Client certificates (common name: username)
2

Distribute certificates

  • Place CA cert and node cert/key on each node
  • Store CA key in secure location (off cluster)
  • Distribute client certificates to application servers
3

Monitor expiration

Set up alerts for certificate expiration and rotate before they expire.

Authentication methods

  • Recommended: Client certificates for applications
  • Alternative: Password authentication with strong passwords
  • Enterprise: SSO/SAML integration

Networking configuration

Required ports

PortPurpose
26257Inter-node and client connections (SQL)
8080DB Console (HTTP)

Network flags

cockroach start \
  --advertise-addr=<public-address> \
  --locality=region=us-east,zone=us-east-1a \
  --join=<node1>,<node2>,<node3>
--listen-addr=<private-ip>:26257
Let --advertise-addr default to --listen-addr.

Load balancing

Load balancing is essential for performance and reliability. It distributes traffic and routes around failed nodes.

Health checks

Configure load balancers to use the readiness endpoint:
GET /health?ready=1
This ensures traffic only routes to nodes ready to accept connections.

High availability

1

Deploy multiple load balancers

A single load balancer is a single point of failure.
2

Use floating IPs or DNS

Configure failover between load balancer instances.
3

Monitor load balancer health

Set up alerts for load balancer failures.

Connection pooling

Critical for performance: Applications must use connection pools.

Sizing guidelines

pool_size = (number_of_cores * 2) + number_of_disks
Typical ranges:
  • Minimum: 4-10 connections
  • Small applications: 10-20 connections
  • Large applications: 20-50 connections per application instance
Too few connections cause high latency. Too many connections waste resources and increase contention.

Configuration parameters

  • max_connections: Maximum pool size
  • min_connections: Minimum idle connections
  • max_lifetime: Connection lifetime (prevent stale connections)
  • idle_timeout: Close idle connections

Cache and memory tuning

--cache=.25  # 25% of system memory
--max-sql-memory=.25  # 25% of system memory

Production settings

--cache=.35 --max-sql-memory=.35
Or with absolute values:
--cache=8GiB --max-sql-memory=8GiB
Increasing cache improves read performance. Increasing SQL memory allows more concurrent connections and complex queries.

Monitoring and alerting

Essential metrics

CPU usage

Alert at >80% sustained usage

Memory usage

Alert at >85% of available RAM

Disk capacity

Alert at >80% full

Disk IOPS

Monitor against provisioned limits

Monitoring tools

  • DB Console: Built-in metrics at http://<node>:8080
  • Prometheus: Scrape /var/lib/cockroach/cockroach-data/cockroach.prometheus
  • Grafana: Use official CockroachDB dashboards
  • Alertmanager: Configure alerts for critical conditions

Backup and restore

1

Schedule regular backups

CREATE SCHEDULE daily_backup
FOR BACKUP INTO 's3://bucket/path?AUTH=implicit'
RECURRING '@daily'
WITH SCHEDULE OPTIONS first_run = 'now';
2

Use cloud storage

  • Amazon S3
  • Google Cloud Storage
  • Azure Blob Storage
Enable object locking for immutability.
3

Test restores regularly

Verify backup integrity by performing test restores.
4

Configure retention

WITH SCHEDULE OPTIONS keep_last_n_backups = 30

Clock synchronization

Clock skew causes transaction anomalies and data inconsistencies.

Configuration

1

Install NTP

apt-get install ntp
systemctl enable ntp
systemctl start ntp
2

Verify synchronization

ntpq -p
3

Monitor clock offset

Set alerts for clock offset >250ms.

File descriptors

Requirements

  • Minimum: 1956 (1700 per store + 256 for networking)
  • Recommended: 15000+ (10000 per store + 5000 for networking)
  • Best: Unlimited

Linux configuration

Add to service definition:
[Service]
LimitNOFILE=15000
Reload systemd:
systemctl daemon-reload

Kubernetes-specific

Storage

  • Use local SSDs, not network storage
  • Configure storage class appropriately

Resources

  • Set CPU and memory limits
  • Make requests equal to limits
  • Avoid burstable instances

Topology

  • Configure pod anti-affinity
  • Use topology spread constraints
  • One pod per Kubernetes node

Operator

  • Use CockroachDB Operator for management
  • Configure PodDisruptionBudget
  • Set up proper RBAC

Transaction retry handling

Applications must implement transaction retry logic.
Transaction contention can cause retries. Your application should:
  1. Catch transaction retry errors
  2. Retry the entire transaction
  3. Use exponential backoff
  4. Set maximum retry limits
for attempt in range(max_retries):
    try:
        # Begin transaction
        # Execute statements
        # Commit transaction
        break
    except TransactionRetryError:
        if attempt == max_retries - 1:
            raise
        sleep(2 ** attempt)

Pre-deployment checklist

1

Hardware

  • Sufficient CPU (min 4 vCPUs, recommended 8+)
  • Adequate RAM (4 GiB per vCPU)
  • SSD storage with required IOPS
  • Network connectivity between nodes
2

Security

  • TLS certificates generated and distributed
  • CA key stored securely
  • Firewall rules configured
  • Network isolation in place
3

Configuration

  • Clock synchronization configured
  • File descriptor limits increased
  • Cache and SQL memory tuned
  • Load balancer deployed and tested
4

Operations

  • Monitoring and alerting configured
  • Backup schedule created
  • Restore procedure tested
  • Incident response plan documented
5

Application

  • Connection pooling implemented
  • Transaction retry logic added
  • Load testing completed
  • Failover testing performed

Next steps

Build docs developers (and LLMs) love