Production Deployment Checklist

This checklist provides critical recommendations for deploying CockroachDB in production. Review each section carefully before going live.

Topology planning

Choose deployment pattern

Review topology patterns to select the best configuration for your latency and resiliency requirements.

Node distribution

Deploy at least 3 nodes for fault tolerance
Use at least 3 nodes per region for multi-region deployments
Distribute nodes across availability zones
Use identical hardware for all nodes

Replication factor

Default: 3x replication (suitable for cloud storage)
Local disks: Increase to 5x replication
Configure via zone configs for specific databases or tables

Hardware requirements

CPU sizing

Minimum

4 vCPUs per nodeAbsolute minimum for production stability. Below this, foreground workloads compete with background tasks.

Metric	Recommendation
Capacity per vCPU	100-150 GiB
Maximum per node	10 TiB
IOPS per vCPU	500
Throughput per vCPU	30 MB/s
Filesystem	ext4 or XFS

Security configuration

Never run insecure clusters in production. Insecure clusters have no authentication, encryption, or authorization.

TLS certificates

Generate certificates

Use cockroach cert or openssl to create:

CA certificate and key
Node certificates (common name: node)
Client certificates (common name: username)

Distribute certificates

Place CA cert and node cert/key on each node
Store CA key in secure location (off cluster)
Distribute client certificates to application servers

Monitor expiration

Set up alerts for certificate expiration and rotate before they expire.

Authentication methods

Recommended: Client certificates for applications
Alternative: Password authentication with strong passwords
Enterprise: SSO/SAML integration

Networking configuration

Required ports

Port	Purpose
26257	Inter-node and client connections (SQL)
8080	DB Console (HTTP)

Network flags

cockroach start \
  --advertise-addr=<public-address> \
  --locality=region=us-east,zone=us-east-1a \
  --join=<node1>,<node2>,<node3>

Single network (private)
Single network (public)
Multi-network

--listen-addr=<private-ip>:26257

Let --advertise-addr default to --listen-addr.

--advertise-addr=<stable-public-ip>:26257

Don’t specify --listen-addr.

--advertise-addr=<routable-address>:26257 \
--locality-advertise-addr=zone=<local-address>:26257

Requires VPN, VPC peering, or NAT.

Load balancing

Load balancing is essential for performance and reliability. It distributes traffic and routes around failed nodes.

Health checks

Configure load balancers to use the readiness endpoint:

GET /health?ready=1

This ensures traffic only routes to nodes ready to accept connections.

High availability

Deploy multiple load balancers

A single load balancer is a single point of failure.

Use floating IPs or DNS

Configure failover between load balancer instances.

Monitor load balancer health

Set up alerts for load balancer failures.

Connection pooling

Critical for performance: Applications must use connection pools.

Sizing guidelines

pool_size = (number_of_cores * 2) + number_of_disks

Typical ranges:

Minimum: 4-10 connections
Small applications: 10-20 connections
Large applications: 20-50 connections per application instance

Too few connections cause high latency. Too many connections waste resources and increase contention.

Configuration parameters

max_connections: Maximum pool size
min_connections: Minimum idle connections
max_lifetime: Connection lifetime (prevent stale connections)
idle_timeout: Close idle connections

Cache and memory tuning

Default settings (not recommended for production)

--cache=.25  # 25% of system memory
--max-sql-memory=.25  # 25% of system memory

Production settings

--cache=.35 --max-sql-memory=.35

Or with absolute values:

--cache=8GiB --max-sql-memory=8GiB

Increasing cache improves read performance. Increasing SQL memory allows more concurrent connections and complex queries.

Monitoring and alerting

Essential metrics

CPU usage

Alert at >80% sustained usage

Memory usage

Alert at >85% of available RAM

Disk capacity

Alert at >80% full

Disk IOPS

Monitor against provisioned limits

Monitoring tools

DB Console: Built-in metrics at http://<node>:8080
Prometheus: Scrape /var/lib/cockroach/cockroach-data/cockroach.prometheus
Grafana: Use official CockroachDB dashboards
Alertmanager: Configure alerts for critical conditions

Backup and restore

Schedule regular backups

CREATE SCHEDULE daily_backup
FOR BACKUP INTO 's3://bucket/path?AUTH=implicit'
RECURRING '@daily'
WITH SCHEDULE OPTIONS first_run = 'now';

Use cloud storage

Amazon S3
Google Cloud Storage
Azure Blob Storage

Enable object locking for immutability.

Test restores regularly

Verify backup integrity by performing test restores.

Configure retention

WITH SCHEDULE OPTIONS keep_last_n_backups = 30

Clock synchronization

Clock skew causes transaction anomalies and data inconsistencies.

Configuration

Install NTP

apt-get install ntp
systemctl enable ntp
systemctl start ntp

Verify synchronization

ntpq -p

Monitor clock offset

Set alerts for clock offset >250ms.

File descriptors

Requirements

Minimum: 1956 (1700 per store + 256 for networking)
Recommended: 15000+ (10000 per store + 5000 for networking)
Best: Unlimited

Linux configuration

With systemd
Without systemd

Add to service definition:

[Service]
LimitNOFILE=15000

Reload systemd:

systemctl daemon-reload

Edit /etc/security/limits.conf:

*    soft    nofile    15000
*    hard    nofile    15000

Restart system for changes to take effect.

Kubernetes-specific

Storage

Use local SSDs, not network storage
Configure storage class appropriately

Resources

Set CPU and memory limits
Make requests equal to limits
Avoid burstable instances

Topology

Configure pod anti-affinity
Use topology spread constraints
One pod per Kubernetes node

Operator

Use CockroachDB Operator for management
Configure PodDisruptionBudget
Set up proper RBAC

Transaction retry handling

Applications must implement transaction retry logic.

Transaction contention can cause retries. Your application should:

Catch transaction retry errors
Retry the entire transaction
Use exponential backoff
Set maximum retry limits

for attempt in range(max_retries):
    try:
        # Begin transaction
        # Execute statements
        # Commit transaction
        break
    except TransactionRetryError:
        if attempt == max_retries - 1:
            raise
        sleep(2 ** attempt)

Pre-deployment checklist

Next steps

Review topology patterns
Set up monitoring
Configure backup strategies
Learn about disaster recovery

Kubernetes Deployment

SQL Overview

⌘I

Get Started

Architecture

Core Concepts

Deployment

SQL Reference

Operations

Migration

Developer Guide

​Topology planning

​Hardware requirements

​CPU sizing

Minimum

Recommended

​Memory provisioning

​Storage specifications

​Cloud-specific recommendations

​Security configuration

​TLS certificates

​Authentication methods

​Networking configuration

​Required ports

​Network flags

​Load balancing

​Health checks

​High availability

​Connection pooling

​Sizing guidelines

​Configuration parameters

​Cache and memory tuning

​Default settings (not recommended for production)

​Production settings

​Monitoring and alerting

​Essential metrics

CPU usage

Memory usage

Disk capacity

Disk IOPS

​Monitoring tools

​Backup and restore

​Clock synchronization

​Configuration

​File descriptors

​Requirements

​Linux configuration

​Kubernetes-specific

Storage

Resources

Topology

Operator

​Transaction retry handling

​Pre-deployment checklist

​Next steps

Build docs developers (and LLMs) love

Topology planning

Hardware requirements

CPU sizing

Memory provisioning

Storage specifications

Cloud-specific recommendations

Security configuration

TLS certificates

Authentication methods

Networking configuration

Required ports

Network flags

Load balancing

Health checks

High availability

Connection pooling

Sizing guidelines

Configuration parameters

Cache and memory tuning

Default settings (not recommended for production)

Production settings

Monitoring and alerting

Essential metrics

Monitoring tools

Backup and restore

Clock synchronization

Configuration

File descriptors

Requirements

Linux configuration

Kubernetes-specific

Transaction retry handling

Pre-deployment checklist

Next steps