Architecture Overview

CockroachDB is a distributed SQL database built with scalability, strong consistency, and survivability as its primary design goals. The system aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention.

Design Principles

CockroachDB follows several key architectural principles:

Homogeneous Deployment

All nodes are symmetric with identical capabilities. Every node runs the same binary and can assume any role in the cluster.

Strong Consistency

Uses Raft consensus protocol for synchronous replication, providing ACID semantics and serializable isolation.

Horizontal Scalability

Adding nodes increases capacity linearly. Queries can be sent to any node and distributed across the cluster.

Survivability

Tolerates failures at multiple levels through configurable replication (default 3x) across diverse geographic locations.

Layered Architecture

CockroachDB implements a layered architecture where each layer builds upon the abstractions provided by the layer below:

Layer Descriptions

SQL Layer

Provides familiar relational concepts (schemas, tables, columns, indexes)
Implements PostgreSQL wire protocol for client compatibility
Handles query parsing, optimization, and distributed execution
See SQL Layer for details

Distributed KV Layer

Presents abstraction of a single, monolithic sorted map from key to value
Both keys and values are byte strings
Handles range addressing and routing
See Distributed SQL for details

Replication Layer

Uses Raft consensus algorithm for strong consistency
Manages range replicas and leader election
Handles range splits, merges, and rebalancing
See Replication Layer for details

Storage Layer

Built on RocksDB (a variant of LevelDB)
Provides persistent storage with MVCC (Multi-Version Concurrency Control)
Manages local disk I/O and compaction
See Storage Layer for details

Data Distribution

The KV map is logically divided into ranges - contiguous segments of the keyspace, each approximately 64MB by default.

Each range is:

Replicated to a configurable number of nodes (default: 3)
Self-contained with its own Raft consensus group
Dynamic - automatically split when too large or merged when too small
Balanced - automatically rebalanced across nodes based on load

Range Example

From the source code at pkg/kv/:n

// Batch provides for the parallel execution of a number of database
// operations. Operations are added to the Batch and then the Batch is executed
// via either DB.Run, Txn.Run or Txn.Commit.
type Batch struct {
    // The Txn the batch is associated with
    txn *Txn
    // Results contains an entry for each operation added to the batch
    Results []Result
    // The Header which will be used to send the resulting BatchRequest
    Header kvpb.Header
    reqs   []kvpb.RequestUnion
}

Node Architecture

Node Components

Each CockroachDB node consists of:

SQL Gateway: Accepts client connections via PostgreSQL protocol
SQL Execution Engine: Parses and executes SQL statements
Transaction Coordinator: Manages distributed transactions
KV Client: Sends KV operations to appropriate ranges
Stores: One or more stores, each backed by a physical disk
Ranges: Multiple range replicas per store

Symmetric Node Design

┌─────────────────────────────────────┐
│         CockroachDB Node            │
├─────────────────────────────────────┤
│  SQL Layer (PostgreSQL Protocol)    │
├─────────────────────────────────────┤
│  Distributed SQL Execution          │
├─────────────────────────────────────┤
│  Transaction Coordinator            │
├─────────────────────────────────────┤
│  KV Client / Range Router           │
├─────────────────────────────────────┤
│  Raft Groups (Range Replicas)       │
├─────────────────────────────────────┤
│  RocksDB Storage Engine             │
└─────────────────────────────────────┘

Scalability Characteristics

CockroachDB achieves horizontal scalability through:

Storage Capacity: Adding more nodes increases cluster capacity by the storage per node divided by the replication factor. Theoretically supports up to 4 exabytes (4E) of logical data.

Query Throughput:

Client queries can be sent to any node
Independent queries can execute without conflicts
Overall throughput scales linearly with cluster size

Distributed Queries:

Single queries are distributed across nodes
Processing happens close to the data
Parallelization improves performance for large operations

Consistency Guarantees

Snapshot Isolation (SI)

Lock-free reads and writes with possibility of write skew in rare cases. High performance option.

Serializable SI (SSI)

Default isolation level. Eliminates write skew through timestamp validation. No locking required.

From the design document:

CockroachDB provides snapshot isolation (SI) and serializable snapshot isolation (SSI) semantics, allowing externally consistent, lock-free reads and writes - both from a historical snapshot timestamp and from the current wall clock time.

Fault Tolerance

CockroachDB survives failures through: Replication Topology Options:

Same datacenter, different disks → disk failure tolerance
Same datacenter, different racks → rack failure tolerance
Different datacenters, same region → datacenter failure tolerance
Different regions → regional failure tolerance

With N replicas, the system tolerates F failures where N = 2F + 1.Examples:

3 replicas → tolerates 1 failure
5 replicas → tolerates 2 failures

Key Features

Hybrid Logical Clock (HLC)

CockroachDB uses Hybrid Logical Clocks for timestamp management:

Combines physical time (wall clock) and logical counters
Tracks causality like vector clocks with less overhead
Enables distributed transaction ordering without atomic clocks

Zone Configurations

Similar to Google Spanner directories, zones allow configuration of:

Replication factor per keyspace region
Geographic placement constraints
Storage device types (SSD vs. HDD)
Performance and availability trade-offs

Gossip Protocol

Nodes use an efficient gossip protocol to disseminate:

Cluster topology information
Node capacity and health metrics
Range metadata and location
Network connectivity status

Entry Point: SQL Interface

Every node in a CockroachDB cluster can act as a SQL gateway. The gateway transforms SQL statements into KV operations and distributes them across the cluster.

SQL operations flow:

Client connects to any node via PostgreSQL protocol
SQL statement parsed and planned
Plan converted to distributed execution
KV operations sent to appropriate ranges
Results aggregated and returned to client

SQL Layer

Query parsing and execution

Distributed SQL

Distributed query processing

Transaction Layer

ACID transactions

Replication Layer

Raft consensus

Storage Layer

RocksDB integration

References

Original design document: docs/design.md
KV layer implementation: pkg/kv/
SQL layer implementation: pkg/sql/
Storage implementation: pkg/storage/

Getting Started

Architecture

SQL Reference

Administration

Operations

Performance

Architecture Overview

Design Principles

Homogeneous Deployment

Strong Consistency

Horizontal Scalability

Survivability

Layered Architecture

Layer Descriptions

Data Distribution

Range Example

Node Architecture

Symmetric Node Design

Scalability Characteristics

Consistency Guarantees

Snapshot Isolation (SI)

Serializable SI (SSI)

Fault Tolerance

Key Features

Hybrid Logical Clock (HLC)

Zone Configurations

Gossip Protocol

Entry Point: SQL Interface

Further Reading

SQL Layer

Distributed SQL

Transaction Layer

Replication Layer

Storage Layer

References

Build docs developers (and LLMs) love

Getting Started

Architecture

SQL Reference

Administration

Operations

Performance

​Design Principles

Homogeneous Deployment

Strong Consistency

Horizontal Scalability

Survivability

​Layered Architecture

​Layer Descriptions

​Data Distribution

​Range Example

​Node Architecture

​Symmetric Node Design

​Scalability Characteristics

​Consistency Guarantees

Snapshot Isolation (SI)

Serializable SI (SSI)

​Fault Tolerance

​Key Features

​Hybrid Logical Clock (HLC)

​Zone Configurations

​Gossip Protocol

​Entry Point: SQL Interface

​Further Reading

SQL Layer

Distributed SQL

Transaction Layer

Replication Layer

Storage Layer

​References

Build docs developers (and LLMs) love

Design Principles

Layered Architecture

Layer Descriptions

Data Distribution

Range Example

Node Architecture

Symmetric Node Design

Scalability Characteristics

Consistency Guarantees

Fault Tolerance

Key Features

Hybrid Logical Clock (HLC)

Zone Configurations

Gossip Protocol

Entry Point: SQL Interface

Further Reading

References