SQL Layer

CockroachDB’s SQL layer provides a familiar relational database interface with PostgreSQL wire protocol compatibility, translating SQL statements into distributed KV operations.

Architecture Overview

Every node in a CockroachDB cluster can act as a SQL gateway. Clients can connect to any node, and the gateway handles query distribution transparently.

PostgreSQL Compatibility

Wire Protocol

CockroachDB implements the PostgreSQL wire protocol, allowing use of native PostgreSQL drivers:

# Python example using psycopg2
import psycopg2

conn = psycopg2.connect(
    host="localhost",
    port=26257,
    user="root",
    database="mydb"
)

Drivers

All PostgreSQL drivers work: psycopg2, pg, Npgsql, JDBC, etc.

Tools

Compatible with pgAdmin, DBeaver, DataGrip, and other Postgres tools.

ORMs

Works with ActiveRecord, SQLAlchemy, Hibernate, Sequelize, etc.

SQL Dialect

From docs/design.md:

CockroachDB also attempts to emulate the flavor of SQL supported by PostgreSQL, although it also diverges in significant ways.

Key differences:

Isolation Levels

Only supports SNAPSHOT and SERIALIZABLE:

-- READ UNCOMMITTED → mapped to SNAPSHOT
-- READ COMMITTED → mapped to SNAPSHOT  
-- REPEATABLE READ → mapped to SNAPSHOT
-- SERIALIZABLE → true SERIALIZABLE

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

From the design document:

CockroachDB exclusively implements MVCC-based consistency for transactions, and thus only supports SQL’s isolation levels SNAPSHOT and SERIALIZABLE.

Type System

Simplified type coercion:

Limited implicit conversions
Rationale: simplicity and efficiency
Most generated SQL already has coherent typing
Existing SQL may need adjustments

See docs/RFCS/20160203_typing.md for details.

Primary Keys

All tables must have a primary key:

-- Explicit primary key
CREATE TABLE users (
    id UUID PRIMARY KEY,
    name STRING
);

-- Implicit primary key added automatically
CREATE TABLE logs (
    message STRING
);
-- CockroachDB adds hidden rowid column

SQL Processing Pipeline

1. Connection Handling

From docs/design.md:

Client connections over the network are handled in each node by a pgwire server process (goroutine). This handles the stream of incoming commands and sends back responses including query/statement results.

Connection State:

// Simplified from pkg/sql/conn_executor.go
type connExecutor struct {
    sessionData *sessiondata.SessionData
    planner     *planner
    server      *Server
    metrics     *Metrics
}

Session Management:

Session variables (e.g., search_path, timezone)
Transaction state (BEGIN/COMMIT/ROLLBACK)
Prepared statements
Portal cursors

Implementation: pkg/sql/conn_executor.go

2. SQL Parsing

CockroachDB uses a yacc-based parser that generates a syntax tree from SQL text.

Parser Generator:

# From CLAUDE.md
./dev generate parsers  # Regenerate SQL parser

Parse Tree Example:

SELECT name, salary FROM employees WHERE dept = 'eng';

↓ Parses to:

&tree.Select{
    Select: &tree.SelectClause{
        Exprs: [name, salary],
        From:  [employees],
        Where: &ComparisonExpr{dept = 'eng'},
    },
}

Implementation: pkg/sql/parser/ and pkg/sql/sem/tree/

3. Semantic Analysis

Name Resolution:

Resolve table and column names
Check permissions
Validate types
Expand * in SELECT

Type Checking:

Infer expression types
Validate type compatibility
Apply implicit conversions where allowed

Constant Folding:

SELECT 1 + 2 * 3 FROM t;
-- Simplified to:
SELECT 7 FROM t;

Implementation: pkg/sql/sem/tree/ and pkg/sql/opt/optbuilder/

4. Query Optimization

CockroachDB uses a cost-based optimizer (CBO):

Optimizer Architecture

Based on the Cascade/Columbia optimizer framework:

Memo Structure: Compact representation of equivalent plans
Exploration: Generate logically equivalent alternatives
Implementation: Choose physical operators
Costing: Estimate CPU, I/O, network costs
Pruning: Eliminate dominated plans

Implementation: pkg/sql/opt/

Optimization Rules:

Predicate Pushdown

Move filters close to data source:

-- Original
SELECT * FROM (
    SELECT * FROM users
) WHERE age > 18;

-- Optimized  
SELECT * FROM users WHERE age > 18;

Index Selection

Choose best index for query:

CREATE INDEX idx_dept ON employees(dept);
CREATE INDEX idx_salary ON employees(salary);

-- Query: 
SELECT * FROM employees WHERE dept = 'eng' AND salary > 100000;

-- Optimizer chooses idx_dept (more selective)

Join Ordering

Reorder joins for efficiency:

-- Original
SELECT * FROM large_table
  JOIN small_table ON ...
  JOIN tiny_table ON ...;

-- Optimizer may reorder to:
-- tiny ⋈ small ⋈ large

Join Algorithm

Choose join implementation:

Lookup Join: Index-based point lookups
Merge Join: Sorted inputs, linear scan
Hash Join: Build hash table, probe
Inverted Join: For JSON, arrays, geospatial

Statistics:

The optimizer uses table statistics for cost estimation:

-- Collect statistics
CREATE STATISTICS stats_name FROM table_name;

-- View statistics
SHOW STATISTICS FOR TABLE table_name;

Statistics include:

Row count
Distinct value count
Null count
Histograms for selectivity estimation

5. Physical Planning

Translates logical plan into physical execution plan: Operators:

Scan: Read from table/index
Filter: Apply predicates
Project: Compute expressions
Join: Combine tables
Aggregate: GROUP BY operations
Sort: ORDER BY
Limit: Row count restriction

Distribution:

Determine which operations run where
Insert streaming/routing operators
Balance parallelism vs. overhead

See Distributed SQL for details. Implementation: pkg/sql/physicalplan/

6. Execution

Execution Engines: CockroachDB has multiple execution engines:

Vectorized Engine (Default)

Columnar execution for better performance:

// Process batches of rows (typically 1024)
type Batch struct {
    Columns []Vector  // Column-oriented storage
    Length  int       // Rows in batch
}

Benefits:

Better CPU cache utilization
SIMD vectorization opportunities
Reduced per-row overhead

Implementation: pkg/sql/colexec/

Row-at-a-Time Engine

Traditional Volcano-style execution:

type PlanNode interface {
    Next() (Row, error)
}

Used for:

DDL statements
Some specialized operations
Fallback for unsupported operations

Implementation: pkg/sql/rowexec/

Data Mapping: SQL to KV

From docs/design.md:

Table Encoding

CockroachDB encodes SQL tables into KV pairs using a structured key format based on table ID, primary key, and column family.

Schema Metadata:

Database: mydb (ID: 51)
Table: customers (ID: 42)
Columns:
  - name (ID: 69)
  - url (ID: 66)

Stored as:

/system/databases/mydb/id → 51
/system/tables/customers/id → 42  
/system/desc/51/42/name → 69
/system/desc/51/42/url → 66

Row Data: For row with primary key “Apple”:

Key Structure: /TableID/PrimaryKey/ColumnID

/51/42/Apple/69 → "1 Infinite Loop, Cupertino, CA"
/51/42/Apple/66 → "http://apple.com/"

Prefix Compression: RocksDB’s prefix compression makes this efficient despite repetitive prefixes.

Index Encoding

Secondary indexes:

CREATE INDEX idx_dept ON employees(dept);

Index entries:

/TableID/IndexID/IndexedColumns/PrimaryKey → []

/42/2/eng/emp001 → []
/42/2/eng/emp007 → []
/42/2/sales/emp003 → []

Non-unique indexes: Include primary key for uniqueness Unique indexes: Indexed columns form the key Covering indexes: Store additional columns in value

Column Families

Motivation

Group frequently-accessed columns:

CREATE TABLE users (
    id INT PRIMARY KEY,
    email STRING,
    name STRING,
    profile_data JSONB,
    FAMILY f1 (id, email, name),
    FAMILY f2 (profile_data)
);

Benefits:

Reduce I/O for narrow queries
Better cache utilization
Smaller KV operations

From the design document:

Each remaining column or column family in the table is then encoded as a value in the underlying KV store, and the column/family identifier is appended as suffix to the KV key.

Prepared Statements

Preparation:

PREPARE get_user AS 
  SELECT * FROM users WHERE id = $1;

Execution:

EXECUTE get_user(123);
EXECUTE get_user(456);

Benefits:

Parse once, execute many times
Plan caching (with some limitations)
Prevent SQL injection

Implementation: pkg/sql/conn_executor_prepare.go

Schema Changes

CockroachDB supports online schema changes without blocking:

Schema change initiated

ALTER TABLE users ADD COLUMN age INT;

Lease-based synchronization

Nodes acquire table descriptor leases with version numbers

Multi-version support

Multiple schema versions coexist during transition

Background backfill

New indexes/columns populated asynchronously

Cutover

Atomic switch to new schema once ready

See docs/RFCS/20151014_online_schema_change.md for details.

Some schema changes require multiple versions:

Adding index: (none) → (write-only) → (backfilling) → (public)
Dropping column: (public) → (write-only) → (delete-only) → (absent)

This prevents inconsistencies during distributed execution.

Query Examples

Simple Query

SELECT name, email FROM users WHERE dept = 'eng';

Execution Path:

Parse: SELECT statement → syntax tree
Analyze: Resolve users table, validate columns
Optimize: Choose index on dept if available
Plan: Scan → Filter → Project
Execute:
- TableReader on range leaseholders
- Filter dept = 'eng' remotely
- Return name, email to gateway
Result: Stream rows to client

Join Query

SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.dept = 'sales';

Possible Plans:

Lookup Join

1. Scan users with dept = 'sales' filter
2. For each user.id:
   - Lookup orders by index on user_id
   - Join matching rows
3. Project name, total

Good when: Users table small after filtering

Hash Join

Scan users (filter dept = 'sales')
Scan orders
Hash users by id
Probe hash table with orders.user_id
Project name, total

Good when: Both tables large, no suitable indexes

Merge Join

Scan users (ordered by id, filter dept)
Scan orders (ordered by user_id)
Merge streams on id = user_id
Project name, total

Good when: Both inputs already sorted

Aggregation Query

SELECT dept, COUNT(*), AVG(salary)
FROM employees
GROUP BY dept
HAVING AVG(salary) > 100000;

Distributed Execution:

Node 1, 2, 3: (Partial aggregation)
├─ Scan local employees ranges
├─ Partial GROUP BY dept
├─ Compute: count_partial, sum, count (for avg)
└─ Stream to coordinator

Coordinator: (Final aggregation)  
├─ Receive partial results
├─ Final GROUP BY dept
├─ Compute: COUNT(*) = sum(count_partial)
│           AVG(salary) = sum(sum) / sum(count)
├─ Filter HAVING AVG(salary) > 100000
└─ Return to client

See Distributed SQL for details.

Performance Features

Query Hints

Force specific query behavior:

-- Force index
SELECT * FROM users@idx_dept WHERE dept = 'eng';

-- Force lookup join
SELECT /*+ LOOKUP JOIN */ * FROM users u JOIN orders o ...

-- Adjust join order
SELECT /*+ STRAIGHT JOIN */ * FROM a, b, c WHERE ...;

Statement Bundle

Capture query diagnostics:

EXPLAIN ANALYZE (DEBUG) SELECT ...;

Generates bundle with:

Query plan
Statistics
Trace data
Environment info

Query Profiling

SET tracing = on;
SELECT ...;
SET tracing = off;

SELECT * FROM crdb_internal.session_trace;

Key Implementation Files

Connection & Execution:

pkg/sql/conn_executor.go - Main SQL execution loop
pkg/sql/exec_util.go - Execution utilities

Parsing:

pkg/sql/parser/ - SQL parser
pkg/sql/sem/tree/ - Syntax tree definitions

Optimization:

pkg/sql/opt/ - Cost-based optimizer
pkg/sql/opt/memo/ - Memo structure
pkg/sql/opt/xform/ - Optimization rules

Execution:

pkg/sql/colexec/ - Vectorized execution
pkg/sql/rowexec/ - Row-at-a-time execution
pkg/sql/distsql/ - Distributed SQL infrastructure

Distributed SQL

Query distribution and parallelization

Transaction Layer

SQL transaction semantics

Getting Started

Architecture

SQL Reference

Administration

Operations

Performance

Architecture Overview

PostgreSQL Compatibility

Wire Protocol

Drivers

Tools

ORMs

SQL Dialect

SQL Processing Pipeline

1. Connection Handling

2. SQL Parsing

3. Semantic Analysis

4. Query Optimization

5. Physical Planning

6. Execution

Data Mapping: SQL to KV

Table Encoding

Index Encoding

Column Families

Prepared Statements

Schema Changes

Query Examples

Simple Query

Join Query

Aggregation Query

Performance Features

Query Hints

Statement Bundle

Query Profiling

Key Implementation Files

Further Reading

Distributed SQL

Transaction Layer

Build docs developers (and LLMs) love

Getting Started

Architecture

SQL Reference

Administration

Operations

Performance

​Architecture Overview

​PostgreSQL Compatibility

​Wire Protocol

Drivers

Tools

ORMs

​SQL Dialect

​SQL Processing Pipeline

​1. Connection Handling

​2. SQL Parsing

​3. Semantic Analysis

​4. Query Optimization

​5. Physical Planning

​6. Execution

​Data Mapping: SQL to KV

​Table Encoding

​Index Encoding

​Column Families

​Prepared Statements

​Schema Changes

​Query Examples

​Simple Query

​Join Query

​Aggregation Query

​Performance Features

​Query Hints

​Statement Bundle

​Query Profiling

​Key Implementation Files

​Further Reading

Distributed SQL

Transaction Layer

Build docs developers (and LLMs) love

Architecture Overview

PostgreSQL Compatibility

Wire Protocol

SQL Dialect

SQL Processing Pipeline

1. Connection Handling

2. SQL Parsing

3. Semantic Analysis

4. Query Optimization

5. Physical Planning

6. Execution

Data Mapping: SQL to KV

Table Encoding

Index Encoding

Column Families

Prepared Statements

Schema Changes

Query Examples

Simple Query

Join Query

Aggregation Query

Performance Features

Query Hints

Statement Bundle

Query Profiling

Key Implementation Files

Further Reading