Design Philosophy
Unlike traditional databases that use two-phase locking (2PL), CockroachDB transactions never acquire locks. Instead, they use write intents and timestamp-based conflict resolution.
docs/design.md:
Cockroach provides distributed transactions without locks. Both SI and SSI require that the outcome of reads must be preserved, i.e. a write of a key at a lower timestamp than a previous read must not succeed.
Key Advantages
No Deadlocks
Without locks, deadlock detection and resolution are unnecessary.
Lock-Free Reads
Reads never block writes and vice versa, improving concurrency.
Fast Failure
Conflicts detected early - writes fail fast rather than blocking.
No Starvation
Priority system ensures long transactions eventually complete.
Isolation Levels
CockroachDB supports two isolation levels:Snapshot Isolation (SI)
Characteristics
Characteristics
- Reads see consistent snapshot at transaction start time
- Writes validated at commit time
- Allows write skew anomalies in rare cases
- Higher performance under contention
- Can have timestamp pushed forward without restart
Serializable Snapshot Isolation (SSI)
Characteristics
Characteristics
- Default isolation level
- Eliminates all anomalies including write skew
- Transactions restart if timestamp pushed
- Provides serializability guarantee
- Minimal overhead in low-contention scenarios
SSI is the default level, with SI provided for application developers who are certain enough of their need for performance and the absence of write skew conditions to consciously elect to use it.
Hybrid Logical Clock (HLC)
CockroachDB uses Hybrid Logical Clocks to assign timestamps that combine physical wall clock time with logical counters for causality tracking.
HLC Structure
HLC Properties
From the design document:HLC time uses timestamps which are composed of a physical component (thought of as and always close to local wall time) and a logical component (used to distinguish between events with the same physical component).Key behaviors:
- Reading: Update local HLC with max(local, received timestamp)
- Writing: Use current HLC timestamp
- Guarantee: HLC time ≥ wall time always
Clock Synchronization
Uncertainty Interval: When reading from remote nodes:- Read at timestamp
t - Uncertainty window:
[t, t + max_clock_offset] - Values in uncertainty window trigger restart
- Optimization: mark node as “certain” after first restart
Transaction Execution
Transaction Lifecycle
Two-Phase Execution
Fromdocs/design.md:
Phase 1: Write Intents
Start transaction:
- Select range likely to be heavily involved
- Write transaction record with state “PENDING”
- Assign random priority and candidate timestamp
- Write “intent” values (normal MVCC values with intent flag)
- Include transaction ID with each intent
- Intents written in parallel to all affected ranges
- Record maximum timestamp from all writes
Phase 2: Commit
Commit transaction:
- Update transaction record to “COMMITTED”
- Use final timestamp (max from phase 1)
- For SSI: verify timestamp not pushed, else restart
- For SI: accept pushed timestamp
- Transaction considered committed at this point
- Remove “intent” flag from all written values
- Can happen asynchronously after commit
- Gateway tracks intents for resolution
- Other transactions may resolve abandoned intents
Transaction Records
Fromdocs/design.md:
Please seeTransaction Record contains:pkg/roachpb/data.protofor the up-to-date structures, the best entry point beingmessage Transaction.
- Transaction ID (UUID)
- Transaction status (PENDING, COMMITTED, ABORTED)
- Candidate timestamp
- Priority
- Isolation level
- Written keys (for intent resolution)
- Stored in range containing first written key
- Accessible via transaction ID
- Updated via Raft consensus
Conflict Resolution
Conflicts are resolved using timestamp manipulation and transaction priorities rather than blocking or locks.
Transaction Interactions
Reader Encounters Write Intent (Newer)
Reader Encounters Write Intent (Older)
If the reader has the higher priority, it pushes the transaction’s commit timestamp (that transaction will then notice its timestamp has been pushed, and restart). If it has the lower or same priority, it retries itself using as a new priority max(new random priority, conflicting txn's priority - 1).
Writer Encounters Uncommitted Intent
Writer Encounters Newer Committed Value
Writer Encounters Read Timestamp
Read Timestamp Cache
Read Timestamp Cache
Each range maintains an in-memory cache:On write:
- Check if key was read after write’s timestamp
- If yes: return new timestamp, forcing restart (SSI only)
- Cache bounded, evicts oldest entries
docs/design.md:
If the write’s candidate timestamp is earlier than the low water mark on the cache itself (i.e. its last evicted timestamp) or if the key being written has a read timestamp later than the write’s candidate timestamp, this later timestamp value is returned with the write.
Transaction Restart vs. Abort
Restart:- Reuse same transaction ID
- Update priority and/or timestamp
- Implicit cleanup of old intents during re-execution
- More efficient than abort
- Transaction explicitly aborted
- New transaction ID required for retry
- Intents cleaned up asynchronously
- Used when transaction record marked ABORTED
Transaction Priorities
Priorities prevent starvation and allow application-level control over conflict resolution.
Priority Assignment
Initial: Random value On conflict:Deadlock Freedom
From the design document:Priorities avoid starvation for arbitrarily long transactions and always pick a winner from between contending transactions (no mutual aborts).How it works:
- Each conflict has a winner (higher priority)
- Loser increases priority on retry
- Eventually, retrying transaction has highest priority
- Guaranteed progress
Transaction Coordinator
Gateway Role
Fromdocs/design.md:
Transactions are managed by the client proxy (or gateway in SQL Azure parlance). Unlike in Spanner, writes are not buffered but are sent directly to all implicated ranges.Responsibilities:
- Track transaction state
- Send writes to appropriate ranges
- Heartbeat transaction record
- Resolve intents on commit/abort
- Handle transaction restarts
pkg/kv/kvclient/kvcoord/txn_coord_sender.go
Transaction Heartbeats
Purpose:- Detect abandoned transactions
- Allow cleanup of orphaned intents
- Prevent blocking on dead transactions
- Other transactions can abort the abandoned transaction
- Intents cleaned up on encounter
- Range remains available
Transactions encountered by readers or writers with dangling intents which haven’t been heartbeat within the required interval are aborted.
Write Intents
Intent Structure
Intents are MVCC values with metadata:Intent Resolution
After commit:Opportunistic Resolution
Opportunistic Resolution
Intents resolved in multiple ways:
- Async resolution: Gateway resolves after commit
- On encounter: Other transactions resolve when found
- GC process: Periodic cleanup of abandoned intents
- Intent resolution queue: Background processing
Distributed Transactions
Single-Range Transactions
Multi-Range Transactions
Commit is atomic: once transaction record committed in one range, entire transaction is committed even if intent resolution incomplete.
Serializable vs. Strict Serializable
Fromdocs/design.md:
Serializability (Default)
CockroachDB guarantees serializability:- Transactions appear to execute in some serial order
- No anomalies (dirty read, write skew, etc.)
- Does not guarantee real-time ordering
Strict Serializability (Linearizability)
Causality Tokens
Causality Tokens
Optional feature for strict serializability:Ensures causally-related transactions get increasing timestamps.
Commit Wait (Future)
Commit Wait (Future)
Spanner-style commit wait:
Performance Characteristics
Advantages
Low Latency
Single-round commit for non-contentious transactions
High Concurrency
Reads never block writes, writes fail fast
No Deadlocks
Priorities ensure progress without deadlock detection
Scalable
No global lock manager, fully distributed
Trade-offs
Example: Bank Transfer
Implementation Details
Key source files: Transaction Coordination:pkg/kv/kvclient/kvcoord/txn_coord_sender.gopkg/kv/txn.go
pkg/kv/kvserver/replica_proposal.gopkg/kv/kvserver/intent_resolver.go
pkg/kv/kvserver/tscache/
pkg/kv/kvserver/concurrency/
Further Reading
Storage Layer
MVCC implementation
Replication Layer
Raft and consensus
SQL Layer
Transaction SQL interface