Transaction layer

The transaction layer of CockroachDB’s architecture implements support for ACID transactions by coordinating concurrent operations.

If you haven’t already, we recommend reading the architecture overview.

Overview

Above all else, CockroachDB believes consistency is the most important feature of a database. Without it, developers cannot build reliable tools, and businesses suffer from potentially subtle and hard to detect anomalies. To provide consistency, CockroachDB implements full support for ACID transaction semantics in the transaction layer. However, it’s important to realize that all statements are handled as transactions, including single statements. This is sometimes referred to as “autocommit mode” because it behaves as if every statement is followed by a COMMIT. Because CockroachDB enables transactions that can span your entire cluster (including cross-range and cross-table transactions), it achieves correctness using a distributed, atomic commit protocol called Parallel Commits.

Transaction lifecycle

Transactions in CockroachDB follow a three-phase lifecycle:

Phase 1: Writes and reads

Writing

When the transaction layer executes write operations, it doesn’t directly write values to disk. Instead, it creates several things that help it mediate a distributed transaction:

Write intents

Write intents are replicated via Raft and act as a combination of a provisional value and an exclusive lock. They are essentially the same as standard multi-version concurrency control (MVCC) values but also contain a pointer to the transaction record stored on the cluster.

Unreplicated locks

Unreplicated locks for a transaction’s writes represent a provisional, uncommitted state. These locks are stored in an in-memory, per-node lock table by the concurrency control machinery.

Transaction record

A transaction record stored in the range where the first write occurs, which includes the transaction’s current state (either PENDING, STAGING, COMMITTED, or ABORTED).

As write intents are created, CockroachDB checks for newer committed values. If newer committed values exist, the transaction may be restarted. If existing write intents or locks exist on the same keys, it is resolved as a transaction conflict. If transactions fail for other reasons, such as failing to pass a SQL constraint, the transaction is aborted.

Reading

If the transaction has not been aborted, the transaction layer begins executing read operations. If a read only encounters standard MVCC values, everything is fine. However, if a locking read encounters any existing locks, the operation must be resolved as a transaction conflict. CockroachDB provides the following types of reads:

Strongly-consistent (non-stale) reads: These are the default and most common type of read. These reads go through the leaseholder and see all writes performed by writers that committed before the reading transaction (under SERIALIZABLE isolation) or statement (under READ COMMITTED isolation) started. They always return data that is correct and up-to-date.
Stale reads: These are useful in situations where you can afford to read data that is slightly stale in exchange for faster reads. They can only be used in read-only transactions that use the AS OF SYSTEM TIME clause. They do not need to go through the leaseholder, since they ensure consistency by reading from a local replica at a timestamp that is never higher than the closed timestamp.

Reads can optionally acquire locks:

Exclusive locks block concurrent writes and locking reads on a row. Only one transaction can hold an exclusive lock on a row at a time, and only the transaction holding the exclusive lock can write to the row.
Shared locks block concurrent writes and exclusive locking reads on a row. Multiple transactions can hold a shared lock on a row at the same time. When multiple transactions hold a shared lock on a row, none can write to the row.

Phase 2: Commits

CockroachDB checks the running transaction’s record to see if it’s been ABORTED. If it has, it throws a retryable error to the client. Otherwise, CockroachDB sets the transaction record’s state to STAGING and checks the transaction’s pending write intents to see if they have been successfully replicated across the cluster. When the transaction passes these checks, CockroachDB responds with the transaction’s success to the client, and moves on to the cleanup phase. At this point, the transaction is committed, and the client is free to begin sending more SQL statements to the cluster.

Phase 3: Cleanup (asynchronous)

After the transaction has been committed, it should be marked as such, and all of the write intents should be resolved. To do this, the coordinating node:

Update transaction record

Moves the state of the transaction record from STAGING to COMMITTED.

Resolve write intents

Resolves the transaction’s write intents to MVCC values by removing the element that points it to the transaction record.

Delete write intents

Deletes the write intents.

This is simply an optimization, though. If operations in the future encounter write intents, they always check their transaction records. Any operation can resolve or remove write intents by checking the transaction record’s status.

Time and hybrid logical clocks

In distributed systems, ordering and causality are difficult problems to solve. While it’s possible to rely entirely on Raft consensus to maintain serializability, it would be inefficient for reading data. To optimize performance of reads, CockroachDB implements hybrid-logical clocks (HLC) which are composed of a physical component (always close to local wall time) and a logical component (used to distinguish between events with the same physical component). In terms of transactions, the gateway node picks a timestamp for the transaction using HLC time. This timestamp is used to both track versions of values (through multi-version concurrency control), as well as provide transactional isolation guarantees.

When nodes send requests to other nodes, they include the timestamp generated by their local HLCs. When nodes receive requests, they inform their local HLC of the timestamp supplied with the event by the sender.

Max clock offset enforcement

CockroachDB requires moderate levels of clock synchronization to preserve data consistency. For this reason, when a node detects that its clock is out of sync with at least half of the other nodes in the cluster by 80% of the maximum offset allowed, it crashes immediately.

While serializable consistency is maintained under SERIALIZABLE isolation regardless of clock skew, skew outside the configured clock offset bounds can result in violations of single-key linearizability between causally dependent transactions. It’s therefore important to prevent clocks from drifting too far by running NTP or other clock synchronization software on each node.

Timestamp cache

Whenever an operation reads a value, CockroachDB stores the operation’s timestamp in a timestamp cache, which shows the high-water mark for values being read. The timestamp cache is a data structure used to store information about the reads performed by leaseholders. This is used to ensure that once some transaction reads a row, another transaction that comes along and tries to write to that row will be ordered after the first transaction, thus ensuring a serial order of transactions (serializability). Whenever a write occurs, its timestamp is checked against the timestamp cache. If the timestamp is earlier than the timestamp cache’s latest value, CockroachDB will attempt to push the timestamp for its transaction forward to a later time. Pushing the timestamp might cause the transaction to restart during the commit time of the transaction under SERIALIZABLE isolation.

Closed timestamps

Each CockroachDB range tracks a property called its closed timestamp, which means that no new writes can ever be introduced at or below that timestamp. The closed timestamp is advanced continuously on the leaseholder, and lags the current time by some target interval. In other words, a closed timestamp is a promise by the range’s leaseholder to its follower replicas that it will not accept writes below that timestamp. Generally speaking, the leaseholder continuously closes timestamps a few seconds in the past. Closed timestamps provide the guarantees that are used to provide support for low-latency historical (stale) reads, also known as Follower Reads. Follower reads can be particularly useful in multi-region deployments.

Transaction conflicts

CockroachDB’s transactions allow the following types of conflicts that involve running into a write intent:

Write-write: Two transactions create write intents or acquire a lock on the same key.
Write-read: A read encounters an existing write intent with a timestamp less than its own (only under SERIALIZABLE isolation).

When conflicts occur, CockroachDB proceeds through the following steps:

Check explicit priority

If the transaction has an explicit priority set (i.e., HIGH or LOW), the transaction with the lower priority is aborted with a serialization error (in the write-write case) or has its timestamp pushed (in the write-read case).

Check if transaction is expired

If the encountered transaction is expired, it’s ABORTED and conflict resolution succeeds. A write intent is considered expired if:

It doesn’t have a transaction record and its timestamp is outside of the transaction liveness threshold
Its transaction record hasn’t been heartbeated within the transaction liveness threshold

Enter TxnWaitQueue

The transaction enters the TxnWaitQueue to wait for the blocking transaction to complete.

Additionally, the following types of conflicts that do not involve running into intents can arise:

Write after read: When a write with a lower timestamp encounters a later read. This is handled through the timestamp cache.
Read within uncertainty window: When a read encounters a value with a higher timestamp but it’s ambiguous whether the value should be considered to be in the future or in the past of the transaction because of possible clock skew.

Read refreshing

Whenever a SERIALIZABLE transaction’s timestamp has been pushed, additional checks are required before allowing it to commit at the pushed timestamp. Any values which the transaction previously read must be checked to verify that no writes have subsequently occurred between the original transaction timestamp and the pushed transaction timestamp. This check prevents serializability violation. The check is done by keeping track of all the reads using a dedicated RefreshRequest. If this succeeds, the transaction is allowed to commit. If the refreshing is unsuccessful (also known as read invalidation), then the transaction must be retried at the pushed timestamp.

READ COMMITTED transactions do not perform read refreshing before commit. Instead, they permit their read timestamps and write timestamps to skew at commit time. This makes it possible for statements in concurrent READ COMMITTED transactions to interleave without aborting transactions.

Transaction pipelining

Transactional writes are pipelined when being replicated and when being written to disk, dramatically reducing the latency of transactions that perform multiple writes. At a high level, transaction pipelining works as follows:

Communicate with leaseholders

For each statement, the transaction gateway node communicates with the leaseholders for the ranges it wants to write to.

Create and send write intents

Each leaseholder creates write intents and sends them to its follower nodes. The leaseholder responds to the transaction gateway node that the write intents have been sent (note that replication is still in-flight at this stage).

Commit transaction

When attempting to commit, the transaction gateway node waits for the write intents to be replicated in parallel to all of the leaseholders’ followers. When it receives responses that the write intents have propagated, it commits the transaction.

All of the waiting for write intents to propagate and be committed happens once, at the very end of the transaction, rather than for each individual write.

Parallel commits

Parallel Commits is an optimized atomic commit protocol that cuts the commit latency of a transaction in half, from two rounds of consensus down to one. Combined with transaction pipelining, this brings the latency incurred by common OLTP transactions to near the theoretical minimum. Under this atomic commit protocol, the transaction coordinator can return to the client eagerly when it knows that the writes in the transaction have succeeded. Once this occurs, the transaction coordinator can set the transaction record’s state to COMMITTED and resolve the transaction’s write intents asynchronously. The transaction coordinator is able to do this while maintaining correctness guarantees because it populates the transaction record with enough information (via a new STAGING state, and an array of in-flight writes) for other transactions to determine whether all writes in the transaction are present, and thus prove whether or not the transaction is committed.

The latency until intents are resolved is unchanged by the introduction of Parallel Commits. This means that contended workloads are expected to profit less from this feature.

Isolation levels

CockroachDB supports two transaction isolation levels:

SERIALIZABLE: The default and strongest isolation level. Transactions appear to have executed in a serial order.
READ COMMITTED: A weaker isolation level that permits certain anomalies in exchange for better performance in some scenarios.

Interactions with other layers

In relationship to other layers in CockroachDB, the transaction layer:

Receives KV operations from the SQL layer
Controls the flow of KV operations sent to the distribution layer

What’s next?

Learn how CockroachDB provides a unified view of your cluster’s data in the distribution layer.

Get Started

Architecture

Core Concepts

Deployment

SQL Reference

Operations

Migration

Developer Guide

Overview

Transaction lifecycle

Phase 1: Writes and reads

Writing

Reading

Phase 2: Commits

Phase 3: Cleanup (asynchronous)

Time and hybrid logical clocks

Max clock offset enforcement

Timestamp cache

Closed timestamps

Transaction conflicts

Read refreshing

Transaction pipelining

Parallel commits

Isolation levels

Interactions with other layers

What’s next?

Build docs developers (and LLMs) love

Get Started

Architecture

Core Concepts

Deployment

SQL Reference

Operations

Migration

Developer Guide

​Overview

​Transaction lifecycle

​Phase 1: Writes and reads

​Writing

​Reading

​Phase 2: Commits

​Phase 3: Cleanup (asynchronous)

​Time and hybrid logical clocks

​Max clock offset enforcement

​Timestamp cache

​Closed timestamps

​Transaction conflicts

​Read refreshing

​Transaction pipelining

​Parallel commits

​Isolation levels

​Interactions with other layers

​What’s next?

Build docs developers (and LLMs) love

Overview

Transaction lifecycle

Phase 1: Writes and reads

Writing

Reading

Phase 2: Commits

Phase 3: Cleanup (asynchronous)

Time and hybrid logical clocks

Max clock offset enforcement

Timestamp cache

Closed timestamps

Transaction conflicts

Read refreshing

Transaction pipelining

Parallel commits

Isolation levels

Interactions with other layers

What’s next?