If you haven’t already, we recommend reading the architecture overview.
Overview
Above all else, CockroachDB believes consistency is the most important feature of a database. Without it, developers cannot build reliable tools, and businesses suffer from potentially subtle and hard to detect anomalies. To provide consistency, CockroachDB implements full support for ACID transaction semantics in the transaction layer. However, it’s important to realize that all statements are handled as transactions, including single statements. This is sometimes referred to as “autocommit mode” because it behaves as if every statement is followed by aCOMMIT.
Because CockroachDB enables transactions that can span your entire cluster (including cross-range and cross-table transactions), it achieves correctness using a distributed, atomic commit protocol called Parallel Commits.
Transaction lifecycle
Transactions in CockroachDB follow a three-phase lifecycle:Phase 1: Writes and reads
Writing
When the transaction layer executes write operations, it doesn’t directly write values to disk. Instead, it creates several things that help it mediate a distributed transaction:Write intents
Write intents are replicated via Raft and act as a combination of a provisional value and an exclusive lock. They are essentially the same as standard multi-version concurrency control (MVCC) values but also contain a pointer to the transaction record stored on the cluster.
Unreplicated locks
Unreplicated locks for a transaction’s writes represent a provisional, uncommitted state. These locks are stored in an in-memory, per-node lock table by the concurrency control machinery.
Reading
If the transaction has not been aborted, the transaction layer begins executing read operations. If a read only encounters standard MVCC values, everything is fine. However, if a locking read encounters any existing locks, the operation must be resolved as a transaction conflict. CockroachDB provides the following types of reads:-
Strongly-consistent (non-stale) reads: These are the default and most common type of read. These reads go through the leaseholder and see all writes performed by writers that committed before the reading transaction (under
SERIALIZABLEisolation) or statement (underREAD COMMITTEDisolation) started. They always return data that is correct and up-to-date. -
Stale reads: These are useful in situations where you can afford to read data that is slightly stale in exchange for faster reads. They can only be used in read-only transactions that use the
AS OF SYSTEM TIMEclause. They do not need to go through the leaseholder, since they ensure consistency by reading from a local replica at a timestamp that is never higher than the closed timestamp.
- Exclusive locks block concurrent writes and locking reads on a row. Only one transaction can hold an exclusive lock on a row at a time, and only the transaction holding the exclusive lock can write to the row.
- Shared locks block concurrent writes and exclusive locking reads on a row. Multiple transactions can hold a shared lock on a row at the same time. When multiple transactions hold a shared lock on a row, none can write to the row.
Phase 2: Commits
CockroachDB checks the running transaction’s record to see if it’s beenABORTED. If it has, it throws a retryable error to the client.
Otherwise, CockroachDB sets the transaction record’s state to STAGING and checks the transaction’s pending write intents to see if they have been successfully replicated across the cluster.
When the transaction passes these checks, CockroachDB responds with the transaction’s success to the client, and moves on to the cleanup phase. At this point, the transaction is committed, and the client is free to begin sending more SQL statements to the cluster.
Phase 3: Cleanup (asynchronous)
After the transaction has been committed, it should be marked as such, and all of the write intents should be resolved. To do this, the coordinating node:Resolve write intents
Resolves the transaction’s write intents to MVCC values by removing the element that points it to the transaction record.
Time and hybrid logical clocks
In distributed systems, ordering and causality are difficult problems to solve. While it’s possible to rely entirely on Raft consensus to maintain serializability, it would be inefficient for reading data. To optimize performance of reads, CockroachDB implements hybrid-logical clocks (HLC) which are composed of a physical component (always close to local wall time) and a logical component (used to distinguish between events with the same physical component). In terms of transactions, the gateway node picks a timestamp for the transaction using HLC time. This timestamp is used to both track versions of values (through multi-version concurrency control), as well as provide transactional isolation guarantees.When nodes send requests to other nodes, they include the timestamp generated by their local HLCs. When nodes receive requests, they inform their local HLC of the timestamp supplied with the event by the sender.
Max clock offset enforcement
CockroachDB requires moderate levels of clock synchronization to preserve data consistency. For this reason, when a node detects that its clock is out of sync with at least half of the other nodes in the cluster by 80% of the maximum offset allowed, it crashes immediately.Timestamp cache
Whenever an operation reads a value, CockroachDB stores the operation’s timestamp in a timestamp cache, which shows the high-water mark for values being read. The timestamp cache is a data structure used to store information about the reads performed by leaseholders. This is used to ensure that once some transaction reads a row, another transaction that comes along and tries to write to that row will be ordered after the first transaction, thus ensuring a serial order of transactions (serializability). Whenever a write occurs, its timestamp is checked against the timestamp cache. If the timestamp is earlier than the timestamp cache’s latest value, CockroachDB will attempt to push the timestamp for its transaction forward to a later time. Pushing the timestamp might cause the transaction to restart during the commit time of the transaction underSERIALIZABLE isolation.
Closed timestamps
Each CockroachDB range tracks a property called its closed timestamp, which means that no new writes can ever be introduced at or below that timestamp. The closed timestamp is advanced continuously on the leaseholder, and lags the current time by some target interval. In other words, a closed timestamp is a promise by the range’s leaseholder to its follower replicas that it will not accept writes below that timestamp. Generally speaking, the leaseholder continuously closes timestamps a few seconds in the past. Closed timestamps provide the guarantees that are used to provide support for low-latency historical (stale) reads, also known as Follower Reads. Follower reads can be particularly useful in multi-region deployments.Transaction conflicts
CockroachDB’s transactions allow the following types of conflicts that involve running into a write intent:- Write-write: Two transactions create write intents or acquire a lock on the same key.
- Write-read: A read encounters an existing write intent with a timestamp less than its own (only under
SERIALIZABLEisolation).
Check explicit priority
If the transaction has an explicit priority set (i.e.,
HIGH or LOW), the transaction with the lower priority is aborted with a serialization error (in the write-write case) or has its timestamp pushed (in the write-read case).Check if transaction is expired
If the encountered transaction is expired, it’s
ABORTED and conflict resolution succeeds. A write intent is considered expired if:- It doesn’t have a transaction record and its timestamp is outside of the transaction liveness threshold
- Its transaction record hasn’t been heartbeated within the transaction liveness threshold
- Write after read: When a write with a lower timestamp encounters a later read. This is handled through the timestamp cache.
- Read within uncertainty window: When a read encounters a value with a higher timestamp but it’s ambiguous whether the value should be considered to be in the future or in the past of the transaction because of possible clock skew.
Read refreshing
Whenever aSERIALIZABLE transaction’s timestamp has been pushed, additional checks are required before allowing it to commit at the pushed timestamp. Any values which the transaction previously read must be checked to verify that no writes have subsequently occurred between the original transaction timestamp and the pushed transaction timestamp. This check prevents serializability violation.
The check is done by keeping track of all the reads using a dedicated RefreshRequest. If this succeeds, the transaction is allowed to commit. If the refreshing is unsuccessful (also known as read invalidation), then the transaction must be retried at the pushed timestamp.
READ COMMITTED transactions do not perform read refreshing before commit. Instead, they permit their read timestamps and write timestamps to skew at commit time. This makes it possible for statements in concurrent READ COMMITTED transactions to interleave without aborting transactions.Transaction pipelining
Transactional writes are pipelined when being replicated and when being written to disk, dramatically reducing the latency of transactions that perform multiple writes. At a high level, transaction pipelining works as follows:Communicate with leaseholders
For each statement, the transaction gateway node communicates with the leaseholders for the ranges it wants to write to.
Create and send write intents
Each leaseholder creates write intents and sends them to its follower nodes. The leaseholder responds to the transaction gateway node that the write intents have been sent (note that replication is still in-flight at this stage).
Parallel commits
Parallel Commits is an optimized atomic commit protocol that cuts the commit latency of a transaction in half, from two rounds of consensus down to one. Combined with transaction pipelining, this brings the latency incurred by common OLTP transactions to near the theoretical minimum. Under this atomic commit protocol, the transaction coordinator can return to the client eagerly when it knows that the writes in the transaction have succeeded. Once this occurs, the transaction coordinator can set the transaction record’s state toCOMMITTED and resolve the transaction’s write intents asynchronously.
The transaction coordinator is able to do this while maintaining correctness guarantees because it populates the transaction record with enough information (via a new STAGING state, and an array of in-flight writes) for other transactions to determine whether all writes in the transaction are present, and thus prove whether or not the transaction is committed.
The latency until intents are resolved is unchanged by the introduction of Parallel Commits. This means that contended workloads are expected to profit less from this feature.
Isolation levels
CockroachDB supports two transaction isolation levels:SERIALIZABLE: The default and strongest isolation level. Transactions appear to have executed in a serial order.READ COMMITTED: A weaker isolation level that permits certain anomalies in exchange for better performance in some scenarios.
Interactions with other layers
In relationship to other layers in CockroachDB, the transaction layer:- Receives KV operations from the SQL layer
- Controls the flow of KV operations sent to the distribution layer