Data Model

DocDB is YugabyteDB’s document-oriented storage engine that uses a key-value model for persisting and retrieving data. Each table row is represented as a document in DocDB, with data stored as values associated with unique keys.

Key-Value Storage Model

DocDB stores table data by encoding rows into multiple key-value pairs. This approach enables:

Efficient updates: Modify individual columns without rewriting entire rows
MVCC support: Multiple versions of values with hybrid timestamps
Flexible schema: Support for nested documents and collections
Optimized access: Binary-comparable encodings for fast lookups

As of v2.20, DocDB uses an optimized packed row format for better performance. The encoding described here represents the logical model.

DocDB Key Structure

Keys in DocDB are compound keys with the following components:

[ColocationId], [HashCode], DocKey, [SubKey1], ..., [SubKeyN], HybridTimestamp

Colocation ID (Optional)

Present when using colocated tables/databases to separate data from different tables in the same tablet.

Hash Code (Optional)

A 16-bit hash of hash column values for hash-sharded tables. Determines which tablet owns the data.

Document Key (DocKey)

The primary key columns encoded in order: hash columns followed by range columns.

Sub Keys (Optional)

Column IDs for non-primary key columns. Enables partial updates without reading full rows.

Hybrid Timestamp

MVCC timestamp in reverse order for efficient retrieval of latest versions.

DocDB Value Types

Values in DocDB can be:

Primitive types: int32, int64, double, text, timestamp, UUID, etc.
Non-primitive types: Sorted maps where objects map scalar keys to values
Collections: Lists and sets implemented using DocDB’s object type

Encoding Example

Consider this YSQL table:

CREATE TABLE msgs (
    user_id text,
    msg_id int,
    msg text,
    msg_props jsonb,
    PRIMARY KEY ((user_id), msg_id)
);

Insert Operation

INSERT INTO msgs (user_id, msg_id, msg, msg_props)
VALUES ('user1', 10, 'Hello', '{"from": "[email protected]", "subject": "hi"}');

DocDB storage at time T1:

(hash1, 'user1', 10), liveness_column_id, T1 -> [NULL]
(hash1, 'user1', 10), msg_column_id, T1 -> 'Hello'
(hash1, 'user1', 10), msg_props_column_id, 'from', T1 -> '[email protected]'
(hash1, 'user1', 10), msg_props_column_id, 'subject', T1 -> 'hi'

The liveness column is a special system column that tracks row-level metadata. It’s invisible to users but essential for proper MVCC behavior.

Update Operation

Updating a nested field:

UPDATE msgs
SET msg_props = msg_props || '{"read": true}'
WHERE user_id = 'user1' AND msg_id = 10;

DocDB storage at time T2 (only new entry added):

(hash1, 'user1', 10), liveness_column_id, T1 -> [NULL]
(hash1, 'user1', 10), msg_column_id, T1 -> 'Hello'
(hash1, 'user1', 10), msg_props_column_id, 'from', T1 -> '[email protected]'
(hash1, 'user1', 10), msg_props_column_id, 'read', T2 -> true  ← New
(hash1, 'user1', 10), msg_props_column_id, 'subject', T1 -> 'hi'

Delete Operation

Deleting a column adds a tombstone marker:

DELETE FROM msgs WHERE user_id = 'user1' AND msg_id = 10;

DocDB storage at time T3:

(hash1, 'user1', 10), T3 -> [DELETE]  ← Tombstone marker

Tombstones logically delete data but physically remain until compaction. Excessive deletes without compaction can impact performance.

Primary Key Encoding

The document key contains the full primary key with components in this order:

Hash Value

16-bit hash of hash column values (if hash columns present)

Hash Columns

All columns in the hash partition key, encoded with type prefixes

Range Columns

Clustering/range columns in defined order (ASC or DESC)

-- Hash columns: (user_id)
-- Range columns: msg_id
PRIMARY KEY ((user_id), msg_id)

-- Encoded as:
-- [hash(user_id)], user_id, msg_id

If no primary key is defined, YugabyteDB automatically generates an internal row ID (similar to PostgreSQL’s ctid).

Binary-Comparable Encoding

All key components use binary-comparable encoding, ensuring:

Sort order preservation: Encoded byte strings sort the same as original values
Type safety: Each data type has a unique byte prefix
Efficient comparisons: Direct byte-level comparisons without decoding

Example type prefixes:

Type	Prefix Byte
NULL	0x00
Int32	0x03
String	0x05
Int64	0x07
Timestamp	0x0B

Multi-Version Concurrency Control (MVCC)

DocDB maintains multiple versions of each key using hybrid timestamps:

(key), T1 -> value1
(key), T2 -> value2  ← Newer version
(key), T3 -> value3  ← Latest version

Benefits:

Lock-Free Reads

Readers never block writers, and writers never block readers

Point-in-Time Queries

Read data as of any past hybrid timestamp

Snapshot Isolation

Transactions see a consistent snapshot of data

Garbage Collection

Old versions cleaned up after no active transactions need them

Packed Rows Optimization

Since v2.20, DocDB uses packed row format for better performance:

Reduced key count: Multiple columns packed into single key-value pairs
Lower overhead: Fewer RocksDB operations per row
Faster scans: Sequential reads benefit from reduced key lookups
Backward compatible: Seamlessly handles both formats

Packed rows are automatically used for new tables. Existing tables can be migrated using pg_repack or similar tools.

Data Expiration (YCQL Only)

YCQL supports Time-To-Live (TTL) at multiple levels: Table-level TTL:

CREATE TABLE sessions (
    session_id UUID PRIMARY KEY,
    data TEXT
) WITH default_time_to_live = 86400;  -- 24 hours

Row-level TTL:

INSERT INTO sessions (session_id, data)
VALUES (uuid(), 'session_data')
USING TTL 3600;  -- 1 hour

TTL is enforced during reads and compactions. Expired data is not immediately deleted but becomes invisible to queries.

Storage Characteristics

Log-Structured Merge (LSM) Tree

DocDB uses RocksDB’s LSM tree architecture:

MemTable: In-memory write buffer for new data
Immutable MemTable: Frozen MemTable being flushed to disk
SST Files: Sorted String Table files on disk (multiple levels)
Compaction: Background merging of SST files to reclaim space

Write Path

Write to WAL

Write-ahead log for durability (Raft log)

Insert to MemTable

In-memory update (fast)

Flush to SST

When MemTable is full, flush to Level 0 SST

Compaction

Background process merges and optimizes SST files

Read Path

Check MemTable

Look for latest version in memory

Check Block Cache

Look for cached SST blocks

Read SST Files

Search SST files from newest to oldest

Merge Results

Combine versions and apply tombstones

Performance Considerations

Write Amplification

LSM trees rewrite data during compaction. Configure compaction strategies based on workload:

Level-based: Better for read-heavy workloads
Size-tiered: Better for write-heavy workloads

Read Amplification

Multiple SST files may need to be checked. Mitigate with:

Bloom filters (enabled by default)
Block cache sizing
Regular compaction

Space Amplification

Multiple versions and tombstones consume space. Manage with:

Regular compaction
Appropriate TTL settings
History retention policies

Next Steps

Distributed Transactions

Learn how transactions work across tablets

Replication

Understand how data is replicated for fault tolerance

Consistency Model

Explore consistency guarantees and isolation levels

Architecture

Review the overall system architecture

Get Started

Core Concepts

Deployment

Develop

Operations

Security

Advanced Features

Key-Value Storage Model

DocDB Key Structure

DocDB Value Types

Encoding Example

Insert Operation

Update Operation

Delete Operation

Primary Key Encoding

Binary-Comparable Encoding

Multi-Version Concurrency Control (MVCC)

Lock-Free Reads

Point-in-Time Queries

Snapshot Isolation

Garbage Collection

Packed Rows Optimization

Data Expiration (YCQL Only)

Storage Characteristics

Log-Structured Merge (LSM) Tree

Write Path

Read Path

Performance Considerations

Next Steps

Distributed Transactions

Replication

Consistency Model

Architecture

Build docs developers (and LLMs) love

Get Started

Core Concepts

Deployment

Develop

Operations

Security

Advanced Features

​Key-Value Storage Model

​DocDB Key Structure

​DocDB Value Types

​Encoding Example

​Insert Operation

​Update Operation

​Delete Operation

​Primary Key Encoding

​Binary-Comparable Encoding

​Multi-Version Concurrency Control (MVCC)

Lock-Free Reads

Point-in-Time Queries

Snapshot Isolation

Garbage Collection

​Packed Rows Optimization

​Data Expiration (YCQL Only)

​Storage Characteristics

​Log-Structured Merge (LSM) Tree

​Write Path

​Read Path

​Performance Considerations

​Next Steps

Distributed Transactions

Replication

Consistency Model

Architecture

Build docs developers (and LLMs) love

Key-Value Storage Model

DocDB Key Structure

DocDB Value Types

Encoding Example

Insert Operation

Update Operation

Delete Operation

Primary Key Encoding

Binary-Comparable Encoding

Multi-Version Concurrency Control (MVCC)

Packed Rows Optimization

Data Expiration (YCQL Only)

Storage Characteristics

Log-Structured Merge (LSM) Tree

Write Path

Read Path

Performance Considerations

Next Steps