Skip to main content
DocDB is YugabyteDB’s document-oriented storage engine that uses a key-value model for persisting and retrieving data. Each table row is represented as a document in DocDB, with data stored as values associated with unique keys.

Key-Value Storage Model

DocDB stores table data by encoding rows into multiple key-value pairs. This approach enables:
  • Efficient updates: Modify individual columns without rewriting entire rows
  • MVCC support: Multiple versions of values with hybrid timestamps
  • Flexible schema: Support for nested documents and collections
  • Optimized access: Binary-comparable encodings for fast lookups
As of v2.20, DocDB uses an optimized packed row format for better performance. The encoding described here represents the logical model.

DocDB Key Structure

Keys in DocDB are compound keys with the following components:
[ColocationId], [HashCode], DocKey, [SubKey1], ..., [SubKeyN], HybridTimestamp
Present when using colocated tables/databases to separate data from different tables in the same tablet.
A 16-bit hash of hash column values for hash-sharded tables. Determines which tablet owns the data.
The primary key columns encoded in order: hash columns followed by range columns.
Column IDs for non-primary key columns. Enables partial updates without reading full rows.
MVCC timestamp in reverse order for efficient retrieval of latest versions.

DocDB Value Types

Values in DocDB can be:
  • Primitive types: int32, int64, double, text, timestamp, UUID, etc.
  • Non-primitive types: Sorted maps where objects map scalar keys to values
  • Collections: Lists and sets implemented using DocDB’s object type

Encoding Example

Consider this YSQL table:
CREATE TABLE msgs (
    user_id text,
    msg_id int,
    msg text,
    msg_props jsonb,
    PRIMARY KEY ((user_id), msg_id)
);

Insert Operation

INSERT INTO msgs (user_id, msg_id, msg, msg_props)
VALUES ('user1', 10, 'Hello', '{"from": "[email protected]", "subject": "hi"}');
DocDB storage at time T1:
(hash1, 'user1', 10), liveness_column_id, T1 -> [NULL]
(hash1, 'user1', 10), msg_column_id, T1 -> 'Hello'
(hash1, 'user1', 10), msg_props_column_id, 'from', T1 -> '[email protected]'
(hash1, 'user1', 10), msg_props_column_id, 'subject', T1 -> 'hi'
The liveness column is a special system column that tracks row-level metadata. It’s invisible to users but essential for proper MVCC behavior.

Update Operation

Updating a nested field:
UPDATE msgs
SET msg_props = msg_props || '{"read": true}'
WHERE user_id = 'user1' AND msg_id = 10;
DocDB storage at time T2 (only new entry added):
(hash1, 'user1', 10), liveness_column_id, T1 -> [NULL]
(hash1, 'user1', 10), msg_column_id, T1 -> 'Hello'
(hash1, 'user1', 10), msg_props_column_id, 'from', T1 -> '[email protected]'
(hash1, 'user1', 10), msg_props_column_id, 'read', T2 -> true  ← New
(hash1, 'user1', 10), msg_props_column_id, 'subject', T1 -> 'hi'

Delete Operation

Deleting a column adds a tombstone marker:
DELETE FROM msgs WHERE user_id = 'user1' AND msg_id = 10;
DocDB storage at time T3:
(hash1, 'user1', 10), T3 -> [DELETE]  ← Tombstone marker
Tombstones logically delete data but physically remain until compaction. Excessive deletes without compaction can impact performance.

Primary Key Encoding

The document key contains the full primary key with components in this order:
1

Hash Value

16-bit hash of hash column values (if hash columns present)
2

Hash Columns

All columns in the hash partition key, encoded with type prefixes
3

Range Columns

Clustering/range columns in defined order (ASC or DESC)
-- Hash columns: (user_id)
-- Range columns: msg_id
PRIMARY KEY ((user_id), msg_id)

-- Encoded as:
-- [hash(user_id)], user_id, msg_id
If no primary key is defined, YugabyteDB automatically generates an internal row ID (similar to PostgreSQL’s ctid).

Binary-Comparable Encoding

All key components use binary-comparable encoding, ensuring:
  • Sort order preservation: Encoded byte strings sort the same as original values
  • Type safety: Each data type has a unique byte prefix
  • Efficient comparisons: Direct byte-level comparisons without decoding
Example type prefixes:
TypePrefix Byte
NULL0x00
Int320x03
String0x05
Int640x07
Timestamp0x0B

Multi-Version Concurrency Control (MVCC)

DocDB maintains multiple versions of each key using hybrid timestamps:
(key), T1 -> value1
(key), T2 -> value2  ← Newer version
(key), T3 -> value3  ← Latest version
Benefits:

Lock-Free Reads

Readers never block writers, and writers never block readers

Point-in-Time Queries

Read data as of any past hybrid timestamp

Snapshot Isolation

Transactions see a consistent snapshot of data

Garbage Collection

Old versions cleaned up after no active transactions need them

Packed Rows Optimization

Since v2.20, DocDB uses packed row format for better performance:
  • Reduced key count: Multiple columns packed into single key-value pairs
  • Lower overhead: Fewer RocksDB operations per row
  • Faster scans: Sequential reads benefit from reduced key lookups
  • Backward compatible: Seamlessly handles both formats
Packed rows are automatically used for new tables. Existing tables can be migrated using pg_repack or similar tools.

Data Expiration (YCQL Only)

YCQL supports Time-To-Live (TTL) at multiple levels: Table-level TTL:
CREATE TABLE sessions (
    session_id UUID PRIMARY KEY,
    data TEXT
) WITH default_time_to_live = 86400;  -- 24 hours
Row-level TTL:
INSERT INTO sessions (session_id, data)
VALUES (uuid(), 'session_data')
USING TTL 3600;  -- 1 hour
TTL is enforced during reads and compactions. Expired data is not immediately deleted but becomes invisible to queries.

Storage Characteristics

Log-Structured Merge (LSM) Tree

DocDB uses RocksDB’s LSM tree architecture:
  1. MemTable: In-memory write buffer for new data
  2. Immutable MemTable: Frozen MemTable being flushed to disk
  3. SST Files: Sorted String Table files on disk (multiple levels)
  4. Compaction: Background merging of SST files to reclaim space

Write Path

1

Write to WAL

Write-ahead log for durability (Raft log)
2

Insert to MemTable

In-memory update (fast)
3

Flush to SST

When MemTable is full, flush to Level 0 SST
4

Compaction

Background process merges and optimizes SST files

Read Path

1

Check MemTable

Look for latest version in memory
2

Check Block Cache

Look for cached SST blocks
3

Read SST Files

Search SST files from newest to oldest
4

Merge Results

Combine versions and apply tombstones

Performance Considerations

LSM trees rewrite data during compaction. Configure compaction strategies based on workload:
  • Level-based: Better for read-heavy workloads
  • Size-tiered: Better for write-heavy workloads
Multiple SST files may need to be checked. Mitigate with:
  • Bloom filters (enabled by default)
  • Block cache sizing
  • Regular compaction
Multiple versions and tombstones consume space. Manage with:
  • Regular compaction
  • Appropriate TTL settings
  • History retention policies

Next Steps

Distributed Transactions

Learn how transactions work across tablets

Replication

Understand how data is replicated for fault tolerance

Consistency Model

Explore consistency guarantees and isolation levels

Architecture

Review the overall system architecture

Build docs developers (and LLMs) love