Key-Value Storage Model
DocDB stores table data by encoding rows into multiple key-value pairs. This approach enables:- Efficient updates: Modify individual columns without rewriting entire rows
- MVCC support: Multiple versions of values with hybrid timestamps
- Flexible schema: Support for nested documents and collections
- Optimized access: Binary-comparable encodings for fast lookups
As of v2.20, DocDB uses an optimized packed row format for better performance. The encoding described here represents the logical model.
DocDB Key Structure
Keys in DocDB are compound keys with the following components:Colocation ID (Optional)
Colocation ID (Optional)
Present when using colocated tables/databases to separate data from different tables in the same tablet.
Hash Code (Optional)
Hash Code (Optional)
A 16-bit hash of hash column values for hash-sharded tables. Determines which tablet owns the data.
Document Key (DocKey)
Document Key (DocKey)
The primary key columns encoded in order: hash columns followed by range columns.
Sub Keys (Optional)
Sub Keys (Optional)
Column IDs for non-primary key columns. Enables partial updates without reading full rows.
Hybrid Timestamp
Hybrid Timestamp
MVCC timestamp in reverse order for efficient retrieval of latest versions.
DocDB Value Types
Values in DocDB can be:- Primitive types: int32, int64, double, text, timestamp, UUID, etc.
- Non-primitive types: Sorted maps where objects map scalar keys to values
- Collections: Lists and sets implemented using DocDB’s object type
Encoding Example
Consider this YSQL table:Insert Operation
The liveness column is a special system column that tracks row-level metadata. It’s invisible to users but essential for proper MVCC behavior.
Update Operation
Updating a nested field:Delete Operation
Deleting a column adds a tombstone marker:Primary Key Encoding
The document key contains the full primary key with components in this order:If no primary key is defined, YugabyteDB automatically generates an internal row ID (similar to PostgreSQL’s ctid).
Binary-Comparable Encoding
All key components use binary-comparable encoding, ensuring:- Sort order preservation: Encoded byte strings sort the same as original values
- Type safety: Each data type has a unique byte prefix
- Efficient comparisons: Direct byte-level comparisons without decoding
| Type | Prefix Byte |
|---|---|
| NULL | 0x00 |
| Int32 | 0x03 |
| String | 0x05 |
| Int64 | 0x07 |
| Timestamp | 0x0B |
Multi-Version Concurrency Control (MVCC)
DocDB maintains multiple versions of each key using hybrid timestamps:Lock-Free Reads
Readers never block writers, and writers never block readers
Point-in-Time Queries
Read data as of any past hybrid timestamp
Snapshot Isolation
Transactions see a consistent snapshot of data
Garbage Collection
Old versions cleaned up after no active transactions need them
Packed Rows Optimization
Since v2.20, DocDB uses packed row format for better performance:- Reduced key count: Multiple columns packed into single key-value pairs
- Lower overhead: Fewer RocksDB operations per row
- Faster scans: Sequential reads benefit from reduced key lookups
- Backward compatible: Seamlessly handles both formats
Data Expiration (YCQL Only)
YCQL supports Time-To-Live (TTL) at multiple levels: Table-level TTL:TTL is enforced during reads and compactions. Expired data is not immediately deleted but becomes invisible to queries.
Storage Characteristics
Log-Structured Merge (LSM) Tree
DocDB uses RocksDB’s LSM tree architecture:- MemTable: In-memory write buffer for new data
- Immutable MemTable: Frozen MemTable being flushed to disk
- SST Files: Sorted String Table files on disk (multiple levels)
- Compaction: Background merging of SST files to reclaim space
Write Path
Read Path
Performance Considerations
Write Amplification
Write Amplification
LSM trees rewrite data during compaction. Configure compaction strategies based on workload:
- Level-based: Better for read-heavy workloads
- Size-tiered: Better for write-heavy workloads
Read Amplification
Read Amplification
Multiple SST files may need to be checked. Mitigate with:
- Bloom filters (enabled by default)
- Block cache sizing
- Regular compaction
Space Amplification
Space Amplification
Multiple versions and tombstones consume space. Manage with:
- Regular compaction
- Appropriate TTL settings
- History retention policies
Next Steps
Distributed Transactions
Learn how transactions work across tablets
Replication
Understand how data is replicated for fault tolerance
Consistency Model
Explore consistency guarantees and isolation levels
Architecture
Review the overall system architecture

