Architecture: Vector LSM
YugabyteDB implements a custom Vector LSM (Log-Structured Merge tree) storage subsystem optimized for vector data, integrating state-of-the-art vector indexing libraries at the storage layer:Key Components
-
Vector LSM Subsystem:
- Maintains multiple vector indexes (mutable + immutable)
- Uses Usearch and Hnswlib for HNSW graph construction
- Persists indexes as memory-mapped files
- Performs compactions to merge indexes and remove deleted vectors
-
Vector ID Mapping:
- Each vector assigned a unique UUID (vector_id)
- Bidirectional mapping stored in RocksDB:
vector_id → ybctid(primary key of indexed row)ybctid → vector_id(embedded in vector column)
-
MVCC Filtering:
- Vector search results filtered by hybrid timestamp
- Post-filtering removes deleted/overwritten vectors
- Predicate pushdown optimizes search performance
-
Copartitioning:
- Vector indexes stored in same tablet as indexed table
- Enables single-RPC queries (search + fetch other columns)
- Sharded and replicated with the main table
Distance Functions
YugabyteDB supports three vector distance metrics:| Function | Operator | Description | Use Cases |
|---|---|---|---|
| L2 (Euclidean) | <-> | Straight-line distance in vector space | Image recognition, spatial data |
| Inner Product | <#> | Dot product (returns negative value) | Ranking, recommendation models |
| Cosine Distance | <=> | Angle between vectors (direction, not magnitude) | Text similarity, semantic search |
Formula Reference
Basic Usage
Installation
Creating Vector Columns
Inserting Vectors
Querying Vectors
Vector Indexing (HNSW)
HNSW (Hierarchical Navigable Small World) creates a multi-layer graph structure for efficient approximate nearest neighbor search.Creating HNSW Indexes
- Use
NONCONCURRENTLYto avoid blocking writes during index build - Create separate indexes for each distance function you need
- Index creation can take significant time for large datasets
HNSW Parameters
| Parameter | Description | Default | Tuning |
|---|---|---|---|
m | Max connections per layer in HNSW graph | 16 | Higher = better recall, more memory |
ef_construction | Size of dynamic candidate list during build | 64 | Higher = better index quality, slower build |
Index Performance Tuning
Advanced Usage
RAG Application Pattern
Multimodal Search
Aggregation Operations
Batch Vector Operations
MVCC and Transactional Behavior
Vector LSM respects YugabyteDB’s MVCC (Multi-Version Concurrency Control):Read-Your-Writes
Snapshot Isolation
Handling Deletes and Updates
Vector LSM filters out deleted/overwritten vectors:Performance Optimization
Index Build Time
Factors affecting build time:- Number of vectors
- Vector dimensionality
- HNSW parameters (m, ef_construction)
- Available memory
Query Performance
Optimizations:- Increase ef_search for better recall:
- Use LIMIT to reduce result set:
- Partition large tables:
Memory Management
Vector LSM uses memory-mapped files for on-disk indexes:- Provision sufficient memory for active indexes
- Monitor page cache hit rate
- Use SSD storage for vector indexes
Integration Examples
OpenAI Embeddings
LangChain Integration
Monitoring and Troubleshooting
Index Statistics
Query Performance Analysis
Common Issues
Slow queries:- Increase
ef_search(sacrifices speed for recall) - Check if index is being used (
EXPLAIN) - Verify sufficient memory for index
- Increase
mandef_constructionwhen creating index - Increase
ef_searchat query time - Consider using exact search for validation
- Reduce index size via table partitioning
- Lower
mparameter (less connections per layer) - Monitor and tune OS page cache
Best Practices
-
Choose Right Distance Function:
- L2 for absolute differences (image recognition)
- Cosine for directional similarity (text, normalized vectors)
- Inner product for ranking/scoring
- Normalize Vectors:
- Index Only After Bulk Load:
- Monitor Index Health:
- Track index size growth
- Monitor query latency
- Rebuild indexes periodically for compaction
- Test Recall Quality:
Limitations
- xCluster: Vector indexes not supported in xCluster replication
- Concurrent Index Creation:
CREATE INDEX CONCURRENTLYnot yet supported - Partial Indexes: Not supported on vector columns
- Index Only Scans: Vector indexes require table access to fetch other columns

