Skip to main content

Introduction

Vector databases store and query high-dimensional embeddings efficiently, enabling:
  • Semantic search
  • Retrieval-Augmented Generation (RAG)
  • Recommendation systems
  • Similarity detection
  • Anomaly detection

Why Vector Databases?

Semantic Search

Find similar items based on meaning, not just keywords

RAG Systems

Retrieve relevant context for LLM prompts

Scalability

Efficiently search billions of vectors

Real-time

Low-latency queries for production systems

LanceDB

LanceDB is an embedded vector database designed for AI applications.

Key Features

  • Embedded: No separate server required
  • Serverless: Works with cloud storage (S3, GCS)
  • Format: Built on Lance columnar format
  • Versioned: Built-in versioning and time travel
  • Multi-modal: Store vectors, text, images together

Installation

uv sync
# or
pip install lancedb sentence-transformers datasets

Building a RAG Application

Create a CLI application for semantic search over SQL questions.

Create Vector Database

vector-db/rag_cli_application.py
import random
import lancedb
from datasets import load_dataset
from sentence_transformers import SentenceTransformer

MODEL_NAME = "paraphrase-MiniLM-L3-v2"

def create_new_vector_db(
    table_name: str = "my-rag-app",
    number_of_documents: int = 1000,
    uri=".lancedb"
):
    # Load dataset
    dataset = load_dataset("b-mc2/sql-create-context")
    docs = random.sample(list(dataset["train"]), k=number_of_documents)
    
    # Generate embeddings
    model = SentenceTransformer(MODEL_NAME)
    texts = [doc["question"] for doc in docs]
    embeddings = model.encode(texts, show_progress_bar=True)
    
    # Prepare data
    data = [
        {
            "id": idx,
            "text": texts[idx],
            "vector": embeddings[idx],
            "answer": docs[idx]["answer"],
            "question": docs[idx]["question"],
            "context": docs[idx]["context"],
        }
        for idx in range(len(texts))
    ]
    
    # Create database and index
    db = lancedb.connect(uri)
    lance_table = db.create_table(table_name, data=data)
    lance_table.create_index()
    
    print(f"Created table '{table_name}' with {number_of_documents} documents")
Components:
  • load_dataset: Load text-to-SQL dataset
  • SentenceTransformer: Generate embeddings
  • lancedb.connect: Create database connection
  • create_table: Store vectors and metadata
  • create_index: Build ANN index for fast search

Query Vector Database

vector-db/rag_cli_application.py
def query_existing_vector_db(
    query: str = "What was ARR last year?",
    table_name: str = "my-rag-app",
    top_n: int = 1,
    uri=".lancedb"
):
    # Encode query
    model = SentenceTransformer(MODEL_NAME)
    query_embedding = model.encode(query)
    
    # Open database
    db = lancedb.connect(uri)
    tbl = db.open_table(table_name)
    
    # Search
    results = tbl.search(query_embedding).limit(top_n).to_list()
    
    # Display results
    print("Search results:")
    for result in results:
        print("\n--- RESULT ---")
        print(f"Answer: {result['answer']}")
        print(f"Context: {result['context']}")
        print(f"Question: {result['question']}")

CLI Usage

1

Create database

python vector-db/rag_cli_application.py create-new-vector-db \
  --table-name test \
  --number-of-documents 300
2

Query database

python vector-db/rag_cli_application.py query-existing-vector-db \
  --query "complex query" \
  --table-name test

Architecture

Storage Format

LanceDB uses the Lance columnar format:
.lancedb/
├── my-rag-app.lance/
│   ├── data/
│   │   ├── chunk-0.lance
│   │   └── chunk-1.lance
│   ├── index/
│   │   └── ivf_pq.idx
│   └── manifest.json
Storage Diagram Benefits:
  • Columnar storage for analytics
  • Efficient compression
  • Fast filtering on metadata
  • Version control built-in

Indexing

LanceDB supports multiple index types:
Inverted File with Product QuantizationBest for:
  • Large datasets (>100k vectors)
  • Approximate nearest neighbor (ANN)
  • Trade-off: speed vs accuracy
table.create_index(
    metric="L2",
    num_partitions=256,
    num_sub_vectors=96
)

Advanced Queries

Filtering

Combine vector search with SQL-like filters:
# Vector search + metadata filter
results = (
    table
    .search(query_embedding)
    .where("category = 'product'")
    .where("price < 100")
    .limit(10)
    .to_list()
)
Combine full-text and vector search:
# Create FTS index
table.create_fts_index("text")

# Hybrid query
results = (
    table
    .search(query_embedding)
    .where("text MATCH 'machine learning'")
    .limit(10)
    .to_list()
)

Reranking

Improve results with cross-encoder reranking:
from sentence_transformers import CrossEncoder

# Initial retrieval
results = table.search(query_embedding).limit(100).to_list()

# Rerank
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, r['text']) for r in results])

# Sort by reranking scores
reranked = sorted(zip(results, scores), key=lambda x: x[1], reverse=True)
top_results = [r[0] for r in reranked[:10]]

Embedding Models

Model Selection

MiniLM

Fast, lightweight
384 dims, ~80MB
Good for: High throughput

SBERT

Balanced
768 dims, ~400MB
Good for: General purpose

BGE

High accuracy
1024 dims, ~1GB
Good for: Quality-critical

OpenAI

State-of-art
1536 dims, API
Good for: Best results

Embedding Code

from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Single text
embedding = model.encode("example text")

# Batch encoding
texts = ["text 1", "text 2", "text 3"]
embeddings = model.encode(
    texts,
    batch_size=32,
    show_progress_bar=True,
    normalize_embeddings=True  # L2 normalization
)

Production RAG Pipeline

from typing import List, Dict
import lancedb
from sentence_transformers import SentenceTransformer
from openai import OpenAI

class RAGSystem:
    def __init__(self, db_path: str, table_name: str):
        self.db = lancedb.connect(db_path)
        self.table = self.db.open_table(table_name)
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.llm = OpenAI()
    
    def retrieve(self, query: str, top_k: int = 5) -> List[Dict]:
        """Retrieve relevant documents"""
        query_emb = self.encoder.encode(query)
        results = self.table.search(query_emb).limit(top_k).to_list()
        return results
    
    def generate(self, query: str, context: List[Dict]) -> str:
        """Generate answer using LLM"""
        context_str = "\n\n".join([doc['text'] for doc in context])
        
        prompt = f"""Answer the question based on the context below.
        
Context:
{context_str}

Question: {query}

Answer:"""
        
        response = self.llm.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
        
        return response.choices[0].message.content
    
    def query(self, question: str) -> Dict:
        """Full RAG pipeline"""
        # Retrieve
        context = self.retrieve(question)
        
        # Generate
        answer = self.generate(question, context)
        
        return {
            "answer": answer,
            "sources": context
        }

Performance Optimization

Process multiple texts together:
# Slow
embeddings = [model.encode(text) for text in texts]

# Fast
embeddings = model.encode(texts, batch_size=32)
5-10x speedup with batching
Adjust index parameters:
table.create_index(
    metric="cosine",
    num_partitions=256,  # More = faster but less accurate
    num_sub_vectors=96,  # More = smaller but less accurate
    accelerator="cuda"   # Use GPU if available
)
Cache frequent queries:
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_embedding(text: str):
    return model.encode(text)
Reduce embedding precision:
import numpy as np

# FP32 -> FP16
embeddings_fp16 = embeddings.astype(np.float16)

# Or use int8
embeddings_int8 = (embeddings * 127).astype(np.int8)
50% memory reduction with minimal accuracy loss

Alternatives

import chromadb

client = chromadb.Client()
collection = client.create_collection("docs")

collection.add(
    documents=["doc1", "doc2"],
    embeddings=[[1,2,3], [4,5,6]],
    ids=["id1", "id2"]
)

results = collection.query(
    query_embeddings=[[1,2,3]],
    n_results=10
)

Best Practices

Chunk Size

  • Target 200-500 tokens per chunk
  • Use semantic chunking
  • Maintain context overlap

Metadata

  • Store source, date, author
  • Enable filtering by metadata
  • Index filterable fields

Monitoring

  • Track query latency
  • Monitor recall quality
  • Log user feedback

Versioning

  • Version embeddings
  • Track model changes
  • Enable rollback

Resources

Next Steps

Build docs developers (and LLMs) love