Performance & Optimization

Performance Overview

Khoj is designed to scale from personal use on a laptop to enterprise deployment serving thousands of users. Understanding performance characteristics helps you optimize both your local development environment and production deployments.

Performance metrics vary based on hardware, data size, and configuration. The benchmarks below are representative examples, not guarantees.

Search Performance

Semantic Search

Embedding Generation

< 100ms per queryUsing the default sentence-transformers model, query embedding generation is fast and happens in real-time.

Vector Search

< 50ms for 100K entriespgvector performs cosine similarity search efficiently using HNSW or IVFFlat indexes.

Re-ranking

< 2s for 15 resultsCross-encoder models provide accuracy improvements but add latency. Adjust top_k to balance speed vs quality.

Filter Application

< 20ms overheadDate, file, and word filters add minimal latency when properly indexed.

Optimization Strategies

Vector Indexing

Configure pgvector indexes for optimal search performance:

-- HNSW index (better for accuracy)
CREATE INDEX ON khoj_entry USING hnsw (embeddings vector_cosine_ops);

-- IVFFlat index (better for speed)
CREATE INDEX ON khoj_entry USING ivfflat (embeddings vector_cosine_ops)
WITH (lists = 100);

Trade-offs:

HNSW: Better recall, slower inserts, higher memory
IVFFlat: Faster inserts, lower memory, slightly lower recall

Batch Processing

When indexing multiple documents, batch embed and insert operations:

# Bad: One at a time
for doc in documents:
    embedding = generate_embedding(doc)
    save_to_db(embedding)

# Good: Batch processing
embeddings = generate_embeddings_batch(documents, batch_size=32)
save_to_db_batch(embeddings)

Benefits:

3-5x faster embedding generation
Reduced database connection overhead
Better GPU utilization

Cache Embeddings

Avoid recomputing embeddings for unchanged content:

Store content hash alongside embeddings
Only regenerate if content changes
Use incremental indexing for large corpora

content_hash = hashlib.sha256(content.encode()).hexdigest()
if existing_entry.content_hash != content_hash:
    # Content changed, regenerate embedding
    embedding = generate_embedding(content)

Limit Result Set

Reduce re-ranking overhead by limiting results:

# Get top 100 from vector search
candidates = vector_search(query, limit=100)

# Re-rank only top 15
results = rerank(candidates, limit=15)

Adjust based on your accuracy requirements.

Indexing Performance

Baseline Metrics

First Run

~10 minutes for 100K linesInitial indexing processes all content and generates embeddings.

Incremental Updates

< 1 minute for 100 changesOnly modified content is reprocessed.

Real-time Sync

< 5 seconds per fileSmall files indexed immediately after upload.

Factors Affecting Indexing Speed

Content Type

Plaintext/Markdown: Fastest (direct processing)
PDF: Medium (OCR for images, text extraction)
Images: Slowest (OCR with Tesseract/OCR models)

Content Size

Larger files take longer to:

Parse and extract text
Split into chunks
Generate embeddings

Hardware

CPU: Single-core performance matters for parsing
GPU: Accelerates embedding generation (optional)
RAM: 4GB minimum, 8GB+ recommended for large corpora
Disk I/O: SSD significantly faster than HDD

Embedding Model

Default: sentence-transformers/all-MiniLM-L6-v2 (fast, 384 dims)
Larger models: Better accuracy, slower speed
Custom models: Variable performance

Optimization Strategies

Parallel Processing

Use multiprocessing for CPU-bound tasks:

from multiprocessing import Pool

def process_file(file_path):
    content = extract_content(file_path)
    return generate_embedding(content)

with Pool(processes=4) as pool:
    embeddings = pool.map(process_file, file_paths)

Note: Be mindful of memory usage with large models.

GPU Acceleration

Enable GPU for faster embedding generation:

# Install CUDA-enabled PyTorch
pip install torch --index-url https://download.pytorch.org/whl/cu118

Models automatically use GPU if available. Expect 2-10x speedup depending on batch size.

Incremental Indexing

Only reindex changed files:

Track file modification timestamps
Store content hashes in database
Skip unchanged files during sync

This reduces 10-minute full indexes to seconds for typical updates.

Background Processing

Offload indexing to background workers:

Use Celery or APScheduler for task queues
Process uploads asynchronously
Return immediate response to users

@app.post("/upload")
async def upload_file(file: UploadFile):
    task_id = queue_indexing_task(file)
    return {"task_id": task_id, "status": "processing"}

Chat Performance

Response Latency Breakdown

Context Retrieval

< 100msSemantic search to find relevant documents for chat context.

Tool Execution

Variable (0-5s)Depends on tools used (web search, code execution, etc.)

Prompt Construction

< 50msBuilding the prompt from system message, history, and context.

LLM Generation

1-10s (streaming)Time to first token: ~500ms. Total time depends on response length and model.

Model Comparison

OpenAI
Anthropic
Google
Local Models

Model	Speed	Quality	Cost
GPT-4o	⚡⚡⚡ Fast	⭐⭐⭐⭐⭐ Excellent	💰💰 Medium
GPT-4 Turbo	⚡⚡ Medium	⭐⭐⭐⭐⭐ Excellent	💰💰💰 High
GPT-3.5 Turbo	⚡⚡⚡⚡ Very Fast	⭐⭐⭐⭐ Good	💰 Low

Model	Speed	Quality	Cost
Claude 3.5 Sonnet	⚡⚡⚡ Fast	⭐⭐⭐⭐⭐ Excellent	💰💰 Medium
Claude 3 Opus	⚡⚡ Medium	⭐⭐⭐⭐⭐ Excellent	💰💰💰 High
Claude 3 Haiku	⚡⚡⚡⚡ Very Fast	⭐⭐⭐⭐ Good	💰 Low

Model	Speed	Quality	Cost
Gemini 2.0 Flash	⚡⚡⚡⚡ Very Fast	⭐⭐⭐⭐⭐ Excellent	💰 Low
Gemini 1.5 Pro	⚡⚡ Medium	⭐⭐⭐⭐⭐ Excellent	💰💰 Medium

Model	Speed	Quality	Cost
Llama 3.3 70B	⚡ Slow (GPU required)	⭐⭐⭐⭐⭐ Excellent	💰 Free (hardware)
Llama 3.2 8B	⚡⚡⚡ Fast	⭐⭐⭐⭐ Good	💰 Free (hardware)
Qwen 2.5 7B	⚡⚡⚡ Fast	⭐⭐⭐⭐ Good	💰 Free (hardware)

Optimization Strategies

Streaming Responses

Always use streaming for better perceived performance:

async def chat_stream(prompt):
    async for chunk in llm.stream(prompt):
        yield chunk

Users see responses immediately instead of waiting for completion.

Context Window Management

Optimize prompt size to reduce latency:

Limit conversation history (last 10-20 messages)
Truncate retrieved documents to relevant excerpts
Remove redundant system prompts

Rule of thumb: Keep prompts under 4K tokens for fastest responses.

Tool Parallelization

Execute independent tools in parallel:

# Sequential (slow)
search_results = await web_search(query)
doc_results = await document_search(query)

# Parallel (fast)
search_results, doc_results = await asyncio.gather(
    web_search(query),
    document_search(query)
)

Caching

Cache responses for common queries:

Store (query_hash, response) pairs
Set TTL based on content volatility
Invalidate on content updates

Can reduce latency to < 100ms for cached hits.

Database Performance

PostgreSQL Optimization

Connection Pooling

Configure Django database settings:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'CONN_MAX_AGE': 600,  # Reuse connections
        'OPTIONS': {
            'connect_timeout': 10,
        }
    }
}

Index Strategy

Create indexes for common query patterns:

-- User queries
CREATE INDEX idx_entry_user ON khoj_entry(user_id);

-- Date filters
CREATE INDEX idx_entry_created ON khoj_entry(created_at);

-- File source lookups
CREATE INDEX idx_entry_source ON khoj_entry(file_source);

-- Vector search (choose one)
CREATE INDEX idx_entry_embeddings_hnsw 
  ON khoj_entry USING hnsw (embeddings vector_cosine_ops);

Query Optimization

Use select_related() and prefetch_related() to avoid N+1 queries
Add db_index=True to frequently filtered fields
Use only() and defer() to limit column fetches

Maintenance

Regular maintenance for optimal performance:

-- Analyze tables for query planner
ANALYZE khoj_entry;

-- Vacuum to reclaim space
VACUUM ANALYZE;

-- Reindex for fragmented indexes
REINDEX INDEX idx_entry_embeddings_hnsw;

Scaling Considerations

Vertical Scaling

Single-server improvements:

Increase PostgreSQL shared_buffers (25% of RAM)
Add more CPU cores for parallel queries
Use faster NVMe storage
Increase max_connections for high concurrency

Horizontal Scaling

Multi-server architecture:

Read replicas for search queries
Separate database for vectors (if needed)
Load balancer for FastAPI instances
Redis for session/cache layer

Memory Management

Model Loading

Embedding models are loaded into memory on startup. Plan for:

MiniLM-L6-v2: ~100MB RAM
MPNet-base: ~400MB RAM
Larger models: 1-4GB+ RAM

Optimization Strategies

Lazy Loading

Load models only when needed:

class EmbeddingModel:
    _model = None
    
    @classmethod
    def get_model(cls):
        if cls._model is None:
            cls._model = SentenceTransformer('model-name')
        return cls._model

Model Quantization

Use quantized models for lower memory:

INT8 quantization: 4x memory reduction
Minimal accuracy loss (less than 1%)
2-3x faster inference

model = SentenceTransformer('model-name')
model.half()  # FP16 precision

Batch Size Tuning

Balance memory usage and throughput:

# Small batch for low memory
embeddings = model.encode(texts, batch_size=8)

# Larger batch for throughput
embeddings = model.encode(texts, batch_size=64)

Monitoring & Profiling

Key Metrics to Track

Response Time

P50, P95, P99 latencies
Time-to-first-token for chat
Search query duration

Throughput

Requests per second
Concurrent users
Indexing rate (docs/minute)

Resource Usage

CPU utilization
Memory usage
Database connections
Disk I/O

Error Rates

4xx/5xx status codes
Database timeouts
LLM API errors

Profiling Tools

Python
Database
API

# cProfile for CPU profiling
python -m cProfile -o profile.stats khoj_script.py

# memory_profiler for memory usage
@profile
def expensive_function():
    # your code

# py-spy for production profiling
py-spy record -o profile.svg -- python -m khoj.main

-- Enable query logging
SET log_statement = 'all';

-- Explain query plans
EXPLAIN ANALYZE SELECT * FROM khoj_entry WHERE ...;

-- Find slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

# FastAPI middleware for request timing
@app.middleware("http")
async def add_process_time_header(request, call_next):
    start_time = time.time()
    response = await call_next(request)
    process_time = time.time() - start_time
    response.headers["X-Process-Time"] = str(process_time)
    return response

Benchmarking

Running Performance Tests

Setup Test Data

Create representative test corpus:

python scripts/generate_test_data.py --size 100k

Run Benchmarks

# Search performance
pytest tests/benchmarks/test_search_performance.py

# Indexing performance
pytest tests/benchmarks/test_indexing_performance.py -v

Compare Results

pytest-benchmark compare baseline.json current.json

Load Testing

Use tools like Locust or k6 for load testing:

# locustfile.py
from locust import HttpUser, task, between

class KhojUser(HttpUser):
    wait_time = between(1, 3)
    
    @task
    def search(self):
        self.client.get("/api/search?q=test query")
    
    @task(3)
    def chat(self):
        self.client.post("/api/chat", json={
            "q": "What is Khoj?",
            "conversation_id": "test"
        })

Run load test:

locust -f locustfile.py --host http://localhost:42110

Performance Best Practices

Use Async

Leverage async/await for I/O-bound operations to handle more concurrent requests.

Cache Aggressively

Cache embeddings, search results, and LLM responses where appropriate.

Batch Operations

Process multiple items together to reduce overhead and improve throughput.

Monitor Continuously

Set up monitoring and alerting to catch performance regressions early.

Profile Before Optimizing

Measure to find actual bottlenecks instead of optimizing prematurely.

Test at Scale

Test with realistic data volumes to identify scaling issues before production.

Additional Resources

Development Setup

Set up your local environment

Architecture

Understand the system design

PostgreSQL Performance

Official PostgreSQL optimization guide

FastAPI Performance

FastAPI async best practices

Development

​Performance Overview

​Search Performance

​Semantic Search

Embedding Generation

Vector Search

Re-ranking

Filter Application

​Optimization Strategies

​Indexing Performance

​Baseline Metrics

First Run

Incremental Updates

Real-time Sync

​Factors Affecting Indexing Speed

​Optimization Strategies

​Chat Performance

​Response Latency Breakdown

​Model Comparison

​Optimization Strategies

​Database Performance

​PostgreSQL Optimization

​Scaling Considerations

Vertical Scaling

Horizontal Scaling

​Memory Management

​Model Loading

​Optimization Strategies

​Monitoring & Profiling

​Key Metrics to Track

Response Time

Throughput

Resource Usage

Error Rates

​Profiling Tools

​Benchmarking

​Running Performance Tests

​Load Testing

​Performance Best Practices

Use Async

Cache Aggressively

Batch Operations

Monitor Continuously

Profile Before Optimizing

Test at Scale

​Additional Resources

Development Setup

Architecture

PostgreSQL Performance

FastAPI Performance

Build docs developers (and LLMs) love

Performance Overview

Search Performance

Semantic Search

Optimization Strategies

Indexing Performance

Baseline Metrics

Factors Affecting Indexing Speed

Optimization Strategies

Chat Performance

Response Latency Breakdown

Model Comparison

Optimization Strategies

Database Performance

PostgreSQL Optimization

Scaling Considerations

Memory Management

Model Loading

Optimization Strategies

Monitoring & Profiling

Key Metrics to Track

Profiling Tools

Benchmarking

Running Performance Tests

Load Testing

Performance Best Practices

Additional Resources