Skip to main content

Performance Overview

Khoj is designed to scale from personal use on a laptop to enterprise deployment serving thousands of users. Understanding performance characteristics helps you optimize both your local development environment and production deployments.
Performance metrics vary based on hardware, data size, and configuration. The benchmarks below are representative examples, not guarantees.

Search Performance

Embedding Generation

< 100ms per queryUsing the default sentence-transformers model, query embedding generation is fast and happens in real-time.

Vector Search

< 50ms for 100K entriespgvector performs cosine similarity search efficiently using HNSW or IVFFlat indexes.

Re-ranking

< 2s for 15 resultsCross-encoder models provide accuracy improvements but add latency. Adjust top_k to balance speed vs quality.

Filter Application

< 20ms overheadDate, file, and word filters add minimal latency when properly indexed.

Optimization Strategies

Configure pgvector indexes for optimal search performance:
-- HNSW index (better for accuracy)
CREATE INDEX ON khoj_entry USING hnsw (embeddings vector_cosine_ops);

-- IVFFlat index (better for speed)
CREATE INDEX ON khoj_entry USING ivfflat (embeddings vector_cosine_ops)
WITH (lists = 100);
Trade-offs:
  • HNSW: Better recall, slower inserts, higher memory
  • IVFFlat: Faster inserts, lower memory, slightly lower recall
When indexing multiple documents, batch embed and insert operations:
# Bad: One at a time
for doc in documents:
    embedding = generate_embedding(doc)
    save_to_db(embedding)

# Good: Batch processing
embeddings = generate_embeddings_batch(documents, batch_size=32)
save_to_db_batch(embeddings)
Benefits:
  • 3-5x faster embedding generation
  • Reduced database connection overhead
  • Better GPU utilization
Avoid recomputing embeddings for unchanged content:
  • Store content hash alongside embeddings
  • Only regenerate if content changes
  • Use incremental indexing for large corpora
content_hash = hashlib.sha256(content.encode()).hexdigest()
if existing_entry.content_hash != content_hash:
    # Content changed, regenerate embedding
    embedding = generate_embedding(content)
Reduce re-ranking overhead by limiting results:
# Get top 100 from vector search
candidates = vector_search(query, limit=100)

# Re-rank only top 15
results = rerank(candidates, limit=15)
Adjust based on your accuracy requirements.

Indexing Performance

Baseline Metrics

First Run

~10 minutes for 100K linesInitial indexing processes all content and generates embeddings.

Incremental Updates

< 1 minute for 100 changesOnly modified content is reprocessed.

Real-time Sync

< 5 seconds per fileSmall files indexed immediately after upload.

Factors Affecting Indexing Speed

1

Content Type

  • Plaintext/Markdown: Fastest (direct processing)
  • PDF: Medium (OCR for images, text extraction)
  • Images: Slowest (OCR with Tesseract/OCR models)
2

Content Size

Larger files take longer to:
  • Parse and extract text
  • Split into chunks
  • Generate embeddings
3

Hardware

  • CPU: Single-core performance matters for parsing
  • GPU: Accelerates embedding generation (optional)
  • RAM: 4GB minimum, 8GB+ recommended for large corpora
  • Disk I/O: SSD significantly faster than HDD
4

Embedding Model

  • Default: sentence-transformers/all-MiniLM-L6-v2 (fast, 384 dims)
  • Larger models: Better accuracy, slower speed
  • Custom models: Variable performance

Optimization Strategies

Use multiprocessing for CPU-bound tasks:
from multiprocessing import Pool

def process_file(file_path):
    content = extract_content(file_path)
    return generate_embedding(content)

with Pool(processes=4) as pool:
    embeddings = pool.map(process_file, file_paths)
Note: Be mindful of memory usage with large models.
Enable GPU for faster embedding generation:
# Install CUDA-enabled PyTorch
pip install torch --index-url https://download.pytorch.org/whl/cu118
Models automatically use GPU if available. Expect 2-10x speedup depending on batch size.
Only reindex changed files:
  • Track file modification timestamps
  • Store content hashes in database
  • Skip unchanged files during sync
This reduces 10-minute full indexes to seconds for typical updates.
Offload indexing to background workers:
  • Use Celery or APScheduler for task queues
  • Process uploads asynchronously
  • Return immediate response to users
@app.post("/upload")
async def upload_file(file: UploadFile):
    task_id = queue_indexing_task(file)
    return {"task_id": task_id, "status": "processing"}

Chat Performance

Response Latency Breakdown

Context Retrieval

< 100msSemantic search to find relevant documents for chat context.

Tool Execution

Variable (0-5s)Depends on tools used (web search, code execution, etc.)

Prompt Construction

< 50msBuilding the prompt from system message, history, and context.

LLM Generation

1-10s (streaming)Time to first token: ~500ms. Total time depends on response length and model.

Model Comparison

ModelSpeedQualityCost
GPT-4o⚡⚡⚡ Fast⭐⭐⭐⭐⭐ Excellent💰💰 Medium
GPT-4 Turbo⚡⚡ Medium⭐⭐⭐⭐⭐ Excellent💰💰💰 High
GPT-3.5 Turbo⚡⚡⚡⚡ Very Fast⭐⭐⭐⭐ Good💰 Low

Optimization Strategies

Always use streaming for better perceived performance:
async def chat_stream(prompt):
    async for chunk in llm.stream(prompt):
        yield chunk
Users see responses immediately instead of waiting for completion.
Optimize prompt size to reduce latency:
  • Limit conversation history (last 10-20 messages)
  • Truncate retrieved documents to relevant excerpts
  • Remove redundant system prompts
Rule of thumb: Keep prompts under 4K tokens for fastest responses.
Execute independent tools in parallel:
# Sequential (slow)
search_results = await web_search(query)
doc_results = await document_search(query)

# Parallel (fast)
search_results, doc_results = await asyncio.gather(
    web_search(query),
    document_search(query)
)
Cache responses for common queries:
  • Store (query_hash, response) pairs
  • Set TTL based on content volatility
  • Invalidate on content updates
Can reduce latency to < 100ms for cached hits.

Database Performance

PostgreSQL Optimization

1

Connection Pooling

Configure Django database settings:
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'CONN_MAX_AGE': 600,  # Reuse connections
        'OPTIONS': {
            'connect_timeout': 10,
        }
    }
}
2

Index Strategy

Create indexes for common query patterns:
-- User queries
CREATE INDEX idx_entry_user ON khoj_entry(user_id);

-- Date filters
CREATE INDEX idx_entry_created ON khoj_entry(created_at);

-- File source lookups
CREATE INDEX idx_entry_source ON khoj_entry(file_source);

-- Vector search (choose one)
CREATE INDEX idx_entry_embeddings_hnsw 
  ON khoj_entry USING hnsw (embeddings vector_cosine_ops);
3

Query Optimization

  • Use select_related() and prefetch_related() to avoid N+1 queries
  • Add db_index=True to frequently filtered fields
  • Use only() and defer() to limit column fetches
4

Maintenance

Regular maintenance for optimal performance:
-- Analyze tables for query planner
ANALYZE khoj_entry;

-- Vacuum to reclaim space
VACUUM ANALYZE;

-- Reindex for fragmented indexes
REINDEX INDEX idx_entry_embeddings_hnsw;

Scaling Considerations

Vertical Scaling

Single-server improvements:
  • Increase PostgreSQL shared_buffers (25% of RAM)
  • Add more CPU cores for parallel queries
  • Use faster NVMe storage
  • Increase max_connections for high concurrency

Horizontal Scaling

Multi-server architecture:
  • Read replicas for search queries
  • Separate database for vectors (if needed)
  • Load balancer for FastAPI instances
  • Redis for session/cache layer

Memory Management

Model Loading

Embedding models are loaded into memory on startup. Plan for:
  • MiniLM-L6-v2: ~100MB RAM
  • MPNet-base: ~400MB RAM
  • Larger models: 1-4GB+ RAM

Optimization Strategies

Load models only when needed:
class EmbeddingModel:
    _model = None
    
    @classmethod
    def get_model(cls):
        if cls._model is None:
            cls._model = SentenceTransformer('model-name')
        return cls._model
Use quantized models for lower memory:
  • INT8 quantization: 4x memory reduction
  • Minimal accuracy loss (less than 1%)
  • 2-3x faster inference
model = SentenceTransformer('model-name')
model.half()  # FP16 precision
Balance memory usage and throughput:
# Small batch for low memory
embeddings = model.encode(texts, batch_size=8)

# Larger batch for throughput
embeddings = model.encode(texts, batch_size=64)

Monitoring & Profiling

Key Metrics to Track

Response Time

  • P50, P95, P99 latencies
  • Time-to-first-token for chat
  • Search query duration

Throughput

  • Requests per second
  • Concurrent users
  • Indexing rate (docs/minute)

Resource Usage

  • CPU utilization
  • Memory usage
  • Database connections
  • Disk I/O

Error Rates

  • 4xx/5xx status codes
  • Database timeouts
  • LLM API errors

Profiling Tools

# cProfile for CPU profiling
python -m cProfile -o profile.stats khoj_script.py

# memory_profiler for memory usage
@profile
def expensive_function():
    # your code

# py-spy for production profiling
py-spy record -o profile.svg -- python -m khoj.main

Benchmarking

Running Performance Tests

1

Setup Test Data

Create representative test corpus:
python scripts/generate_test_data.py --size 100k
2

Run Benchmarks

# Search performance
pytest tests/benchmarks/test_search_performance.py

# Indexing performance
pytest tests/benchmarks/test_indexing_performance.py -v
3

Compare Results

pytest-benchmark compare baseline.json current.json

Load Testing

Use tools like Locust or k6 for load testing:
# locustfile.py
from locust import HttpUser, task, between

class KhojUser(HttpUser):
    wait_time = between(1, 3)
    
    @task
    def search(self):
        self.client.get("/api/search?q=test query")
    
    @task(3)
    def chat(self):
        self.client.post("/api/chat", json={
            "q": "What is Khoj?",
            "conversation_id": "test"
        })
Run load test:
locust -f locustfile.py --host http://localhost:42110

Performance Best Practices

Use Async

Leverage async/await for I/O-bound operations to handle more concurrent requests.

Cache Aggressively

Cache embeddings, search results, and LLM responses where appropriate.

Batch Operations

Process multiple items together to reduce overhead and improve throughput.

Monitor Continuously

Set up monitoring and alerting to catch performance regressions early.

Profile Before Optimizing

Measure to find actual bottlenecks instead of optimizing prematurely.

Test at Scale

Test with realistic data volumes to identify scaling issues before production.

Additional Resources

Development Setup

Set up your local environment

Architecture

Understand the system design

PostgreSQL Performance

Official PostgreSQL optimization guide

FastAPI Performance

FastAPI async best practices

Build docs developers (and LLMs) love