Performance Overview
Khoj is designed to scale from personal use on a laptop to enterprise deployment serving thousands of users. Understanding performance characteristics helps you optimize both your local development environment and production deployments.Performance metrics vary based on hardware, data size, and configuration. The benchmarks below are representative examples, not guarantees.
Search Performance
Semantic Search
Embedding Generation
< 100ms per queryUsing the default
sentence-transformers model, query embedding generation is fast and happens in real-time.Vector Search
< 50ms for 100K entriespgvector performs cosine similarity search efficiently using HNSW or IVFFlat indexes.
Re-ranking
< 2s for 15 resultsCross-encoder models provide accuracy improvements but add latency. Adjust
top_k to balance speed vs quality.Filter Application
< 20ms overheadDate, file, and word filters add minimal latency when properly indexed.
Optimization Strategies
Vector Indexing
Vector Indexing
Configure pgvector indexes for optimal search performance:Trade-offs:
- HNSW: Better recall, slower inserts, higher memory
- IVFFlat: Faster inserts, lower memory, slightly lower recall
Batch Processing
Batch Processing
When indexing multiple documents, batch embed and insert operations:Benefits:
- 3-5x faster embedding generation
- Reduced database connection overhead
- Better GPU utilization
Cache Embeddings
Cache Embeddings
Avoid recomputing embeddings for unchanged content:
- Store content hash alongside embeddings
- Only regenerate if content changes
- Use incremental indexing for large corpora
Limit Result Set
Limit Result Set
Reduce re-ranking overhead by limiting results:Adjust based on your accuracy requirements.
Indexing Performance
Baseline Metrics
First Run
~10 minutes for 100K linesInitial indexing processes all content and generates embeddings.
Incremental Updates
< 1 minute for 100 changesOnly modified content is reprocessed.
Real-time Sync
< 5 seconds per fileSmall files indexed immediately after upload.
Factors Affecting Indexing Speed
Content Type
- Plaintext/Markdown: Fastest (direct processing)
- PDF: Medium (OCR for images, text extraction)
- Images: Slowest (OCR with Tesseract/OCR models)
Content Size
Larger files take longer to:
- Parse and extract text
- Split into chunks
- Generate embeddings
Hardware
- CPU: Single-core performance matters for parsing
- GPU: Accelerates embedding generation (optional)
- RAM: 4GB minimum, 8GB+ recommended for large corpora
- Disk I/O: SSD significantly faster than HDD
Optimization Strategies
Parallel Processing
Parallel Processing
Use multiprocessing for CPU-bound tasks:Note: Be mindful of memory usage with large models.
GPU Acceleration
GPU Acceleration
Enable GPU for faster embedding generation:Models automatically use GPU if available. Expect 2-10x speedup depending on batch size.
Incremental Indexing
Incremental Indexing
Only reindex changed files:
- Track file modification timestamps
- Store content hashes in database
- Skip unchanged files during sync
Background Processing
Background Processing
Offload indexing to background workers:
- Use Celery or APScheduler for task queues
- Process uploads asynchronously
- Return immediate response to users
Chat Performance
Response Latency Breakdown
Model Comparison
- OpenAI
- Anthropic
- Google
- Local Models
| Model | Speed | Quality | Cost |
|---|---|---|---|
| GPT-4o | ⚡⚡⚡ Fast | ⭐⭐⭐⭐⭐ Excellent | 💰💰 Medium |
| GPT-4 Turbo | ⚡⚡ Medium | ⭐⭐⭐⭐⭐ Excellent | 💰💰💰 High |
| GPT-3.5 Turbo | ⚡⚡⚡⚡ Very Fast | ⭐⭐⭐⭐ Good | 💰 Low |
Optimization Strategies
Streaming Responses
Streaming Responses
Always use streaming for better perceived performance:Users see responses immediately instead of waiting for completion.
Context Window Management
Context Window Management
Optimize prompt size to reduce latency:
- Limit conversation history (last 10-20 messages)
- Truncate retrieved documents to relevant excerpts
- Remove redundant system prompts
Tool Parallelization
Tool Parallelization
Execute independent tools in parallel:
Caching
Caching
Cache responses for common queries:
- Store (query_hash, response) pairs
- Set TTL based on content volatility
- Invalidate on content updates
Database Performance
PostgreSQL Optimization
Query Optimization
- Use
select_related()andprefetch_related()to avoid N+1 queries - Add
db_index=Trueto frequently filtered fields - Use
only()anddefer()to limit column fetches
Scaling Considerations
Vertical Scaling
Single-server improvements:
- Increase PostgreSQL
shared_buffers(25% of RAM) - Add more CPU cores for parallel queries
- Use faster NVMe storage
- Increase
max_connectionsfor high concurrency
Horizontal Scaling
Multi-server architecture:
- Read replicas for search queries
- Separate database for vectors (if needed)
- Load balancer for FastAPI instances
- Redis for session/cache layer
Memory Management
Model Loading
Optimization Strategies
Lazy Loading
Lazy Loading
Load models only when needed:
Model Quantization
Model Quantization
Use quantized models for lower memory:
- INT8 quantization: 4x memory reduction
- Minimal accuracy loss (less than 1%)
- 2-3x faster inference
Batch Size Tuning
Batch Size Tuning
Balance memory usage and throughput:
Monitoring & Profiling
Key Metrics to Track
Response Time
- P50, P95, P99 latencies
- Time-to-first-token for chat
- Search query duration
Throughput
- Requests per second
- Concurrent users
- Indexing rate (docs/minute)
Resource Usage
- CPU utilization
- Memory usage
- Database connections
- Disk I/O
Error Rates
- 4xx/5xx status codes
- Database timeouts
- LLM API errors
Profiling Tools
- Python
- Database
- API
Benchmarking
Running Performance Tests
Load Testing
Use tools like Locust or k6 for load testing:Performance Best Practices
Use Async
Leverage async/await for I/O-bound operations to handle more concurrent requests.
Cache Aggressively
Cache embeddings, search results, and LLM responses where appropriate.
Batch Operations
Process multiple items together to reduce overhead and improve throughput.
Monitor Continuously
Set up monitoring and alerting to catch performance regressions early.
Profile Before Optimizing
Measure to find actual bottlenecks instead of optimizing prematurely.
Test at Scale
Test with realistic data volumes to identify scaling issues before production.
Additional Resources
Development Setup
Set up your local environment
Architecture
Understand the system design
PostgreSQL Performance
Official PostgreSQL optimization guide
FastAPI Performance
FastAPI async best practices
