Deployment checklist
Before deploying to production, ensure you have:- Benchmarked retrieval quality on representative queries
- Tuned
top_k,candidate_pool_size, and reranking settings - Set up monitoring and observability
- Configured proper logging levels
- Secured API keys using environment variables
- Tested error handling and fallback behavior
- Established cost budgets and alerts
- Implemented rate limiting for LLM calls
- Validated latency meets SLO requirements
Environment configuration
Production environment variables
Use environment variables for all secrets and environment-specific settings:.env.production
Configuration file structure
Organize configurations by environment:configs/base.yaml):
configs/production.yaml):
Database-specific deployment considerations
Pinecone
Pinecone
Namespace strategy:Multi-tenancy:Use namespaces for tenant isolation (scales to 100,000+ tenants):Best practices:
- Monitor pod utilization and scale replicas based on QPS
- Use serverless indexes for variable workloads
- Implement retry logic for rate limit errors (429)
Weaviate
Weaviate
Connection configuration:Multi-tenancy:Weaviate supports native multi-tenancy with per-tenant shards:Best practices:
- Use batch imports for initial indexing (100+ docs/batch)
- Enable quantization (PQ or BQ) to reduce memory 4x
- Monitor shard health and replication status
Milvus
Milvus
Production configuration:Partition-based multi-tenancy:Best practices:
- Use scalar quantization (SQ8) for 4x storage reduction
- Enable partition pruning with metadata filters
- Monitor memory usage per collection
- Set appropriate
index_file_sizefor write throughput
Qdrant
Qdrant
Production setup:Payload-based multi-tenancy:Best practices:
- Enable payload indexing for frequently filtered fields
- Use quantization (scalar or binary) for large datasets
- Monitor disk usage and configure storage thresholds
- Use gRPC instead of HTTP for lower latency
Chroma
Chroma
Production configuration:Best practices:
- Run Chroma server in Docker for production
- Use persistent storage volumes
- Implement connection pooling for concurrent requests
- Monitor collection size and query latency
Logging and monitoring
Production logging configuration
Set appropriate log levels by environment:Key metrics to monitor
Query latency
- p50, p95, p99 latency by query type
- Breakdown: embedding, retrieval, reranking, generation
- Alert on p95 > SLO threshold
Retrieval quality
- Online Recall@k and MRR
- User feedback signals (clicks, dwell time)
- Fallback rate (queries with no results)
Cost metrics
- LLM API token usage per query
- Embedding API costs
- Database operations cost
- Cost per 1000 queries
System health
- Database connection errors
- API rate limit hits
- Retry and timeout rates
- Error rates by type
Example monitoring implementation
Error handling and resilience
Retry logic with exponential backoff
Circuit breaker pattern
Performance optimization
Cost-optimized configuration
Reduce costs while maintaining quality:Caching strategy
Deployment patterns
Containerized deployment
Sample Dockerfile:Kubernetes deployment
deployment.yaml
Next steps
Benchmarking
Validate production performance with benchmarks
Configuration
Fine-tune production settings
Environment variables
Complete reference for production credentials
Building RAG pipelines
Learn core RAG pipeline concepts