Production architecture
The system is designed for production deployment with these characteristics:- Stateless API: Horizontally scalable FastAPI application
- External dependencies: Chroma vector store, OpenAI API, Unstructured API
- Persistent data: Models, vector database, evaluation reports
- Async-ready: Uses
async deffor non-blocking I/O operations
Deployment options
- Kubernetes
- Cloud Run (GCP)
- AWS ECS
- Docker Swarm
Recommended for: Enterprise deployments with auto-scaling needsSample deployment:Service:
Monitoring & observability
Metrics to track
System metrics:- API response time (p50, p95, p99)
- Request rate (requests/second)
- Error rate (4xx, 5xx responses)
- CPU and memory utilization
- Container restarts
- RAG queries per minute
- Retrieval relevance scores
- Triage confidence scores
- Human review rate (
needs_human_review=true) - Average citations per response
- OpenAI API token usage
- Embedding API calls
- LLM generation calls
- Total API spend per day/week
Monitoring setup
From the architecture documentation (ARCHITECTURE.MD):
Monitoring & ObservabilityRecommended stack:
- Track retrieval relevance, faithfulness rate, adversarial failures, LLM latency, cost and system metrics.
- Use structured JSON logging and export metrics via OpenTelemetry / Prometheus.
- Treat periodic offline evaluations as regression tests.
Prometheus + Grafana
For system metrics, API latency, and request rates
LangFuse / LangSmith
For LLM-specific observability (traces, costs, quality)
Sentry
For error tracking and alerting
CloudWatch / Datadog
For cloud-native monitoring
Structured logging
The system uses structured logging (src/logger.py). Enhance for production:
Cost controls
From the architecture documentation:Cost Controls
- Enforce hard limits on
top_kretrieval.- Use embedding & retrieval caching and smaller verification models where possible.
- Run heavy adversarial checks offline.
- Even host our own open source models.
Implementation strategies
Set up billing alerts
Configure alerts in OpenAI dashboard:
- Daily spend threshold
- Monthly budget limit
- Unusual usage patterns
Security
From the architecture documentation:Security & Safety
- Never implicitly trust user input; tests for prompt injection are part of the evaluation suite.
- Store secrets only in environment variables; never expose system prompts.
- Implement explicit refusal behavior for out-of-scope or unsupported requests.
Security checklist
Secrets management
Secrets management
Never store secrets in code or Docker images✅ Use:
- AWS Secrets Manager / Parameter Store
- GCP Secret Manager
- Azure Key Vault
- Kubernetes Secrets
- HashiCorp Vault
- Commit
.envto git - Hardcode API keys
- Copy
.envinto Docker images
Input validation
Input validation
Validate all user inputs
Rate limiting
Rate limiting
Implement rate limiting per client
Network security
Network security
Restrict network access
- Use VPC/network policies to isolate services
- Enable HTTPS/TLS for all external traffic
- Use API gateway for authentication
- Whitelist IP addresses if possible
Prompt injection prevention
Prompt injection prevention
The system includes adversarial evaluation testsFrom
src/rag/evals.py:- Tests for prompt injection attempts
- Out-of-scope query detection
- Ambiguous input handling
Scaling strategy
From the architecture documentation:Scaling Strategy
- Application: stateless FastAPI services, horizontally scalable via containers (Docker/Kubernetes).
- Knowledge store: migrate to managed vector DBs as data grows.
- Background jobs: evaluations and retraining run in separate jobs to keep API responsive.
- Async: Utilize async def and await to not block the event loop for I/O workflows.
Horizontal scaling
The FastAPI application is stateless and can be scaled horizontally:Vector database scaling
Current: Local Chroma instance (./chroma_db)
Production options:
Chroma Cloud
Managed Chroma with auto-scaling
Pinecone
Serverless vector database
Weaviate
Self-hosted or cloud, supports multi-tenancy
Qdrant
High-performance vector search
Background jobs
Separate compute for:- Model training (
src/ml/train.py) - Batch predictions (
src/ml/predict.py) - Offline evaluations (
src/rag/evals.py)
Troubleshooting
High latency
High latency
Symptoms: API responses taking >5 secondsDiagnosis:
- Check LLM generation time
- Check vector search latency
- Check network latency to OpenAI
- Reduce
top_kretrieval count - Use faster embedding model
- Enable caching
- Use streaming responses for LLM
Memory leaks
Memory leaks
Symptoms: Container memory usage increasing over timeDiagnosis:Solutions:
- Clear LRU caches periodically
- Limit in-memory vector store size
- Use garbage collection
API cost spikes
API cost spikes
Symptoms: Unexpected OpenAI billing chargesDiagnosis:
- Check OpenAI usage dashboard
- Review application logs for token counts
- Look for retry loops or infinite recursion
- Implement token usage logging
- Set hard limits on
max_tokens - Enable caching for repeated queries
- Use smaller models for non-critical tasks
Deployment checklist
Pre-deployment
- Environment variables configured in secrets manager
- Resource limits set (CPU, memory)
- Health checks configured
- Logging and monitoring enabled
- Rate limiting configured
- TLS/HTTPS enabled
Deployment
- Deploy to staging environment first
- Run smoke tests
- Verify health endpoint
- Check logs for errors
- Test sample queries
Post-deployment
- Monitor error rates
- Track API latency
- Review cost metrics
- Set up alerting rules
- Schedule offline evaluations
Next steps
Evaluation
Set up offline evaluation pipeline
Training models
Retrain triage models with production data
Docker deployment
Deploy with Docker and Docker Compose
Environment variables
Configure secrets and settings