Overview
Iqra AI is designed for horizontal scaling from the ground up. The architecture separates concerns between Proxy servers (media handling), Backend servers (logic processing), and Background services (async tasks), allowing you to scale each component independently based on workload characteristics. This guide covers strategies for scaling from hundreds to thousands of concurrent sessions.Scaling architecture
Component responsibilities
Understanding each component’s role is critical for effective scaling:Proxy servers
Proxy servers
Primary function: WebRTC/SIP media streaming and RTP packet handlingResource profile:
- High network I/O (audio streaming)
- Moderate CPU (codec processing)
- Low memory per connection
- Stateful (maintains WebRTC peer connections)
- Linear scaling with connection count
- Network bandwidth is typically the bottleneck
- Plan for 100-200 concurrent connections per server
Backend servers
Backend servers
Primary function: Agent logic, LLM integration, and business rulesResource profile:
- High CPU (LLM inference, script execution)
- High memory (conversation context, state management)
- Moderate network I/O (API calls to LLM providers)
- Stateful (maintains active session state)
- Scales with complexity of agent logic
- Memory increases with conversation context size
- Plan for 50-100 concurrent sessions per server
Background services
Background services
Primary function: Async processing, scheduled tasks, cleanupResource profile:
- Variable CPU (depends on job type)
- Moderate memory
- Low network I/O
- Mostly stateless
- Can run as singleton or distributed
- Scale based on job queue depth
- Use work queue for distribution
Capacity planning
Baseline requirements
Start with these baseline server specifications: Proxy Server (100 concurrent connections)- 4 vCPU
- 8 GB RAM
- 1 Gbps network
- 50 GB SSD storage
- 8 vCPU
- 16 GB RAM
- 500 Mbps network
- 100 GB SSD storage
- 4 vCPU
- 8 GB RAM
- 100 Mbps network
- 100 GB SSD storage
Calculating required capacity
Determine how many servers you need:Example calculation
For a deployment with 500 peak concurrent sessions:These are conservative estimates. Monitor actual resource utilization and adjust based on your specific agent complexity and conversation patterns.
Adding capacity
Adding servers to existing regions
Add servers to handle increased load:Load testing new capacity
Before enabling a new server for production traffic:Run load tests
Use your load testing tool to simulate traffic while the server is still in maintenance mode.
Auto-scaling strategies
Kubernetes horizontal pod autoscaling
Deploy Iqra AI on Kubernetes for automatic scaling:Custom autoscaling based on session count
Implement application-aware autoscaling:Graceful draining
When removing capacity, drain sessions gracefully:Database scaling
MongoDB scaling
Iqra AI uses MongoDB for persistent storage. Scale your database infrastructure:Replica sets
Replica sets
Use MongoDB replica sets for high availability:Configure connection string:
Sharding
Sharding
Read preference
Read preference
Optimize read performance:This distributes read load across replicas while maintaining consistency.
Redis scaling
Redis handles real-time metrics and session state:Redis Cluster
Redis Cluster
For high throughput, use Redis Cluster:
Redis Sentinel
Redis Sentinel
For high availability without sharding:
Network optimization
Load balancing
Use application-aware load balancing: For HTTP/WebSocket traffic:- Use layer 7 load balancer (ALB, NGINX, HAProxy)
- Sticky sessions based on session ID
- Health check endpoints monitoring runtime status
- Use layer 4 load balancer (NLB)
- UDP support for RTP
- Preserve source IP for geo-routing
CDN for static assets
Offload static content delivery:- Dashboard UI assets → CloudFront/Cloudflare
- Agent avatar images → CDN
- Shared media files → CDN
Monitoring scaling effectiveness
Track these metrics to validate scaling decisions:Utilization metrics
Target ranges
Optimal operation:- Capacity utilization: 40-70%
- CPU usage: 30-60%
- Memory usage: 40-70%
- Session distribution: Low standard deviation (balanced load)
Best practices
Do’s
- Scale proactively - Add capacity before you hit limits, not after
- Test at scale - Load test with realistic traffic patterns
- Monitor trends - Track growth rates to predict future capacity needs
- Document baselines - Record performance characteristics at different load levels
- Use infrastructure as code - Automate server provisioning for rapid scaling
Don’ts
- Don’t scale down aggressively - Be conservative removing capacity
- Don’t ignore database scaling - Application servers aren’t the only bottleneck
- Don’t forget network limits - Check NIC throughput limits
- Don’t scale without monitoring - Ensure metrics are flowing before scaling decisions
- Don’t mix workload types - Keep Proxy and Backend servers separate
Troubleshooting
New servers not receiving traffic
New servers not receiving traffic
Check:
- Server is enabled (not in maintenance mode or disabled)
- Server is reporting healthy status to metrics system
- Load balancer health checks are passing
- Firewall rules allow inbound connections
- DNS/service discovery has updated
Unbalanced load distribution
Unbalanced load distribution
Causes:
- Sticky sessions with long-lived connections
- Some servers in degraded state
- Heterogeneous hardware (different server specs)
- Load balancer algorithm (switch to least-connections)
Database becomes bottleneck
Database becomes bottleneck
Solutions:
- Add read replicas for read-heavy workloads
- Enable query result caching in application
- Optimize slow queries (use database profiler)
- Implement connection pooling
- Consider MongoDB sharding for write-heavy workloads
Next steps
Multi-region
Deploy across multiple geographic regions
Monitoring
Set up comprehensive observability