Overview
Iqra AI provides a comprehensive metrics and monitoring system that tracks the health and performance of all infrastructure components in real-time. The system collects hardware metrics, application status, and session data to give you full visibility into your deployment. All metrics are stored in Redis for real-time access and MongoDB for historical analysis.Architecture
Metrics collection
The monitoring system consists of three layers:- Hardware monitoring - CPU, memory, and network utilization per server
- Application monitoring - Runtime status, session counts, queue depths
- Historical tracking - Time-series data for trend analysis
The metrics system is implemented in
IqraInfrastructure/Managers/Server/Metrics/ServerMetricsManager.cs:8 and uses platform-specific hardware monitors for Linux and Windows.Data flow
Server status data
Every node in the Iqra AI infrastructure reports standardized metrics:Base metrics
All server types report these core metrics:Backend server metrics
Backend nodes report additional session tracking:Proxy server metrics
Proxy nodes track queue processing:Runtime status
Servers report their current operational state:Starting
Starting
The node is initializing and not yet ready to handle traffic. This is the initial state when a server boots.
Healthy
Healthy
The node is fully operational and accepting new sessions. All health checks are passing.
Degraded
Degraded
The node is operational but experiencing issues (high latency, elevated error rates, resource constraints). Consider investigating.
Draining
Draining
The node is gracefully shutting down. It’s completing existing sessions but not accepting new ones.
Offline
Offline
The node has stopped reporting metrics and is considered unavailable.
Querying metrics
Get all active nodes
Retrieve the current status of all infrastructure nodes:Get specific server status
Query the status of a specific server by region and node ID:Check node availability
Verify if specific node types are running:Hardware metrics
Iqra AI monitors system resources using platform-specific implementations:Linux monitoring
On Linux systems, metrics are collected from/proc filesystem:
- CPU usage: Calculated from
/proc/statdelta measurements - Memory usage: Read from
/proc/meminfo(used vs total) - Network throughput: Measured from
/proc/net/devbyte counters
Windows monitoring
On Windows systems, metrics use Performance Counters:- CPU usage:
Processor(_Total)\% Processor Time - Memory usage:
Memory\% Committed Bytes In Use - Network throughput: Sum of all network interface bytes/sec
Metrics publishing
TheServerMetricsMonitor automatically publishes metrics at regular intervals:
- Collects current hardware metrics from the platform monitor
- Updates the in-memory status object
- Publishes to Redis for real-time access
- Records to MongoDB every 1 minute for historical tracking
Setting runtime status
Applications should update their runtime status based on operational state:Building dashboards
Real-time capacity monitoring
Build a dashboard showing regional capacity:Health check endpoint
Implement a health check endpoint for load balancers:Alerting strategies
Critical alerts
Set up alerts for conditions requiring immediate attention:Warning alerts
Set up warnings for conditions that may require investigation:- CPU usage > 70% for more than 10 minutes
- Memory usage > 75% for more than 10 minutes
- Regional capacity utilization > 60%
- Network throughput exceeding expected baseline by 50%
Metrics retention
Plan your metrics retention based on compliance and analysis needs:| Time Range | Resolution | Use Case |
|---|---|---|
| Last 24 hours | 1 minute | Real-time troubleshooting |
| Last 7 days | 5 minutes | Recent trend analysis |
| Last 30 days | 1 hour | Capacity planning |
| Last 12 months | 1 day | Long-term trends |
Best practices
Metric collection
- Keep intervals consistent - Use the default 1-minute interval for historical recording unless you have specific requirements
- Monitor the monitors - Set up alerts if the metrics system itself stops reporting
- Use tags consistently - Always include region and node identifiers in queries
Performance
- Cache active node lists - Don’t query all active nodes on every request; cache for 5-10 seconds
- Aggregate in the database - Use MongoDB aggregation pipelines for historical analysis
- Limit real-time queries - Only query specific nodes when needed; use the map view for bulk access
Troubleshooting
Metrics not appearing
Metrics not appearing
Verify:
- Redis is accessible and running
- ServerMetricsMonitor service is initialized
- Hardware monitor is supported on the platform
- No exceptions in application logs
Stale metrics
Stale metrics
Check:
- Network connectivity between nodes and Redis
- Clock synchronization (NTP) across servers
- ServerMetricsMonitorService is running
High memory usage from metrics
High memory usage from metrics
Redis stores only current status; historical data is in MongoDB. If Redis memory is high:
- Verify nodes are cleaning up status on shutdown
- Check for zombie node entries in Redis
- Implement TTL on Redis keys (30 seconds recommended)
Next steps
Multi-region
Learn about deploying across multiple regions
Scaling
Horizontal scaling strategies for high traffic