Monitoring and observability

Overview

Iqra AI provides a comprehensive metrics and monitoring system that tracks the health and performance of all infrastructure components in real-time. The system collects hardware metrics, application status, and session data to give you full visibility into your deployment. All metrics are stored in Redis for real-time access and MongoDB for historical analysis.

Architecture

Metrics collection

The monitoring system consists of three layers:

Hardware monitoring - CPU, memory, and network utilization per server
Application monitoring - Runtime status, session counts, queue depths
Historical tracking - Time-series data for trend analysis

The metrics system is implemented in IqraInfrastructure/Managers/Server/Metrics/ServerMetricsManager.cs:8 and uses platform-specific hardware monitors for Linux and Windows.

Data flow

[Node Hardware Monitor]
        ↓
[ServerMetricsMonitor]
        ↓
[Redis Live Status Channel] → Real-time dashboard
        ↓
[MongoDB Historical Store] → Analytics & alerts

Server status data

Every node in the Iqra AI infrastructure reports standardized metrics:

Base metrics

All server types report these core metrics:

public class ServerStatusData {
    public string NodeId { get; set; }
    public AppNodeTypeEnum Type { get; set; }
    public NodeRuntimeStatus RuntimeStatus { get; set; }
    public string RuntimeStatusReason { get; set; }
    public string Version { get; set; }
    public DateTime LastUpdated { get; set; }

    // Hardware metrics
    public double CpuUsagePercent { get; set; }
    public double MemoryUsagePercent { get; set; }
    public double NetworkDownloadMbps { get; set; }
    public double NetworkUploadMbps { get; set; }
}

Backend server metrics

Backend nodes report additional session tracking:

public class BackendServerStatusData : ServerStatusData {
    public string RegionId { get; set; }
    public int MaxConcurrentCallsCount { get; set; }
    public int CurrentActiveTelephonySessionCount { get; set; }
    public int CurrentActiveWebSessionCount { get; set; }
}

Proxy server metrics

Proxy nodes track queue processing:

public class ProxyServerStatusData : ServerStatusData {
    public string RegionId { get; set; }
    public int CurrentOutboundMarkedQueues { get; set; }
    public int CurrentOutboundProcessingMarkedQueues { get; set; }
    public int CurrentOutboundProcessedMarkedQueues { get; set; }
}

Runtime status

Servers report their current operational state:

Starting

The node is initializing and not yet ready to handle traffic. This is the initial state when a server boots.

Healthy

The node is fully operational and accepting new sessions. All health checks are passing.

Degraded

The node is operational but experiencing issues (high latency, elevated error rates, resource constraints). Consider investigating.

Draining

The node is gracefully shutting down. It’s completing existing sessions but not accepting new ones.

Offline

The node has stopped reporting metrics and is considered unavailable.

Querying metrics

Get all active nodes

Retrieve the current status of all infrastructure nodes:

var serverMetricsManager = serviceProvider.GetRequiredService<ServerMetricsManager>();

// Get all active nodes
var activeNodes = await serverMetricsManager.GetAllActiveNodesAsync();

foreach (var node in activeNodes) {
    Console.WriteLine($"Node: {node.NodeId}");
    Console.WriteLine($"Type: {node.Type}");
    Console.WriteLine($"Status: {node.RuntimeStatus}");
    Console.WriteLine($"CPU: {node.CpuUsagePercent:F2}%");
    Console.WriteLine($"Memory: {node.MemoryUsagePercent:F2}%");
    Console.WriteLine();
}

Get specific server status

Query the status of a specific server by region and node ID:

var status = await serverMetricsManager.GetServerStatusData(
    regionId: "US-EAST",
    nodeId: "backend-us-east-1"
);

if (status is BackendServerStatusData backendStatus) {
    Console.WriteLine($"Active telephony sessions: {backendStatus.CurrentActiveTelephonySessionCount}");
    Console.WriteLine($"Active web sessions: {backendStatus.CurrentActiveWebSessionCount}");
    Console.WriteLine($"Total capacity: {backendStatus.MaxConcurrentCallsCount}");

    var utilizationPercent = (backendStatus.CurrentActiveTelephonySessionCount +
                              backendStatus.CurrentActiveWebSessionCount) /
                             (double)backendStatus.MaxConcurrentCallsCount * 100;

    Console.WriteLine($"Capacity utilization: {utilizationPercent:F2}%");
}

Check node availability

Verify if specific node types are running:

// Check if any backend nodes are running
var backendRunning = await serverMetricsManager.CheckBackendNodeRunning(
    "US-EAST", "backend-us-east-1"
);

// Check if any proxy nodes are running
var proxyRunning = await serverMetricsManager.CheckProxyNodeRunning(
    "US-EAST", "proxy-us-east-1"
);

// Check if background processing is running
var backgroundRunning = await serverMetricsManager.CheckAnyBackgroundNodeRunning();

// Count total worker nodes
var (anyWorkerRunning, workerCount) = await serverMetricsManager.AreAnyWorkerNodesRunningAndCount();
Console.WriteLine($"Worker nodes running: {workerCount}");

Hardware metrics

Iqra AI monitors system resources using platform-specific implementations:

Linux monitoring

On Linux systems, metrics are collected from /proc filesystem:

CPU usage: Calculated from /proc/stat delta measurements
Memory usage: Read from /proc/meminfo (used vs total)
Network throughput: Measured from /proc/net/dev byte counters

Windows monitoring

On Windows systems, metrics use Performance Counters:

CPU usage: Processor(_Total)\% Processor Time
Memory usage: Memory\% Committed Bytes In Use
Network throughput: Sum of all network interface bytes/sec

The hardware monitoring implementation is located in IqraInfrastructure/Managers/Server/Metrics/Monitor/Hardware/ with separate LinuxHardwareMonitor.cs and WindowsHardwareMonitor.cs classes.

Metrics publishing

The ServerMetricsMonitor automatically publishes metrics at regular intervals:

// Update and publish current metrics
await serverMetricsMonitor.UpdateAndPublishStatusAsync();

This operation:

Collects current hardware metrics from the platform monitor
Updates the in-memory status object
Publishes to Redis for real-time access
Records to MongoDB every 1 minute for historical tracking

Historical metrics are recorded at 1-minute intervals to balance storage costs with data granularity. If you need higher resolution, adjust _historicalRecordInterval in ServerMetricsMonitor.cs:24.

Setting runtime status

Applications should update their runtime status based on operational state:

serverMetricsMonitor.SetRuntimeStatus(
    NodeRuntimeStatus.Healthy,
    "All systems operational"
);

// During graceful shutdown
serverMetricsMonitor.SetRuntimeStatus(
    NodeRuntimeStatus.Draining,
    "Completing existing sessions before shutdown"
);

// If experiencing issues
serverMetricsMonitor.SetRuntimeStatus(
    NodeRuntimeStatus.Degraded,
    "Database connection pool exhausted"
);

Building dashboards

Real-time capacity monitoring

Build a dashboard showing regional capacity:

var activeNodes = await serverMetricsManager.GetAllActiveNodesAsync();

var regionStats = activeNodes
    .OfType<BackendServerStatusData>()
    .GroupBy(n => n.RegionId)
    .Select(g => new {
        RegionId = g.Key,
        TotalCapacity = g.Sum(n => n.MaxConcurrentCallsCount),
        ActiveSessions = g.Sum(n => n.CurrentActiveTelephonySessionCount +
                                    n.CurrentActiveWebSessionCount),
        NodeCount = g.Count(),
        AvgCpuUsage = g.Average(n => n.CpuUsagePercent),
        AvgMemoryUsage = g.Average(n => n.MemoryUsagePercent)
    });

foreach (var region in regionStats) {
    var utilization = (region.ActiveSessions / (double)region.TotalCapacity) * 100;

    Console.WriteLine($"Region: {region.RegionId}");
    Console.WriteLine($"  Nodes: {region.NodeCount}");
    Console.WriteLine($"  Capacity: {region.ActiveSessions}/{region.TotalCapacity} ({utilization:F1}%)");
    Console.WriteLine($"  Avg CPU: {region.AvgCpuUsage:F1}%");
    Console.WriteLine($"  Avg Memory: {region.AvgMemoryUsage:F1}%");
    Console.WriteLine();
}

Health check endpoint

Implement a health check endpoint for load balancers:

app.MapGet("/health", async (ServerMetricsManager metricsManager) => {
    var status = await metricsManager.GetServerStatusData(
        Environment.GetEnvironmentVariable("REGION_ID"),
        Environment.GetEnvironmentVariable("NODE_ID")
    );

    if (status == null) {
        return Results.Problem("Metrics not available", statusCode: 503);
    }

    if (status.RuntimeStatus != NodeRuntimeStatus.Healthy) {
        return Results.Problem(
            status.RuntimeStatusReason,
            statusCode: 503
        );
    }

    // Check resource utilization
    if (status.CpuUsagePercent > 90 || status.MemoryUsagePercent > 90) {
        return Results.Problem(
            "Resource utilization critical",
            statusCode: 503
        );
    }

    return Results.Ok(new {
        status = "healthy",
        version = status.Version,
        uptime = DateTime.UtcNow - status.LastUpdated
    });
});

Alerting strategies

Critical alerts

Set up alerts for conditions requiring immediate attention:

Node offline

Alert when a node stops reporting metrics for more than 30 seconds.

Resource exhaustion

Alert when CPU or memory usage exceeds 90% for more than 5 minutes.

Capacity threshold

Alert when regional capacity utilization exceeds 80%.

Runtime status degraded

Alert when any node enters Degraded or Draining state unexpectedly.

Warning alerts

Set up warnings for conditions that may require investigation:

CPU usage > 70% for more than 10 minutes
Memory usage > 75% for more than 10 minutes
Regional capacity utilization > 60%
Network throughput exceeding expected baseline by 50%

Metrics retention

Plan your metrics retention based on compliance and analysis needs:

Time Range	Resolution	Use Case
Last 24 hours	1 minute	Real-time troubleshooting
Last 7 days	5 minutes	Recent trend analysis
Last 30 days	1 hour	Capacity planning
Last 12 months	1 day	Long-term trends

Implement a MongoDB TTL index to automatically expire old historical records based on your retention policy.

Best practices

Metric collection

Keep intervals consistent - Use the default 1-minute interval for historical recording unless you have specific requirements
Monitor the monitors - Set up alerts if the metrics system itself stops reporting
Use tags consistently - Always include region and node identifiers in queries

Performance

Cache active node lists - Don’t query all active nodes on every request; cache for 5-10 seconds
Aggregate in the database - Use MongoDB aggregation pipelines for historical analysis
Limit real-time queries - Only query specific nodes when needed; use the map view for bulk access

Troubleshooting

Metrics not appearing

Verify:

Redis is accessible and running
ServerMetricsMonitor service is initialized
Hardware monitor is supported on the platform
No exceptions in application logs

Stale metrics

Check:

Network connectivity between nodes and Redis
Clock synchronization (NTP) across servers
ServerMetricsMonitorService is running

High memory usage from metrics

Redis stores only current status; historical data is in MongoDB. If Redis memory is high:

Verify nodes are cleaning up status on shutdown
Check for zombie node entries in Redis
Implement TTL on Redis keys (30 seconds recommended)

Next steps

Multi-region

Learn about deploying across multiple regions

Scaling

Horizontal scaling strategies for high traffic

FlowApp System

API Reference

Advanced

Monitoring and observability

Overview

Architecture

Metrics collection

Data flow

Server status data

Base metrics

Backend server metrics

Proxy server metrics

Runtime status

Querying metrics

Get all active nodes

Get specific server status

Check node availability

Hardware metrics

Linux monitoring

Windows monitoring

Metrics publishing

Setting runtime status

Building dashboards

Real-time capacity monitoring

Health check endpoint

Alerting strategies

Critical alerts

Warning alerts

Metrics retention

Best practices

Metric collection

Performance

Troubleshooting

Next steps

Multi-region

Scaling

Build docs developers (and LLMs) love

FlowApp System

API Reference

Advanced

​Overview

​Architecture

​Metrics collection

​Data flow

​Server status data

​Base metrics

​Backend server metrics

​Proxy server metrics

​Runtime status

​Querying metrics

​Get all active nodes

​Get specific server status

​Check node availability

​Hardware metrics

​Linux monitoring

​Windows monitoring

​Metrics publishing

​Setting runtime status

​Building dashboards

​Real-time capacity monitoring

​Health check endpoint

​Alerting strategies

​Critical alerts

​Warning alerts

​Metrics retention

​Best practices

​Metric collection

​Performance

​Troubleshooting

​Next steps

Multi-region

Scaling

Build docs developers (and LLMs) love

Overview

Architecture

Metrics collection

Data flow

Server status data

Base metrics

Backend server metrics

Proxy server metrics

Runtime status

Querying metrics

Get all active nodes

Get specific server status

Check node availability

Hardware metrics

Linux monitoring

Windows monitoring

Metrics publishing

Setting runtime status

Building dashboards

Real-time capacity monitoring

Health check endpoint

Alerting strategies

Critical alerts

Warning alerts

Metrics retention

Best practices

Metric collection

Performance

Troubleshooting

Next steps