Skip to main content

Infrastructure Monitoring

Infrastructure monitoring provides visibility into the health, performance, and availability of all systems within the SOC environment. This layer uses industry-leading tools to ensure operational reliability and detect performance-based security anomalies.
Zabbix and Prometheus work together to provide comprehensive monitoring: Zabbix for traditional infrastructure monitoring and Prometheus for cloud-native metrics and alerting.

Architecture Overview

Zabbix

Enterprise infrastructure monitoring for availability and performance

Prometheus

Time-series metrics collection and alerting for modern infrastructure

Zabbix Monitoring Platform

Core Capabilities

Zabbix provides comprehensive monitoring for enterprise infrastructure:
Data Collection Techniques:
  • Agent-based: Zabbix agents on monitored hosts
  • Agentless: SNMP, IPMI, JMX monitoring
  • Active vs Passive: Agent or server-initiated checks
  • Web monitoring: HTTP/HTTPS endpoint checks
  • Database monitoring: Native database queries
  • Log file monitoring: Pattern matching in logs

Monitoring Templates

Zabbix includes 500+ pre-built templates for common systems and applications. Always start with official templates and customize as needed.
Linux Monitoring:
  • CPU utilization and load average
  • Memory usage (used, free, cached)
  • Disk space and I/O metrics
  • Network interface statistics
  • Process monitoring
  • System logs
Windows Monitoring:
  • Performance counters
  • Windows services
  • Event log monitoring
  • Active Directory health
  • IIS web server metrics
SNMP Monitoring:
  • Interface status and bandwidth
  • CPU and memory on network devices
  • Routing table monitoring
  • Temperature sensors
  • Power supply status
  • Fan speed monitoring
Common Applications:
  • Apache/Nginx web servers
  • MySQL/PostgreSQL databases
  • Elasticsearch clusters
  • Docker containers
  • Kubernetes clusters
  • Redis, MongoDB, RabbitMQ

Alert Configuration

Properly configured triggers prevent alert fatigue. Use appropriate thresholds, dependencies, and severity levels to ensure actionable alerts.
// Example: High CPU usage trigger
Trigger: {avg(/Linux/system.cpu.util,5m)}>80
Severity: Warning
Expression: Average CPU > 80% for 5 minutes

// Critical CPU with dependency
Trigger: {avg(/Linux/system.cpu.util,5m)}>95
Severity: High
Dependency: Host is available (prevents false alerts)

// Disk space prediction
Trigger: {timeleft(/Linux/vfs.fs.size[/,pfree],1h,0)}<24h
Severity: Warning
Expression: Disk will be full in less than 24 hours

Zabbix Agent Configuration

# Install Zabbix agent
apt-get install zabbix-agent  # Debian/Ubuntu
yum install zabbix-agent      # RHEL/CentOS

# Configure agent
cat > /etc/zabbix/zabbix_agentd.conf <<EOF
Server=zabbix-server.local
ServerActive=zabbix-server.local
Hostname=$(hostname -f)
EnableRemoteCommands=0
LogFile=/var/log/zabbix/zabbix_agentd.log

# User parameters for custom metrics
UserParameter=custom.metric,/usr/local/bin/check_metric.sh
EOF

# Start agent
systemctl enable zabbix-agent
systemctl start zabbix-agent

Prometheus Metrics System

Architecture and Concepts

Prometheus follows a pull-based model for metrics collection:

Time-Series Database

Efficient storage of metrics with labels for multi-dimensional data

PromQL Query Language

Powerful query language for data aggregation and analysis

Service Discovery

Automatic target discovery for dynamic environments

Alertmanager

Flexible alert routing and notification management

Exporters

Exporters expose metrics in Prometheus format:
System and Infrastructure:
  • Node Exporter: Linux/Unix system metrics
  • Windows Exporter: Windows system metrics
  • Blackbox Exporter: Endpoint probing (HTTP, DNS, TCP)
  • SNMP Exporter: Network device metrics
Applications:
  • MySQL Exporter: Database metrics
  • PostgreSQL Exporter: Database performance
  • Redis Exporter: Redis statistics
  • Elasticsearch Exporter: Cluster health

Prometheus Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'soc-production'
    environment: 'prod'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Rule files
rule_files:
  - '/etc/prometheus/rules/*.yml'

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  # Node exporters
  - job_name: 'node'
    static_configs:
      - targets:
          - 'server1:9100'
          - 'server2:9100'
          - 'server3:9100'
        labels:
          group: 'production'
  
  # Blackbox exporter for endpoint monitoring
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://wazuh.local
          - https://elasticsearch.local
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115
  
  # Service discovery for Kubernetes
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

PromQL Queries

PromQL enables powerful metric aggregation across dimensions. Learn the basics of rate(), increase(), and aggregation operators for effective monitoring.
# CPU usage percentage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Network traffic rate (bytes per second)
irate(node_network_receive_bytes_total[5m])

# HTTP request rate per minute
rate(http_requests_total[1m]) * 60

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Disk space remaining (hours)
predict_linear(node_filesystem_avail_bytes[1h], 3600)

Alert Rules

# /etc/prometheus/rules/alerts.yml
groups:
  - name: infrastructure
    interval: 30s
    rules:
      # High CPU usage
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% for 5 minutes"
      
      # Low disk space
      - alert: LowDiskSpace
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Only {{ $value }}% disk space remaining"
      
      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          description: "{{ $labels.instance }} has been down for 1 minute"
      
      # High error rate
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High HTTP error rate"
          description: "Error rate is {{ $value }} requests/sec"

Visualization and Dashboards

Zabbix Dashboards

Network Maps

Visual topology with real-time status indicators

Custom Widgets

Graphs, gauges, and tables for key metrics

Screens

Multi-graph displays for comprehensive views

Reports

Scheduled PDF/CSV reports for stakeholders

Grafana Integration

Grafana provides unified visualization for both Zabbix and Prometheus:
# Grafana datasource configuration
apiVersion: 1

datasources:
  # Prometheus datasource
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
  
  # Zabbix datasource (requires plugin)
  - name: Zabbix
    type: alexanderzobnin-zabbix-datasource
    access: proxy
    url: http://zabbix/api_jsonrpc.php
    jsonData:
      username: grafana
      trends: true
Grafana’s dashboard marketplace offers 1000+ pre-built dashboards for Prometheus exporters. Import and customize rather than building from scratch.

Integration with SOC Architecture

Security-Relevant Metrics

Infrastructure monitoring contributes to security operations:
Security Indicators:
  • Sudden CPU/memory spikes (cryptomining)
  • Unusual network traffic patterns
  • Unexpected process creation
  • Abnormal disk I/O (data exfiltration)

Forwarding to Wazuh

Integrate infrastructure monitoring alerts into your SIEM for correlation with security events. Performance issues often precede or accompany security incidents.
# Example: Forward Prometheus alerts to Wazuh
import requests
import json

def forward_alert_to_wazuh(alert):
    wazuh_socket = '/var/ossec/queue/sockets/queue'
    
    alert_message = {
        'integration': 'prometheus',
        'prometheus': {
            'alertname': alert['labels']['alertname'],
            'instance': alert['labels']['instance'],
            'severity': alert['labels']['severity'],
            'description': alert['annotations']['description']
        }
    }
    
    with open(wazuh_socket, 'w') as sock:
        sock.write(json.dumps(alert_message))

Best Practices

  • Monitor what matters: Focus on business-critical services
  • Set meaningful thresholds: Avoid alert fatigue
  • Use dependencies: Prevent alert storms
  • Document runbooks: Link alerts to resolution procedures
  • Optimize check intervals: Balance freshness vs overhead
  • Use passive checks: For high-volume environments
  • Database partitioning: Implement in Zabbix for large deployments
  • Retention policies: Keep only necessary historical data
  • Cluster Zabbix servers: For redundancy
  • Prometheus federation: Hierarchical monitoring
  • Backup configurations: Version control Grafana dashboards
  • Monitor the monitors: Ensure monitoring systems are healthy
  • Encrypt communications: Use TLS for all traffic
  • Restrict agent commands: Disable remote commands unless required
  • Authentication: Strong passwords and API tokens
  • Network segmentation: Isolate monitoring network

Official Documentation

Zabbix Documentation

Complete Zabbix installation and configuration guide

Prometheus Documentation

Official Prometheus documentation and best practices

Grafana Documentation

Grafana setup, datasources, and dashboard creation

PromQL Guide

PromQL query language reference

Next Steps

  1. Configure SIEM Platform to receive monitoring alerts
  2. Set up Incident Response workflows for infrastructure issues
  3. Review Operations Guide for day-to-day monitoring procedures

Build docs developers (and LLMs) love