Monitoring Guide
This guide covers the daily monitoring operations for the Enterprise SOC, including dashboard configuration, key metrics, alert management, and event correlation workflows.Dashboard Setup
Wazuh Dashboards
Central security event visualization and correlation platform for unified threat monitoring
Prometheus Metrics
Real-time metrics and alerting for infrastructure performance and availability
Elasticsearch Analytics
Deep log analysis and search capabilities for forensic investigation
Zabbix Infrastructure
Infrastructure health monitoring and availability tracking
Wazuh Security Dashboard
The Wazuh platform serves as the central hub for security event visualization:Access the Wazuh Dashboard
Navigate to the Wazuh web interface and authenticate with your SOC credentials
Configure Security Overview
Enable key panels:
- Security events summary
- Top triggered rules
- Alert evolution over time
- Agent status overview
Set Up Custom Views
Create role-based dashboards for:
- Tier 1 Analysts (high-priority alerts)
- Tier 2 Analysts (investigation workflows)
- SOC Manager (metrics and KPIs)
Infrastructure Monitoring
Combine Zabbix and Prometheus for comprehensive infrastructure visibility. Zabbix excels at availability monitoring while Prometheus provides detailed metrics and alerting.
- Network device availability
- Server health (CPU, memory, disk)
- Service status monitoring
- Database performance
- Container metrics (if using containerized deployments)
- Application performance metrics
- Custom security metrics
- Resource utilization trends
Key Metrics to Monitor
Security Metrics
Critical Security Events
Critical Security Events
- Failed Authentication Attempts: Monitor for brute force attacks
- Privilege Escalation: Track sudo usage and administrative actions
- File Integrity Violations: Critical system file modifications
- Malware Detection: EDR alerts from Wazuh agents
- Network Intrusions: IDS/IPS alerts from Snort and Suricata
Network Security
Network Security
- IDS/IPS Alert Volume: Track Snort and Suricata detection rates
- Blocked Connections: Firewall deny logs
- Unusual Traffic Patterns: Port scans, DDoS indicators
- External Communications: Unexpected outbound connections
- DNS Anomalies: DNS tunneling, DGA detection
Endpoint Security
Endpoint Security
- Agent Health: Wazuh agent connectivity status
- EDR Detections: Endpoint threats and suspicious behavior
- Vulnerability Status: Unpatched systems count
- Configuration Compliance: Policy violations
- Process Anomalies: Unusual process execution
Performance Metrics
Performance Metrics
- Log Ingestion Rate: Events per second in Logstash/Fluentd
- Elasticsearch Cluster Health: Index status and performance
- Query Response Time: Dashboard load times
- Storage Utilization: Log retention capacity
- Processing Lag: Pipeline delays
Alert Configuration and Tuning
Alert Levels
The SOC uses a tiered alert severity system:| Severity | Level | Response Time | Examples |
|---|---|---|---|
| Critical | 12-15 | Immediate | Active exploitation, data exfiltration |
| High | 9-11 | < 15 minutes | Malware detection, privilege escalation |
| Medium | 6-8 | < 1 hour | Policy violations, suspicious activity |
| Low | 3-5 | < 4 hours | Information events, minor anomalies |
| Informational | 0-2 | Daily review | Audit logs, routine events |
Wazuh Alert Configuration
Configure Alert Destinations
Set up integrations:
- TheHive for incident creation
- Email notifications for critical alerts
- Slack/Teams for team notifications
IDS/IPS Alert Tuning
Snort and Suricata Configuration:- Enable Community Rules: Start with Emerging Threats or Snort Community rules
- Suppress False Positives: Create suppression lists for known benign traffic
- Custom Signatures: Develop environment-specific detection rules
- Threshold Configuration: Set event thresholds to detect scanning and brute force
- Regular Updates: Schedule weekly rule updates from threat intelligence feeds
Prometheus Alerting Rules
Configure alerting rules in Prometheus for infrastructure issues:Event Correlation Workflows
Multi-Source Correlation
Wazuh provides powerful event correlation capabilities to detect complex attack patterns:Identify Correlation Patterns
Define attack scenarios requiring multiple events:
- Reconnaissance → Exploitation → Lateral Movement
- Failed Login → Successful Login → Data Access
- Port Scan → Vulnerability Exploit → Malware Execution
Integrate Multiple Sources
Correlate events from:
- IDS/IPS alerts (Snort/Suricata)
- Firewall logs
- Endpoint EDR events
- Authentication logs
- Network flow data
Elasticsearch Query Correlation
Use Elasticsearch for advanced correlation queries:Elasticsearch Query DSL enables complex temporal and cross-index correlations that complement Wazuh rule-based detection.
Daily Monitoring Checklist
Start of Shift (0-15 minutes)
- Review overnight critical and high alerts
- Check all monitoring systems are operational (Wazuh, Prometheus, Zabbix, Elasticsearch)
- Verify agent connectivity (check for disconnected endpoints)
- Review pending incidents in TheHive
- Check log ingestion rates and pipeline health
Morning Review (15-60 minutes)
- Analyze security event trends from past 24 hours
- Review IDS/IPS alerts (Snort/Suricata) for new attack patterns
- Check for failed authentication spikes
- Review file integrity monitoring alerts
- Investigate medium-severity alerts
- Update threat hunting queries based on new intelligence
Midday Operations (As needed)
- Respond to real-time alerts as they arrive
- Perform proactive threat hunting (see Threat Hunting guide)
- Tune false positive alerts
- Collaborate on active investigations
- Review and update correlation rules
Afternoon Review (30 minutes)
- Check compliance dashboard for policy violations
- Review vulnerability scan results
- Update incident tickets in TheHive
- Document findings and IOCs
- Review infrastructure metrics for anomalies
Best Practices
Monitoring Hygiene
- Tune Aggressively: Dedicate time weekly to reduce false positives
- Document Everything: Maintain runbooks for common alert types
- Baseline Normal: Understand normal behavior to identify anomalies
- Regular Reviews: Weekly review of alert effectiveness and coverage
- Continuous Learning: Stay updated on new attack techniques and adjust monitoring
Alert Response Priorities
- Active Exploitation - Drop everything and respond
- Data Exfiltration - Immediate containment required
- Malware Execution - Isolate and investigate
- Privilege Escalation - Verify legitimacy immediately
- Failed Authentication Patterns - Monitor for escalation
Communication Protocols
Clear communication during security events is critical for effective response.
- Critical Alerts: Immediately notify SOC lead and affected asset owners
- Incidents: Create TheHive case and notify stakeholders
- Ongoing Investigations: Regular updates every 2-4 hours
- False Positives: Document in knowledge base to prevent future confusion
- Shift Handover: Detailed written summary plus verbal briefing
Performance Optimization
Dashboard Performance
- Limit time ranges for heavy queries (default: last 24 hours)
- Use Elasticsearch aggregations instead of raw queries
- Schedule resource-intensive reports during off-peak hours
- Archive old indices to separate clusters if necessary
Query Optimization
Slow queries impact monitoring effectiveness. Optimize queries to return results in under 3 seconds.
- Use index patterns efficiently
- Filter at query time rather than post-processing
- Leverage Elasticsearch field caching
- Use time-based indices for log data
Troubleshooting Common Issues
High Alert Volume
Symptoms: Overwhelming number of alerts, analyst burnout Solutions:- Identify top noise generators using alert frequency analysis
- Implement alert grouping and deduplication
- Adjust severity levels for low-impact events
- Create suppression rules for known false positives
Missing Events
Symptoms: Expected events not appearing in dashboards Solutions:- Check agent connectivity in Wazuh
- Verify Logstash/Fluentd pipeline processing
- Review Elasticsearch index health
- Check log source configuration
- Verify firewall rules allow log transmission
Dashboard Slowness
Symptoms: Queries taking > 10 seconds, timeouts Solutions:- Reduce query time range
- Check Elasticsearch cluster health
- Review index optimization status
- Increase cluster resources if needed
- Implement query result caching
Related Resources
- Incident Handling - Procedures for responding to security incidents
- Threat Hunting - Proactive threat detection techniques
- Maintenance - System maintenance and tuning procedures
