Maintenance
This guide covers regular maintenance tasks, system tuning, and operational procedures to ensure the Enterprise SOC infrastructure operates at peak performance. Proper maintenance prevents system degradation, maintains detection effectiveness, and ensures long-term reliability.Maintenance Overview
Daily Tasks
Weekly Tasks
Monthly Tasks
Quarterly Tasks
Regular Maintenance Tasks
Daily Maintenance (15-30 minutes)
Daily Maintenance (15-30 minutes)
System Health Checks
Monitoring System Status
- Wazuh: Check manager and agent status
- Elasticsearch: Verify cluster health
- Logstash/Fluentd: Check pipeline status and throughput
- Zabbix: Verify server and agent connectivity
- Prometheus: Check target health and scrape status
- TheHive: Confirm platform accessibility and background jobs
Agent Connectivity
- Identify offline Wazuh agents
- Check for network connectivity issues
- Verify agent services are running
- Document persistent offline agents
- Escalate critical system outages
Log Ingestion Verification
- Check Logstash/Fluentd event rates
- Verify events appearing in Elasticsearch
- Review pipeline errors and failed events
- Monitor queue depths and backlogs
- Identify silent log sources
Alert Pipeline Health
- Verify recent alerts in Wazuh
- Check TheHive integration status
- Test notification channels (email, Slack, etc.)
- Review alert delivery times
Quick Performance Check
- Query Response Times: Test dashboard load times (should be < 5 seconds)
- Indexing Rate: Verify Elasticsearch indexing keeps pace with ingestion
- CPU/Memory: Check for resource exhaustion on critical systems
- Network Throughput: Monitor bandwidth utilization
Weekly Maintenance (2-4 hours)
Weekly Maintenance (2-4 hours)
Rule and Signature Updates
IDS/IPS Rule Updates
- Review new rules added
- Monitor for new false positives
- Document any rule suppressions needed
Wazuh Rule Updates
/var/ossec/etc/rules/local_rules.xml for compatibilityThreat Intelligence Updates
- Update IOC databases
- Import new MISP events (if using MISP)
- Update IP reputation lists
- Refresh malware hash databases
- Update domain blocklists
- Sync with industry threat feeds
False Positive Review
- Review top noisy alerts from past week
- Create suppression rules for confirmed false positives
- Adjust alert severity levels
- Update correlation thresholds
- Document tuning decisions
Performance Optimization
-
Elasticsearch Index Optimization:
- Clear old logs and temporary files
- Review slow query logs
- Optimize heavy dashboard queries
- Check for index bloat
Monthly Maintenance (4-8 hours)
Monthly Maintenance (4-8 hours)
Comprehensive System Review
Security Updates and Patching
- Wazuh manager and agents
- Elasticsearch cluster
- Logstash/Fluentd
- TheHive and Cortex
- Zabbix server and agents
- Prometheus and exporters
Log Retention and Cleanup
- Snapshot indices to long-term storage
- Compress archived logs
- Verify archive integrity
- Update retention documentation
Capacity Planning Review
- Review storage growth rate
- Project future capacity needs (3-6 months)
- Analyze CPU and memory trends
- Review network bandwidth utilization
- Identify resource bottlenecks
- Plan infrastructure upgrades
- Events per second (EPS) trend
- Storage growth (GB per day)
- Query performance trends
- Agent count growth
Access Review
- Review active user accounts
- Verify role assignments
- Remove inactive accounts
- Audit privileged access
- Review API key usage
- Update access documentation
- Wazuh dashboard access
- Elasticsearch users
- TheHive user accounts
- System SSH access
- Service accounts
Detection Effectiveness Review
- Map detections to MITRE ATT&CK framework
- Identify coverage gaps
- Review detection rule effectiveness
- Analyze false positive rates
- Update detection priorities
- Document coverage improvements
Documentation Updates
- Update runbooks with new procedures
- Document configuration changes
- Refresh architecture diagrams
- Update contact lists
- Review and update incident playbooks
Quarterly Maintenance (1-2 days)
Quarterly Maintenance (1-2 days)
Major Updates and Testing
Major Version Upgrades
- Review release notes for breaking changes
- Test upgrades in staging environment
- Backup all configurations and data
- Schedule maintenance window
- Execute upgrade following vendor procedures
- Validate functionality post-upgrade
- Update documentation
- Security patches (immediate)
- Critical bug fixes (within 1 month)
- Feature updates (quarterly)
Disaster Recovery Testing
- Test restore from backups
- Verify backup completeness
- Practice failover procedures
- Test DR site readiness (if applicable)
- Document recovery times (RTO/RPO)
- Update DR documentation
- Train staff on DR procedures
Security Assessment
- Vulnerability scan all SOC systems
- Review security configurations
- Audit authentication mechanisms
- Test network segmentation
- Review firewall rules
- Assess encryption in transit and at rest
- Penetration test SOC components (optional)
Performance Benchmarking
- Measure query response times
- Benchmark indexing rates
- Test maximum EPS capacity
- Measure alert processing latency
- Document baseline metrics
- Compare against previous quarters
- Identify performance degradation
Strategic Planning
- Review SOC metrics and KPIs
- Assess team training needs
- Plan infrastructure improvements
- Budget for upcoming year
- Evaluate new technologies
- Update SOC roadmap
Log Retention and Cleanup
Retention Policy Guidelines
Hot Storage
Warm Storage
Cold Storage
Frozen/Compliance
Elasticsearch Index Lifecycle Management
Example ILM Policy:Wazuh Log Management
Archive old Wazuh logs:Performance Tuning
Elasticsearch Optimization
Cluster Performance
Cluster Performance
- Use time-based indices (daily or weekly rollover)
- Set appropriate shard count (aim for 20-50GB per shard)
- Increase refresh interval for write-heavy indices
- Enable compression for older indices
- Disable replicas during bulk indexing
- Use index templates for consistent settings
Query Optimization
Query Optimization
- Use filter context instead of query context when possible
- Limit result size and use pagination
- Avoid wildcard queries on large fields
- Use index patterns to limit search scope
- Cache frequently used aggregations
- Optimize field mappings (use keyword for exact match)
Hardware and JVM Tuning
Hardware and JVM Tuning
- Set min and max heap size equal
- Never exceed 32GB heap size
- Allocate 50% of RAM to heap, leave 50% for filesystem cache
- Use SSD storage for data directories
- Ensure adequate CPU cores (2+ per node)
- Monitor GC pauses (should be < 1 second)
Wazuh Performance Tuning
Increase concurrent agent connections:Logstash/Fluentd Pipeline Tuning
Logstash pipeline workers:Backup and Disaster Recovery
Backup Strategy
Identify Critical Data
- Elasticsearch indices (snapshots)
- Wazuh manager configuration and rules
- TheHive case database
- Custom detection rules and scripts
- System configurations
- SSL certificates and keys
- User and access control data
Offsite Backup
- Cloud storage (S3, Azure Blob, Google Cloud Storage)
- Secondary datacenter
- Tape backup for long-term retention
- Encrypted backup transfer
- Verify backup integrity after transfer
Disaster Recovery Procedures
Recovery Priority:-
Critical (RTO: 4 hours)
- Wazuh manager (detection and alerting)
- Elasticsearch cluster (log search)
- TheHive (incident management)
-
High (RTO: 8 hours)
- IDS/IPS systems (Snort/Suricata)
- Log ingestion pipeline (Logstash/Fluentd)
- Prometheus monitoring
-
Medium (RTO: 24 hours)
- Zabbix infrastructure monitoring
- Historical data restore
- Dashboard customizations
Assess Damage
- Determine scope of failure
- Identify affected systems
- Estimate recovery time
- Activate incident response team
- Notify stakeholders
Restore Core Systems
- Deploy fresh OS on replacement hardware
- Restore system configurations from backup
- Restore application data
- Verify system functionality
- Re-establish network connectivity
Restore Data
- Restore Elasticsearch snapshots
- Import Wazuh agent keys
- Restore TheHive case database
- Verify data integrity
- Resume log ingestion
Compliance and Auditing
Audit Log Management
- User authentication and authorization
- Configuration changes
- Rule modifications
- Incident access and modifications
- Data exports and queries
- System administrative actions
- Backup and restore operations
Compliance Reporting
Generate regular compliance reports:- Log retention compliance: Verify retention periods met
- Access reviews: Document user access audits
- Incident response: Timeline and actions for all incidents
- System availability: Uptime and SLA metrics
- Vulnerability management: Patching compliance
- Change management: Documentation of all changes
Maintenance Best Practices
Document Everything
- Maintenance procedures
- Configuration changes
- Troubleshooting steps
- Lessons learned
Test Before Deploying
- New rules and signatures
- Software updates
- Configuration modifications
- Integration changes
Maintain Rollback Plans
- Configuration changes
- Software upgrades
- Rule deployments
- Infrastructure changes
Monitor After Changes
- Watch for new errors
- Monitor performance metrics
- Review alert volume
- Validate functionality
Change Management Process
Plan the Change
- Document what will change and why
- Identify affected systems
- Assess risk and impact
- Schedule maintenance window
- Prepare rollback plan
Communicate
- Notify stakeholders of maintenance window
- Inform SOC team of expected changes
- Update status pages
- Set expectations for downtime
Execute Change
- Follow documented procedure
- Take before snapshots/backups
- Make changes incrementally
- Test at each step
- Document actual changes made
Validate
- Test all affected functionality
- Verify integrations
- Check performance metrics
- Review logs for errors
- Confirm with stakeholders
Maintenance Windows
Recommended Schedule:- Emergency Patches: As needed (security critical)
- Routine Updates: Weekly, Tuesday 2-4 AM
- Major Changes: Monthly, first Sunday 12-6 AM
- DR Testing: Quarterly, scheduled 3 months in advance
Troubleshooting Common Issues
High Resource Usage
Symptoms: CPU, memory, or disk at capacity Solutions:- Identify resource-intensive processes
- Optimize heavy queries
- Increase refresh intervals
- Archive or delete old data
- Scale horizontally (add nodes)
Agent Connectivity Issues
Symptoms: Agents showing as disconnected Solutions:- Verify network connectivity
- Check firewall rules (port 1514 for Wazuh)
- Restart agent service
- Re-key agent if authentication fails
- Check manager capacity
Slow Query Performance
Symptoms: Dashboards loading slowly Solutions:- Review slow query logs
- Optimize query filters
- Reduce time range
- Add indices to filtering fields
- Increase cluster resources
Related Resources
- Monitoring Guide - Daily monitoring operations and alert management
- Incident Handling - Procedures for responding to security incidents
- Threat Hunting - Proactive threat detection techniques
