Skip to main content

Maintenance

This guide covers regular maintenance tasks, system tuning, and operational procedures to ensure the Enterprise SOC infrastructure operates at peak performance. Proper maintenance prevents system degradation, maintains detection effectiveness, and ensures long-term reliability.

Maintenance Overview

Regular maintenance is essential for SOC health. Schedule maintenance windows during low-activity periods and always have rollback procedures ready.

Daily Tasks

Quick health checks and operational verification to ensure all systems are functioning correctly

Weekly Tasks

Rule updates, log review, and performance optimization for sustained effectiveness

Monthly Tasks

Deep system analysis, comprehensive updates, and capacity planning reviews

Quarterly Tasks

Major upgrades, disaster recovery testing, and security assessments

Regular Maintenance Tasks

System Health Checks

1

Monitoring System Status

Verify all SOC components are operational:
  • Wazuh: Check manager and agent status
    /var/ossec/bin/wazuh-control status
    /var/ossec/bin/agent_control -l
    
  • Elasticsearch: Verify cluster health
    curl -X GET "localhost:9200/_cluster/health?pretty"
    
  • Logstash/Fluentd: Check pipeline status and throughput
  • Zabbix: Verify server and agent connectivity
  • Prometheus: Check target health and scrape status
  • TheHive: Confirm platform accessibility and background jobs
2

Agent Connectivity

Review disconnected agents:
  • Identify offline Wazuh agents
  • Check for network connectivity issues
  • Verify agent services are running
  • Document persistent offline agents
  • Escalate critical system outages
3

Log Ingestion Verification

Confirm logs are being received:
  • Check Logstash/Fluentd event rates
  • Verify events appearing in Elasticsearch
  • Review pipeline errors and failed events
  • Monitor queue depths and backlogs
  • Identify silent log sources
4

Alert Pipeline Health

Ensure alerting is functioning:
  • Verify recent alerts in Wazuh
  • Check TheHive integration status
  • Test notification channels (email, Slack, etc.)
  • Review alert delivery times
5

Storage Monitoring

Check disk space on all systems:
  • Elasticsearch data nodes (alert at 75% usage)
  • Wazuh manager log storage
  • Backup storage capacity
  • Database storage
  • Archive retention

Quick Performance Check

  • Query Response Times: Test dashboard load times (should be < 5 seconds)
  • Indexing Rate: Verify Elasticsearch indexing keeps pace with ingestion
  • CPU/Memory: Check for resource exhaustion on critical systems
  • Network Throughput: Monitor bandwidth utilization

Rule and Signature Updates

1

IDS/IPS Rule Updates

Update Snort and Suricata signatures:Snort:
# Pull latest rules from source
pulledpork.pl -c /etc/snort/pulledpork.conf

# Test configuration
snort -T -c /etc/snort/snort.conf

# Restart Snort
systemctl restart snort
Suricata:
# Update rules using suricata-update
suricata-update

# Reload rules without restart
kill -USR2 $(pidof suricata)
Post-Update:
  • Review new rules added
  • Monitor for new false positives
  • Document any rule suppressions needed
2

Wazuh Rule Updates

Update Wazuh detection rules:
# Backup current rules
cp -r /var/ossec/ruleset/rules /var/ossec/ruleset/rules.backup.$(date +%F)

# Update Wazuh ruleset
/var/ossec/bin/update_ruleset

# Test configuration
/var/ossec/bin/wazuh-logtest

# Restart Wazuh manager
systemctl restart wazuh-manager
Review custom rules in /var/ossec/etc/rules/local_rules.xml for compatibility
3

Threat Intelligence Updates

Refresh threat intelligence feeds:
  • Update IOC databases
  • Import new MISP events (if using MISP)
  • Update IP reputation lists
  • Refresh malware hash databases
  • Update domain blocklists
  • Sync with industry threat feeds
4

False Positive Review

Tune detection rules:
  • Review top noisy alerts from past week
  • Create suppression rules for confirmed false positives
  • Adjust alert severity levels
  • Update correlation thresholds
  • Document tuning decisions
Example Wazuh suppression:
<!-- In /var/ossec/etc/ossec.conf -->
<ossec_config>
  <alerts>
    <log_alert_level>3</log_alert_level>
  </alerts>
  <rules>
    <include>local_rules.xml</include>
  </rules>
</ossec_config>
5

Vulnerability Management

Review and prioritize vulnerabilities:
  • Check for new CVEs affecting SOC infrastructure
  • Review Wazuh vulnerability detection results
  • Prioritize patching based on risk
  • Schedule patch deployment
  • Verify patch application

Performance Optimization

  • Elasticsearch Index Optimization:
    # Force merge old indices
    curl -X POST "localhost:9200/wazuh-alerts-*/_forcemerge?max_num_segments=1"
    
  • Clear old logs and temporary files
  • Review slow query logs
  • Optimize heavy dashboard queries
  • Check for index bloat

Comprehensive System Review

1

Security Updates and Patching

Apply system updates:Operating System Updates:
# Ubuntu/Debian
apt update && apt upgrade -y

# CentOS/RHEL
yum update -y
SOC Component Updates:
  • Wazuh manager and agents
  • Elasticsearch cluster
  • Logstash/Fluentd
  • TheHive and Cortex
  • Zabbix server and agents
  • Prometheus and exporters
Test updates in staging environment before production deployment. Always have rollback plan ready.
2

Log Retention and Cleanup

Manage log data lifecycle:Elasticsearch Index Management:
# Delete indices older than 90 days
curator_cli --host localhost delete_indices --filter_list \
  '[{"filtertype":"age","source":"name","timestring":"%Y.%m.%d","unit":"days","unit_count":90}]'

# Close indices older than 30 days (keep but not searchable)
curator_cli --host localhost close --filter_list \
  '[{"filtertype":"age","source":"name","timestring":"%Y.%m.%d","unit":"days","unit_count":30}]'
Archive old data:
  • Snapshot indices to long-term storage
  • Compress archived logs
  • Verify archive integrity
  • Update retention documentation
3

Capacity Planning Review

Analyze resource usage trends:
  • Review storage growth rate
  • Project future capacity needs (3-6 months)
  • Analyze CPU and memory trends
  • Review network bandwidth utilization
  • Identify resource bottlenecks
  • Plan infrastructure upgrades
Key Metrics:
  • Events per second (EPS) trend
  • Storage growth (GB per day)
  • Query performance trends
  • Agent count growth
4

Access Review

Audit user access and permissions:
  • Review active user accounts
  • Verify role assignments
  • Remove inactive accounts
  • Audit privileged access
  • Review API key usage
  • Update access documentation
Systems to review:
  • Wazuh dashboard access
  • Elasticsearch users
  • TheHive user accounts
  • System SSH access
  • Service accounts
5

Detection Effectiveness Review

Evaluate detection coverage:
  • Map detections to MITRE ATT&CK framework
  • Identify coverage gaps
  • Review detection rule effectiveness
  • Analyze false positive rates
  • Update detection priorities
  • Document coverage improvements
6

Integration Testing

Verify integrations are functioning:
  • Test Wazuh → TheHive alert creation
  • Verify Cortex analyzer connectivity
  • Test IDS → Logstash → Elasticsearch pipeline
  • Confirm Prometheus → Alertmanager flow
  • Validate email/Slack notifications
  • Check firewall log ingestion

Documentation Updates

  • Update runbooks with new procedures
  • Document configuration changes
  • Refresh architecture diagrams
  • Update contact lists
  • Review and update incident playbooks

Major Updates and Testing

1

Major Version Upgrades

Plan and execute major upgrades:
  • Review release notes for breaking changes
  • Test upgrades in staging environment
  • Backup all configurations and data
  • Schedule maintenance window
  • Execute upgrade following vendor procedures
  • Validate functionality post-upgrade
  • Update documentation
Upgrade Priority:
  1. Security patches (immediate)
  2. Critical bug fixes (within 1 month)
  3. Feature updates (quarterly)
2

Disaster Recovery Testing

Validate backup and recovery procedures:
  • Test restore from backups
  • Verify backup completeness
  • Practice failover procedures
  • Test DR site readiness (if applicable)
  • Document recovery times (RTO/RPO)
  • Update DR documentation
  • Train staff on DR procedures
Disaster recovery testing is critical. Untested backups are not backups.
3

Security Assessment

Conduct security review of SOC infrastructure:
  • Vulnerability scan all SOC systems
  • Review security configurations
  • Audit authentication mechanisms
  • Test network segmentation
  • Review firewall rules
  • Assess encryption in transit and at rest
  • Penetration test SOC components (optional)
4

Performance Benchmarking

Establish performance baselines:
  • Measure query response times
  • Benchmark indexing rates
  • Test maximum EPS capacity
  • Measure alert processing latency
  • Document baseline metrics
  • Compare against previous quarters
  • Identify performance degradation
5

Compliance Review

Verify regulatory compliance:
  • Review audit logs for completeness
  • Verify log retention meets requirements
  • Confirm encryption standards
  • Validate access controls
  • Review incident documentation
  • Generate compliance reports
  • Address any findings

Strategic Planning

  • Review SOC metrics and KPIs
  • Assess team training needs
  • Plan infrastructure improvements
  • Budget for upcoming year
  • Evaluate new technologies
  • Update SOC roadmap

Log Retention and Cleanup

Retention Policy Guidelines

Hot Storage

30 days - Full search and analysisAll logs immediately searchable in Elasticsearch with full indexing

Warm Storage

31-90 days - Reduced accessClosed indices, available for search but slower performance

Cold Storage

91-365 days - Archive storageSnapshots stored on cheaper storage, restore required for access

Frozen/Compliance

1-7 years - Compliance retentionCompressed archives for regulatory compliance, rarely accessed

Elasticsearch Index Lifecycle Management

Use Elasticsearch Index Lifecycle Management (ILM) to automate index transitions through lifecycle phases.
Example ILM Policy:
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "1d"
          }
        }
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "forcemerge": {"max_num_segments": 1},
          "shrink": {"number_of_shards": 1}
        }
      },
      "cold": {
        "min_age": "90d",
        "actions": {
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Wazuh Log Management

Archive old Wazuh logs:
# Compress logs older than 30 days
find /var/ossec/logs/archives -name "*.log" -mtime +30 -exec gzip {} \;

# Move compressed archives to cold storage
find /var/ossec/logs/archives -name "*.gz" -mtime +90 -exec mv {} /mnt/cold-storage/wazuh/ \;

# Delete archives older than retention policy
find /mnt/cold-storage/wazuh -name "*.gz" -mtime +365 -delete

Performance Tuning

Elasticsearch Optimization

Index Settings Optimization:
{
  "index": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "refresh_interval": "30s",
    "codec": "best_compression"
  }
}
Best Practices:
  • Use time-based indices (daily or weekly rollover)
  • Set appropriate shard count (aim for 20-50GB per shard)
  • Increase refresh interval for write-heavy indices
  • Enable compression for older indices
  • Disable replicas during bulk indexing
  • Use index templates for consistent settings
Slow Query Analysis:
# Enable slow query logging
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "logger.index.search.slowlog": "DEBUG",
    "logger.index.indexing.slowlog": "DEBUG"
  }
}'
Optimization Techniques:
  • Use filter context instead of query context when possible
  • Limit result size and use pagination
  • Avoid wildcard queries on large fields
  • Use index patterns to limit search scope
  • Cache frequently used aggregations
  • Optimize field mappings (use keyword for exact match)
JVM Heap Size:
# In /etc/elasticsearch/jvm.options
# Set heap to 50% of available RAM, max 32GB
-Xms16g
-Xmx16g
Best Practices:
  • Set min and max heap size equal
  • Never exceed 32GB heap size
  • Allocate 50% of RAM to heap, leave 50% for filesystem cache
  • Use SSD storage for data directories
  • Ensure adequate CPU cores (2+ per node)
  • Monitor GC pauses (should be < 1 second)

Wazuh Performance Tuning

Increase concurrent agent connections:
<!-- In /var/ossec/etc/ossec.conf -->
<ossec_config>
  <remote>
    <connection>secure</connection>
    <port>1514</port>
    <protocol>tcp</protocol>
    <queue_size>131072</queue_size>
  </remote>
  
  <global>
    <logall>no</logall>
    <logall_json>no</logall_json>
    <email_notification>yes</email_notification>
  </global>
</ossec_config>
Agent buffer optimization:
<!-- In agent ossec.conf -->
<client_buffer>
  <disabled>no</disabled>
  <queue_size>5000</queue_size>
  <events_per_second>500</events_per_second>
</client_buffer>

Logstash/Fluentd Pipeline Tuning

Logstash pipeline workers:
# In /etc/logstash/logstash.yml
pipeline.workers: 4
pipeline.batch.size: 250
pipeline.batch.delay: 50
queue.type: persisted
queue.max_bytes: 1gb

Backup and Disaster Recovery

Backup Strategy

1

Identify Critical Data

What to backup:
  • Elasticsearch indices (snapshots)
  • Wazuh manager configuration and rules
  • TheHive case database
  • Custom detection rules and scripts
  • System configurations
  • SSL certificates and keys
  • User and access control data
2

Implement Backup Automation

Elasticsearch Snapshots:
# Create snapshot repository
curl -X PUT "localhost:9200/_snapshot/backup_repository" -H 'Content-Type: application/json' -d'
{
  "type": "fs",
  "settings": {
    "location": "/mnt/backup/elasticsearch",
    "compress": true
  }
}'

# Create snapshot (automated via cron)
curl -X PUT "localhost:9200/_snapshot/backup_repository/snapshot_$(date +%F)" -H 'Content-Type: application/json' -d'
{
  "indices": "wazuh-*,suricata-*,snort-*",
  "ignore_unavailable": true,
  "include_global_state": false
}'
Wazuh Configuration Backup:
#!/bin/bash
# Daily Wazuh backup script
BACKUP_DIR="/mnt/backup/wazuh/$(date +%F)"
mkdir -p $BACKUP_DIR

# Backup configurations
tar -czf $BACKUP_DIR/wazuh-config.tar.gz /var/ossec/etc/

# Backup rules
tar -czf $BACKUP_DIR/wazuh-rules.tar.gz /var/ossec/ruleset/

# Backup agent keys
cp /var/ossec/etc/client.keys $BACKUP_DIR/
3

Offsite Backup

Replicate to offsite location:
  • Cloud storage (S3, Azure Blob, Google Cloud Storage)
  • Secondary datacenter
  • Tape backup for long-term retention
  • Encrypted backup transfer
  • Verify backup integrity after transfer
4

Backup Testing

Quarterly restore tests:
  • Restore Elasticsearch snapshot to test cluster
  • Restore Wazuh configuration to test manager
  • Verify data completeness and integrity
  • Document restore procedures and timing
  • Update DR documentation with findings

Disaster Recovery Procedures

Disaster recovery procedures must be tested regularly. Plan for complete SOC failure and practice recovery.
Recovery Priority:
  1. Critical (RTO: 4 hours)
    • Wazuh manager (detection and alerting)
    • Elasticsearch cluster (log search)
    • TheHive (incident management)
  2. High (RTO: 8 hours)
    • IDS/IPS systems (Snort/Suricata)
    • Log ingestion pipeline (Logstash/Fluentd)
    • Prometheus monitoring
  3. Medium (RTO: 24 hours)
    • Zabbix infrastructure monitoring
    • Historical data restore
    • Dashboard customizations
Recovery Procedures:
1

Assess Damage

  • Determine scope of failure
  • Identify affected systems
  • Estimate recovery time
  • Activate incident response team
  • Notify stakeholders
2

Restore Core Systems

  • Deploy fresh OS on replacement hardware
  • Restore system configurations from backup
  • Restore application data
  • Verify system functionality
  • Re-establish network connectivity
3

Restore Data

  • Restore Elasticsearch snapshots
  • Import Wazuh agent keys
  • Restore TheHive case database
  • Verify data integrity
  • Resume log ingestion
4

Validate and Resume

  • Test all integrations
  • Verify alerting functions
  • Reconnect agents
  • Resume normal operations
  • Document recovery process and timing

Compliance and Auditing

Audit Log Management

Maintain comprehensive audit logs for security operations, system changes, and access to comply with regulations.
What to Audit:
  • User authentication and authorization
  • Configuration changes
  • Rule modifications
  • Incident access and modifications
  • Data exports and queries
  • System administrative actions
  • Backup and restore operations
Elasticsearch Audit Logging:
# In elasticsearch.yml
xpack.security.audit.enabled: true
xpack.security.audit.logfile.events.include:
  - access_granted
  - access_denied
  - authentication_failed
  - authentication_success
  - connection_denied
  - connection_granted

Compliance Reporting

Generate regular compliance reports:
  • Log retention compliance: Verify retention periods met
  • Access reviews: Document user access audits
  • Incident response: Timeline and actions for all incidents
  • System availability: Uptime and SLA metrics
  • Vulnerability management: Patching compliance
  • Change management: Documentation of all changes

Maintenance Best Practices

Document Everything

Maintain detailed documentation of:
  • Maintenance procedures
  • Configuration changes
  • Troubleshooting steps
  • Lessons learned

Test Before Deploying

Always test changes in staging:
  • New rules and signatures
  • Software updates
  • Configuration modifications
  • Integration changes

Maintain Rollback Plans

Have rollback procedures for:
  • Configuration changes
  • Software upgrades
  • Rule deployments
  • Infrastructure changes

Monitor After Changes

Enhanced monitoring post-maintenance:
  • Watch for new errors
  • Monitor performance metrics
  • Review alert volume
  • Validate functionality

Change Management Process

1

Plan the Change

  • Document what will change and why
  • Identify affected systems
  • Assess risk and impact
  • Schedule maintenance window
  • Prepare rollback plan
2

Communicate

  • Notify stakeholders of maintenance window
  • Inform SOC team of expected changes
  • Update status pages
  • Set expectations for downtime
3

Execute Change

  • Follow documented procedure
  • Take before snapshots/backups
  • Make changes incrementally
  • Test at each step
  • Document actual changes made
4

Validate

  • Test all affected functionality
  • Verify integrations
  • Check performance metrics
  • Review logs for errors
  • Confirm with stakeholders
5

Document

  • Record changes made
  • Note any issues encountered
  • Update configuration documentation
  • Share lessons learned
  • Close change ticket

Maintenance Windows

Recommended Schedule:
  • Emergency Patches: As needed (security critical)
  • Routine Updates: Weekly, Tuesday 2-4 AM
  • Major Changes: Monthly, first Sunday 12-6 AM
  • DR Testing: Quarterly, scheduled 3 months in advance
Schedule maintenance during lowest traffic periods based on your organization’s patterns. Review metrics to identify optimal windows.

Troubleshooting Common Issues

High Resource Usage

Symptoms: CPU, memory, or disk at capacity Solutions:
  • Identify resource-intensive processes
  • Optimize heavy queries
  • Increase refresh intervals
  • Archive or delete old data
  • Scale horizontally (add nodes)

Agent Connectivity Issues

Symptoms: Agents showing as disconnected Solutions:
  • Verify network connectivity
  • Check firewall rules (port 1514 for Wazuh)
  • Restart agent service
  • Re-key agent if authentication fails
  • Check manager capacity

Slow Query Performance

Symptoms: Dashboards loading slowly Solutions:
  • Review slow query logs
  • Optimize query filters
  • Reduce time range
  • Add indices to filtering fields
  • Increase cluster resources

Build docs developers (and LLMs) love