Maintenance

This guide covers regular maintenance tasks, system tuning, and operational procedures to ensure the Enterprise SOC infrastructure operates at peak performance. Proper maintenance prevents system degradation, maintains detection effectiveness, and ensures long-term reliability.

Maintenance Overview

Regular maintenance is essential for SOC health. Schedule maintenance windows during low-activity periods and always have rollback procedures ready.

Daily Tasks

Quick health checks and operational verification to ensure all systems are functioning correctly

Weekly Tasks

Rule updates, log review, and performance optimization for sustained effectiveness

Monthly Tasks

Deep system analysis, comprehensive updates, and capacity planning reviews

Quarterly Tasks

Major upgrades, disaster recovery testing, and security assessments

Regular Maintenance Tasks

Daily Maintenance (15-30 minutes)

System Health Checks

Monitoring System Status

Verify all SOC components are operational:

Wazuh: Check manager and agent status

/var/ossec/bin/wazuh-control status
/var/ossec/bin/agent_control -l

Elasticsearch: Verify cluster health

curl -X GET "localhost:9200/_cluster/health?pretty"

Logstash/Fluentd: Check pipeline status and throughput
Zabbix: Verify server and agent connectivity
Prometheus: Check target health and scrape status
TheHive: Confirm platform accessibility and background jobs

Agent Connectivity

Review disconnected agents:

Identify offline Wazuh agents
Check for network connectivity issues
Verify agent services are running
Document persistent offline agents
Escalate critical system outages

Log Ingestion Verification

Confirm logs are being received:

Check Logstash/Fluentd event rates
Verify events appearing in Elasticsearch
Review pipeline errors and failed events
Monitor queue depths and backlogs
Identify silent log sources

Alert Pipeline Health

Ensure alerting is functioning:

Verify recent alerts in Wazuh
Check TheHive integration status
Test notification channels (email, Slack, etc.)
Review alert delivery times

Storage Monitoring

Check disk space on all systems:

Elasticsearch data nodes (alert at 75% usage)
Wazuh manager log storage
Backup storage capacity
Database storage
Archive retention

Quick Performance Check

Query Response Times: Test dashboard load times (should be < 5 seconds)
Indexing Rate: Verify Elasticsearch indexing keeps pace with ingestion
CPU/Memory: Check for resource exhaustion on critical systems
Network Throughput: Monitor bandwidth utilization

Weekly Maintenance (2-4 hours)

Rule and Signature Updates

IDS/IPS Rule Updates

Update Snort and Suricata signatures:Snort:

# Pull latest rules from source
pulledpork.pl -c /etc/snort/pulledpork.conf

# Test configuration
snort -T -c /etc/snort/snort.conf

# Restart Snort
systemctl restart snort

Suricata:

# Update rules using suricata-update
suricata-update

# Reload rules without restart
kill -USR2 $(pidof suricata)

Post-Update:

Review new rules added
Monitor for new false positives
Document any rule suppressions needed

Wazuh Rule Updates

Update Wazuh detection rules:

# Backup current rules
cp -r /var/ossec/ruleset/rules /var/ossec/ruleset/rules.backup.$(date +%F)

# Update Wazuh ruleset
/var/ossec/bin/update_ruleset

# Test configuration
/var/ossec/bin/wazuh-logtest

# Restart Wazuh manager
systemctl restart wazuh-manager

Review custom rules in /var/ossec/etc/rules/local_rules.xml for compatibility

Threat Intelligence Updates

Refresh threat intelligence feeds:

Update IOC databases
Import new MISP events (if using MISP)
Update IP reputation lists
Refresh malware hash databases
Update domain blocklists
Sync with industry threat feeds

False Positive Review

Tune detection rules:

Review top noisy alerts from past week
Create suppression rules for confirmed false positives
Adjust alert severity levels
Update correlation thresholds
Document tuning decisions

Example Wazuh suppression:

<!-- In /var/ossec/etc/ossec.conf -->
<ossec_config>
  <alerts>
    <log_alert_level>3</log_alert_level>
  </alerts>
  <rules>
    <include>local_rules.xml</include>
  </rules>
</ossec_config>

Vulnerability Management

Review and prioritize vulnerabilities:

Check for new CVEs affecting SOC infrastructure
Review Wazuh vulnerability detection results
Prioritize patching based on risk
Schedule patch deployment
Verify patch application

Performance Optimization

Elasticsearch Index Optimization:

# Force merge old indices
curl -X POST "localhost:9200/wazuh-alerts-*/_forcemerge?max_num_segments=1"

Clear old logs and temporary files
Review slow query logs
Optimize heavy dashboard queries
Check for index bloat

Monthly Maintenance (4-8 hours)

Comprehensive System Review

Security Updates and Patching

Apply system updates:Operating System Updates:

# Ubuntu/Debian
apt update && apt upgrade -y

# CentOS/RHEL
yum update -y

SOC Component Updates:

Wazuh manager and agents
Elasticsearch cluster
Logstash/Fluentd
TheHive and Cortex
Zabbix server and agents
Prometheus and exporters

Test updates in staging environment before production deployment. Always have rollback plan ready.

Log Retention and Cleanup

Manage log data lifecycle:Elasticsearch Index Management:

# Delete indices older than 90 days
curator_cli --host localhost delete_indices --filter_list \
  '[{"filtertype":"age","source":"name","timestring":"%Y.%m.%d","unit":"days","unit_count":90}]'

# Close indices older than 30 days (keep but not searchable)
curator_cli --host localhost close --filter_list \
  '[{"filtertype":"age","source":"name","timestring":"%Y.%m.%d","unit":"days","unit_count":30}]'

Archive old data:

Snapshot indices to long-term storage
Compress archived logs
Verify archive integrity
Update retention documentation

Capacity Planning Review

Analyze resource usage trends:

Review storage growth rate
Project future capacity needs (3-6 months)
Analyze CPU and memory trends
Review network bandwidth utilization
Identify resource bottlenecks
Plan infrastructure upgrades

Key Metrics:

Events per second (EPS) trend
Storage growth (GB per day)
Query performance trends
Agent count growth

Access Review

Audit user access and permissions:

Review active user accounts
Verify role assignments
Remove inactive accounts
Audit privileged access
Review API key usage
Update access documentation

Systems to review:

Wazuh dashboard access
Elasticsearch users
TheHive user accounts
System SSH access
Service accounts

Detection Effectiveness Review

Evaluate detection coverage:

Map detections to MITRE ATT&CK framework
Identify coverage gaps
Review detection rule effectiveness
Analyze false positive rates
Update detection priorities
Document coverage improvements

Integration Testing

Verify integrations are functioning:

Test Wazuh → TheHive alert creation
Verify Cortex analyzer connectivity
Test IDS → Logstash → Elasticsearch pipeline
Confirm Prometheus → Alertmanager flow
Validate email/Slack notifications
Check firewall log ingestion

Documentation Updates

Update runbooks with new procedures
Document configuration changes
Refresh architecture diagrams
Update contact lists
Review and update incident playbooks

Quarterly Maintenance (1-2 days)

Major Updates and Testing

Major Version Upgrades

Plan and execute major upgrades:

Review release notes for breaking changes
Test upgrades in staging environment
Backup all configurations and data
Schedule maintenance window
Execute upgrade following vendor procedures
Validate functionality post-upgrade
Update documentation

Upgrade Priority:

Security patches (immediate)
Critical bug fixes (within 1 month)
Feature updates (quarterly)

Disaster Recovery Testing

Validate backup and recovery procedures:

Test restore from backups
Verify backup completeness
Practice failover procedures
Test DR site readiness (if applicable)
Document recovery times (RTO/RPO)
Update DR documentation
Train staff on DR procedures

Disaster recovery testing is critical. Untested backups are not backups.

Security Assessment

Conduct security review of SOC infrastructure:

Vulnerability scan all SOC systems
Review security configurations
Audit authentication mechanisms
Test network segmentation
Review firewall rules
Assess encryption in transit and at rest
Penetration test SOC components (optional)

Performance Benchmarking

Establish performance baselines:

Measure query response times
Benchmark indexing rates
Test maximum EPS capacity
Measure alert processing latency
Document baseline metrics
Compare against previous quarters
Identify performance degradation

Compliance Review

Verify regulatory compliance:

Review audit logs for completeness
Verify log retention meets requirements
Confirm encryption standards
Validate access controls
Review incident documentation
Generate compliance reports
Address any findings

Strategic Planning

Review SOC metrics and KPIs
Assess team training needs
Plan infrastructure improvements
Budget for upcoming year
Evaluate new technologies
Update SOC roadmap

Log Retention and Cleanup

Retention Policy Guidelines

Hot Storage

30 days - Full search and analysisAll logs immediately searchable in Elasticsearch with full indexing

Warm Storage

31-90 days - Reduced accessClosed indices, available for search but slower performance

Cold Storage

91-365 days - Archive storageSnapshots stored on cheaper storage, restore required for access

Frozen/Compliance

1-7 years - Compliance retentionCompressed archives for regulatory compliance, rarely accessed

Elasticsearch Index Lifecycle Management

Use Elasticsearch Index Lifecycle Management (ILM) to automate index transitions through lifecycle phases.

Example ILM Policy:

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "1d"
          }
        }
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "forcemerge": {"max_num_segments": 1},
          "shrink": {"number_of_shards": 1}
        }
      },
      "cold": {
        "min_age": "90d",
        "actions": {
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Wazuh Log Management

Archive old Wazuh logs:

# Compress logs older than 30 days
find /var/ossec/logs/archives -name "*.log" -mtime +30 -exec gzip {} \;

# Move compressed archives to cold storage
find /var/ossec/logs/archives -name "*.gz" -mtime +90 -exec mv {} /mnt/cold-storage/wazuh/ \;

# Delete archives older than retention policy
find /mnt/cold-storage/wazuh -name "*.gz" -mtime +365 -delete

Performance Tuning

Elasticsearch Optimization

Cluster Performance

Index Settings Optimization:

{
  "index": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "refresh_interval": "30s",
    "codec": "best_compression"
  }
}

Best Practices:

Use time-based indices (daily or weekly rollover)
Set appropriate shard count (aim for 20-50GB per shard)
Increase refresh interval for write-heavy indices
Enable compression for older indices
Disable replicas during bulk indexing
Use index templates for consistent settings

Query Optimization

Slow Query Analysis:

# Enable slow query logging
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "logger.index.search.slowlog": "DEBUG",
    "logger.index.indexing.slowlog": "DEBUG"
  }
}'

Optimization Techniques:

Use filter context instead of query context when possible
Limit result size and use pagination
Avoid wildcard queries on large fields
Use index patterns to limit search scope
Cache frequently used aggregations
Optimize field mappings (use keyword for exact match)

Hardware and JVM Tuning

JVM Heap Size:

# In /etc/elasticsearch/jvm.options
# Set heap to 50% of available RAM, max 32GB
-Xms16g
-Xmx16g

Best Practices:

Set min and max heap size equal
Never exceed 32GB heap size
Allocate 50% of RAM to heap, leave 50% for filesystem cache
Use SSD storage for data directories
Ensure adequate CPU cores (2+ per node)
Monitor GC pauses (should be < 1 second)

Wazuh Performance Tuning

Increase concurrent agent connections:

<!-- In /var/ossec/etc/ossec.conf -->
<ossec_config>
  <remote>
    <connection>secure</connection>
    <port>1514</port>
    <protocol>tcp</protocol>
    <queue_size>131072</queue_size>
  </remote>
  
  <global>
    <logall>no</logall>
    <logall_json>no</logall_json>
    <email_notification>yes</email_notification>
  </global>
</ossec_config>

Agent buffer optimization:

<!-- In agent ossec.conf -->
<client_buffer>
  <disabled>no</disabled>
  <queue_size>5000</queue_size>
  <events_per_second>500</events_per_second>
</client_buffer>

Logstash/Fluentd Pipeline Tuning

Logstash pipeline workers:

# In /etc/logstash/logstash.yml
pipeline.workers: 4
pipeline.batch.size: 250
pipeline.batch.delay: 50
queue.type: persisted
queue.max_bytes: 1gb

Backup and Disaster Recovery

Backup Strategy

Identify Critical Data

What to backup:

Elasticsearch indices (snapshots)
Wazuh manager configuration and rules
TheHive case database
Custom detection rules and scripts
System configurations
SSL certificates and keys
User and access control data

Implement Backup Automation

Elasticsearch Snapshots:

# Create snapshot repository
curl -X PUT "localhost:9200/_snapshot/backup_repository" -H 'Content-Type: application/json' -d'
{
  "type": "fs",
  "settings": {
    "location": "/mnt/backup/elasticsearch",
    "compress": true
  }
}'

# Create snapshot (automated via cron)
curl -X PUT "localhost:9200/_snapshot/backup_repository/snapshot_$(date +%F)" -H 'Content-Type: application/json' -d'
{
  "indices": "wazuh-*,suricata-*,snort-*",
  "ignore_unavailable": true,
  "include_global_state": false
}'

Wazuh Configuration Backup:

#!/bin/bash
# Daily Wazuh backup script
BACKUP_DIR="/mnt/backup/wazuh/$(date +%F)"
mkdir -p $BACKUP_DIR

# Backup configurations
tar -czf $BACKUP_DIR/wazuh-config.tar.gz /var/ossec/etc/

# Backup rules
tar -czf $BACKUP_DIR/wazuh-rules.tar.gz /var/ossec/ruleset/

# Backup agent keys
cp /var/ossec/etc/client.keys $BACKUP_DIR/

Offsite Backup

Replicate to offsite location:

Cloud storage (S3, Azure Blob, Google Cloud Storage)
Secondary datacenter
Tape backup for long-term retention
Encrypted backup transfer
Verify backup integrity after transfer

Backup Testing

Quarterly restore tests:

Restore Elasticsearch snapshot to test cluster
Restore Wazuh configuration to test manager
Verify data completeness and integrity
Document restore procedures and timing
Update DR documentation with findings

Disaster Recovery Procedures

Disaster recovery procedures must be tested regularly. Plan for complete SOC failure and practice recovery.

Recovery Priority:

Critical (RTO: 4 hours)
- Wazuh manager (detection and alerting)
- Elasticsearch cluster (log search)
- TheHive (incident management)
High (RTO: 8 hours)
- IDS/IPS systems (Snort/Suricata)
- Log ingestion pipeline (Logstash/Fluentd)
- Prometheus monitoring
Medium (RTO: 24 hours)
- Zabbix infrastructure monitoring
- Historical data restore
- Dashboard customizations

Recovery Procedures:

Assess Damage

Determine scope of failure
Identify affected systems
Estimate recovery time
Activate incident response team
Notify stakeholders

Restore Core Systems

Deploy fresh OS on replacement hardware
Restore system configurations from backup
Restore application data
Verify system functionality
Re-establish network connectivity

Restore Data

Restore Elasticsearch snapshots
Import Wazuh agent keys
Restore TheHive case database
Verify data integrity
Resume log ingestion

Validate and Resume

Test all integrations
Verify alerting functions
Reconnect agents
Resume normal operations
Document recovery process and timing

Compliance and Auditing

Audit Log Management

Maintain comprehensive audit logs for security operations, system changes, and access to comply with regulations.

What to Audit:

User authentication and authorization
Configuration changes
Rule modifications
Incident access and modifications
Data exports and queries
System administrative actions
Backup and restore operations

Elasticsearch Audit Logging:

# In elasticsearch.yml
xpack.security.audit.enabled: true
xpack.security.audit.logfile.events.include:
  - access_granted
  - access_denied
  - authentication_failed
  - authentication_success
  - connection_denied
  - connection_granted

Compliance Reporting

Generate regular compliance reports:

Log retention compliance: Verify retention periods met
Access reviews: Document user access audits
Incident response: Timeline and actions for all incidents
System availability: Uptime and SLA metrics
Vulnerability management: Patching compliance
Change management: Documentation of all changes

Maintenance Best Practices

Document Everything

Maintain detailed documentation of:

Maintenance procedures
Configuration changes
Troubleshooting steps
Lessons learned

Test Before Deploying

Always test changes in staging:

New rules and signatures
Software updates
Configuration modifications
Integration changes

Maintain Rollback Plans

Have rollback procedures for:

Configuration changes
Software upgrades
Rule deployments
Infrastructure changes

Monitor After Changes

Enhanced monitoring post-maintenance:

Watch for new errors
Monitor performance metrics
Review alert volume
Validate functionality

Change Management Process

Plan the Change

Document what will change and why
Identify affected systems
Assess risk and impact
Schedule maintenance window
Prepare rollback plan

Communicate

Notify stakeholders of maintenance window
Inform SOC team of expected changes
Update status pages
Set expectations for downtime

Execute Change

Follow documented procedure
Take before snapshots/backups
Make changes incrementally
Test at each step
Document actual changes made

Validate

Test all affected functionality
Verify integrations
Check performance metrics
Review logs for errors
Confirm with stakeholders

Document

Record changes made
Note any issues encountered
Update configuration documentation
Share lessons learned
Close change ticket

Maintenance Windows

Recommended Schedule:

Emergency Patches: As needed (security critical)
Routine Updates: Weekly, Tuesday 2-4 AM
Major Changes: Monthly, first Sunday 12-6 AM
DR Testing: Quarterly, scheduled 3 months in advance

Schedule maintenance during lowest traffic periods based on your organization’s patterns. Review metrics to identify optimal windows.

Troubleshooting Common Issues

High Resource Usage

Symptoms: CPU, memory, or disk at capacity Solutions:

Identify resource-intensive processes
Optimize heavy queries
Increase refresh intervals
Archive or delete old data
Scale horizontally (add nodes)

Agent Connectivity Issues

Symptoms: Agents showing as disconnected Solutions:

Verify network connectivity
Check firewall rules (port 1514 for Wazuh)
Restart agent service
Re-key agent if authentication fails
Check manager capacity

Slow Query Performance

Symptoms: Dashboards loading slowly Solutions:

Review slow query logs
Optimize query filters
Reduce time range
Add indices to filtering fields
Increase cluster resources

Monitoring Guide - Daily monitoring operations and alert management
Incident Handling - Procedures for responding to security incidents
Threat Hunting - Proactive threat detection techniques

Overview

Architecture Components

Deployment

Operations

Security

​Maintenance

​Maintenance Overview

Daily Tasks

Weekly Tasks

Monthly Tasks

Quarterly Tasks

​Regular Maintenance Tasks

​System Health Checks

​Quick Performance Check

​Rule and Signature Updates

​Performance Optimization

​Comprehensive System Review

​Documentation Updates

​Major Updates and Testing

​Strategic Planning

​Log Retention and Cleanup

​Retention Policy Guidelines

Hot Storage

Warm Storage

Cold Storage

Frozen/Compliance

​Elasticsearch Index Lifecycle Management

​Wazuh Log Management

​Performance Tuning

​Elasticsearch Optimization

​Wazuh Performance Tuning

​Logstash/Fluentd Pipeline Tuning

​Backup and Disaster Recovery

​Backup Strategy

​Disaster Recovery Procedures

​Compliance and Auditing

​Audit Log Management

​Compliance Reporting

​Maintenance Best Practices

Document Everything

Test Before Deploying

Maintain Rollback Plans

Monitor After Changes

​Change Management Process

​Maintenance Windows

​Troubleshooting Common Issues

​High Resource Usage

​Agent Connectivity Issues

​Slow Query Performance

​Related Resources

Build docs developers (and LLMs) love

Maintenance

Maintenance Overview

Regular Maintenance Tasks

System Health Checks

Quick Performance Check

Rule and Signature Updates

Performance Optimization

Comprehensive System Review

Documentation Updates

Major Updates and Testing

Strategic Planning

Log Retention and Cleanup

Retention Policy Guidelines

Elasticsearch Index Lifecycle Management

Wazuh Log Management

Performance Tuning

Elasticsearch Optimization

Wazuh Performance Tuning

Logstash/Fluentd Pipeline Tuning

Backup and Disaster Recovery

Backup Strategy

Disaster Recovery Procedures

Compliance and Auditing

Audit Log Management

Compliance Reporting

Maintenance Best Practices

Change Management Process

Maintenance Windows

Troubleshooting Common Issues

High Resource Usage

Agent Connectivity Issues

Slow Query Performance

Related Resources