Log Aggregation and Storage
The log aggregation layer serves as the central nervous system of the SOC architecture, collecting, processing, enriching, and storing security events from all sources for analysis and long-term retention.This layer uses industry-standard tools for building a scalable, high-performance log pipeline capable of handling millions of events per second.
Architecture Overview
Log Processing
Logstash and Fluentd for data collection and transformation
Storage Engine
Elasticsearch for scalable search and analytics
Log Processing Pipeline
Logstash
Logstash is a server-side data processing pipeline that ingests data from multiple sources:- Input Plugins
- Filter Plugins
- Output Plugins
Supported Input Sources:
- Beats: Lightweight shippers (Filebeat, Winlogbeat)
- Syslog: RFC3164 and RFC5424 formats
- HTTP: RESTful API endpoints
- TCP/UDP: Raw network data
- File: Direct file reading
- Kafka: Distributed streaming platform
- JDBC: Database connections
Logstash Configuration Example
IDS/IPS Alert Processing
IDS/IPS Alert Processing
Fluentd
Fluentd is a lightweight, Ruby-based log collector with a plugin ecosystem:Why Use Fluentd?
Why Use Fluentd?
Advantages:
- Lower memory footprint than Logstash
- Excellent for containerized environments
- Native Kubernetes integration
- Buffer and retry mechanisms
- 500+ community plugins
Fluentd vs Logstash
Fluentd vs Logstash
Use Fluentd when:
- Collecting logs from Kubernetes/Docker
- Resource constraints are critical
- Need lightweight forwarding agent
- Complex data transformation required
- Need extensive plugin ecosystem
- Tight Elastic Stack integration preferred
Fluentd Configuration Example
Elasticsearch Storage
Elasticsearch serves as the primary storage and search engine for all security events.Core Capabilities
Scalable Storage
Distributed architecture scales horizontally across nodes
Fast Search
Near real-time search across billions of documents
Aggregations
Complex analytics and statistical operations
RESTful API
JSON-based API for all operations
Cluster Architecture
- Node Types
- Index Management
- High Availability
Specialized Node Roles:
- Master Nodes: Cluster state management (3 nodes for HA)
- Data Nodes: Store indices and handle search queries
- Ingest Nodes: Pre-processing pipeline execution
- Coordinating Nodes: Route requests, merge results
Index Templates
Data Retention and Lifecycle
Index Lifecycle Policies (ILM)
Index Lifecycle Policies (ILM)
Automate index management through lifecycle phases:Hot Phase (0-7 days):
- Actively written and queried
- High-performance SSD storage
- Full replicas for availability
- Read-only, infrequent queries
- Reduce replica count
- Force merge to fewer segments
- Rarely accessed
- Move to cheaper storage
- Minimum replicas
- Automatically delete old indices
- Or snapshot to long-term storage
ILM Configuration Example
ILM Configuration Example
Data Processing Patterns
Parsing and Normalization
Standardize log formats across different sources:- Grok Patterns
- Field Extraction
- Enrichment
Parse unstructured logs with Grok:
Performance Optimization
Logstash Tuning
- Increase pipeline workers
- Adjust batch size and delay
- Use persistent queues
- Enable pipeline monitoring
Elasticsearch Tuning
- Optimize JVM heap size (50% of RAM, max 32GB)
- Use SSD storage for hot data
- Tune thread pools
- Configure circuit breakers
Monitoring Pipeline Health
Integration with SOC Components
Data Flow
- Collection: Logstash/Fluentd collect from multiple sources
- Processing: Parse, enrich, and normalize events
- Storage: Write to Elasticsearch indices
- Analysis: Wazuh queries Elasticsearch for correlation
- Visualization: Dashboards display aggregated data
Source Integration Examples
- IDS/IPS Integration
- Firewall Logs
- Wazuh Agents
Best Practices
Data Collection
Data Collection
- Use buffering to handle traffic spikes
- Implement input rate limiting
- Tag data at ingestion point
- Validate data format before processing
Processing
Processing
- Keep filters simple and efficient
- Use conditionals to avoid unnecessary processing
- Test Grok patterns before production
- Monitor filter execution time
Storage
Storage
- Implement time-based indices
- Use appropriate shard sizes (20-40GB)
- Enable compression for cold data
- Regular snapshot backups
Security
Security
- Enable TLS for all communications
- Implement role-based access control
- Audit log access and modifications
- Encrypt data at rest
Official Documentation
Logstash Guide
Complete Logstash documentation and plugin reference
Fluentd Documentation
Fluentd installation, configuration, and plugins
Elasticsearch Guide
Comprehensive Elasticsearch documentation
Elastic Common Schema
ECS field naming conventions for normalization
Next Steps
- Configure SIEM Platform to query Elasticsearch indices
- Set up Detection Layer log forwarding
- Implement Infrastructure Monitoring integration
