Overview
UTMStack processes security data through a pipeline optimized for real-time threat detection. The key innovation is pre-ingestion correlation, which analyzes and correlates data before permanent storage, reducing overhead and improving response times.Complete Data Flow Diagram
Detailed Data Flow Stages
Stage 1: Data Collection
Purpose: Gather security events from diverse sourcesAgent-Based Collection
The UTMStack agent (written in Go) collects data using multiple methods: Filebeat Integration:- Monitors log files on disk
- Supports multi-line event parsing
- Handles log rotation automatically
- Configurable file patterns
- Listens for Netflow v1, v5, v6, v7, v9
- IPFIX protocol support
- Real-time packet metadata extraction
- Flow aggregation and sampling
- RFC 3164 and RFC 5424 compliant
- TCP and UDP listeners
- TLS-encrypted syslog (RFC 5425)
- Supports CEF and LEEF formats
- Windows Event Log collection
- Linux auditd integration
- macOS unified logging
- Process and file monitoring
Direct Integrations
Cloud Services:- AWS CloudTrail, VPC Flow Logs, GuardDuty
- Azure Activity Logs, Security Center
- Google Cloud Logging
- Office 365 audit logs
- Firewall logs (Palo Alto, Fortinet, Cisco)
- IDS/IPS (Snort, Suricata)
- EDR platforms
- Email security gateways
Stage 2: Local Buffering
Purpose: Ensure reliable delivery and handle network interruptions The agent maintains a local SQLite database that:- Buffers events when the server is unreachable
- Implements retry logic with exponential backoff
- Prevents data loss during network issues
- Manages disk space with automatic cleanup
- Provides local query capability for debugging
Stage 3: Secure Transport
Purpose: Securely transmit events to the UTMStack servergRPC Communication
The agent communicates with the backend using gRPC over TLS:- Bidirectional streaming for efficient data transfer
- Protocol Buffers for compact serialization
- HTTP/2 multiplexing reduces connection overhead
- Built-in authentication with certificates
Authentication
- Each agent uses a unique 24+ character key
- TLS client certificates for mutual authentication
- Key rotation without service interruption
- Centralized key management in backend
Stage 4: Parsing and Normalization
Purpose: Convert diverse log formats into a unified schemaLog Parser
Supports multiple formats:- JSON: Native parsing with schema validation
- Syslog: RFC-compliant parsing with custom patterns
- CEF: Common Event Format from security tools
- LEEF: Log Event Extended Format
- CSV: Custom delimiter support
- Key-Value: Generic key=value parsing
Field Normalization
Maps source-specific fields to common schema:Stage 5: Enrichment
Purpose: Add context to events for better analysisGeoIP Enrichment
- IP address to geographic location
- ASN and organization information
- City, country, and region data
- Threat reputation scores
Asset Correlation
- Link events to known assets
- User and device information
- Business context (department, criticality)
- Historical behavior baselines
Threat Intelligence
- Check IPs against threat feeds
- Domain reputation lookup
- File hash analysis (VirusTotal, etc.)
- MITRE ATT&CK technique mapping
Stage 6: Real-Time Correlation
Purpose: Detect threats by correlating events BEFORE storage This is UTMStack’s key differentiator. The correlation engine:Pattern Matching
- Applies correlation rules in real-time
- Matches against known attack patterns
- Uses stateful analysis (tracks sessions)
- Implements complex event processing
Behavioral Analysis
- Compares against learned baselines
- Detects statistical anomalies
- Identifies unusual access patterns
- Recognizes privilege escalation attempts
Alert Generation
- Creates alerts for matched rules
- Assigns severity and confidence scores
- Links related events (incident timeline)
- Enriches with context and recommendations
Stage 7: Storage
Purpose: Persist data for search, analysis, and complianceOpenSearch/Elasticsearch
- What: Raw and correlated log events
- Why: Fast full-text search and aggregations
- Retention: Configurable (default 30 days hot storage)
- Indexing: Daily indices with automatic rollover
PostgreSQL
- What: Alerts, incidents, configuration, users
- Why: ACID compliance for critical data
- Retention: Long-term (years)
- Schema: Relational with foreign keys
Redis
- What: Session data, real-time counters, cache
- Why: In-memory speed for active sessions
- Retention: Short-term (hours to days)
- Persistence: AOF for durability
Stage 8: Query and Analysis
Purpose: Enable security analysts to investigate and respondReal-Time Dashboards
- WebSocket updates for live data
- ECharts visualizations
- Customizable widgets
- Drill-down capabilities
Log Search
- Lucene query syntax
- Field-specific searches
- Time-range filtering
- Export to CSV/JSON
Analytics
- Aggregation queries
- Trend analysis
- Comparative analysis
- Statistical functions
Data Retention Strategy
Hot Storage (OpenSearch/Elasticsearch)
- Recent data (default 30 days)
- Immediately searchable
- High-performance SSD storage
- Regular indices
Warm Storage
- Older data (30-90 days)
- Read-only indices
- Compressed for space efficiency
- Slower but still accessible
Cold Storage
- Archive data (90+ days)
- Snapshot to object storage
- Restore required before search
- Compliance retention
Retention Configuration
Performance Optimization
Batching
- Agents batch events (default 100 events or 5 seconds)
- Reduces network overhead
- Improves indexing throughput
Compression
- gRPC automatic compression
- Reduces bandwidth by 70-80%
- Minimal CPU overhead
Caching
- Frequently accessed data cached in Redis
- Reduces database load
- Improves dashboard response time
Indexing Strategy
- Time-based indices for log data
- Index templates for consistency
- Mapping optimization for common fields
Monitoring Data Flow
Track data flow health with these metrics:- Collection Rate: Events/second per agent
- Buffer Size: Events in local buffer
- Transport Lag: Delay between collection and ingestion
- Processing Rate: Events/second through pipeline
- Correlation Rate: Alerts generated/hour
- Storage Rate: Data indexed/second
- Query Performance: Average search response time
Next Steps
Agent System
Learn more about agent-based data collection
Correlation Engine
Understand real-time correlation
Data Storage
Deep dive into storage architecture
Performance Tuning
Optimize data processing performance