How Sources Work
Sources run as independent async tasks that:- Connect to external systems or listen for incoming data
- Read raw data from files, network sockets, APIs, or system interfaces
- Parse data into Vector’s internal event format (logs, metrics, or traces)
- Emit events downstream to transforms and sinks
- Handle backpressure by slowing down data ingestion when buffers fill
Source Types
Pull Sources
Actively fetch data from external systems (polling)
Push Sources
Listen for incoming data sent by external systems
Generator Sources
Generate synthetic data for testing and demos
Source Categories
File and Log Collection
file
Read logs from files with glob patterns and automatic rotation handling
docker_logs
Collect container logs from Docker daemon
kubernetes_logs
Gather logs from Kubernetes pods via the Kubernetes API
journald
Read from systemd’s journal for system logs
Network and Syslog
syslog
Receive syslog messages via TCP, UDP, or Unix sockets
socket
Listen for data on TCP or UDP sockets
http_server
Accept HTTP/HTTPS POST requests with log or metric data
fluent
Receive data in Fluentd’s forward protocol
Cloud Platform Sources
aws_s3
Read logs from S3 buckets with SQS notifications
aws_sqs
Poll messages from AWS SQS queues
aws_kinesis_firehose
Receive data from Kinesis Data Firehose
gcp_pubsub
Subscribe to Google Cloud Pub/Sub topics
Metrics and System Observability
host_metrics
Collect CPU, memory, disk, and network metrics from the host
prometheus_scrape
Scrape Prometheus-compatible metric endpoints
internal_metrics
Expose Vector’s own internal metrics
statsd
Receive StatsD metrics via UDP
Application and Service Logs
exec
Execute commands and collect stdout/stderr as logs
stdin
Read logs from standard input (useful for testing)
datadog_agent
Receive logs and metrics from Datadog agents
heroku_logs
Collect logs from Heroku log drains
Development and Testing
demo_logs
Generate realistic fake logs for testing
internal_logs
Capture Vector’s own internal logs
Source Configuration
Common Options
All sources support these options:Acknowledgements
Sources can wait for downstream confirmation before acknowledging data:Acknowledgements add latency but provide stronger delivery guarantees. Only enable for critical data paths where loss is unacceptable.
Multi-line Log Handling
Many sources support multi-line log aggregation:Data Types
Sources produce specific event types:| Source | Event Type | Notes |
|---|---|---|
file, socket, syslog | Logs | String data with optional parsing |
host_metrics, prometheus_scrape | Metrics | Counters, gauges, histograms |
datadog_agent traces | Traces | Distributed tracing spans |
internal_metrics | Metrics | Vector’s operational metrics |
demo_logs | Logs | Synthetic log generation |
Backpressure Handling
Sources respect backpressure from downstream components:How It Works
- When transforms or sinks are slow, their input buffers fill
- Backpressure propagates to the source’s output channel
- The source pauses reading new data (file positions maintained)
- For network sources, TCP windows shrink or connections queue
- When buffers drain, sources resume automatically
File Sources
File sources checkpoint their position:- Current read position is saved regularly
- On restart, reading resumes from the last checkpoint
- No data is lost during restarts or crashes
Network Sources
Network sources handle backpressure via:- TCP: Send window shrinks, clients slow down naturally
- UDP: Packets may be dropped (UDP is inherently lossy)
- HTTP: Requests queue; configure appropriate timeouts
Performance Considerations
File Source Optimization
Network Source Tuning
Best Practices
Choose the right source type
Choose the right source type
- Use
filefor application logs written to disk - Use
syslogfor centralized syslog collection - Use
http_serverfor application-direct log shipping - Use cloud-native sources (S3, Pub/Sub) for cloud logs
- Use
host_metricsfor infrastructure monitoring
Configure appropriate timeouts
Configure appropriate timeouts
Set reasonable timeouts for network sources:
Use acknowledgements wisely
Use acknowledgements wisely
Enable acknowledgements only for critical data:
- Financial transactions
- Security audit logs
- Compliance-required data
- Debug logs
- Sampling metrics
- Non-critical telemetry
Monitor source health
Monitor source health
Track source metrics:
Handle multi-line logs correctly
Handle multi-line logs correctly
Configure multi-line aggregation at the source (not in transforms):
- More efficient (less data movement)
- Preserves log context
- Reduces downstream processing
Troubleshooting
Source Not Receiving Data
-
Check connectivity: Verify network/file access
-
Verify configuration: Use Vector’s validate command
-
Check permissions: Ensure Vector can read files/bind ports
-
Monitor metrics: Check internal metrics for errors
High Memory Usage
- Reduce
max_read_bytesfor file sources - Decrease
receive_buffer_bytesfor network sources - Add buffering between source and slow sinks
- Enable sampling for high-volume sources
File Source Missing Data
- Check
ignore_older_secssetting - Verify
read_fromis set tobeginningif historical data is needed - Ensure file patterns match correctly
- Check Vector’s data directory for checkpoint corruption
Related Topics
- Data Model - Understanding event types
- Pipeline Model - How sources connect to transforms and sinks
- Transforms - Processing events from sources
- Buffering - Managing backpressure from sources