Skip to main content
Sources are Vector components that ingest data from external systems. They generate events (logs, metrics, or traces) and send them downstream to transforms and sinks. Vector includes 30+ built-in sources for common data collection scenarios.

How Sources Work

Sources run as independent async tasks that:
  1. Connect to external systems or listen for incoming data
  2. Read raw data from files, network sockets, APIs, or system interfaces
  3. Parse data into Vector’s internal event format (logs, metrics, or traces)
  4. Emit events downstream to transforms and sinks
  5. Handle backpressure by slowing down data ingestion when buffers fill

Source Types

Pull Sources

Actively fetch data from external systems (polling)

Push Sources

Listen for incoming data sent by external systems

Generator Sources

Generate synthetic data for testing and demos

Source Categories

File and Log Collection

file

Read logs from files with glob patterns and automatic rotation handling

docker_logs

Collect container logs from Docker daemon

kubernetes_logs

Gather logs from Kubernetes pods via the Kubernetes API

journald

Read from systemd’s journal for system logs
Example: File source with rotation
sources:
  app_logs:
    type: file
    include:
      - /var/log/myapp/*.log
      - /var/log/myapp/**/*.log  # Recursive
    exclude:
      - /var/log/myapp/*.gz       # Ignore compressed files
    read_from: beginning           # Or 'end' for new data only
    max_line_bytes: 102400         # 100KB max line size
    ignore_older_secs: 86400       # Skip files older than 1 day
    fingerprint:
      strategy: checksum            # Track files across rotations

Network and Syslog

syslog

Receive syslog messages via TCP, UDP, or Unix sockets

socket

Listen for data on TCP or UDP sockets

http_server

Accept HTTP/HTTPS POST requests with log or metric data

fluent

Receive data in Fluentd’s forward protocol
Example: HTTP server source
sources:
  http_logs:
    type: http_server
    address: 0.0.0.0:8080
    path: /logs                    # Accept POST to /logs
    encoding: json                  # Expect JSON payloads
    headers:
      - Authorization               # Capture these headers as fields
    tls:
      enabled: true
      crt_file: /path/to/cert.pem
      key_file: /path/to/key.pem

Cloud Platform Sources

aws_s3

Read logs from S3 buckets with SQS notifications

aws_sqs

Poll messages from AWS SQS queues

aws_kinesis_firehose

Receive data from Kinesis Data Firehose

gcp_pubsub

Subscribe to Google Cloud Pub/Sub topics
Example: AWS S3 source with SQS
sources:
  s3_logs:
    type: aws_s3
    region: us-east-1
    strategy: sqs                   # Use SQS for notifications
    sqs:
      queue_url: https://sqs.us-east-1.amazonaws.com/123456789/logs-queue
      poll_secs: 15
    compression: auto               # Auto-detect gzip, zstd, etc.
    multiline:                      # Handle multi-line logs
      start_pattern: '^\[\d{4}-'    # Lines starting with [2024-...
      mode: continue_past
      timeout_ms: 1000

Metrics and System Observability

host_metrics

Collect CPU, memory, disk, and network metrics from the host

prometheus_scrape

Scrape Prometheus-compatible metric endpoints

internal_metrics

Expose Vector’s own internal metrics

statsd

Receive StatsD metrics via UDP
Example: Host metrics collection
sources:
  system_metrics:
    type: host_metrics
    scrape_interval_secs: 15
    collectors:
      - cpu
      - disk
      - memory
      - network
      - filesystem
    namespace: host                 # Prefix metrics with 'host.'
    filesystem:
      devices:
        excludes: [tmpfs, devfs]    # Skip virtual filesystems
      filesystems:
        includes: [ext4, xfs]

Application and Service Logs

exec

Execute commands and collect stdout/stderr as logs

stdin

Read logs from standard input (useful for testing)

datadog_agent

Receive logs and metrics from Datadog agents

heroku_logs

Collect logs from Heroku log drains
Example: Execute command periodically
sources:
  custom_check:
    type: exec
    mode: scheduled
    scheduled:
      exec_interval_secs: 300       # Run every 5 minutes
    command:
      - /usr/local/bin/check-health.sh
    decoding:
      codec: json                   # Parse output as JSON

Development and Testing

demo_logs

Generate realistic fake logs for testing

internal_logs

Capture Vector’s own internal logs
Example: Demo logs for testing
sources:
  test_data:
    type: demo_logs
    format: apache_common          # apache_common, apache_error, syslog, json
    interval: 0.1                   # Generate 10 logs per second
    count: 1000                     # Stop after 1000 logs (optional)

Source Configuration

Common Options

All sources support these options:
sources:
  my_source:
    type: <source_type>
    
    # Optional: Override default output (rarely needed)
    # outputs: [default]
    
    # Optional: Acknowledgments for delivery guarantees
    acknowledgements:
      enabled: true                 # Wait for sink confirmation
    
    # Optional: Decoding configuration
    decoding:
      codec: json                   # json, bytes, gelf, native, etc.
    
    # Optional: Framing for network sources
    framing:
      method: newline_delimited     # newline_delimited, character_delimited, length_delimited

Acknowledgements

Sources can wait for downstream confirmation before acknowledging data:
sources:
  critical_logs:
    type: kafka
    bootstrap_servers: localhost:9092
    topics:
      - critical-app-logs
    acknowledgements:
      enabled: true                 # Don't commit Kafka offsets until sinks confirm

sinks:
  elasticsearch:
    type: elasticsearch
    inputs:
      - critical_logs
    acknowledgements:
      enabled: true                 # Confirm when data is written
Acknowledgements add latency but provide stronger delivery guarantees. Only enable for critical data paths where loss is unacceptable.

Multi-line Log Handling

Many sources support multi-line log aggregation:
sources:
  java_logs:
    type: file
    include:
      - /var/log/myapp/*.log
    multiline:
      start_pattern: '^\[\d{4}-\d{2}-\d{2}'  # Lines starting with timestamp
      condition_pattern: '^\['                # Continue until next [ timestamp
      mode: continue_through
      timeout_ms: 1000                        # Flush incomplete after 1s

Data Types

Sources produce specific event types:
SourceEvent TypeNotes
file, socket, syslogLogsString data with optional parsing
host_metrics, prometheus_scrapeMetricsCounters, gauges, histograms
datadog_agent tracesTracesDistributed tracing spans
internal_metricsMetricsVector’s operational metrics
demo_logsLogsSynthetic log generation

Backpressure Handling

Sources respect backpressure from downstream components:

How It Works

  1. When transforms or sinks are slow, their input buffers fill
  2. Backpressure propagates to the source’s output channel
  3. The source pauses reading new data (file positions maintained)
  4. For network sources, TCP windows shrink or connections queue
  5. When buffers drain, sources resume automatically

File Sources

File sources checkpoint their position:
  • Current read position is saved regularly
  • On restart, reading resumes from the last checkpoint
  • No data is lost during restarts or crashes

Network Sources

Network sources handle backpressure via:
  • TCP: Send window shrinks, clients slow down naturally
  • UDP: Packets may be dropped (UDP is inherently lossy)
  • HTTP: Requests queue; configure appropriate timeouts

Performance Considerations

File Source Optimization

sources:
  high_throughput_logs:
    type: file
    include:
      - /var/log/high-volume/*.log
    
    # Increase read buffer for throughput
    max_read_bytes: 1048576         # Read 1MB at a time (default: 2KB)
    
    # Reduce checkpointing frequency
    data_dir: /var/lib/vector
    
    # Use faster fingerprinting
    fingerprint:
      strategy: device_and_inode    # Faster than checksum

Network Source Tuning

sources:
  high_volume_syslog:
    type: syslog
    address: 0.0.0.0:514
    mode: tcp
    
    # Increase connection limits
    max_length: 1048576             # 1MB max message size
    receive_buffer_bytes: 65536     # 64KB socket buffer
    
    # TCP keepalive
    keepalive:
      time_secs: 60

Best Practices

  • Use file for application logs written to disk
  • Use syslog for centralized syslog collection
  • Use http_server for application-direct log shipping
  • Use cloud-native sources (S3, Pub/Sub) for cloud logs
  • Use host_metrics for infrastructure monitoring
Set reasonable timeouts for network sources:
sources:
  http_logs:
    type: http_server
    keepalive:
      max_connection_age_secs: 300  # Close idle connections
Enable acknowledgements only for critical data:
  • Financial transactions
  • Security audit logs
  • Compliance-required data
Disable for high-volume, best-effort data:
  • Debug logs
  • Sampling metrics
  • Non-critical telemetry
Track source metrics:
sources:
  vector_metrics:
    type: internal_metrics

transforms:
  filter_source_metrics:
    type: filter
    inputs:
      - vector_metrics
    condition: |
      starts_with!(.name, "component_received_events_total") ||
      starts_with!(.name, "component_errors_total")
Configure multi-line aggregation at the source (not in transforms):
  • More efficient (less data movement)
  • Preserves log context
  • Reduces downstream processing

Troubleshooting

Source Not Receiving Data

  1. Check connectivity: Verify network/file access
    # For network sources
    telnet localhost 8080
    
    # For file sources
    ls -la /var/log/myapp/
    
  2. Verify configuration: Use Vector’s validate command
    vector validate --config /etc/vector/vector.yaml
    
  3. Check permissions: Ensure Vector can read files/bind ports
    # File permissions
    sudo -u vector cat /var/log/myapp/app.log
    
    # Port binding (< 1024 requires privileges)
    sudo setcap 'cap_net_bind_service=+ep' /usr/bin/vector
    
  4. Monitor metrics: Check internal metrics for errors
    curl http://localhost:8686/metrics | grep component_errors
    

High Memory Usage

  • Reduce max_read_bytes for file sources
  • Decrease receive_buffer_bytes for network sources
  • Add buffering between source and slow sinks
  • Enable sampling for high-volume sources

File Source Missing Data

  • Check ignore_older_secs setting
  • Verify read_from is set to beginning if historical data is needed
  • Ensure file patterns match correctly
  • Check Vector’s data directory for checkpoint corruption

Build docs developers (and LLMs) love