Sources - Vector

Sources are Vector components that ingest data from external systems. They generate events (logs, metrics, or traces) and send them downstream to transforms and sinks. Vector includes 30+ built-in sources for common data collection scenarios.

How Sources Work

Sources run as independent async tasks that:

Connect to external systems or listen for incoming data
Read raw data from files, network sockets, APIs, or system interfaces
Parse data into Vector’s internal event format (logs, metrics, or traces)
Emit events downstream to transforms and sinks
Handle backpressure by slowing down data ingestion when buffers fill

Source Types

Pull Sources

Actively fetch data from external systems (polling)

Push Sources

Listen for incoming data sent by external systems

Generator Sources

Generate synthetic data for testing and demos

Source Categories

File and Log Collection

file

Read logs from files with glob patterns and automatic rotation handling

docker_logs

Collect container logs from Docker daemon

kubernetes_logs

Gather logs from Kubernetes pods via the Kubernetes API

journald

Read from systemd’s journal for system logs

Example: File source with rotation

sources:
  app_logs:
    type: file
    include:
      - /var/log/myapp/*.log
      - /var/log/myapp/**/*.log  # Recursive
    exclude:
      - /var/log/myapp/*.gz       # Ignore compressed files
    read_from: beginning           # Or 'end' for new data only
    max_line_bytes: 102400         # 100KB max line size
    ignore_older_secs: 86400       # Skip files older than 1 day
    fingerprint:
      strategy: checksum            # Track files across rotations

Network and Syslog

syslog

Receive syslog messages via TCP, UDP, or Unix sockets

socket

Listen for data on TCP or UDP sockets

http_server

Accept HTTP/HTTPS POST requests with log or metric data

fluent

Receive data in Fluentd’s forward protocol

Example: HTTP server source

sources:
  http_logs:
    type: http_server
    address: 0.0.0.0:8080
    path: /logs                    # Accept POST to /logs
    encoding: json                  # Expect JSON payloads
    headers:
      - Authorization               # Capture these headers as fields
    tls:
      enabled: true
      crt_file: /path/to/cert.pem
      key_file: /path/to/key.pem

Cloud Platform Sources

aws_s3

Read logs from S3 buckets with SQS notifications

aws_sqs

Poll messages from AWS SQS queues

aws_kinesis_firehose

Receive data from Kinesis Data Firehose

gcp_pubsub

Subscribe to Google Cloud Pub/Sub topics

Example: AWS S3 source with SQS

sources:
  s3_logs:
    type: aws_s3
    region: us-east-1
    strategy: sqs                   # Use SQS for notifications
    sqs:
      queue_url: https://sqs.us-east-1.amazonaws.com/123456789/logs-queue
      poll_secs: 15
    compression: auto               # Auto-detect gzip, zstd, etc.
    multiline:                      # Handle multi-line logs
      start_pattern: '^\[\d{4}-'    # Lines starting with [2024-...
      mode: continue_past
      timeout_ms: 1000

Metrics and System Observability

host_metrics

Collect CPU, memory, disk, and network metrics from the host

prometheus_scrape

Scrape Prometheus-compatible metric endpoints

internal_metrics

Expose Vector’s own internal metrics

statsd

Receive StatsD metrics via UDP

Example: Host metrics collection

sources:
  system_metrics:
    type: host_metrics
    scrape_interval_secs: 15
    collectors:
      - cpu
      - disk
      - memory
      - network
      - filesystem
    namespace: host                 # Prefix metrics with 'host.'
    filesystem:
      devices:
        excludes: [tmpfs, devfs]    # Skip virtual filesystems
      filesystems:
        includes: [ext4, xfs]

Application and Service Logs

exec

Execute commands and collect stdout/stderr as logs

stdin

Read logs from standard input (useful for testing)

datadog_agent

Receive logs and metrics from Datadog agents

heroku_logs

Collect logs from Heroku log drains

Example: Execute command periodically

sources:
  custom_check:
    type: exec
    mode: scheduled
    scheduled:
      exec_interval_secs: 300       # Run every 5 minutes
    command:
      - /usr/local/bin/check-health.sh
    decoding:
      codec: json                   # Parse output as JSON

Development and Testing

demo_logs

Generate realistic fake logs for testing

internal_logs

Capture Vector’s own internal logs

Example: Demo logs for testing

sources:
  test_data:
    type: demo_logs
    format: apache_common          # apache_common, apache_error, syslog, json
    interval: 0.1                   # Generate 10 logs per second
    count: 1000                     # Stop after 1000 logs (optional)

Source Configuration

Common Options

All sources support these options:

sources:
  my_source:
    type: <source_type>
    
    # Optional: Override default output (rarely needed)
    # outputs: [default]
    
    # Optional: Acknowledgments for delivery guarantees
    acknowledgements:
      enabled: true                 # Wait for sink confirmation
    
    # Optional: Decoding configuration
    decoding:
      codec: json                   # json, bytes, gelf, native, etc.
    
    # Optional: Framing for network sources
    framing:
      method: newline_delimited     # newline_delimited, character_delimited, length_delimited

Acknowledgements

Sources can wait for downstream confirmation before acknowledging data:

sources:
  critical_logs:
    type: kafka
    bootstrap_servers: localhost:9092
    topics:
      - critical-app-logs
    acknowledgements:
      enabled: true                 # Don't commit Kafka offsets until sinks confirm

sinks:
  elasticsearch:
    type: elasticsearch
    inputs:
      - critical_logs
    acknowledgements:
      enabled: true                 # Confirm when data is written

Acknowledgements add latency but provide stronger delivery guarantees. Only enable for critical data paths where loss is unacceptable.

Multi-line Log Handling

Many sources support multi-line log aggregation:

sources:
  java_logs:
    type: file
    include:
      - /var/log/myapp/*.log
    multiline:
      start_pattern: '^\[\d{4}-\d{2}-\d{2}'  # Lines starting with timestamp
      condition_pattern: '^\['                # Continue until next [ timestamp
      mode: continue_through
      timeout_ms: 1000                        # Flush incomplete after 1s

Data Types

Sources produce specific event types:

Source	Event Type	Notes
`file`, `socket`, `syslog`	Logs	String data with optional parsing
`host_metrics`, `prometheus_scrape`	Metrics	Counters, gauges, histograms
`datadog_agent` traces	Traces	Distributed tracing spans
`internal_metrics`	Metrics	Vector’s operational metrics
`demo_logs`	Logs	Synthetic log generation

Backpressure Handling

Sources respect backpressure from downstream components:

How It Works

When transforms or sinks are slow, their input buffers fill
Backpressure propagates to the source’s output channel
The source pauses reading new data (file positions maintained)
For network sources, TCP windows shrink or connections queue
When buffers drain, sources resume automatically

File Sources

File sources checkpoint their position:

Current read position is saved regularly
On restart, reading resumes from the last checkpoint
No data is lost during restarts or crashes

Network Sources

Network sources handle backpressure via:

TCP: Send window shrinks, clients slow down naturally
UDP: Packets may be dropped (UDP is inherently lossy)
HTTP: Requests queue; configure appropriate timeouts

Performance Considerations

File Source Optimization

sources:
  high_throughput_logs:
    type: file
    include:
      - /var/log/high-volume/*.log
    
    # Increase read buffer for throughput
    max_read_bytes: 1048576         # Read 1MB at a time (default: 2KB)
    
    # Reduce checkpointing frequency
    data_dir: /var/lib/vector
    
    # Use faster fingerprinting
    fingerprint:
      strategy: device_and_inode    # Faster than checksum

Network Source Tuning

sources:
  high_volume_syslog:
    type: syslog
    address: 0.0.0.0:514
    mode: tcp
    
    # Increase connection limits
    max_length: 1048576             # 1MB max message size
    receive_buffer_bytes: 65536     # 64KB socket buffer
    
    # TCP keepalive
    keepalive:
      time_secs: 60

Best Practices

Choose the right source type

Use file for application logs written to disk
Use syslog for centralized syslog collection
Use http_server for application-direct log shipping
Use cloud-native sources (S3, Pub/Sub) for cloud logs
Use host_metrics for infrastructure monitoring

Configure appropriate timeouts

Set reasonable timeouts for network sources:

sources:
  http_logs:
    type: http_server
    keepalive:
      max_connection_age_secs: 300  # Close idle connections

Use acknowledgements wisely

Enable acknowledgements only for critical data:

Financial transactions
Security audit logs
Compliance-required data

Disable for high-volume, best-effort data:

Debug logs
Sampling metrics
Non-critical telemetry

Monitor source health

Track source metrics:

sources:
  vector_metrics:
    type: internal_metrics

transforms:
  filter_source_metrics:
    type: filter
    inputs:
      - vector_metrics
    condition: |
      starts_with!(.name, "component_received_events_total") ||
      starts_with!(.name, "component_errors_total")

Handle multi-line logs correctly

Configure multi-line aggregation at the source (not in transforms):

More efficient (less data movement)
Preserves log context
Reduces downstream processing

Troubleshooting

Source Not Receiving Data

Check connectivity: Verify network/file access

# For network sources
telnet localhost 8080

# For file sources
ls -la /var/log/myapp/

Verify configuration: Use Vector’s validate command

vector validate --config /etc/vector/vector.yaml

Check permissions: Ensure Vector can read files/bind ports

# File permissions
sudo -u vector cat /var/log/myapp/app.log

# Port binding (< 1024 requires privileges)
sudo setcap 'cap_net_bind_service=+ep' /usr/bin/vector

Monitor metrics: Check internal metrics for errors

curl http://localhost:8686/metrics | grep component_errors

High Memory Usage

Reduce max_read_bytes for file sources
Decrease receive_buffer_bytes for network sources
Add buffering between source and slow sinks
Enable sampling for high-volume sources

File Source Missing Data

Check ignore_older_secs setting
Verify read_from is set to beginning if historical data is needed
Ensure file patterns match correctly
Check Vector’s data directory for checkpoint corruption

Data Model - Understanding event types
Pipeline Model - How sources connect to transforms and sinks
Transforms - Processing events from sources
Buffering - Managing backpressure from sources

Getting Started

Core Concepts

Configuration

Deployment

Administration

Guides

​How Sources Work

​Source Types

Pull Sources

Push Sources

Generator Sources

​Source Categories

​File and Log Collection

file

docker_logs

kubernetes_logs

journald

​Network and Syslog

syslog

socket

http_server

fluent

​Cloud Platform Sources

aws_s3

aws_sqs

aws_kinesis_firehose

gcp_pubsub

​Metrics and System Observability

host_metrics

prometheus_scrape

internal_metrics

statsd

​Application and Service Logs

exec

stdin

datadog_agent

heroku_logs

​Development and Testing

demo_logs

internal_logs

​Source Configuration

​Common Options

​Acknowledgements

​Multi-line Log Handling

​Data Types

​Backpressure Handling

​How It Works

​File Sources

​Network Sources

​Performance Considerations

​File Source Optimization

​Network Source Tuning

​Best Practices

​Troubleshooting

​Source Not Receiving Data

​High Memory Usage

​File Source Missing Data

​Related Topics

Build docs developers (and LLMs) love

How Sources Work

Source Types

Source Categories

File and Log Collection

Network and Syslog

Cloud Platform Sources

Metrics and System Observability

Application and Service Logs

Development and Testing

Source Configuration

Common Options

Acknowledgements

Multi-line Log Handling

Data Types

Backpressure Handling

How It Works

File Sources

Network Sources

Performance Considerations

File Source Optimization

Network Source Tuning

Best Practices

Troubleshooting

Source Not Receiving Data

High Memory Usage

File Source Missing Data

Related Topics