Transforms - Vector

Transforms are Vector components that process events as they flow through your pipeline. They can parse, filter, enrich, aggregate, and route data between sources and sinks. Transforms are the core of Vector’s data processing capabilities.

How Transforms Work

Transforms sit between sources and sinks in Vector’s topology: Each transform:

Receives events from one or more inputs (sources or other transforms)
Processes events according to its configuration
Emits results to one or more outputs (other transforms or sinks)
Handles backpressure from downstream components

Transform Types

Parsing

Extract structured data from raw text

Filtering

Select or discard events based on conditions

Routing

Send events to different destinations

Enrichment

Add context and metadata

Aggregation

Combine multiple events

Conversion

Change event types (logs ↔ metrics)

Transform Categories

Parsing and Structuring

remap - Programmable transformation with VRL

The most powerful and commonly used transform. Uses Vector Remap Language (VRL) for complex event manipulation.

transforms:
  parse_logs:
    type: remap
    inputs:
      - raw_logs
    source: |
      # Parse JSON log message
      . = parse_json!(.message)
      
      # Extract timestamp
      .timestamp = parse_timestamp!(.time, "%Y-%m-%d %H:%M:%S")
      
      # Add environment tag
      .environment = "production"
      
      # Parse user agent
      .user_agent = parse_user_agent!(.user_agent_string)
      
      # Remove sensitive data
      del(.password)
      del(.api_key)

Use cases: JSON parsing, log parsing, field extraction, data transformation, enrichment

regex_parser - Extract fields with regex

Extract structured fields using regular expressions. Less powerful than remap but simpler for basic parsing.

transforms:
  parse_apache:
    type: regex_parser
    inputs:
      - apache_logs
    regex: '^(?P<ip>\S+) \S+ (?P<user>\S+) \[(?P<timestamp>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) \S+" (?P<status>\d+) (?P<bytes>\d+)$'
    field: message

grok_parser - Parse with Grok patterns

Use Grok patterns (Logstash-compatible) for parsing common log formats.

transforms:
  grok_parse:
    type: grok_parser
    inputs:
      - logs
    pattern: '%{COMMONAPACHELOG}'

Filtering and Sampling

filter - Keep or drop events

Keep or discard events based on conditions.

transforms:
  # Keep only errors
  errors_only:
    type: filter
    inputs:
      - parsed_logs
    condition: '.level == "error" || .status >= 400'
  
  # Drop health check logs
  no_healthchecks:
    type: filter
    inputs:
      - parsed_logs
    condition: '.path != "/health" && .path != "/ready"'

Use cases: Reducing data volume, removing noise, isolating specific events

sample - Reduce data volume

Keep only a percentage of events for high-volume data.

transforms:
  sample_debug_logs:
    type: sample
    inputs:
      - debug_logs
    rate: 10              # Keep 1 in every 10 events
    key_field: request_id # Sample by request (optional)

Use cases: Cost reduction, load testing, debugging high-traffic endpoints

dedupe - Remove duplicates

Remove duplicate events based on field values.

transforms:
  remove_dupes:
    type: dedupe
    inputs:
      - logs
    fields:
      match:
        - request_id
        - timestamp
    cache:
      num_events: 10000   # Remember last 10k events

Routing and Distribution

route - Send events to different outputs

Route events to named outputs based on conditions.

transforms:
  route_by_severity:
    type: route
    inputs:
      - logs
    route:
      critical: '.level == "critical" || .level == "fatal"'
      errors: '.level == "error"'
      warnings: '.level == "warning"'
      info: '.level == "info"'
      # _unmatched: Everything else

sinks:
  pagerduty_alerts:
    type: http
    inputs:
      - route_by_severity.critical
    uri: https://events.pagerduty.com/v2/enqueue
  
  elasticsearch_errors:
    type: elasticsearch
    inputs:
      - route_by_severity.errors
      - route_by_severity.warnings
  
  s3_archive:
    type: aws_s3
    inputs:
      - route_by_severity._unmatched  # Everything else

swimlanes - Parallel processing paths

Create parallel processing paths for different event types.

transforms:
  split_by_type:
    type: swimlanes
    inputs:
      - logs
    lanes:
      application:
        type: filter
        condition: '.source_type == "application"'
      system:
        type: filter
        condition: '.source_type == "system"'

Enrichment and Context

remap with enrichment tables

Add external data from enrichment tables (CSV, databases).

enrichment_tables:
  geoip:
    type: geoip
    path: /usr/share/GeoIP/GeoLite2-City.mmdb
  
  user_data:
    type: file
    file:
      path: /etc/vector/users.csv
      encoding:
        type: csv
    schema:
      user_id: integer
      name: string
      department: string

transforms:
  enrich_logs:
    type: remap
    inputs:
      - logs
    source: |
      # Add GeoIP data
      .geo = get_enrichment_table_record!("geoip", {
        "ip": .ip_address
      })
      
      # Add user information
      .user = get_enrichment_table_record!("user_data", {
        "user_id": .user_id
      })

lua - Custom logic with Lua

Write custom transformation logic in Lua.

transforms:
  custom_logic:
    type: lua
    inputs:
      - logs
    version: "2"
    hooks:
      process: |
        function process(event, emit)
          -- Custom Lua logic
          event.log.processed = true
          event.log.custom_field = calculate_something(event.log.value)
          emit(event)
        end

Note: VRL (via remap) is preferred over Lua for better performance and type safety.

Aggregation and Reduction

reduce - Merge events by key

Combine multiple events into a single aggregate event.

transforms:
  merge_by_request:
    type: reduce
    inputs:
      - logs
    group_by:
      - request_id
    merge_strategies:
      timestamp: min       # Keep earliest timestamp
      status: max          # Keep highest status code
      duration: sum        # Sum all durations
      messages: array      # Collect all messages
    ends_when: '.status != null && .final == true'
    expire_after_ms: 30000  # Flush after 30s

Use cases: Combining fragmented logs, request/response pairing, transaction assembly

aggregate - Create metrics from logs

Convert log events into aggregate metrics.

transforms:
  log_to_metrics:
    type: aggregate
    inputs:
      - parsed_logs
    interval_ms: 60000    # Aggregate every minute
    aggregates:
      - name: requests_per_endpoint
        kind: counter
        tags:
          endpoint: "{{ path }}"
          method: "{{ method }}"
          status: "{{ status }}"

Type Conversion

log_to_metric - Extract metrics from logs

Convert log events into metrics.

transforms:
  extract_metrics:
    type: log_to_metric
    inputs:
      - access_logs
    metrics:
      - type: counter
        field: request_count
        name: http_requests_total
        namespace: app
        tags:
          method: "{{ method }}"
          status: "{{ status }}"
      
      - type: histogram
        field: duration_ms
        name: http_request_duration_milliseconds
        namespace: app

metric_to_log - Convert metrics to logs

Convert metric events into log events.

transforms:
  metrics_as_logs:
    type: metric_to_log
    inputs:
      - host_metrics
    host_tag: host
    timezone: local

Specialized Transforms

throttle

Rate limit events to prevent overwhelming downstream systems

tag_cardinality_limit

Prevent cardinality explosion in metric tags

throttle - Rate limiting

Control event throughput to prevent overwhelming downstream systems.

transforms:
  rate_limit:
    type: throttle
    inputs:
      - high_volume_logs
    threshold: 1000       # Max events per window
    window_secs: 1        # Per second
    key_field: client_id  # Rate limit per client

tag_cardinality_limit - Protect metrics

Prevent cardinality explosion by limiting unique tag combinations.

transforms:
  protect_metrics:
    type: tag_cardinality_limit
    inputs:
      - app_metrics
    limit_exceeded_action: drop_tag
    mode: probabilistic
    value_limit: 1000     # Max unique values per tag

Transform Behavior

Synchronous vs. Asynchronous

Synchronous transforms (e.g., filter, remap):

Process events immediately
Maintain event order
Support concurrent processing for performance
Most common type

Asynchronous transforms (e.g., reduce, aggregate):

Process events over time windows
May reorder events
Require internal state management
Used for aggregation and stateful operations

Multiple Outputs

Some transforms support multiple named outputs:

transforms:
  route_logs:
    type: route
    inputs:
      - logs
    route:
      errors: '.level == "error"'
      warnings: '.level == "warning"'
      # _unmatched: implicit output for unmatched events

# Reference specific outputs
sinks:
  error_sink:
    inputs:
      - route_logs.errors
  
  warning_sink:
    inputs:
      - route_logs.warnings
  
  other_sink:
    inputs:
      - route_logs._unmatched

Vector Remap Language (VRL)

VRL is Vector’s purpose-built language for event transformation. It’s the recommended way to process events.

VRL Features

Type-safe: Compile-time type checking prevents runtime errors
Fast: Compiled to efficient bytecode
Ergonomic: Designed specifically for event processing
Infallible: Fallible operations use ! to handle errors explicitly

Common VRL Patterns

transforms:
  parse:
    type: remap
    source: |
      . = parse_json!(.message)

VRL Error Handling

transforms:
  safe_parsing:
    type: remap
    source: |
      # Fallible operations use ! to handle errors
      parsed, err = parse_json(.message)
      if err != null {
        .parse_error = err
        .parsed = false
      } else {
        . = parsed
        .parsed = true
      }

Test VRL expressions interactively with the vector vrl REPL:

vector vrl

Performance Optimization

Transform Ordering

Order transforms to minimize processing:

# Good: Filter early, process less data
transforms:
  1_filter:
    type: filter
    inputs: [logs]
    condition: '.level != "debug"'
  
  2_parse:
    type: remap
    inputs: [1_filter]
    source: |
      . = parse_json!(.message)  # Only parse filtered events

# Bad: Parse everything, then filter
transforms:
  1_parse:
    type: remap
    inputs: [logs]
    source: |
      . = parse_json!(.message)  # Parse all events
  
  2_filter:
    type: filter
    inputs: [1_parse]
    condition: '.level != "debug"'  # Filter after expensive parsing

Concurrent Processing

Vector automatically enables concurrency for eligible transforms. To maximize performance:

Use remap over lua (VRL is faster and supports better concurrency)
Avoid stateful operations when possible
Use route to split traffic before expensive operations

Memory Management

transforms:
  # Reduce memory in aggregating transforms
  aggregate:
    type: reduce
    expire_after_ms: 5000  # Flush state frequently
  
  # Limit cache sizes
  dedupe:
    type: dedupe
    cache:
      num_events: 5000     # Smaller cache = less memory

Best Practices

Transform early, route late

Parse and structure data as early as possible
Filter out unnecessary data before expensive operations
Route to different destinations at the end of processing

Use VRL over Lua

VRL is faster, safer, and better integrated:

Type safety prevents runtime errors
Better performance through compilation
First-class support for Vector data types
Interactive REPL for testing

Handle errors explicitly

transforms:
  safe_transform:
    type: remap
    drop_on_error: false  # Keep events even if VRL fails
    source: |
      parsed, err = parse_json(.message)
      if err == null {
        . = parsed
      }

Test transforms independently

Use unit tests for transform logic:

tests:
  - name: parse_apache_logs
    inputs:
      - insert_at: parse_logs
        value: '{"message": "127.0.0.1 - - [01/Jan/2024:00:00:00 +0000] GET /api HTTP/1.1 200"}'
    outputs:
      - extract_from: parse_logs
        conditions:
          - type: vrl
            source: '.ip == "127.0.0.1" && .status == 200'

Monitor transform performance

Watch internal metrics for bottlenecks:

component_received_events_total
component_sent_events_total
component_errors_total
component_execution_time_seconds

Troubleshooting

Events Not Flowing

Check transform condition logic
Verify input references are correct
Look for errors in VRL compilation
Enable debug logging: VECTOR_LOG=debug

High Memory Usage

Reduce cache sizes in dedupe
Decrease expiration times in reduce and aggregate
Add sample transforms for high-volume data
Filter earlier in the pipeline

VRL Errors

Use the VRL REPL to debug:

echo '{"message": "test", "level": "info"}' | vector vrl '. = parse_json!(.message)'

VRL Reference - Complete VRL language documentation
Data Model - Understanding event structure
Pipeline Model - How transforms fit in topologies
Sources - Where events come from
Sinks - Where events go

Getting Started

Core Concepts

Configuration

Deployment

Administration

Guides

​How Transforms Work

​Transform Types

Parsing

Filtering

Routing

Enrichment

Aggregation

Conversion

​Transform Categories

​Parsing and Structuring

​Filtering and Sampling

​Routing and Distribution

​Enrichment and Context

​Aggregation and Reduction

​Type Conversion

​Specialized Transforms

throttle

tag_cardinality_limit

​Transform Behavior

​Synchronous vs. Asynchronous

​Multiple Outputs

​Vector Remap Language (VRL)

​VRL Features

​Common VRL Patterns

​VRL Error Handling

​Performance Optimization

​Transform Ordering

​Concurrent Processing

​Memory Management

​Best Practices

​Troubleshooting

​Events Not Flowing

​High Memory Usage

​VRL Errors

​Related Topics

Build docs developers (and LLMs) love