Understanding Delivery Guarantees
Vector offers three levels of delivery guarantees:At-Most-Once Delivery
Events are sent once without confirmation. If delivery fails, the event is lost. Use case: Non-critical telemetry where some data loss is acceptable Characteristics:- Highest throughput
- Lowest latency
- No delivery confirmation
- Possible data loss
At-Least-Once Delivery
Events are retried until acknowledged. The same event may be delivered multiple times. Use case: Most observability data where duplicates can be handled downstream Characteristics:- Strong delivery guarantee
- Possible duplicates
- Automatic retries
- Higher resource usage
Exactly-Once Delivery
Events are delivered exactly once with no duplicates. Use case: Financial transactions, billing data, critical metrics Characteristics:- Strongest guarantee
- No duplicates
- Highest overhead
- Requires downstream deduplication support
Component Delivery Guarantees
Different Vector components provide different guarantees:| Component Type | Default Guarantee | Configurable |
|---|---|---|
| File source | At-least-once | No |
| Syslog source | At-most-once | No |
| HTTP source | At-least-once | Yes |
| Kafka source | At-least-once | No |
| S3 sink | At-least-once | No |
| Elasticsearch sink | At-least-once | No |
| HTTP sink | At-least-once | Yes |
| Kafka sink | At-least-once | No |
Acknowledgments System
Vector’s acknowledgment system ensures data is not dropped during processing.How Acknowledgments Work
Configuring Acknowledgments
End-to-End Acknowledgments
For multi-hop pipelines, enable end-to-end acknowledgments:Buffering Strategies
Buffers are crucial for reliability, providing temporary storage during downstream failures.Memory Buffers
Best for performance when durability across restarts isn’t required:Disk Buffers
Provide durability across Vector restarts:Choose buffer location
Use fast storage (SSD) for buffer directories to minimize performance impact:By default, Vector uses the system temp directory. Specify a custom location:
Size appropriately
Calculate buffer size based on:
- Expected downtime duration
- Average event throughput
- Available disk space
Retry Configuration
Configure retry behavior for transient failures:Exponential Backoff
Vector automatically applies exponential backoff between retries:Customizing Retry Logic
Health Checks
Health checks verify sink availability before sending data:Custom Health Checks
Dead Letter Queues
Handle events that repeatedly fail processing:Dead Letter Queue Pattern
Failure Handling Patterns
Pattern 1: Graceful Degradation
Pattern 2: Circuit Breaker
Pattern 3: Multi-Path Delivery
Send to multiple destinations for redundancy:Pattern 4: Sampling on Pressure
Monitoring Reliability
Key Metrics
Track these metrics to ensure reliability:component_sent_events_total: Events successfully deliveredcomponent_sent_event_bytes_total: Bytes successfully deliveredcomponent_errors_total: Errors encounteredcomponent_discarded_events_total: Events droppedbuffer_events: Current buffer sizebuffer_byte_size: Buffer memory/disk usagebuffer_received_events_total: Events entering bufferbuffer_sent_events_total: Events leaving buffer
Alerting on Reliability Issues
High Availability Deployments
Active-Active Configuration
Deploy multiple Vector instances for redundancy:Load Balancing
Disaster Recovery
Backup and Restore
Recovery Testing
Regularly test failure scenarios:Best Practices
- Enable acknowledgments: For critical data, always enable acknowledgments
- Use disk buffers: Protect against data loss during restarts
- Size buffers appropriately: Balance durability with resource constraints
- Monitor continuously: Track delivery metrics and alert on anomalies
- Test failure scenarios: Regularly verify failure handling works as expected
- Document guarantees: Clearly define delivery guarantees for each pipeline
- Plan for disasters: Have runbooks and recovery procedures ready
- Use dead letter queues: Isolate and investigate persistent failures
- Configure retries wisely: Balance retry attempts with downstream capacity
- Deploy redundantly: Use multiple instances for high availability
Troubleshooting Reliability Issues
Data Loss Investigation
Performance Degradation
- Symptom: Increasing buffer sizes
- Causes: Downstream slowness, insufficient concurrency, network issues
- Solutions: Increase concurrency, add more sinks, optimize transforms