Overview
DLQs store failed tasks for:- Replication Tasks - Cross-cluster replication failures
- History Tasks - Transfer, timer, visibility, archival task failures
- Processing fails repeatedly
- Permanent error detected (e.g., corrupted data)
- Target namespace deleted
DLQ Types
Replication DLQ
Stores failed cross-cluster replication tasks:- Namespace replication events
- Workflow history replication
- Task queue replication
History Task DLQ
Stores failed internal task processing:- Transfer tasks (cross-workflow operations)
- Timer tasks (scheduled operations)
- Visibility tasks (search indexing)
- Archival tasks (long-term storage)
Monitoring DLQs
DLQ Metrics
dlq_message_count
Type: GaugeDescription: Number of messages in DLQ by task category
Tags:
task_categoryUpdate Frequency: Every 3 hours from shard 1 owner
Persistence Metrics
Task Categories
Inspecting DLQ
Replication DLQ
List DLQ Messages
Get DLQ Size
History Task DLQ
List Messages by Shard and Category
Count Messages
Recovering from DLQ
Replication DLQ
Merge Single Task
Reprocess one failed task:Merge All Tasks
Reprocess all DLQ messages:Merge with Filtering
History Task DLQ
Merge Tasks by Category
Merge Specific Task
Merge Range of Tasks
Purging DLQ
Warning: Purging permanently deletes tasks. Only use if tasks cannot be recovered or are no longer needed.Replication DLQ
Delete Single Task
Delete All Tasks
History Task DLQ
Purge by Category
Purge Range
Common DLQ Scenarios
Scenario 1: Namespace Deleted on Target Cluster
Cause: Replication tasks for non-existent namespace Solution:- Recreate Namespace
- Purge Tasks
Scenario 2: Corrupted Task Data
Cause: Data corruption or schema mismatch Solution:Scenario 3: Transient Target Cluster Outage
Cause: Target cluster was unavailable during replication Solution:Scenario 4: High Volume of Transfer Tasks in DLQ
Cause: Downstream service failures or rate limiting Solution:Scenario 5: Visibility Task Failures
Cause: Elasticsearch indexing errors Solution:Automation
Automated DLQ Monitoring
Create monitoring script:Automated DLQ Merge
Auto-merge after transient failures:Dynamic Configuration
Tune DLQ behavior:Best Practices
1. Monitor DLQ Size
Set up alerts:2. Investigate Before Purging
Always inspect tasks before deletion:3. Fix Root Cause
DLQ is a symptom, not the problem:- Check target cluster health
- Verify namespace configuration
- Review error logs
- Check resource availability
4. Merge in Batches
For large DLQs, process incrementally:5. Regular DLQ Audits
Schedule weekly reviews:Troubleshooting
DLQ Metrics Not Updating
Cause: Metrics emitted only by shard 1 owner Solution:Cannot Read DLQ
Cause: Insufficient permissions or wrong cluster Solution:Merge Fails
Cause: Tasks still failing for same reason Solution:See Also
- Monitoring - DLQ metrics
- Persistence - Queue operations
- Archival - Archival task failures