Skip to main content
Temporal Server uses Dead Letter Queues (DLQs) to handle tasks that fail processing after exhausting retry attempts. This prevents poison messages from blocking queue progress.

Overview

DLQs store failed tasks for:
  • Replication Tasks - Cross-cluster replication failures
  • History Tasks - Transfer, timer, visibility, archival task failures
Tasks move to DLQ when:
  1. Processing fails repeatedly
  2. Permanent error detected (e.g., corrupted data)
  3. Target namespace deleted

DLQ Types

Replication DLQ

Stores failed cross-cluster replication tasks:
  • Namespace replication events
  • Workflow history replication
  • Task queue replication
Location: Per source cluster, per target cluster

History Task DLQ

Stores failed internal task processing:
  • Transfer tasks (cross-workflow operations)
  • Timer tasks (scheduled operations)
  • Visibility tasks (search indexing)
  • Archival tasks (long-term storage)
Location: Per history shard, per task category

Monitoring DLQs

DLQ Metrics

dlq_message_count

Type: Gauge
Description: Number of messages in DLQ by task category
Tags: task_category
Update Frequency: Every 3 hours from shard 1 owner
# View DLQ messages by category
dlq_message_count

# Alert on non-zero DLQ count
dlq_message_count > 0

Persistence Metrics

PersistenceEnqueueMessageToDLQ         # Tasks moved to DLQ
PersistenceReadMessagesFromDLQ         # DLQ reads
PersistenceDeleteMessageFromDLQ        # Individual message deletion
PersistenceRangeDeleteMessagesFromDLQ  # Bulk deletion

Task Categories

transfer     # Transfer tasks (activities, child workflows, signals)
timer        # Timer tasks (timeouts, retries, scheduled events)
visibility   # Visibility updates (search indexing)
archival     # Archival tasks (history upload)

Inspecting DLQ

Replication DLQ

List DLQ Messages

tctl admin dlq read \
  --cluster source-cluster \
  --namespace my-namespace
Output:
[
  {
    "taskId": 12345,
    "taskType": "HistoryReplicationTask",
    "namespaceId": "abc-123",
    "workflowId": "my-workflow",
    "runId": "run-456",
    "firstEventId": 1,
    "nextEventId": 10,
    "version": 5
  }
]

Get DLQ Size

tctl admin dlq count \
  --cluster source-cluster \
  --namespace my-namespace

History Task DLQ

List Messages by Shard and Category

tctl admin queue dlq read \
  --shard-id 1 \
  --category transfer

Count Messages

tctl admin queue dlq count \
  --shard-id 1 \
  --category transfer

Recovering from DLQ

Replication DLQ

Merge Single Task

Reprocess one failed task:
tctl admin dlq merge \
  --cluster source-cluster \
  --namespace my-namespace \
  --task-id 12345

Merge All Tasks

Reprocess all DLQ messages:
tctl admin dlq merge \
  --cluster source-cluster \
  --namespace my-namespace
Note: Large DLQs may take time to process. Monitor progress with count command.

Merge with Filtering

# Merge tasks for specific workflow
tctl admin dlq merge \
  --cluster source-cluster \
  --namespace my-namespace \
  --workflow-id my-workflow

History Task DLQ

Merge Tasks by Category

tctl admin queue dlq merge \
  --shard-id 1 \
  --category transfer

Merge Specific Task

tctl admin queue dlq merge \
  --shard-id 1 \
  --category transfer \
  --min-message-id 12345 \
  --max-message-id 12345

Merge Range of Tasks

tctl admin queue dlq merge \
  --shard-id 1 \
  --category transfer \
  --min-message-id 10000 \
  --max-message-id 20000

Purging DLQ

Warning: Purging permanently deletes tasks. Only use if tasks cannot be recovered or are no longer needed.

Replication DLQ

Delete Single Task

tctl admin dlq purge \
  --cluster source-cluster \
  --namespace my-namespace \
  --task-id 12345

Delete All Tasks

tctl admin dlq purge \
  --cluster source-cluster \
  --namespace my-namespace

History Task DLQ

Purge by Category

tctl admin queue dlq purge \
  --shard-id 1 \
  --category transfer

Purge Range

tctl admin queue dlq purge \
  --shard-id 1 \
  --category transfer \
  --min-message-id 10000 \
  --max-message-id 20000

Common DLQ Scenarios

Scenario 1: Namespace Deleted on Target Cluster

Cause: Replication tasks for non-existent namespace Solution:
# Recreate namespace on target cluster
tctl --cluster target-cluster namespace register my-namespace \
  --clusters source-cluster target-cluster \
  --active-cluster source-cluster

# Merge DLQ tasks
tctl admin dlq merge \
  --cluster source-cluster \
  --namespace my-namespace

Scenario 2: Corrupted Task Data

Cause: Data corruption or schema mismatch Solution:
# Read task details
tctl admin dlq read \
  --cluster source-cluster \
  --namespace my-namespace

# If corrupted, purge specific task
tctl admin dlq purge \
  --cluster source-cluster \
  --namespace my-namespace \
  --task-id 12345

# Investigate root cause in logs
kubectl logs -l app=temporal-history | grep "task-id=12345"

Scenario 3: Transient Target Cluster Outage

Cause: Target cluster was unavailable during replication Solution:
# Wait for target cluster to recover
# Verify target cluster health
tctl --cluster target-cluster cluster health

# Merge all DLQ tasks
tctl admin dlq merge \
  --cluster source-cluster \
  --namespace my-namespace

Scenario 4: High Volume of Transfer Tasks in DLQ

Cause: Downstream service failures or rate limiting Solution:
# Check for pattern in failed tasks
tctl admin queue dlq read \
  --shard-id 1 \
  --category transfer | jq '.[] | .taskType' | sort | uniq -c

# Fix underlying issue (e.g., scale matching service)

# Merge DLQ in batches
for i in {1..10}; do
  tctl admin queue dlq merge \
    --shard-id $i \
    --category transfer
done

Scenario 5: Visibility Task Failures

Cause: Elasticsearch indexing errors Solution:
# Check Elasticsearch health
curl http://elasticsearch:9200/_cluster/health

# Recreate index if corrupted
curl -X DELETE http://elasticsearch:9200/temporal_visibility_v1
temporal-elasticsearch-setup --version v1

# Merge visibility DLQ
for shard in {1..4096}; do
  tctl admin queue dlq merge \
    --shard-id $shard \
    --category visibility
done

Automation

Automated DLQ Monitoring

Create monitoring script:
#!/bin/bash
# dlq-monitor.sh

CLUSTER="source-cluster"
NAMESPACES=("production" "staging")

for ns in "${NAMESPACES[@]}"; do
  count=$(tctl admin dlq count --cluster $CLUSTER --namespace $ns 2>/dev/null | grep -oP '\d+')
  
  if [ "$count" -gt 0 ]; then
    echo "WARNING: DLQ for namespace $ns has $count messages"
    # Alert to PagerDuty, Slack, etc.
  fi
done
Run via cron:
*/15 * * * * /usr/local/bin/dlq-monitor.sh

Automated DLQ Merge

Auto-merge after transient failures:
#!/bin/bash
# dlq-auto-merge.sh

SHARD_COUNT=4096
CATEGORIES=("transfer" "timer" "visibility")
THRESHOLD=100  # Auto-merge if less than threshold

for shard in $(seq 1 $SHARD_COUNT); do
  for category in "${CATEGORIES[@]}"; do
    count=$(tctl admin queue dlq count \
      --shard-id $shard \
      --category $category 2>/dev/null | grep -oP '\d+')
    
    if [ "$count" -gt 0 ] && [ "$count" -lt $THRESHOLD ]; then
      echo "Auto-merging $count tasks from shard $shard category $category"
      tctl admin queue dlq merge \
        --shard-id $shard \
        --category $category
    fi
  done
done

Dynamic Configuration

Tune DLQ behavior:
# config/dynamicconfig/production.yaml

# Max retries before moving to DLQ
history.transferProcessorMaxRetryCount:
  - value: 100
    constraints: {}

history.timerProcessorMaxRetryCount:
  - value: 100
    constraints: {}

# DLQ message batch size
history.dlqMaxMessageCount:
  - value: 1000
    constraints: {}

# Enable DLQ metrics emission
worker.dlqMetricsEmitterEnabled:
  - value: true
    constraints: {}

Best Practices

1. Monitor DLQ Size

Set up alerts:
dlq_message_count > 0

2. Investigate Before Purging

Always inspect tasks before deletion:
tctl admin dlq read --cluster source-cluster --namespace my-namespace | less

3. Fix Root Cause

DLQ is a symptom, not the problem:
  • Check target cluster health
  • Verify namespace configuration
  • Review error logs
  • Check resource availability

4. Merge in Batches

For large DLQs, process incrementally:
# Process 1000 tasks at a time
for i in {0..10}; do
  start=$((i * 1000))
  end=$(((i + 1) * 1000))
  
  tctl admin queue dlq merge \
    --shard-id 1 \
    --category transfer \
    --min-message-id $start \
    --max-message-id $end
  
  sleep 60  # Pause between batches
done

5. Regular DLQ Audits

Schedule weekly reviews:
# Weekly DLQ report
for ns in $(tctl namespace list | grep Name | awk '{print $2}'); do
  count=$(tctl admin dlq count --cluster source-cluster --namespace $ns 2>/dev/null)
  echo "$ns: $count"
done

Troubleshooting

DLQ Metrics Not Updating

Cause: Metrics emitted only by shard 1 owner Solution:
# Find shard 1 owner
tctl admin shard describe --shard-id 1

# Check logs on that host
kubectl logs <pod-name> | grep DLQMetricsEmitter

Cannot Read DLQ

Cause: Insufficient permissions or wrong cluster Solution:
# Verify cluster connectivity
tctl cluster health

# Check admin permissions
tctl admin cluster describe

Merge Fails

Cause: Tasks still failing for same reason Solution:
# Check recent errors
tctl admin queue dlq read --shard-id 1 --category transfer

# Review history service logs
kubectl logs -l app=temporal-history --tail=1000 | grep DLQ

# Fix underlying issue before retrying merge

See Also

Build docs developers (and LLMs) love