Skip to main content

Overview

Cadence provides extensive monitoring capabilities through metrics, structured logging, and health checks. This guide covers production monitoring strategies, metric integration, and observability best practices.

Metrics Architecture

Cadence supports multiple metrics backends through the Tally metrics library:

Supported Reporters

  • Prometheus - Recommended for production use
  • StatsD - Legacy support with tag limitations
  • M3 - Uber’s internal metrics platform
Only one metrics reporter can be configured per service. Attempting to configure multiple reporters will result in a fatal error.

Prometheus Integration

Configuration

Configure Prometheus metrics in your config.yaml:
services:
  frontend:
    metrics:
      prometheus:
        timerType: "histogram"
        listenAddress: "0.0.0.0:9090"
        defaultHistogramBuckets:
          - 0.001
          - 0.005
          - 0.01
          - 0.05
          - 0.1
          - 0.5
          - 1.0
          - 5.0
          - 10.0
      tags:
        environment: "production"
        datacenter: "us-east-1"
      prefix: "cadence"
      reportingInterval: "1s"

Configuration Parameters

ParameterDescriptionDefault
listenAddressHost:Port for Prometheus scrape endpointRequired
timerTypeUse “histogram” for latency metrics”histogram”
defaultHistogramBucketsHistogram bucket boundaries in secondsSee above
reportingIntervalMetric reporting interval1s
tagsGlobal tags applied to all metrics
prefixMetric name prefix""

Metric Sanitization

Cadence automatically sanitizes metric names to comply with Prometheus naming conventions:
  • Characters - and . are replaced with _
  • Only alphanumeric characters and _ are allowed
  • Tag names follow the same rules
Metric names emitted may differ from internal metric names due to sanitization. Ensure your dashboards account for this transformation.

StatsD Integration

Configuration

services:
  history:
    metrics:
      statsd:
        hostPort: "127.0.0.1:8125"
        prefix: "cadence.history"
        flushInterval: "10s"
        flushBytes: 512
      tags:
        service: "history"
Tally’s standard StatsD implementation doesn’t support tagging. Cadence provides an enhanced reporter with tag support.

Key Metrics

Service Health Metrics

These metrics are emitted by all Cadence services:
# Process metrics
cadence_restarts                    # Service restart counter
cadence_num_goroutines             # Active goroutines
cadence_gomaxprocs                 # GOMAXPROCS setting

# Memory metrics
cadence_memory_allocated           # Total allocated bytes
cadence_memory_heap                # Heap memory
cadence_memory_heapidle            # Idle heap memory
cadence_memory_heapinuse           # In-use heap memory
cadence_memory_stack               # Stack memory
cadence_memory_num_gc              # GC run count
cadence_memory_gc_pause_ms         # GC pause duration

Persistence Metrics

Monitor database operations across all persistence calls:
# Operation metrics (per operation type)
cadence_persistence_requests
cadence_persistence_errors
cadence_persistence_latency

# Example operations:
# - PersistenceCreateWorkflowExecution
# - PersistenceUpdateWorkflowExecution
# - PersistenceGetWorkflowExecution
# - PersistenceAppendHistoryEvents

RPC Metrics

Track inter-service and client communication:
# Client metrics (Frontend, History, Matching)
cadence_frontend_client_requests
cadence_frontend_client_errors
cadence_frontend_client_latency

cadence_history_client_requests
cadence_history_client_errors
cadence_history_client_latency

cadence_matching_client_requests
cadence_matching_client_errors
cadence_matching_client_latency

Workflow Execution Metrics

# Workflow lifecycle
cadence_workflow_started
cadence_workflow_completed
cadence_workflow_failed
cadence_workflow_timeout
cadence_workflow_canceled
cadence_workflow_continued_as_new

# Task processing
cadence_decision_task_schedule_to_start_latency
cadence_activity_task_schedule_to_start_latency
cadence_decision_task_execution_latency
cadence_activity_task_execution_latency

Service-Specific Metrics

History Service

cadence_history_cache_requests
cadence_history_cache_errors
cadence_history_cache_latency
cadence_history_shard_context_closed
cadence_history_replication_tasks_applied

Matching Service

cadence_matching_tasks_added
cadence_matching_tasks_dispatched
cadence_matching_poll_success
cadence_matching_poll_timeout
cadence_matching_tasklist_backlog

Metric Tags

Cadence automatically applies these standard tags:
TagDescriptionExample
cadence_serviceService name”frontend”, “history”
operationAPI operation”StartWorkflowExecution”
domainWorkflow domain”my-domain”
tasklistTask list name”my-tasklist”
workflow_typeWorkflow type”MyWorkflow”
activity_typeActivity type”MyActivity”

Health Checks

Endpoint Configuration

Cadence services expose health check endpoints:
GET http://<host>:<port>/health

Response Format

{
  "ok": true,
  "msg": "All systems operational"
}

Health Check Implementation

Health checks verify:
  1. Service startup - Service has initialized successfully
  2. Persistence connectivity - Database connections are healthy
  3. Ring membership - Service is part of the ring
Integrate health checks with your load balancer and orchestration platform (Kubernetes, ECS) for automated failover.

Logging Configuration

Structured Logging

Cadence uses Zap for structured JSON logging:
log:
  level: "info"
  encoding: "json"
  outputFile: "/var/log/cadence/cadence.log"
  levelKey: "level"

Log Levels

LevelUse Case
debugDevelopment and troubleshooting
infoNormal operations (recommended for production)
warnWarning conditions
errorError conditions
fatalFatal errors causing service shutdown

Console vs JSON Encoding

# JSON format (production)
log:
  encoding: "json"
  outputFile: "/var/log/cadence/frontend.log"

# Console format (development)
log:
  encoding: "console"
  stdout: true

Log Fields

Cadence includes contextual fields in all log entries:
{
  "ts": "2026-03-04T10:15:30.123Z",
  "level": "info",
  "msg": "Workflow execution started",
  "logger": "history.engine",
  "wf-domain-name": "my-domain",
  "wf-id": "workflow-123",
  "wf-run-id": "abc-def-ghi",
  "shard-id": 42
}

Alerting Guidelines

Critical Alerts

Configure alerts for these conditions:

Service Availability

# Service down
up{job="cadence-frontend"} == 0

# High restart rate
rate(cadence_restarts[5m]) > 0.1

Persistence Issues

# High error rate
rate(cadence_persistence_errors[5m]) > 10

# High latency
histogram_quantile(0.99, cadence_persistence_latency) > 1.0

Task Processing

# Task list backlog growth
rate(cadence_matching_tasklist_backlog[5m]) > 100

# High task timeout rate
rate(cadence_workflow_timeout[5m]) > 5

Warning Alerts

# Memory usage
cadence_memory_heap > 8e9  # 8GB

# High GC pause time
rate(cadence_memory_gc_pause_ms[5m]) > 100

# High request latency
histogram_quantile(0.95, cadence_frontend_client_latency) > 0.5

Distributed Tracing

Cadence supports distributed tracing through YARPC:

Configuration

services:
  frontend:
    rpc:
      grpcPort: 7833
      # Tracing is enabled via YARPC configuration
Distributed tracing integration requires custom YARPC middleware configuration. Refer to YARPC documentation for Jaeger or Zipkin integration.

Monitoring Best Practices

Dashboard Organization

Create dashboards for each layer:
  1. Service Health - CPU, memory, goroutines, restarts
  2. RPC Layer - Request rates, errors, latency by service and operation
  3. Persistence Layer - Database operations, latency, errors
  4. Workflow Execution - Started/completed/failed workflows, task latency
  5. Business Metrics - Domain-specific workflow metrics

Cardinality Management

Avoid high-cardinality dimensions in metric tags:
  • Workflow IDs
  • Run IDs
  • Specific timestamps
  • User IDs
Use workflow_type, domain, and operation instead.

Retention Policies

  • Raw metrics: 15-30 days
  • Aggregated metrics: 90-365 days
  • Logs: 7-30 days (ship to archival for longer retention)

Troubleshooting

No Metrics Appearing

  1. Verify metrics endpoint is accessible:
    curl http://localhost:9090/metrics
    
  2. Check service logs for metrics initialization:
    grep "metric" /var/log/cadence/frontend.log
    
  3. Validate configuration:
    # Only one reporter should be configured
    metrics:
      prometheus:  # OR statsd OR m3, not multiple
        listenAddress: "0.0.0.0:9090"
    

High Metric Cardinality

If Prometheus complains about high cardinality:
  1. Review custom tags configuration
  2. Ensure workflow IDs aren’t being used as tag values
  3. Check for runaway domain/tasklist creation

Missing Log Output

Check log configuration:
log:
  level: "info"  # Not "INFO"
  outputFile: "/var/log/cadence/service.log"  # Ensure path exists and is writable

See Also

Build docs developers (and LLMs) love