Skip to main content
Temporal Server provides comprehensive monitoring capabilities through metrics, structured logging, and distributed tracing to observe cluster health and performance.

Metrics Collection

Temporal emits metrics using either Prometheus or StatsD backends. The metrics framework can be configured to use either Tally or OpenTelemetry.

Prometheus Configuration

Configure Prometheus metrics in your config.yaml:
global:
  metrics:
    prometheus:
      framework: "opentelemetry"  # or "tally"
      listenAddress: "127.0.0.1:8000"
      handlerPath: "/metrics"
      loggerRPS: 0  # 0 means no limit
Framework Options:
  • tally - Legacy framework using uber-go/tally
  • opentelemetry - Modern OpenTelemetry-based metrics (recommended)
Configuration Fields:
  • listenAddress - Address where Prometheus scrapes metrics
  • handlerPath - HTTP endpoint path (default: /metrics)
  • loggerRPS - Rate limit for metric logger (0 = unlimited)

StatsD Configuration

For StatsD integration:
global:
  metrics:
    statsd:
      framework: "opentelemetry"
      hostPort: "127.0.0.1:8125"
      prefix: "temporal"
      flushInterval: "1s"
      flushBytes: 1432
      reporter:
        tagSeparator: ":"
Configuration Fields:
  • hostPort - StatsD server address
  • prefix - Metric name prefix
  • flushInterval - Batch flush interval (default: 1s)
  • flushBytes - Maximum UDP packet size (default: 1432)
  • tagSeparator - Character to separate tags (optional)

Common Metrics Configuration

global:
  metrics:
    clientConfig:
      tags:
        environment: "production"
        cluster: "us-east-1"
      excludeTags:
        namespace:
          - "system-namespace"  # whitelist only
      prefix: "temporal_"
      perUnitHistogramBoundaries:
        dimensionless: [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]
        milliseconds: [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000]
        bytes: [1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288]
Options:
  • tags - Global tags added to all metrics
  • excludeTags - Filter sensitive tag values (replaced with _tag_excluded_)
  • prefix - Prefix for all metric names
  • perUnitHistogramBoundaries - Custom histogram buckets by unit type
  • withoutUnitSuffix - Remove unit suffixes (OpenTelemetry only)
  • withoutCounterSuffix - Remove _total suffix from counters (OpenTelemetry only)
  • recordTimerInSeconds - Emit timers in seconds instead of milliseconds

Key Metrics by Service

Service Health Metrics

These metrics track overall service health:
service_requests              # Total RPC requests received
service_pending_requests      # Current pending requests (gauge)
service_errors                # Unexpected service errors
service_error_with_type       # Errors by error type
service_latency               # Request latency
service_latency_nouserlatency # Server-side latency only
service_latency_userlatency   # User workflow latency
Common Tags:
  • operation - API method name
  • service_role - Service type (frontend, history, matching, worker)

Persistence Layer Metrics

Track database operations:
# Shard Operations
GetOrCreateShard
UpdateShard
AssertShardOwnership

# Workflow Execution
CreateWorkflowExecution
GetWorkflowExecution
UpdateWorkflowExecution
DeleteWorkflowExecution

# Task Queue Operations
CreateTaskQueue
GetTaskQueue
UpdateTaskQueue
DeleteTaskQueue

# Task Operations
GetTransferTasks
CompleteTransferTask
GetTimerTasks
CompleteTimerTask
GetVisibilityTasks
GetReplicationTasks

# History Operations
AppendHistoryNodes
ReadHistoryBranch
DeleteHistoryBranch
Each persistence operation emits:
  • Request count
  • Error count
  • Latency histogram
  • db_kind tag (cassandra, mysql, postgres, sqlite)

History Service Metrics

# Core Operations
StartWorkflowExecution
RecordActivityTaskHeartbeat
RespondWorkflowTaskCompleted
RespondActivityTaskCompleted

# Shard Management
ShardController
ShardInfo

# Task Processing
TransferQueueProcessor
TimerQueueProcessor
VisibilityQueueProcessor
ArchivalQueueProcessor
OutboundQueueProcessor

# Cache Performance
HistoryCacheGetOrCreate
EventsCacheGetEvent
EventsCachePutEvent

Matching Service Metrics

PollWorkflowTaskQueue
PollActivityTaskQueue
AddActivityTask
AddWorkflowTask
TaskQueueMgr
TaskQueuePartitionManager

Authorization Metrics

service_authorization_latency      # Authorization check duration
service_errors_unauthorized        # Rejected requests
service_errors_authorize_failed    # Authorization system errors
Tagged with:
  • namespace - Target namespace
  • operation - API being authorized

Error Tracking

Error metrics by type:
service_errors_invalid_argument
service_errors_namespace_not_active
service_errors_resource_exhausted
service_errors_entity_not_found
service_errors_execution_already_started
service_errors_context_timeout
service_errors_retry_task
service_errors_incomplete_history
service_errors_nondeterministic

Resource Metrics

Lock and Semaphore Usage

lock_requests        # Lock acquisition attempts
lock_latency         # Time waiting for locks
semaphore_requests   # Semaphore acquisition attempts
semaphore_failures   # Failed semaphore acquisitions
semaphore_latency    # Time waiting for semaphore

Cache Metrics

NamespaceCache
EventsCacheGetEvent
EventsCachePutEvent
EventsCacheGetFromStore
VersionMembershipCacheGet
VersionMembershipCachePut
Tagged with cache_type:
  • mutablestate
  • events
  • version_membership
  • routing_info

TLS Certificate Monitoring

certificates_expired   # Number of expired certificates (gauge)
certificates_expiring  # Number of certificates expiring soon (gauge)
Configure certificate monitoring:
global:
  tls:
    expirationChecks:
      warningWindow: "720h"  # 30 days
      errorWindow: "168h"    # 7 days
      checkInterval: "1h"

Alerting Guidelines

Critical Alerts

Set up alerts for:
  1. Service Availability
    rate(service_errors[5m]) > 0.05  # 5% error rate
    service_pending_requests > 1000
    
  2. Persistence Layer
    rate(UpdateShard_errors[1m]) > 0
    histogram_quantile(0.99, rate(GetWorkflowExecution_latency[5m])) > 1000
    
  3. Shard Health
    rate(ShardController_errors[5m]) > 0
    
  4. Certificate Expiration
    certificates_expired > 0
    certificates_expiring > 0
    

Warning Alerts

  1. High Latency
    histogram_quantile(0.95, rate(service_latency[5m])) > 500
    
  2. Resource Pressure
    semaphore_failures > 100
    lock_latency > 100
    
  3. Cache Efficiency
    rate(EventsCacheGetFromStore[5m]) / rate(EventsCacheGetEvent[5m]) > 0.5
    

Logging Configuration

Configure structured logging:
log:
  stdout: true
  level: "info"  # debug, info, warn, error
  outputFile: "/var/log/temporal/server.log"
  encoding: "json"  # json or console
Log Levels:
  • debug - Detailed diagnostic information
  • info - General operational events
  • warn - Warning messages, degraded state
  • error - Error events, requires attention
Important Log Tags:
  • namespace - Namespace name
  • workflowID - Workflow execution ID
  • runID - Workflow run ID
  • operation - Operation being performed
  • error - Error details
  • shard-id - History shard ID

Health Checks

Temporal exposes health check endpoints:
# Frontend health
curl http://localhost:7233/health

# Service-specific health
curl http://localhost:7234/health  # History
curl http://localhost:7235/health  # Matching

Distributed Tracing

Enable OpenTelemetry tracing:
global:
  otel:
    enabled: true
    exporters:
      - type: "otlp"
        endpoint: "otel-collector:4317"
        insecure: false
        headers:
          api-key: "your-api-key"
Tracing captures:
  • Request flow across services
  • Persistence operation timing
  • Cross-namespace operations
  • Replication latency

Dashboard Recommendations

Service Overview Dashboard

  • Request rate by service and operation
  • Error rate and types
  • Latency percentiles (p50, p95, p99)
  • Active connections

Persistence Dashboard

  • Operation latency by type
  • Error rates by operation
  • Connection pool utilization
  • Query duration

Workflow Execution Dashboard

  • Workflow start rate
  • Workflow completion rate
  • Task queue backlog
  • Activity timeouts

Resource Usage Dashboard

  • CPU and memory per service
  • Lock contention
  • Cache hit rates
  • GC pause time

See Also

Build docs developers (and LLMs) love