Metrics Collection
Temporal emits metrics using either Prometheus or StatsD backends. The metrics framework can be configured to use either Tally or OpenTelemetry.Prometheus Configuration
Configure Prometheus metrics in yourconfig.yaml:
tally- Legacy framework using uber-go/tallyopentelemetry- Modern OpenTelemetry-based metrics (recommended)
listenAddress- Address where Prometheus scrapes metricshandlerPath- HTTP endpoint path (default:/metrics)loggerRPS- Rate limit for metric logger (0 = unlimited)
StatsD Configuration
For StatsD integration:hostPort- StatsD server addressprefix- Metric name prefixflushInterval- Batch flush interval (default: 1s)flushBytes- Maximum UDP packet size (default: 1432)tagSeparator- Character to separate tags (optional)
Common Metrics Configuration
tags- Global tags added to all metricsexcludeTags- Filter sensitive tag values (replaced with_tag_excluded_)prefix- Prefix for all metric namesperUnitHistogramBoundaries- Custom histogram buckets by unit typewithoutUnitSuffix- Remove unit suffixes (OpenTelemetry only)withoutCounterSuffix- Remove_totalsuffix from counters (OpenTelemetry only)recordTimerInSeconds- Emit timers in seconds instead of milliseconds
Key Metrics by Service
Service Health Metrics
These metrics track overall service health:operation- API method nameservice_role- Service type (frontend, history, matching, worker)
Persistence Layer Metrics
Track database operations:- Request count
- Error count
- Latency histogram
db_kindtag (cassandra, mysql, postgres, sqlite)
History Service Metrics
Matching Service Metrics
Authorization Metrics
namespace- Target namespaceoperation- API being authorized
Error Tracking
Error metrics by type:Resource Metrics
Lock and Semaphore Usage
Cache Metrics
cache_type:
mutablestateeventsversion_membershiprouting_info
TLS Certificate Monitoring
Alerting Guidelines
Critical Alerts
Set up alerts for:-
Service Availability
-
Persistence Layer
-
Shard Health
-
Certificate Expiration
Warning Alerts
-
High Latency
-
Resource Pressure
-
Cache Efficiency
Logging Configuration
Configure structured logging:debug- Detailed diagnostic informationinfo- General operational eventswarn- Warning messages, degraded stateerror- Error events, requires attention
namespace- Namespace nameworkflowID- Workflow execution IDrunID- Workflow run IDoperation- Operation being performederror- Error detailsshard-id- History shard ID
Health Checks
Temporal exposes health check endpoints:Distributed Tracing
Enable OpenTelemetry tracing:- Request flow across services
- Persistence operation timing
- Cross-namespace operations
- Replication latency
Dashboard Recommendations
Service Overview Dashboard
- Request rate by service and operation
- Error rate and types
- Latency percentiles (p50, p95, p99)
- Active connections
Persistence Dashboard
- Operation latency by type
- Error rates by operation
- Connection pool utilization
- Query duration
Workflow Execution Dashboard
- Workflow start rate
- Workflow completion rate
- Task queue backlog
- Activity timeouts
Resource Usage Dashboard
- CPU and memory per service
- Lock contention
- Cache hit rates
- GC pause time
See Also
- Metrics Reference - Complete metrics catalog
- Scaling Guide - Performance tuning
- Security - Authentication and authorization