Overview
Cadence provides extensive monitoring capabilities through metrics, structured logging, and health checks. This guide covers production monitoring strategies, metric integration, and observability best practices.Metrics Architecture
Cadence supports multiple metrics backends through the Tally metrics library:Supported Reporters
- Prometheus - Recommended for production use
- StatsD - Legacy support with tag limitations
- M3 - Uber’s internal metrics platform
Only one metrics reporter can be configured per service. Attempting to configure multiple reporters will result in a fatal error.
Prometheus Integration
Configuration
Configure Prometheus metrics in yourconfig.yaml:
Configuration Parameters
| Parameter | Description | Default |
|---|---|---|
listenAddress | Host:Port for Prometheus scrape endpoint | Required |
timerType | Use “histogram” for latency metrics | ”histogram” |
defaultHistogramBuckets | Histogram bucket boundaries in seconds | See above |
reportingInterval | Metric reporting interval | 1s |
tags | Global tags applied to all metrics | |
prefix | Metric name prefix | "" |
Metric Sanitization
Cadence automatically sanitizes metric names to comply with Prometheus naming conventions:- Characters
-and.are replaced with_ - Only alphanumeric characters and
_are allowed - Tag names follow the same rules
StatsD Integration
Configuration
Tally’s standard StatsD implementation doesn’t support tagging. Cadence provides an enhanced reporter with tag support.
Key Metrics
Service Health Metrics
These metrics are emitted by all Cadence services:Persistence Metrics
Monitor database operations across all persistence calls:RPC Metrics
Track inter-service and client communication:Workflow Execution Metrics
Service-Specific Metrics
History Service
Matching Service
Metric Tags
Cadence automatically applies these standard tags:| Tag | Description | Example |
|---|---|---|
cadence_service | Service name | ”frontend”, “history” |
operation | API operation | ”StartWorkflowExecution” |
domain | Workflow domain | ”my-domain” |
tasklist | Task list name | ”my-tasklist” |
workflow_type | Workflow type | ”MyWorkflow” |
activity_type | Activity type | ”MyActivity” |
Health Checks
Endpoint Configuration
Cadence services expose health check endpoints:Response Format
Health Check Implementation
Health checks verify:- Service startup - Service has initialized successfully
- Persistence connectivity - Database connections are healthy
- Ring membership - Service is part of the ring
Logging Configuration
Structured Logging
Cadence uses Zap for structured JSON logging:Log Levels
| Level | Use Case |
|---|---|
debug | Development and troubleshooting |
info | Normal operations (recommended for production) |
warn | Warning conditions |
error | Error conditions |
fatal | Fatal errors causing service shutdown |
Console vs JSON Encoding
Log Fields
Cadence includes contextual fields in all log entries:Alerting Guidelines
Critical Alerts
Configure alerts for these conditions:Service Availability
Persistence Issues
Task Processing
Warning Alerts
Distributed Tracing
Cadence supports distributed tracing through YARPC:Configuration
Distributed tracing integration requires custom YARPC middleware configuration. Refer to YARPC documentation for Jaeger or Zipkin integration.
Monitoring Best Practices
Dashboard Organization
Create dashboards for each layer:- Service Health - CPU, memory, goroutines, restarts
- RPC Layer - Request rates, errors, latency by service and operation
- Persistence Layer - Database operations, latency, errors
- Workflow Execution - Started/completed/failed workflows, task latency
- Business Metrics - Domain-specific workflow metrics
Cardinality Management
Retention Policies
- Raw metrics: 15-30 days
- Aggregated metrics: 90-365 days
- Logs: 7-30 days (ship to archival for longer retention)
Troubleshooting
No Metrics Appearing
-
Verify metrics endpoint is accessible:
-
Check service logs for metrics initialization:
-
Validate configuration:
High Metric Cardinality
If Prometheus complains about high cardinality:- Review custom tags configuration
- Ensure workflow IDs aren’t being used as tag values
- Check for runaway domain/tasklist creation
Missing Log Output
Check log configuration:See Also
- Scaling Guide - Performance tuning
- Security Guide - Access control and encryption
- Configuration Reference - Complete config options