Service Overview
Cadence consists of four core services that work together to provide workflow orchestration capabilities. Each service is stateless, horizontally scalable, and has a specific set of responsibilities.Frontend Service
The Frontend service acts as the API gateway for all client interactions with Cadence.Responsibilities
- API Gateway: Exposes public APIs for workflow and activity operations
- Request Validation: Validates all incoming requests
- Rate Limiting: Enforces per-domain rate limits
- Authentication & Authorization: Handles security concerns
- Cluster Redirection: Routes requests to appropriate cluster in multi-region setup
- Domain Management: Handles domain registration and updates
Key Components
API Layers
The frontend implements multiple decorator layers:- Base Handler: Core API implementation
- Version Check: Ensures client compatibility
- Rate Limiter: Enforces quota limits
- Metrics: Captures telemetry data
- Cluster Redirection: Handles multi-cluster routing
- Access Control: Authorization enforcement
Configuration
Rate Limiting
The frontend implements a sophisticated multi-stage rate limiting system:- User RPS: Rate limit for client API calls
- Worker RPS: Rate limit for worker poll requests
- Visibility RPS: Rate limit for visibility queries
- Async RPS: Rate limit for async workflow operations
APIs Exposed
Workflow APIs
StartWorkflowExecutionSignalWorkflowExecutionTerminateWorkflowExecutionGetWorkflowExecutionHistoryDescribeWorkflowExecution
Domain APIs
RegisterDomainDescribeDomainUpdateDomainListDomainsDeprecateDomain
Task List APIs
PollForDecisionTaskPollForActivityTaskRespondDecisionTaskCompletedRespondActivityTaskCompleted
History Service
The History service is the core workflow execution engine that maintains workflow state and makes execution decisions.Responsibilities
- Workflow State Management: Maintains mutable state for active workflows
- Event History: Persists immutable workflow history events
- Decision Processing: Processes decisions from workflow workers
- Task Generation: Creates decision and activity tasks
- Timer Management: Handles workflow and activity timeouts
- Shard Ownership: Manages history shard ownership
Shard-Based Architecture
Key Components
Workflow Execution State
The History service maintains two types of state:-
Mutable State: Current workflow execution state
- Pending decision/activity tasks
- Timers
- Signals
- Child workflows
- Execution info (status, timeouts, etc.)
-
Immutable History: Event log of all workflow actions
- WorkflowExecutionStarted
- DecisionTaskScheduled
- ActivityTaskStarted
- WorkflowExecutionCompleted
Task Queues
History service manages multiple task queues per shard:Transfer Queue
- Processes tasks that need immediate execution
- Examples: Decision tasks, activity tasks, close execution
- FIFO processing within each shard
Timer Queue
- Processes time-based tasks
- Examples: Workflow timeout, activity timeout, retry timer
- Priority queue ordered by fire time
Configuration
Scalability Considerations
- Maximum Scale: Limited by
numHistoryShards - Shard Distribution: Automatic rebalancing when instances join/leave
- Graceful Shutdown: Drains shards before stopping
Matching Service
The Matching service routes tasks from History to Workers using task lists.Responsibilities
- Task List Management: Maintains task lists for decisions and activities
- Task Routing: Delivers tasks to polling workers
- Sync Match: Optimizes latency by matching tasks with waiting pollers
- Task Persistence: Stores unmatched tasks in database
- Load Balancing: Distributes tasks across available workers
Task List Architecture
Key Components
Sync Match Optimization
Sync Match: When a task arrives and workers are already polling- Near-zero latency task delivery
- Reduced database load
- Better throughput
Task List Types
-
Decision Task List: Routes decision tasks
- One per workflow task list
- Workers poll for workflow execution decisions
-
Activity Task List: Routes activity tasks
- Can be different from decision task list
- Workers poll for activity execution
Configuration
Scalability
- Task List Partitioning: High-throughput task lists can be partitioned
- Isolation Groups: Route tasks to specific worker pools
- Dynamic Partitioning: Automatic partition adjustment based on load
Worker Service
The Worker service handles internal background processing tasks for the Cadence system.Responsibilities
- Replication: Processes cross-datacenter replication tasks
- Indexing: Indexes workflow data to Elasticsearch/Pinot for visibility
- Archival: Archives old workflow histories to blob storage
- System Workflows: Runs internal system workflows
- Domain Replication: Replicates domain metadata across clusters
Key Components
Replicator
Handles cross-cluster replication:- History service writes replication tasks to Kafka
- Worker service consumes from Kafka
- Applies tasks to target cluster
- Handles conflict resolution
Indexer
Indexes workflow data for advanced visibility:Archiver
Archives workflow histories to long-term storage:- Local filesystem
- AWS S3
- Google Cloud Storage
- Custom implementations
Scanner
Performs data consistency checks and cleanup:- Task List Scanner: Removes orphaned task list entries
- History Scanner: Validates workflow history integrity
- Timer Scanner: Checks for stuck timers
- Execution Scanner: Identifies zombie workflows
Configuration
System Domains
Worker service creates internal system domains:- cadence-system: Core system workflows
- cadence-batcher: Batch operations
- cadence-canary: Health checks
Inter-Service Communication
Communication Patterns
RPC Configuration
Protocols Supported:- gRPC (recommended)
- TChannel (legacy)
Service Discovery
Services discover each other using Ringpop:Deployment Considerations
Resource Requirements
| Service | CPU | Memory | Disk | Network |
|---|---|---|---|---|
| Frontend | Low-Medium | Low | Minimal | High |
| History | High | High | Minimal | Medium |
| Matching | Low-Medium | Low-Medium | Minimal | Medium |
| Worker | Medium | Medium | Low | Medium |
Scaling Guidelines
-
Frontend: Scale based on RPS
- Start with 2-3 instances
- Add instances as traffic increases
-
History: Scale based on shard count
- Each instance should own 100-200 shards
- More instances = better distribution
-
Matching: Scale based on task throughput
- Start with 2-3 instances
- Scale if sync match rate drops
-
Worker: Scale based on background load
- Replication lag
- Indexing lag
- Archival backlog
Next Steps
Persistence Layer
Learn about database design and configuration
Cross-DC Replication
Set up multi-region deployments