Overview
The Matching Service is an internal Cadence service that coordinates task distribution between workflow workers and the Cadence cluster. It manages task lists, handles long-polling from workers, and routes tasks efficiently. Service Location:service/matching/handler/interfaces.go
Service Architecture
The Matching Service operates on a per-task-list basis:Health Check
Health
Check the health status of the matching service.Whether the service is healthy
Health status message (e.g., “matching good”)
Task Addition APIs
AddDecisionTask
Add a decision task to a task list.UUID of the domain
Workflow execution that owns this task
Task list to add the task to
Event ID of the decision task scheduled event
Maximum time before task times out if not started
Source of the task (History or DbBacklog)
Address of the host that forwarded this task
Task list partition configuration
Current partition configuration after adding the task
- Validates domain and task list
- Applies rate limiting
- Attempts sync match with waiting poller
- If no poller, persists to task queue
- Returns partition config for task list
- Worker RPS limit per domain
- Global matching service RPS limit
- Returns
ServiceBusyErrorif throttled
AddActivityTask
Add an activity task to a task list.UUID of the domain
Workflow execution that owns this task
UUID of the source domain (for cross-domain activities)
Task list to add the task to
Event ID of the activity task scheduled event
Maximum time before task times out if not started
Source of the task
Address of forwarding host
Additional dispatch information for the activity
Task list partition configuration
Current partition configuration
Task Polling APIs
PollForDecisionTask
Long poll for a decision task from a task list.UUID of the domain
Unique identifier for this poller
The poll request details
Address of forwarding host for partition routing
Opaque task token for completing the task
Workflow execution for this task
The workflow type
Event ID of the previous decision task
Event ID when this decision task started
Retry attempt number
Approximate number of tasks in backlog
Workflow execution history
Token for fetching additional history
Query to execute if present
Multiple queries to execute
- Blocks until a task is available or context timeout
- Returns empty response on timeout (not an error)
- Validates context has appropriate timeout (1-90 seconds recommended)
- Supports sync match for immediate task dispatch
- Decision tasks preferentially routed to same worker
- Reduces history loading overhead
- Falls back to normal task list if sticky worker unavailable
PollForActivityTask
Long poll for an activity task from a task list.UUID of the domain
Unique identifier for this poller
The poll request details
Address of forwarding host
Opaque task token for completing the task
Workflow execution for this activity
Activity ID
The activity type to execute
Serialized activity input
When the activity was scheduled
When the activity started
Total timeout from schedule to completion
Timeout from start to completion
Maximum time between heartbeats
Retry attempt number
When this retry attempt was scheduled
Details from last heartbeat
Parent workflow type
Parent workflow domain name
Context propagation headers
Query APIs
QueryWorkflow
Query a workflow execution through the matching service.UUID of the domain
Task list where the query should be sent
The query request
Address of forwarding host
Serialized query result
Information if query was rejected
- Matches queries to pending decision tasks
- Forwards to appropriate partition if partitioned
- Returns query result synchronously
- Supports query reject conditions
RespondQueryTaskCompleted
Respond to a query task.UUID of the domain
Task list for the query
Query task ID
Query completion details
Task List Management APIs
DescribeTaskList
Get information about a task list.UUID of the domain
Description request parameters
List of currently active pollers
Status information if requested
Identity: Worker identity stringLastAccessTime: Last poll timestampRatePerSecond: Poll rate from this worker
ListTaskListPartitions
List partitions for a task list.Domain name
Task list to query
Activity task list partitions
Decision task list partitions
Key: Partition identifierOwnerHostName: Host owning this partition
GetTaskListsByDomain
Retrieve all task lists in a domain.Domain name to query
Map of decision task list names to their status
Map of activity task list names to their status
UpdateTaskListPartitionConfig
Update partition configuration for a task list.UUID of the domain
Task list to update
Type (Decision or Activity)
New partition configuration
RefreshTaskListPartitionConfig
Refresh partition configuration from persistence.Poller Management APIs
CancelOutstandingPoll
Cancel an outstanding poll request.UUID of the domain
Type of task list (0=Decision, 1=Activity)
Task list being polled
ID of the poller to cancel
- Worker shutdown
- Connection errors
- Task list reassignment
Task List Partitioning
Overview
Task list partitioning allows horizontal scaling of high-throughput task lists:Partition Configuration
- NumReadPartitions: Number of partitions for polling
- NumWritePartitions: Number of partitions for task addition
- Version: Configuration version for consistency
Partition Routing
Tasks are routed to partitions using:- Workflow ID hash: Ensures tasks from same workflow go to same partition
- Round-robin: For non-workflow-specific tasks
- Isolation groups: For tenant isolation
Dynamic Repartitioning
- Update partition count via
UpdateTaskListPartitionConfig - New tasks routed to new partition count
- Existing pollers gradually migrate
- Old partitions drain automatically
Rate Limiting
The Matching Service implements multiple rate limiting strategies:Worker Rate Limiting
AddDecisionTaskAddActivityTaskPollForDecisionTaskPollForActivityTask
matching.workerRPS: Global worker RPSmatching.domainWorkerRPS: Per-domain worker RPS
User Rate Limiting
QueryWorkflowDescribeTaskListListTaskListPartitionsGetTaskListsByDomain
matching.userRPS: Global user RPSmatching.domainUserRPS: Per-domain user RPS
Rate Limit Errors
When rate limited, returns:Task Synchronization
Sync Match
When a task is added and pollers are waiting:- Task immediately dispatched to poller
- No persistence overhead
- Lowest possible latency (< 1ms typical)
Async Match
When no pollers are waiting:- Task persisted to task queue
- Next poller retrieves from queue
- Higher latency but guarantees delivery
Backlog Management
Task queues maintain:- Read Level: Current read position
- Ack Level: Last acknowledged task
- Backlog Count: Approximate pending tasks
- Task ID Blocks: Pre-allocated ID ranges
Performance Optimization
Local Dispatch
Tasks preferentially dispatched to pollers on same host:- Reduces network round-trips
- Improves cache locality
- Lowers tail latency
Task Batching
Multiple tasks can be batched for persistence:- Reduces database write load
- Improves throughput
- Slight increase in latency
Poller Management
Active poller tracking:- Maintains poller registry
- Monitors poller health
- Routes tasks to healthy pollers
- Expires stale pollers
Isolation Groups
Isolation groups provide task routing based on worker capabilities:- GPU workers
- Compliance zones
- Resource-specific workers
- Tenant isolation
Monitoring & Metrics
Key Metrics
matching.tasks.sync-match: Sync match success ratematching.tasks.backlog: Task backlog sizematching.poll.latency: Poll latency distributionmatching.poll.timeouts: Poll timeout ratematching.tasks.expired: Task expiration ratematching.pollers.count: Active poller count
Health Indicators
- High sync match rate (>90%): Healthy
- Growing backlog: Need more workers or partitions
- High poll timeouts: Need more tasks or reduce pollers
- Task expirations: Increase timeouts or add workers
Error Handling
ServiceBusyError
Rate limiting or resource constraints:EntityNotExistsError
Task list or domain not found:StickyWorkerUnavailableError
Sticky worker not polling:Best Practices
Poll Timeout Configuration
- Set context timeout: 60-90 seconds
- Shorter timeouts increase overhead
- Longer timeouts delay shutdown
Task List Design
- One task list per use case
- Avoid sharing task lists across workflows
- Use partitioning for high throughput
- Consider isolation groups for specialized workers
Poller Management
- Maintain stable poller count
- Gracefully shut down pollers
- Monitor backlog and adjust capacity
- Use sticky execution for decision tasks
Error Handling
- Implement retry logic for transient errors
- Monitor rate limiting errors
- Handle task expiration gracefully
- Log poller connectivity issues
See Also
- Frontend Service API - Public task polling APIs
- History Service API - Workflow state management
- Task Types - Task type definitions