Skip to main content

Overview

The Matching Service is an internal Cadence service that coordinates task distribution between workflow workers and the Cadence cluster. It manages task lists, handles long-polling from workers, and routes tasks efficiently. Service Location: service/matching/handler/interfaces.go
The Matching Service API is internal to Cadence. Applications should use the Frontend Service API instead.

Service Architecture

The Matching Service operates on a per-task-list basis:

Health Check

Health

Check the health status of the matching service.
Health(context.Context) (*types.HealthStatus, error)
Ok
boolean
Whether the service is healthy
Msg
string
Health status message (e.g., “matching good”)

Task Addition APIs

AddDecisionTask

Add a decision task to a task list.
AddDecisionTask(context.Context, *types.AddDecisionTaskRequest) (*types.AddDecisionTaskResponse, error)
DomainUUID
string
required
UUID of the domain
Execution
WorkflowExecution
required
Workflow execution that owns this task
TaskList
TaskList
required
Task list to add the task to
ScheduleID
int64
required
Event ID of the decision task scheduled event
ScheduleToStartTimeoutSeconds
int32
Maximum time before task times out if not started
Source
TaskSource
Source of the task (History or DbBacklog)
ForwardedFrom
string
Address of the host that forwarded this task
PartitionConfig
map[string]string
Task list partition configuration
PartitionConfig
TaskListPartitionConfig
Current partition configuration after adding the task
Task Addition Flow:
  1. Validates domain and task list
  2. Applies rate limiting
  3. Attempts sync match with waiting poller
  4. If no poller, persists to task queue
  5. Returns partition config for task list
Rate Limiting:
  • Worker RPS limit per domain
  • Global matching service RPS limit
  • Returns ServiceBusyError if throttled

AddActivityTask

Add an activity task to a task list.
AddActivityTask(context.Context, *types.AddActivityTaskRequest) (*types.AddActivityTaskResponse, error)
DomainUUID
string
required
UUID of the domain
Execution
WorkflowExecution
required
Workflow execution that owns this task
SourceDomainUUID
string
UUID of the source domain (for cross-domain activities)
TaskList
TaskList
required
Task list to add the task to
ScheduleID
int64
required
Event ID of the activity task scheduled event
ScheduleToStartTimeoutSeconds
int32
Maximum time before task times out if not started
Source
TaskSource
Source of the task
ForwardedFrom
string
Address of forwarding host
ActivityTaskDispatchInfo
ActivityTaskDispatchInfo
Additional dispatch information for the activity
PartitionConfig
map[string]string
Task list partition configuration
PartitionConfig
TaskListPartitionConfig
Current partition configuration

Task Polling APIs

PollForDecisionTask

Long poll for a decision task from a task list.
PollForDecisionTask(context.Context, *types.MatchingPollForDecisionTaskRequest) (*types.MatchingPollForDecisionTaskResponse, error)
DomainUUID
string
required
UUID of the domain
PollerID
string
required
Unique identifier for this poller
PollRequest
PollForDecisionTaskRequest
required
The poll request details
ForwardedFrom
string
Address of forwarding host for partition routing
TaskToken
[]byte
Opaque task token for completing the task
WorkflowExecution
WorkflowExecution
Workflow execution for this task
WorkflowType
WorkflowType
The workflow type
PreviousStartedEventId
int64
Event ID of the previous decision task
StartedEventId
int64
Event ID when this decision task started
Attempt
int64
Retry attempt number
BacklogCountHint
int64
Approximate number of tasks in backlog
History
History
Workflow execution history
NextPageToken
[]byte
Token for fetching additional history
Query
WorkflowQuery
Query to execute if present
Queries
map[string]WorkflowQuery
Multiple queries to execute
Long Polling Behavior:
  • Blocks until a task is available or context timeout
  • Returns empty response on timeout (not an error)
  • Validates context has appropriate timeout (1-90 seconds recommended)
  • Supports sync match for immediate task dispatch
Sticky Task Lists: If sticky execution is enabled:
  • Decision tasks preferentially routed to same worker
  • Reduces history loading overhead
  • Falls back to normal task list if sticky worker unavailable

PollForActivityTask

Long poll for an activity task from a task list.
PollForActivityTask(context.Context, *types.MatchingPollForActivityTaskRequest) (*types.MatchingPollForActivityTaskResponse, error)
DomainUUID
string
required
UUID of the domain
PollerID
string
required
Unique identifier for this poller
PollRequest
PollForActivityTaskRequest
required
The poll request details
ForwardedFrom
string
Address of forwarding host
TaskToken
[]byte
Opaque task token for completing the task
WorkflowExecution
WorkflowExecution
Workflow execution for this activity
ActivityId
string
Activity ID
ActivityType
ActivityType
The activity type to execute
Input
[]byte
Serialized activity input
ScheduledTimestamp
int64
When the activity was scheduled
StartedTimestamp
int64
When the activity started
ScheduleToCloseTimeoutSeconds
int32
Total timeout from schedule to completion
StartToCloseTimeoutSeconds
int32
Timeout from start to completion
HeartbeatTimeoutSeconds
int32
Maximum time between heartbeats
Attempt
int32
Retry attempt number
ScheduledTimestampOfThisAttempt
int64
When this retry attempt was scheduled
HeartbeatDetails
[]byte
Details from last heartbeat
WorkflowType
WorkflowType
Parent workflow type
WorkflowDomain
string
Parent workflow domain name
Header
Header
Context propagation headers

Query APIs

QueryWorkflow

Query a workflow execution through the matching service.
QueryWorkflow(context.Context, *types.MatchingQueryWorkflowRequest) (*types.MatchingQueryWorkflowResponse, error)
DomainUUID
string
required
UUID of the domain
TaskList
TaskList
required
Task list where the query should be sent
QueryRequest
QueryWorkflowRequest
required
The query request
ForwardedFrom
string
Address of forwarding host
QueryResult
[]byte
Serialized query result
QueryRejected
QueryRejected
Information if query was rejected
Query Routing:
  1. Matches queries to pending decision tasks
  2. Forwards to appropriate partition if partitioned
  3. Returns query result synchronously
  4. Supports query reject conditions

RespondQueryTaskCompleted

Respond to a query task.
RespondQueryTaskCompleted(context.Context, *types.MatchingRespondQueryTaskCompletedRequest) error
DomainUUID
string
required
UUID of the domain
TaskList
TaskList
required
Task list for the query
TaskID
string
required
Query task ID
CompletedRequest
RespondQueryTaskCompletedRequest
required
Query completion details

Task List Management APIs

DescribeTaskList

Get information about a task list.
DescribeTaskList(context.Context, *types.MatchingDescribeTaskListRequest) (*types.DescribeTaskListResponse, error)
DomainUUID
string
required
UUID of the domain
DescRequest
DescribeTaskListRequest
required
Description request parameters
Pollers
[]PollerInfo
List of currently active pollers
TaskListStatus
TaskListStatus
Status information if requested
PollerInfo includes:
  • Identity: Worker identity string
  • LastAccessTime: Last poll timestamp
  • RatePerSecond: Poll rate from this worker

ListTaskListPartitions

List partitions for a task list.
ListTaskListPartitions(context.Context, *types.MatchingListTaskListPartitionsRequest) (*types.ListTaskListPartitionsResponse, error)
Domain
string
required
Domain name
TaskList
TaskList
required
Task list to query
ActivityTaskListPartitions
[]TaskListPartitionMetadata
Activity task list partitions
DecisionTaskListPartitions
[]TaskListPartitionMetadata
Decision task list partitions
Partition Metadata:
  • Key: Partition identifier
  • OwnerHostName: Host owning this partition

GetTaskListsByDomain

Retrieve all task lists in a domain.
GetTaskListsByDomain(context.Context, *types.GetTaskListsByDomainRequest) (*types.GetTaskListsByDomainResponse, error)
Domain
string
required
Domain name to query
DecisionTaskListMap
map[string]TaskListStatus
Map of decision task list names to their status
ActivityTaskListMap
map[string]TaskListStatus
Map of activity task list names to their status

UpdateTaskListPartitionConfig

Update partition configuration for a task list.
UpdateTaskListPartitionConfig(context.Context, *types.MatchingUpdateTaskListPartitionConfigRequest) (*types.MatchingUpdateTaskListPartitionConfigResponse, error)
DomainUUID
string
required
UUID of the domain
TaskList
TaskList
required
Task list to update
TaskListType
TaskListType
required
Type (Decision or Activity)
PartitionConfig
TaskListPartitionConfig
required
New partition configuration

RefreshTaskListPartitionConfig

Refresh partition configuration from persistence.
RefreshTaskListPartitionConfig(context.Context, *types.MatchingRefreshTaskListPartitionConfigRequest) (*types.MatchingRefreshTaskListPartitionConfigResponse, error)

Poller Management APIs

CancelOutstandingPoll

Cancel an outstanding poll request.
CancelOutstandingPoll(context.Context, *types.CancelOutstandingPollRequest) error
DomainUUID
string
required
UUID of the domain
TaskListType
int32
required
Type of task list (0=Decision, 1=Activity)
TaskList
TaskList
required
Task list being polled
PollerID
string
required
ID of the poller to cancel
Use Cases:
  • Worker shutdown
  • Connection errors
  • Task list reassignment

Task List Partitioning

Overview

Task list partitioning allows horizontal scaling of high-throughput task lists:

Partition Configuration

type TaskListPartitionConfig struct {
    Version              int32
    NumReadPartitions    int32
    NumWritePartitions   int32
}
  • NumReadPartitions: Number of partitions for polling
  • NumWritePartitions: Number of partitions for task addition
  • Version: Configuration version for consistency

Partition Routing

Tasks are routed to partitions using:
  • Workflow ID hash: Ensures tasks from same workflow go to same partition
  • Round-robin: For non-workflow-specific tasks
  • Isolation groups: For tenant isolation

Dynamic Repartitioning

  1. Update partition count via UpdateTaskListPartitionConfig
  2. New tasks routed to new partition count
  3. Existing pollers gradually migrate
  4. Old partitions drain automatically

Rate Limiting

The Matching Service implements multiple rate limiting strategies:

Worker Rate Limiting

workerRateLimiter quotas.Policy
Applied to:
  • AddDecisionTask
  • AddActivityTask
  • PollForDecisionTask
  • PollForActivityTask
Configuration:
  • matching.workerRPS: Global worker RPS
  • matching.domainWorkerRPS: Per-domain worker RPS

User Rate Limiting

userRateLimiter quotas.Policy
Applied to:
  • QueryWorkflow
  • DescribeTaskList
  • ListTaskListPartitions
  • GetTaskListsByDomain
Configuration:
  • matching.userRPS: Global user RPS
  • matching.domainUserRPS: Per-domain user RPS

Rate Limit Errors

When rate limited, returns:
errMatchingHostThrottle = &types.ServiceBusyError{
    Message: "Matching host rps exceeded",
}
Clients should implement exponential backoff.

Task Synchronization

Sync Match

When a task is added and pollers are waiting:
  1. Task immediately dispatched to poller
  2. No persistence overhead
  3. Lowest possible latency (< 1ms typical)

Async Match

When no pollers are waiting:
  1. Task persisted to task queue
  2. Next poller retrieves from queue
  3. Higher latency but guarantees delivery

Backlog Management

Task queues maintain:
  • Read Level: Current read position
  • Ack Level: Last acknowledged task
  • Backlog Count: Approximate pending tasks
  • Task ID Blocks: Pre-allocated ID ranges

Performance Optimization

Local Dispatch

Tasks preferentially dispatched to pollers on same host:
  • Reduces network round-trips
  • Improves cache locality
  • Lowers tail latency

Task Batching

Multiple tasks can be batched for persistence:
  • Reduces database write load
  • Improves throughput
  • Slight increase in latency

Poller Management

Active poller tracking:
  • Maintains poller registry
  • Monitors poller health
  • Routes tasks to healthy pollers
  • Expires stale pollers

Isolation Groups

Isolation groups provide task routing based on worker capabilities:
type TaskListMetadata struct {
    IsolationGroups []string
}
Tasks can be routed to specific worker pools for:
  • GPU workers
  • Compliance zones
  • Resource-specific workers
  • Tenant isolation

Monitoring & Metrics

Key Metrics

  • matching.tasks.sync-match: Sync match success rate
  • matching.tasks.backlog: Task backlog size
  • matching.poll.latency: Poll latency distribution
  • matching.poll.timeouts: Poll timeout rate
  • matching.tasks.expired: Task expiration rate
  • matching.pollers.count: Active poller count

Health Indicators

  • High sync match rate (>90%): Healthy
  • Growing backlog: Need more workers or partitions
  • High poll timeouts: Need more tasks or reduce pollers
  • Task expirations: Increase timeouts or add workers

Error Handling

ServiceBusyError

Rate limiting or resource constraints:
&types.ServiceBusyError{Message: "Matching host rps exceeded"}
Recovery: Exponential backoff, increase capacity

EntityNotExistsError

Task list or domain not found:
&types.EntityNotExistsError{Message: "..."}
Recovery: Verify domain/task list name

StickyWorkerUnavailableError

Sticky worker not polling:
&types.StickyWorkerUnavailableError{}
Recovery: Automatic fallback to normal task list

Best Practices

Poll Timeout Configuration

  • Set context timeout: 60-90 seconds
  • Shorter timeouts increase overhead
  • Longer timeouts delay shutdown

Task List Design

  • One task list per use case
  • Avoid sharing task lists across workflows
  • Use partitioning for high throughput
  • Consider isolation groups for specialized workers

Poller Management

  • Maintain stable poller count
  • Gracefully shut down pollers
  • Monitor backlog and adjust capacity
  • Use sticky execution for decision tasks

Error Handling

  • Implement retry logic for transient errors
  • Monitor rate limiting errors
  • Handle task expiration gracefully
  • Log poller connectivity issues

See Also

Build docs developers (and LLMs) love