Matching Service
The Matching Service is responsible for managing task queues and matching tasks from the History Service with polling workers. It acts as a high-performance, scalable task distribution system.Overview
The Matching Service sits between the History Service and Worker processes:- History Service adds Workflow Tasks and Activity Tasks to task queues
- Worker processes poll task queues for work to execute
- Matching Service efficiently matches tasks with available workers
See
proto/internal/temporal/server/api/matchingservice/v1/service.proto for the API definition.Task Queues
What is a Task Queue?
A Task Queue in Matching Service:- Is identified by a name (e.g.,
checkout,analytics,email-sender) - Belongs to a specific namespace
- Holds tasks for many different Workflow Executions
- Has separate queues for Workflow Tasks and Activity Tasks
- Is polled by one or more Worker processes
Task Types
The Matching Service handles two types of tasks:Workflow Tasks
Workflow Tasks
Workflow Tasks are delivered to workers to advance workflow execution:
- Worker receives workflow history events
- Worker executes workflow code (replaying if needed)
- Worker sends back commands (schedule activity, start timer, etc.)
- Workflow Tasks are used to progress workflow state
Activity Tasks
Activity Tasks
Activity Tasks are delivered to workers to execute activities:
- Worker receives activity input
- Worker executes activity function (with side effects)
- Worker sends back result or error
- Activities perform actual business logic
Request Flow
Long-poll requests from workers follow this path:- Worker sends
PollWorkflowTaskQueueorPollActivityTaskQueueto Frontend Service - Frontend routes the request to the Matching Service instance responsible for that task queue
- Matching Service either:
- Immediately returns a task if one is available
- Holds the long-poll request until a task arrives (with timeout)
- When History Service adds a task via
AddWorkflowTaskorAddActivityTask:- Task is matched with a waiting poller
- Or task is added to the queue backlog
Task Queue Partitions
Partitioning Strategy
To achieve higher throughput, the Matching Service splits task queues into partitions:- Default: 4 partitions per task queue
- Configurable: Can be more or fewer partitions
- Each partition is owned by one Matching Service instance
- Partition ownership can be reassigned for load balancing
- Partitions can be loaded/unloaded from storage dynamically
Partition Hierarchy
Partitions are organized in a tree structure:- Root partition: Parent of all child partitions
- Child partitions: Leaf nodes that hold tasks and pollers
- For small partition counts, root is direct parent of all children
- For larger partition counts, multiple tree levels may exist
Task and Poller Forwarding
To improve efficiency, Matching Service implements forwarding: Poller Forwarding:- When a worker polls an empty partition, it can be forwarded to the parent partition
- Allows pollers to be matched with tasks in other partitions
- When a task is added to a partition with no pollers, it can be forwarded to the parent partition
- Increases likelihood of finding an available poller
If the root partition of a task queue is loaded, this forces all other partitions to load as well, ensuring forwarding can occur between any child partition with tasks and any child with pollers.
Load Balancing
Partitioning provides natural load balancing:- Different partitions can be owned by different Matching Service instances
- High-throughput task queues benefit from multiple partitions
- Low-throughput task queues can use fewer partitions or rely on forwarding
Task Matching Strategies
The Matching Service uses different strategies to match tasks with workers:Synchronous Match
When a task arrives and a poller is already waiting:- Task is immediately delivered to the waiting poller
- No database write occurs
- Lowest possible latency
Asynchronous Match
When a task arrives but no poller is waiting:- Task is written to the partition’s backlog in persistence
- When a poller arrives, task is loaded and delivered
- Higher latency but ensures task is not lost
In-Memory Buffering
For high-throughput scenarios:- Tasks may be held in memory briefly before persistence
- Reduces write load on database
- Trade-off between latency and durability
Workflow Task Scheduling
Workflow Tasks have special semantics:Task Deduplication
Only one Workflow Task can be outstanding for a workflow execution at a time:- History Service ensures this via Mutable State
- If a task is already scheduled, new events are batched
- Prevents concurrent execution of workflow code
Sticky Execution
Workers can request sticky task queue assignment:- Worker caches workflow state after executing a task
- Next task for that workflow is preferentially routed to the same worker
- Avoids reloading history from persistence
- Significantly improves latency
Activity Task Scheduling
Activity Tasks are simpler than Workflow Tasks:- Multiple activities from one workflow can run concurrently
- No state caching (activities start fresh each time)
- Can have very high throughput per workflow
Activity Heartbeating
For long-running activities:- Workers can send heartbeats to indicate progress
- Heartbeats reset the activity timeout
- History Service tracks last heartbeat time
- If heartbeat timeout expires, activity is marked as failed
Task Queue Visibility
The Matching Service provides APIs to inspect task queue state:DescribeTaskQueue
Returns information about a task queue:
- Number of tasks in backlog
- Number of active pollers
- Task queue type (workflow or activity)
- Partition information
Task Queue Statistics
For monitoring and autoscaling:- Track queue depth over time
- Monitor poller count
- Measure task latency (time from task creation to delivery)
- Identify hot partitions
Failure Handling
Task Timeouts
Tasks can time out at multiple stages: Schedule-to-Start Timeout:- Task sits in queue too long before being picked up
- Tracked by timer in History Service
- Causes task to be retried or workflow to fail
- Worker took too long to complete the task
- Tracked by timer in History Service
- Activity is retried; workflow task fails workflow task
Poller Disconnection
When a worker disconnects:- Outstanding long-poll requests are cancelled
- In-flight tasks time out and are retried
- Matching Service removes poller from active list
Matching Service Failure
When a Matching Service instance fails:- Task queue ownership is transferred to another instance
- In-memory state is lost but can be recovered from persistence
- Pollers reconnect and re-establish long-poll requests
- Minimal disruption due to stateless design
Performance Optimizations
Task Buffering
For high-throughput task queues:- Tasks are buffered in memory before persistence writes
- Reduces database write load
- Configured via dynamic configuration
Poller Isolation
Workers can specify task queue types:- Normal vs. sticky task queues
- Workflow vs. activity task queues
- Allows specialized worker pools
Rate Limiting
Matching Service enforces rate limits:- Per-namespace task dispatch rate limits
- Per-task-queue rate limits
- Protects downstream workers from overload
Multi-Cluster Considerations
In multi-region deployments:- Each cluster has its own Matching Service
- Task queues are cluster-local (not replicated)
- Workers must connect to the correct cluster
- Namespace failover does not migrate task queues
Internals (Limited Documentation)
Contributions to expand this documentation are welcome!Further Reading
History Service
How workflows are executed
Worker Service
Internal background workers
Workflow Lifecycle
Sequence diagrams of workflow execution
Architecture Overview
High-level system architecture