Matching Service

The Matching Service is responsible for managing task queues and matching tasks from the History Service with polling workers. It acts as a high-performance, scalable task distribution system.

Overview

The Matching Service sits between the History Service and Worker processes:

History Service adds Workflow Tasks and Activity Tasks to task queues
Worker processes poll task queues for work to execute
Matching Service efficiently matches tasks with available workers

See proto/internal/temporal/server/api/matchingservice/v1/service.proto for the API definition.

Task Queues

What is a Task Queue?

A Task Queue in Matching Service:

Is identified by a name (e.g., checkout, analytics, email-sender)
Belongs to a specific namespace
Holds tasks for many different Workflow Executions
Has separate queues for Workflow Tasks and Activity Tasks
Is polled by one or more Worker processes

Task Types

The Matching Service handles two types of tasks:

Workflow Tasks

Workflow Tasks are delivered to workers to advance workflow execution:

Worker receives workflow history events
Worker executes workflow code (replaying if needed)
Worker sends back commands (schedule activity, start timer, etc.)
Workflow Tasks are used to progress workflow state

Activity Tasks

Activity Tasks are delivered to workers to execute activities:

Worker receives activity input
Worker executes activity function (with side effects)
Worker sends back result or error
Activities perform actual business logic

Request Flow

Long-poll requests from workers follow this path:

Worker sends PollWorkflowTaskQueue or PollActivityTaskQueue to Frontend Service
Frontend routes the request to the Matching Service instance responsible for that task queue
Matching Service either:
- Immediately returns a task if one is available
- Holds the long-poll request until a task arrives (with timeout)
When History Service adds a task via AddWorkflowTask or AddActivityTask:
- Task is matched with a waiting poller
- Or task is added to the queue backlog

Task Queue Partitions

Partitioning Strategy

To achieve higher throughput, the Matching Service splits task queues into partitions:

Default: 4 partitions per task queue
Configurable: Can be more or fewer partitions
Each partition is owned by one Matching Service instance
Partition ownership can be reassigned for load balancing
Partitions can be loaded/unloaded from storage dynamically

Partition Hierarchy

Partitions are organized in a tree structure:

Root partition: Parent of all child partitions
Child partitions: Leaf nodes that hold tasks and pollers
For small partition counts, root is direct parent of all children
For larger partition counts, multiple tree levels may exist

Task and Poller Forwarding

To improve efficiency, Matching Service implements forwarding: Poller Forwarding:

When a worker polls an empty partition, it can be forwarded to the parent partition
Allows pollers to be matched with tasks in other partitions

Task Forwarding:

When a task is added to a partition with no pollers, it can be forwarded to the parent partition
Increases likelihood of finding an available poller

If the root partition of a task queue is loaded, this forces all other partitions to load as well, ensuring forwarding can occur between any child partition with tasks and any child with pollers.

Load Balancing

Partitioning provides natural load balancing:

Different partitions can be owned by different Matching Service instances
High-throughput task queues benefit from multiple partitions
Low-throughput task queues can use fewer partitions or rely on forwarding

Task Matching Strategies

The Matching Service uses different strategies to match tasks with workers:

Synchronous Match

When a task arrives and a poller is already waiting:

Task is immediately delivered to the waiting poller
No database write occurs
Lowest possible latency

Asynchronous Match

When a task arrives but no poller is waiting:

Task is written to the partition’s backlog in persistence
When a poller arrives, task is loaded and delivered
Higher latency but ensures task is not lost

In-Memory Buffering

For high-throughput scenarios:

Tasks may be held in memory briefly before persistence
Reduces write load on database
Trade-off between latency and durability

Workflow Task Scheduling

Workflow Tasks have special semantics:

Task Deduplication

Only one Workflow Task can be outstanding for a workflow execution at a time:

History Service ensures this via Mutable State
If a task is already scheduled, new events are batched
Prevents concurrent execution of workflow code

Sticky Execution

Workers can request sticky task queue assignment:

Worker caches workflow state after executing a task
Next task for that workflow is preferentially routed to the same worker
Avoids reloading history from persistence
Significantly improves latency

// When worker completes a workflow task with sticky enabled:
// - Worker specifies a sticky task queue name
// - History Service sends next task to sticky queue first
// - Falls back to normal queue after timeout

Activity Task Scheduling

Activity Tasks are simpler than Workflow Tasks:

Multiple activities from one workflow can run concurrently
No state caching (activities start fresh each time)
Can have very high throughput per workflow

Activity Heartbeating

For long-running activities:

Workers can send heartbeats to indicate progress
Heartbeats reset the activity timeout
History Service tracks last heartbeat time
If heartbeat timeout expires, activity is marked as failed

Task Queue Visibility

The Matching Service provides APIs to inspect task queue state:

`DescribeTaskQueue`

Returns information about a task queue:

Number of tasks in backlog
Number of active pollers
Task queue type (workflow or activity)
Partition information

Task Queue Statistics

For monitoring and autoscaling:

Track queue depth over time
Monitor poller count
Measure task latency (time from task creation to delivery)
Identify hot partitions

Failure Handling

Task Timeouts

Tasks can time out at multiple stages: Schedule-to-Start Timeout:

Task sits in queue too long before being picked up
Tracked by timer in History Service
Causes task to be retried or workflow to fail

Start-to-Close Timeout:

Worker took too long to complete the task
Tracked by timer in History Service
Activity is retried; workflow task fails workflow task

Poller Disconnection

When a worker disconnects:

Outstanding long-poll requests are cancelled
In-flight tasks time out and are retried
Matching Service removes poller from active list