Skip to main content

Matching Service

The Matching Service is responsible for managing task queues and matching tasks from the History Service with polling workers. It acts as a high-performance, scalable task distribution system.

Overview

The Matching Service sits between the History Service and Worker processes:
  • History Service adds Workflow Tasks and Activity Tasks to task queues
  • Worker processes poll task queues for work to execute
  • Matching Service efficiently matches tasks with available workers

Task Queues

What is a Task Queue?

A Task Queue in Matching Service:
  • Is identified by a name (e.g., checkout, analytics, email-sender)
  • Belongs to a specific namespace
  • Holds tasks for many different Workflow Executions
  • Has separate queues for Workflow Tasks and Activity Tasks
  • Is polled by one or more Worker processes

Task Types

The Matching Service handles two types of tasks:
Workflow Tasks are delivered to workers to advance workflow execution:
  • Worker receives workflow history events
  • Worker executes workflow code (replaying if needed)
  • Worker sends back commands (schedule activity, start timer, etc.)
  • Workflow Tasks are used to progress workflow state
Activity Tasks are delivered to workers to execute activities:
  • Worker receives activity input
  • Worker executes activity function (with side effects)
  • Worker sends back result or error
  • Activities perform actual business logic

Request Flow

Long-poll requests from workers follow this path:
  1. Worker sends PollWorkflowTaskQueue or PollActivityTaskQueue to Frontend Service
  2. Frontend routes the request to the Matching Service instance responsible for that task queue
  3. Matching Service either:
    • Immediately returns a task if one is available
    • Holds the long-poll request until a task arrives (with timeout)
  4. When History Service adds a task via AddWorkflowTask or AddActivityTask:
    • Task is matched with a waiting poller
    • Or task is added to the queue backlog

Task Queue Partitions

Partitioning Strategy

To achieve higher throughput, the Matching Service splits task queues into partitions:
  • Default: 4 partitions per task queue
  • Configurable: Can be more or fewer partitions
  • Each partition is owned by one Matching Service instance
  • Partition ownership can be reassigned for load balancing
  • Partitions can be loaded/unloaded from storage dynamically

Partition Hierarchy

Partitions are organized in a tree structure:
  • Root partition: Parent of all child partitions
  • Child partitions: Leaf nodes that hold tasks and pollers
  • For small partition counts, root is direct parent of all children
  • For larger partition counts, multiple tree levels may exist

Task and Poller Forwarding

To improve efficiency, Matching Service implements forwarding: Poller Forwarding:
  • When a worker polls an empty partition, it can be forwarded to the parent partition
  • Allows pollers to be matched with tasks in other partitions
Task Forwarding:
  • When a task is added to a partition with no pollers, it can be forwarded to the parent partition
  • Increases likelihood of finding an available poller
If the root partition of a task queue is loaded, this forces all other partitions to load as well, ensuring forwarding can occur between any child partition with tasks and any child with pollers.

Load Balancing

Partitioning provides natural load balancing:
  • Different partitions can be owned by different Matching Service instances
  • High-throughput task queues benefit from multiple partitions
  • Low-throughput task queues can use fewer partitions or rely on forwarding

Task Matching Strategies

The Matching Service uses different strategies to match tasks with workers:

Synchronous Match

When a task arrives and a poller is already waiting:
  1. Task is immediately delivered to the waiting poller
  2. No database write occurs
  3. Lowest possible latency

Asynchronous Match

When a task arrives but no poller is waiting:
  1. Task is written to the partition’s backlog in persistence
  2. When a poller arrives, task is loaded and delivered
  3. Higher latency but ensures task is not lost

In-Memory Buffering

For high-throughput scenarios:
  • Tasks may be held in memory briefly before persistence
  • Reduces write load on database
  • Trade-off between latency and durability

Workflow Task Scheduling

Workflow Tasks have special semantics:

Task Deduplication

Only one Workflow Task can be outstanding for a workflow execution at a time:
  • History Service ensures this via Mutable State
  • If a task is already scheduled, new events are batched
  • Prevents concurrent execution of workflow code

Sticky Execution

Workers can request sticky task queue assignment:
  • Worker caches workflow state after executing a task
  • Next task for that workflow is preferentially routed to the same worker
  • Avoids reloading history from persistence
  • Significantly improves latency
// When worker completes a workflow task with sticky enabled:
// - Worker specifies a sticky task queue name
// - History Service sends next task to sticky queue first
// - Falls back to normal queue after timeout

Activity Task Scheduling

Activity Tasks are simpler than Workflow Tasks:
  • Multiple activities from one workflow can run concurrently
  • No state caching (activities start fresh each time)
  • Can have very high throughput per workflow

Activity Heartbeating

For long-running activities:
  • Workers can send heartbeats to indicate progress
  • Heartbeats reset the activity timeout
  • History Service tracks last heartbeat time
  • If heartbeat timeout expires, activity is marked as failed

Task Queue Visibility

The Matching Service provides APIs to inspect task queue state:

DescribeTaskQueue

Returns information about a task queue:
  • Number of tasks in backlog
  • Number of active pollers
  • Task queue type (workflow or activity)
  • Partition information

Task Queue Statistics

For monitoring and autoscaling:
  • Track queue depth over time
  • Monitor poller count
  • Measure task latency (time from task creation to delivery)
  • Identify hot partitions

Failure Handling

Task Timeouts

Tasks can time out at multiple stages: Schedule-to-Start Timeout:
  • Task sits in queue too long before being picked up
  • Tracked by timer in History Service
  • Causes task to be retried or workflow to fail
Start-to-Close Timeout:
  • Worker took too long to complete the task
  • Tracked by timer in History Service
  • Activity is retried; workflow task fails workflow task

Poller Disconnection

When a worker disconnects:
  • Outstanding long-poll requests are cancelled
  • In-flight tasks time out and are retried
  • Matching Service removes poller from active list

Matching Service Failure

When a Matching Service instance fails:
  • Task queue ownership is transferred to another instance
  • In-memory state is lost but can be recovered from persistence
  • Pollers reconnect and re-establish long-poll requests
  • Minimal disruption due to stateless design

Performance Optimizations

Task Buffering

For high-throughput task queues:
  • Tasks are buffered in memory before persistence writes
  • Reduces database write load
  • Configured via dynamic configuration

Poller Isolation

Workers can specify task queue types:
  • Normal vs. sticky task queues
  • Workflow vs. activity task queues
  • Allows specialized worker pools

Rate Limiting

Matching Service enforces rate limits:
  • Per-namespace task dispatch rate limits
  • Per-task-queue rate limits
  • Protects downstream workers from overload

Multi-Cluster Considerations

In multi-region deployments:
  • Each cluster has its own Matching Service
  • Task queues are cluster-local (not replicated)
  • Workers must connect to the correct cluster
  • Namespace failover does not migrate task queues

Internals (Limited Documentation)

Detailed internal documentation for the Matching Service is still being developed. The following areas are particularly complex:
  • Partition ownership and loading/unloading logic
  • Task versioning and Build ID-based routing
  • Rate limiting implementation
  • Backlog management and persistence
Contributions to expand this documentation are welcome!

Further Reading

History Service

How workflows are executed

Worker Service

Internal background workers

Workflow Lifecycle

Sequence diagrams of workflow execution

Architecture Overview

High-level system architecture

Build docs developers (and LLMs) love