Skip to main content

Service Overview

Cadence consists of four core services that work together to provide workflow orchestration capabilities. Each service is stateless, horizontally scalable, and has a specific set of responsibilities.

Frontend Service

The Frontend service acts as the API gateway for all client interactions with Cadence.

Responsibilities

  • API Gateway: Exposes public APIs for workflow and activity operations
  • Request Validation: Validates all incoming requests
  • Rate Limiting: Enforces per-domain rate limits
  • Authentication & Authorization: Handles security concerns
  • Cluster Redirection: Routes requests to appropriate cluster in multi-region setup
  • Domain Management: Handles domain registration and updates

Key Components

// Frontend service structure
type Service struct {
    Resource
    handler      *api.WorkflowHandler
    adminHandler admin.Handler
    config       *config.Config
}

API Layers

The frontend implements multiple decorator layers:
  1. Base Handler: Core API implementation
  2. Version Check: Ensures client compatibility
  3. Rate Limiter: Enforces quota limits
  4. Metrics: Captures telemetry data
  5. Cluster Redirection: Handles multi-cluster routing
  6. Access Control: Authorization enforcement

Configuration

services:
  frontend:
    rpc:
      port: 7933
      grpcPort: 7833
      bindOnLocalHost: true
      grpcMaxMsgSize: 33554432
    metrics:
      statsd:
        hostPort: "127.0.0.1:8125"
        prefix: "cadence"

Rate Limiting

The frontend implements a sophisticated multi-stage rate limiting system:
// Rate limiter configuration
userRateLimiter := quotas.NewMultiStageRateLimiter(
    quotas.NewDynamicRateLimiter(s.config.UserRPS.AsFloat64()),
    collections.user,
)
Rate Limit Types:
  • User RPS: Rate limit for client API calls
  • Worker RPS: Rate limit for worker poll requests
  • Visibility RPS: Rate limit for visibility queries
  • Async RPS: Rate limit for async workflow operations

APIs Exposed

Workflow APIs

  • StartWorkflowExecution
  • SignalWorkflowExecution
  • TerminateWorkflowExecution
  • GetWorkflowExecutionHistory
  • DescribeWorkflowExecution

Domain APIs

  • RegisterDomain
  • DescribeDomain
  • UpdateDomain
  • ListDomains
  • DeprecateDomain

Task List APIs

  • PollForDecisionTask
  • PollForActivityTask
  • RespondDecisionTaskCompleted
  • RespondActivityTaskCompleted

History Service

The History service is the core workflow execution engine that maintains workflow state and makes execution decisions.

Responsibilities

  • Workflow State Management: Maintains mutable state for active workflows
  • Event History: Persists immutable workflow history events
  • Decision Processing: Processes decisions from workflow workers
  • Task Generation: Creates decision and activity tasks
  • Timer Management: Handles workflow and activity timeouts
  • Shard Ownership: Manages history shard ownership

Shard-Based Architecture

Key Components

// History service structure
type Service struct {
    Resource
    handler handler.Handler
    config  *config.Config
}

// History engine per shard
type Engine struct {
    shard            ShardContext
    executionManager persistence.ExecutionManager
    taskProcessor    processor.TaskProcessor
}

Workflow Execution State

The History service maintains two types of state:
  1. Mutable State: Current workflow execution state
    • Pending decision/activity tasks
    • Timers
    • Signals
    • Child workflows
    • Execution info (status, timeouts, etc.)
  2. Immutable History: Event log of all workflow actions
    • WorkflowExecutionStarted
    • DecisionTaskScheduled
    • ActivityTaskStarted
    • WorkflowExecutionCompleted

Task Queues

History service manages multiple task queues per shard:
// Task types processed by history
type TaskType int
const (
    TransferTaskType     // Immediate tasks (decisions, activities)
    TimerTaskType        // Delayed tasks (timeouts, retries)
    ReplicationTaskType  // Cross-DC replication tasks
)

Transfer Queue

  • Processes tasks that need immediate execution
  • Examples: Decision tasks, activity tasks, close execution
  • FIFO processing within each shard

Timer Queue

  • Processes time-based tasks
  • Examples: Workflow timeout, activity timeout, retry timer
  • Priority queue ordered by fire time

Configuration

services:
  history:
    rpc:
      port: 7934
      grpcPort: 7834
      grpcMaxMsgSize: 33554432

Scalability Considerations

  • Maximum Scale: Limited by numHistoryShards
  • Shard Distribution: Automatic rebalancing when instances join/leave
  • Graceful Shutdown: Drains shards before stopping

Matching Service

The Matching service routes tasks from History to Workers using task lists.

Responsibilities

  • Task List Management: Maintains task lists for decisions and activities
  • Task Routing: Delivers tasks to polling workers
  • Sync Match: Optimizes latency by matching tasks with waiting pollers
  • Task Persistence: Stores unmatched tasks in database
  • Load Balancing: Distributes tasks across available workers

Task List Architecture

Key Components

// Matching service structure
type Service struct {
    Resource
    handler handler.Handler
    config  *config.Config
}

// Task list manager
type taskListManager struct {
    taskListID   *tasklist.Identifier
    taskBuffer   chan *persistence.TaskInfo
    deliverBuffer chan *InternalTask
}

Sync Match Optimization

Sync Match: When a task arrives and workers are already polling
// Sync match flow
1. Worker sends PollForTask (long poll)
2. History service adds task
3. Matching checks for waiting pollers
4. If poller exists: deliver immediately (sync match)
5. If no poller: persist task to database
Benefits:
  • Near-zero latency task delivery
  • Reduced database load
  • Better throughput

Task List Types

  1. Decision Task List: Routes decision tasks
    • One per workflow task list
    • Workers poll for workflow execution decisions
  2. Activity Task List: Routes activity tasks
    • Can be different from decision task list
    • Workers poll for activity execution

Configuration

services:
  matching:
    rpc:
      port: 7935
      grpcPort: 7835
      grpcMaxMsgSize: 33554432

Scalability

  • Task List Partitioning: High-throughput task lists can be partitioned
  • Isolation Groups: Route tasks to specific worker pools
  • Dynamic Partitioning: Automatic partition adjustment based on load

Worker Service

The Worker service handles internal background processing tasks for the Cadence system.

Responsibilities

  • Replication: Processes cross-datacenter replication tasks
  • Indexing: Indexes workflow data to Elasticsearch/Pinot for visibility
  • Archival: Archives old workflow histories to blob storage
  • System Workflows: Runs internal system workflows
  • Domain Replication: Replicates domain metadata across clusters
The Worker service is not the same as application workers that execute workflow and activity code. This is an internal Cadence component.

Key Components

// Worker service structure
type Service struct {
    Resource
    config *Config
}

// Background processors
type processors struct {
    replicator         *replicator.Replicator
    indexer            *indexer.Indexer
    archiver           *archiver.Archiver
    scanner            *scanner.Scanner
    esAnalyzer         *esanalyzer.Analyzer
    failoverManager    *failovermanager.Manager
}

Replicator

Handles cross-cluster replication:
// Replicator processes replication tasks from Kafka
type Replicator struct {
    kafkaClient         messaging.Client
    historyClient       history.Client
    domainReplicator    domain.Replicator
}
Flow:
  1. History service writes replication tasks to Kafka
  2. Worker service consumes from Kafka
  3. Applies tasks to target cluster
  4. Handles conflict resolution

Indexer

Indexes workflow data for advanced visibility:
// Indexer processes visibility events
type Indexer struct {
    kafkaClient    messaging.Client
    esClient       elasticsearch.Client
    bulkProcessor  es.BulkProcessor
}
Configuration:
dynamicconfig:
  WorkerIndexerConcurrency: 100
  WorkerESProcessorBulkActions: 500
  WorkerESProcessorBulkSize: 2097152  # 2MB
  WorkerESProcessorFlushInterval: 1s

Archiver

Archives workflow histories to long-term storage:
archival:
  history:
    status: "enabled"
    enableRead: true
    provider:
      filestore:
        fileMode: "0666"
        dirMode: "0766"
Supported Providers:
  • Local filesystem
  • AWS S3
  • Google Cloud Storage
  • Custom implementations

Scanner

Performs data consistency checks and cleanup:
  • Task List Scanner: Removes orphaned task list entries
  • History Scanner: Validates workflow history integrity
  • Timer Scanner: Checks for stuck timers
  • Execution Scanner: Identifies zombie workflows

Configuration

services:
  worker:
    rpc:
      port: 7939
    metrics:
      statsd:
        hostPort: "127.0.0.1:8125"
        prefix: "cadence"

System Domains

Worker service creates internal system domains:
  • cadence-system: Core system workflows
  • cadence-batcher: Batch operations
  • cadence-canary: Health checks

Inter-Service Communication

Communication Patterns

RPC Configuration

Protocols Supported:
  • gRPC (recommended)
  • TChannel (legacy)
Message Size Limits:
grpcMaxMsgSize: 33554432  # 32MB default

Service Discovery

Services discover each other using Ringpop:
ringpop:
  name: cadence
  bootstrapMode: hosts
  bootstrapHosts:
    - "127.0.0.1:7933"
    - "127.0.0.1:7934"
    - "127.0.0.1:7935"

Deployment Considerations

Resource Requirements

ServiceCPUMemoryDiskNetwork
FrontendLow-MediumLowMinimalHigh
HistoryHighHighMinimalMedium
MatchingLow-MediumLow-MediumMinimalMedium
WorkerMediumMediumLowMedium

Scaling Guidelines

  1. Frontend: Scale based on RPS
    • Start with 2-3 instances
    • Add instances as traffic increases
  2. History: Scale based on shard count
    • Each instance should own 100-200 shards
    • More instances = better distribution
  3. Matching: Scale based on task throughput
    • Start with 2-3 instances
    • Scale if sync match rate drops
  4. Worker: Scale based on background load
    • Replication lag
    • Indexing lag
    • Archival backlog

Next Steps

Persistence Layer

Learn about database design and configuration

Cross-DC Replication

Set up multi-region deployments

Build docs developers (and LLMs) love