Fault Tolerance

Overview

Serverless Workflow is designed with resilience in mind, acknowledging that errors are an inevitable part of any system. The DSL provides robust mechanisms to identify, describe, and handle errors effectively, ensuring the workflow can recover gracefully from failures.

The fault tolerance features in Serverless Workflow enhance its robustness and reliability, making it capable of handling a wide range of failure scenarios gracefully and effectively.

Errors

Errors in Serverless Workflow are described using the Problem Details RFC (RFC 7807). This specification helps to standardize the way errors are communicated, using the instance property as a JSON Pointer to identify the specific component of the workflow that has raised the error.

Error Structure

An error follows this structure:

type: https://serverlessworkflow.io/spec/1.0.0/errors/communication
title: Service Unavailable
status: 503
detail: The service is currently unavailable. Please try again later.
instance: /do/getPetById

type

string

required

A URI reference that identifies the error type

title

string

A short, human-readable summary of the error type

status

integer

The HTTP status code for this occurrence of the error

detail

string

A human-readable explanation specific to this occurrence of the error

instance

string

A JSON Pointer to the workflow component that raised the error

Standard Error Types

The Serverless Workflow specification defines several standard error types to describe commonly known errors:

Error Type	Status	Description
`https://serverlessworkflow.io/spec/1.0.0/errors/configuration`	500	Configuration error
`https://serverlessworkflow.io/spec/1.0.0/errors/validation`	400	Validation error
`https://serverlessworkflow.io/spec/1.0.0/errors/expression`	400	Expression evaluation error
`https://serverlessworkflow.io/spec/1.0.0/errors/communication`	500	Communication error
`https://serverlessworkflow.io/spec/1.0.0/errors/timeout`	408	Timeout error
`https://serverlessworkflow.io/spec/1.0.0/errors/authorization`	403	Authorization error

Using these standard error types ensures that workflows behave consistently across different runtimes and allows authors to rely on predictable error handling and recovery processes.

Defining Custom Errors

You can define reusable custom errors in the use section:

use:
  errors:
    businessRuleViolation:
      type: https://example.com/errors/business-rule
      status: 422
      title: Business Rule Violation
    
    insufficientFunds:
      type: https://example.com/errors/insufficient-funds
      status: 402
      title: Insufficient Funds
    
    invalidOperation:
      type: https://example.com/errors/invalid-operation
      status: 400
      title: Invalid Operation

Try-Catch Pattern

The try task enables you to attempt executing a task and handle errors gracefully:

Basic Try-Catch

tryExample:
  try:
    call: http
    with:
      method: get
      endpoint:
        uri: https://api.example.com/data
  catch:
    errors:
      with:
        status: 503
    as: serviceError

try

object

required

The task to attempt

catch

object

required

Error handling configuration

catch.errors

object

Error filter to specify which errors to catch

catch.as

string

Variable name to store the caught error

Catching Specific Errors

processPayment:
  try:
    call: paymentService
    with:
      amount: ${ .orderTotal }
      customerId: ${ .customerId }
  catch:
    errors:
      with:
        type: https://example.com/errors/insufficient-funds
    do:
      - notifyCustomer:
          call: notificationService
          with:
            message: Insufficient funds for payment

Multiple Error Handlers

processOrder:
  try:
    call: orderService
    with:
      orderId: ${ .orderId }
  catch:
    errors:
      one:
        - with:
            type: https://example.com/errors/validation
          do:
            - handleValidationError:
                call: errorLogger
                with:
                  error: Validation failed
        
        - with:
            type: https://example.com/errors/insufficient-inventory
          do:
            - handleInventoryError:
                call: inventoryService
                with:
                  action: backorder
        
        - with:
            status: 503
          retry:
            delay:
              seconds: 3
            limit:
              attempt:
                count: 5

Catch All Errors

riskyOperation:
  try:
    call: unreliableService
    with:
      data: ${ .inputData }
  catch:
    errors: {}
    as: caughtError
    do:
      - logError:
          call: logger
          with:
            error: ${ .caughtError }
      - useDefault:
          set:
            result: default-value

An empty errors object catches all errors, providing a fallback for any failure.

Retry Policies

Retry policies allow workflows to automatically retry failed operations, which is especially useful for handling transient failures.

Basic Retry

fetchData:
  try:
    call: http
    with:
      method: get
      endpoint:
        uri: https://api.example.com/data
  catch:
    errors:
      with:
        status: 503
    retry:
      delay:
        seconds: 3
      limit:
        attempt:
          count: 5

retry.delay

object

Duration to wait before retrying

retry.limit

object

Limits on retry attempts

Retry with Exponential Backoff

reliableCall:
  try:
    call: http
    with:
      method: post
      endpoint:
        uri: https://api.example.com/process
      body: ${ .data }
  catch:
    errors:
      with:
        type: https://serverlessworkflow.io/spec/1.0.0/errors/communication
    retry:
      delay:
        seconds: 1
      backoff:
        exponential:
          factor: 2
      limit:
        attempt:
          count: 5

retry.backoff.exponential

object

Exponential backoff configuration

retry.backoff.exponential.factor

number

default:"2"

Multiplication factor for each retry delay

With exponential backoff and factor 2:

Attempt 1: 1 second delay
Attempt 2: 2 seconds delay
Attempt 3: 4 seconds delay
Attempt 4: 8 seconds delay
Attempt 5: 16 seconds delay

Retry with Linear Backoff

steadyRetry:
  try:
    call: dataService
  catch:
    errors:
      with:
        status: 503
    retry:
      delay:
        seconds: 3
      backoff:
        linear: {}
      limit:
        attempt:
          count: 5

retry.backoff.linear

object

Linear backoff configuration (delay increases by the same amount each time)

With linear backoff:

Attempt 1: 3 seconds delay
Attempt 2: 6 seconds delay
Attempt 3: 9 seconds delay
Attempt 4: 12 seconds delay
Attempt 5: 15 seconds delay

Time-Based Retry Limit

timedRetry:
  try:
    call: longRunningService
  catch:
    errors:
      with:
        type: https://serverlessworkflow.io/spec/1.0.0/errors/timeout
    retry:
      delay:
        seconds: 5
      limit:
        duration:
          minutes: 10

retry.limit.duration

object

Maximum total time to spend retrying

Reusable Retry Policies

Define retry policies once and reuse them across multiple tasks:

use:
  retries:
    standardRetry:
      delay:
        seconds: 2
      backoff:
        exponential:
          factor: 2
      limit:
        attempt:
          count: 5
    
    aggressiveRetry:
      delay:
        milliseconds: 500
      backoff:
        exponential:
          factor: 1.5
      limit:
        attempt:
          count: 10
    
    patientRetry:
      delay:
        seconds: 10
      backoff:
        linear: {}
      limit:
        duration:
          minutes: 30

do:
  - criticalOperation:
      try:
        call: criticalService
      catch:
        errors:
          with:
            status: 503
        retry: standardRetry
  
  - rapidOperation:
      try:
        call: fastService
      catch:
        errors:
          with:
            type: https://serverlessworkflow.io/spec/1.0.0/errors/communication
        retry: aggressiveRetry

Advanced Error Handling Patterns

Error Recovery with Fallback

do:
  - tryPrimaryService:
      try:
        call: http
        with:
          method: get
          endpoint:
            uri: https://primary.example.com/api
      catch:
        errors:
          with:
            status: 503
        retry:
          delay:
            seconds: 2
          limit:
            attempt:
              count: 3
        as: primaryError
  
  - tryFallbackService:
      if: ${ .primaryError != null }
      try:
        call: http
        with:
          method: get
          endpoint:
            uri: https://fallback.example.com/api
      catch:
        errors:
          with:
            status: 503
        as: fallbackError
  
  - useDefaultData:
      if: ${ .primaryError != null and .fallbackError != null }
      set:
        result:
          data: default-data
          source: default

Circuit Breaker Pattern

do:
  - checkCircuitState:
      call: circuitBreakerService
      with:
        service: externalApi
  
  - callService:
      if: ${ .checkCircuitState.output.state != "open" }
      try:
        call: http
        with:
          method: get
          endpoint:
            uri: https://external.example.com/api
      catch:
        errors: {}
        do:
          - recordFailure:
              call: circuitBreakerService
              with:
                action: recordFailure
                service: externalApi
  
  - useCachedData:
      if: ${ .checkCircuitState.output.state == "open" }
      call: cacheService
      with:
        key: lastKnownGood

Saga Pattern with Compensation

do:
  - reserveInventory:
      try:
        call: inventoryService
        with:
          action: reserve
          items: ${ .orderItems }
      catch:
        errors: {}
        then: end
      export:
        as: ${ $context + { inventoryReserved: true } }
  
  - processPayment:
      try:
        call: paymentService
        with:
          amount: ${ .orderTotal }
      catch:
        errors: {}
        do:
          - compensateInventory:
              call: inventoryService
              with:
                action: release
                items: ${ .orderItems }
        then: end
      export:
        as: ${ $context + { paymentProcessed: true } }
  
  - confirmOrder:
      try:
        call: orderService
        with:
          action: confirm
      catch:
        errors: {}
        do:
          - compensatePayment:
              call: paymentService
              with:
                action: refund
                amount: ${ .orderTotal }
          - compensateInventory:
              call: inventoryService
              with:
                action: release
                items: ${ .orderItems }
        then: end

Nested Try-Catch

processWithRecovery:
  try:
    do:
      - step1:
          try:
            call: service1
          catch:
            errors:
              with:
                status: 503
            retry:
              delay:
                seconds: 2
              limit:
                attempt:
                  count: 3
      
      - step2:
          try:
            call: service2
            with:
              data: ${ .step1.output }
          catch:
            errors:
              with:
                type: https://example.com/errors/validation
            do:
              - fixData:
                  call: dataFixer
                  with:
                    data: ${ .step1.output }
              - retryStep2:
                  call: service2
                  with:
                    data: ${ .fixData.output }
  catch:
    errors: {}
    do:
      - logFailure:
          call: logger
          with:
            message: Complete process failed
      - notifyAdmin:
          call: notificationService
          with:
            recipient: [email protected]
            message: Critical workflow failure

Error Handling with Context Preservation

do:
  - initializeContext:
      set:
        processId: ${ .requestId }
        startTime: ${ now }
        status: processing
  
  - processData:
      try:
        call: processor
        with:
          data: ${ .inputData }
      catch:
        errors: {}
        as: processingError
        export:
          as: ${ $context + { 
            error: .processingError, 
            status: "failed",
            failedAt: now 
          } }
  
  - finalizeStatus:
      set:
        finalStatus: ${ if $context.error then "failed" else "success" end }

Raising Errors

The raise task explicitly raises an error:

validateInput:
  call: validator
  with:
    data: ${ .inputData }

checkValidation:
  if: ${ .validateInput.output.isValid == false }
  raise:
    error:
      type: https://serverlessworkflow.io/spec/1.0.0/errors/validation
      status: 400
      title: Validation Failed
      detail: ${ .validateInput.output.message }

raise.error

object

required

The error to raise, following RFC 7807 Problem Details format

Conditional Error Raising

do:
  - checkBusinessRules:
      call: businessRuleEngine
      with:
        data: ${ .orderData }
  
  - raiseIfViolation:
      if: ${ .checkBusinessRules.output.violations | length > 0 }
      raise:
        error:
          type: https://example.com/errors/business-rule
          status: 422
          title: Business Rule Violation
          detail: ${ .checkBusinessRules.output.violations | map(.message) | join(", ") }

Error Logging and Monitoring

Logging Errors

processWithLogging:
  try:
    call: riskyOperation
  catch:
    errors: {}
    as: caughtError
    do:
      - logError:
          call: http
          with:
            method: post
            endpoint:
              uri: https://logging.example.com/errors
            body:
              workflowId: ${ $workflow.id }
              taskName: ${ $task.name }
              error: ${ .caughtError }
              timestamp: ${ now }
      
      - handleError:
          call: errorHandler
          with:
            error: ${ .caughtError }

Metrics and Alerting

processWithMetrics:
  try:
    call: monitoredOperation
  catch:
    errors: {}
    as: operationError
    do:
      - incrementErrorCounter:
          call: http
          with:
            method: post
            endpoint:
              uri: https://metrics.example.com/increment
            body:
              metric: operation_errors
              tags:
                service: ${ $task.name }
                errorType: ${ .operationError.type }
      
      - sendAlert:
          if: ${ .operationError.status >= 500 }
          call: alertingService
          with:
            severity: high
            message: ${ .operationError.detail }

Best Practices

Catch specific errors first

Handle specific error types before catching general errors to provide targeted recovery strategies.

Use appropriate retry strategies

Apply exponential backoff for transient failures and set reasonable retry limits to avoid infinite loops.

Log all errors

Always log errors with sufficient context for debugging and monitoring.

Provide fallback mechanisms

Implement fallback strategies like cached data or default values when services are unavailable.

Clean up resources

Use compensation tasks to release resources when errors occur midway through a process.

Set appropriate timeouts

Combine error handling with timeout configuration to prevent workflows from hanging indefinitely.

Use standard error types

Prefer standard error types for common failure scenarios to ensure consistency across workflows.

Common Pitfalls

Catching Too Broadly

# Bad: Catches and ignores all errors
try:
  call: importantOperation
catch:
  errors: {}
  # No error handling or logging

# Good: Specific error handling with logging
try:
  call: importantOperation
catch:
  errors:
    with:
      type: https://example.com/errors/expected-error
  as: error
  do:
    - logError:
        call: logger
        with:
          error: ${ .error }

Infinite Retry Loops

# Bad: No retry limit
retry:
  delay:
    seconds: 1

# Good: Reasonable retry limit
retry:
  delay:
    seconds: 1
  limit:
    attempt:
      count: 5

Not Handling Compensation

# Bad: No compensation for partial failures
do:
  - reserveResource:
      call: reserveService
  - processResource:
      call: processService  # If this fails, resource remains reserved

# Good: Proper compensation
do:
  - reserveResource:
      try:
        call: reserveService
      catch:
        errors: {}
        then: end
  - processResource:
      try:
        call: processService
      catch:
        errors: {}
        do:
          - releaseResource:
              call: releaseService

Workflows - Learn about workflow structure
Tasks - Understand task types and properties
Timeouts - Configure timeout behavior
Runtime Expressions - Use expressions in error handling

Getting Started

Core Concepts

Advanced Topics

Examples

Use Cases

Overview

Errors

Error Structure

Standard Error Types

Defining Custom Errors

Try-Catch Pattern

Basic Try-Catch

Catching Specific Errors

Multiple Error Handlers

Catch All Errors

Retry Policies

Basic Retry

Retry with Exponential Backoff

Retry with Linear Backoff

Time-Based Retry Limit

Reusable Retry Policies

Advanced Error Handling Patterns

Error Recovery with Fallback

Circuit Breaker Pattern

Saga Pattern with Compensation

Nested Try-Catch

Error Handling with Context Preservation

Raising Errors

Conditional Error Raising

Error Logging and Monitoring

Logging Errors

Metrics and Alerting

Best Practices

Common Pitfalls

Catching Too Broadly

Infinite Retry Loops

Not Handling Compensation

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Advanced Topics

Examples

Use Cases

​Overview

​Errors

​Error Structure

​Standard Error Types

​Defining Custom Errors

​Try-Catch Pattern

​Basic Try-Catch

​Catching Specific Errors

​Multiple Error Handlers

​Catch All Errors

​Retry Policies

​Basic Retry

​Retry with Exponential Backoff

​Retry with Linear Backoff

​Time-Based Retry Limit

​Reusable Retry Policies

​Advanced Error Handling Patterns

​Error Recovery with Fallback

​Circuit Breaker Pattern

​Saga Pattern with Compensation

​Nested Try-Catch

​Error Handling with Context Preservation

​Raising Errors

​Conditional Error Raising

​Error Logging and Monitoring

​Logging Errors

​Metrics and Alerting

​Best Practices

​Common Pitfalls

​Catching Too Broadly

​Infinite Retry Loops

​Not Handling Compensation

​Related Topics

Build docs developers (and LLMs) love

Overview

Errors

Error Structure

Standard Error Types

Defining Custom Errors

Try-Catch Pattern

Basic Try-Catch

Catching Specific Errors

Multiple Error Handlers

Catch All Errors

Retry Policies

Basic Retry

Retry with Exponential Backoff

Retry with Linear Backoff

Time-Based Retry Limit

Reusable Retry Policies

Advanced Error Handling Patterns

Error Recovery with Fallback

Circuit Breaker Pattern

Saga Pattern with Compensation

Nested Try-Catch

Error Handling with Context Preservation

Raising Errors

Conditional Error Raising

Error Logging and Monitoring

Logging Errors

Metrics and Alerting

Best Practices

Common Pitfalls

Catching Too Broadly

Infinite Retry Loops

Not Handling Compensation

Related Topics