Overview
Serverless Workflow is designed with resilience in mind, acknowledging that errors are an inevitable part of any system. The DSL provides robust mechanisms to identify, describe, and handle errors effectively, ensuring the workflow can recover gracefully from failures.
The fault tolerance features in Serverless Workflow enhance its robustness and reliability, making it capable of handling a wide range of failure scenarios gracefully and effectively.
Errors
Errors in Serverless Workflow are described using the Problem Details RFC (RFC 7807). This specification helps to standardize the way errors are communicated, using the instance property as a JSON Pointer to identify the specific component of the workflow that has raised the error.
Error Structure
An error follows this structure:
type: https://serverlessworkflow.io/spec/1.0.0/errors/communication
title: Service Unavailable
status: 503
detail: The service is currently unavailable. Please try again later.
instance: /do/getPetById
A URI reference that identifies the error type
A short, human-readable summary of the error type
The HTTP status code for this occurrence of the error
A human-readable explanation specific to this occurrence of the error
A JSON Pointer to the workflow component that raised the error
Standard Error Types
The Serverless Workflow specification defines several standard error types to describe commonly known errors:
| Error Type | Status | Description |
|---|
https://serverlessworkflow.io/spec/1.0.0/errors/configuration | 500 | Configuration error |
https://serverlessworkflow.io/spec/1.0.0/errors/validation | 400 | Validation error |
https://serverlessworkflow.io/spec/1.0.0/errors/expression | 400 | Expression evaluation error |
https://serverlessworkflow.io/spec/1.0.0/errors/communication | 500 | Communication error |
https://serverlessworkflow.io/spec/1.0.0/errors/timeout | 408 | Timeout error |
https://serverlessworkflow.io/spec/1.0.0/errors/authorization | 403 | Authorization error |
Using these standard error types ensures that workflows behave consistently across different runtimes and allows authors to rely on predictable error handling and recovery processes.
Defining Custom Errors
You can define reusable custom errors in the use section:
use:
errors:
businessRuleViolation:
type: https://example.com/errors/business-rule
status: 422
title: Business Rule Violation
insufficientFunds:
type: https://example.com/errors/insufficient-funds
status: 402
title: Insufficient Funds
invalidOperation:
type: https://example.com/errors/invalid-operation
status: 400
title: Invalid Operation
Try-Catch Pattern
The try task enables you to attempt executing a task and handle errors gracefully:
Basic Try-Catch
tryExample:
try:
call: http
with:
method: get
endpoint:
uri: https://api.example.com/data
catch:
errors:
with:
status: 503
as: serviceError
Error handling configuration
Error filter to specify which errors to catch
Variable name to store the caught error
Catching Specific Errors
processPayment:
try:
call: paymentService
with:
amount: ${ .orderTotal }
customerId: ${ .customerId }
catch:
errors:
with:
type: https://example.com/errors/insufficient-funds
do:
- notifyCustomer:
call: notificationService
with:
message: Insufficient funds for payment
Multiple Error Handlers
processOrder:
try:
call: orderService
with:
orderId: ${ .orderId }
catch:
errors:
one:
- with:
type: https://example.com/errors/validation
do:
- handleValidationError:
call: errorLogger
with:
error: Validation failed
- with:
type: https://example.com/errors/insufficient-inventory
do:
- handleInventoryError:
call: inventoryService
with:
action: backorder
- with:
status: 503
retry:
delay:
seconds: 3
limit:
attempt:
count: 5
Catch All Errors
riskyOperation:
try:
call: unreliableService
with:
data: ${ .inputData }
catch:
errors: {}
as: caughtError
do:
- logError:
call: logger
with:
error: ${ .caughtError }
- useDefault:
set:
result: default-value
An empty errors object catches all errors, providing a fallback for any failure.
Retry Policies
Retry policies allow workflows to automatically retry failed operations, which is especially useful for handling transient failures.
Basic Retry
fetchData:
try:
call: http
with:
method: get
endpoint:
uri: https://api.example.com/data
catch:
errors:
with:
status: 503
retry:
delay:
seconds: 3
limit:
attempt:
count: 5
Duration to wait before retrying
Retry with Exponential Backoff
reliableCall:
try:
call: http
with:
method: post
endpoint:
uri: https://api.example.com/process
body: ${ .data }
catch:
errors:
with:
type: https://serverlessworkflow.io/spec/1.0.0/errors/communication
retry:
delay:
seconds: 1
backoff:
exponential:
factor: 2
limit:
attempt:
count: 5
retry.backoff.exponential
Exponential backoff configuration
retry.backoff.exponential.factor
Multiplication factor for each retry delay
With exponential backoff and factor 2:
- Attempt 1: 1 second delay
- Attempt 2: 2 seconds delay
- Attempt 3: 4 seconds delay
- Attempt 4: 8 seconds delay
- Attempt 5: 16 seconds delay
Retry with Linear Backoff
steadyRetry:
try:
call: dataService
catch:
errors:
with:
status: 503
retry:
delay:
seconds: 3
backoff:
linear: {}
limit:
attempt:
count: 5
Linear backoff configuration (delay increases by the same amount each time)
With linear backoff:
- Attempt 1: 3 seconds delay
- Attempt 2: 6 seconds delay
- Attempt 3: 9 seconds delay
- Attempt 4: 12 seconds delay
- Attempt 5: 15 seconds delay
Time-Based Retry Limit
timedRetry:
try:
call: longRunningService
catch:
errors:
with:
type: https://serverlessworkflow.io/spec/1.0.0/errors/timeout
retry:
delay:
seconds: 5
limit:
duration:
minutes: 10
Maximum total time to spend retrying
Reusable Retry Policies
Define retry policies once and reuse them across multiple tasks:
use:
retries:
standardRetry:
delay:
seconds: 2
backoff:
exponential:
factor: 2
limit:
attempt:
count: 5
aggressiveRetry:
delay:
milliseconds: 500
backoff:
exponential:
factor: 1.5
limit:
attempt:
count: 10
patientRetry:
delay:
seconds: 10
backoff:
linear: {}
limit:
duration:
minutes: 30
do:
- criticalOperation:
try:
call: criticalService
catch:
errors:
with:
status: 503
retry: standardRetry
- rapidOperation:
try:
call: fastService
catch:
errors:
with:
type: https://serverlessworkflow.io/spec/1.0.0/errors/communication
retry: aggressiveRetry
Advanced Error Handling Patterns
Error Recovery with Fallback
do:
- tryPrimaryService:
try:
call: http
with:
method: get
endpoint:
uri: https://primary.example.com/api
catch:
errors:
with:
status: 503
retry:
delay:
seconds: 2
limit:
attempt:
count: 3
as: primaryError
- tryFallbackService:
if: ${ .primaryError != null }
try:
call: http
with:
method: get
endpoint:
uri: https://fallback.example.com/api
catch:
errors:
with:
status: 503
as: fallbackError
- useDefaultData:
if: ${ .primaryError != null and .fallbackError != null }
set:
result:
data: default-data
source: default
Circuit Breaker Pattern
do:
- checkCircuitState:
call: circuitBreakerService
with:
service: externalApi
- callService:
if: ${ .checkCircuitState.output.state != "open" }
try:
call: http
with:
method: get
endpoint:
uri: https://external.example.com/api
catch:
errors: {}
do:
- recordFailure:
call: circuitBreakerService
with:
action: recordFailure
service: externalApi
- useCachedData:
if: ${ .checkCircuitState.output.state == "open" }
call: cacheService
with:
key: lastKnownGood
Saga Pattern with Compensation
do:
- reserveInventory:
try:
call: inventoryService
with:
action: reserve
items: ${ .orderItems }
catch:
errors: {}
then: end
export:
as: ${ $context + { inventoryReserved: true } }
- processPayment:
try:
call: paymentService
with:
amount: ${ .orderTotal }
catch:
errors: {}
do:
- compensateInventory:
call: inventoryService
with:
action: release
items: ${ .orderItems }
then: end
export:
as: ${ $context + { paymentProcessed: true } }
- confirmOrder:
try:
call: orderService
with:
action: confirm
catch:
errors: {}
do:
- compensatePayment:
call: paymentService
with:
action: refund
amount: ${ .orderTotal }
- compensateInventory:
call: inventoryService
with:
action: release
items: ${ .orderItems }
then: end
Nested Try-Catch
processWithRecovery:
try:
do:
- step1:
try:
call: service1
catch:
errors:
with:
status: 503
retry:
delay:
seconds: 2
limit:
attempt:
count: 3
- step2:
try:
call: service2
with:
data: ${ .step1.output }
catch:
errors:
with:
type: https://example.com/errors/validation
do:
- fixData:
call: dataFixer
with:
data: ${ .step1.output }
- retryStep2:
call: service2
with:
data: ${ .fixData.output }
catch:
errors: {}
do:
- logFailure:
call: logger
with:
message: Complete process failed
- notifyAdmin:
call: notificationService
with:
recipient: [email protected]
message: Critical workflow failure
Error Handling with Context Preservation
do:
- initializeContext:
set:
processId: ${ .requestId }
startTime: ${ now }
status: processing
- processData:
try:
call: processor
with:
data: ${ .inputData }
catch:
errors: {}
as: processingError
export:
as: ${ $context + {
error: .processingError,
status: "failed",
failedAt: now
} }
- finalizeStatus:
set:
finalStatus: ${ if $context.error then "failed" else "success" end }
Raising Errors
The raise task explicitly raises an error:
validateInput:
call: validator
with:
data: ${ .inputData }
checkValidation:
if: ${ .validateInput.output.isValid == false }
raise:
error:
type: https://serverlessworkflow.io/spec/1.0.0/errors/validation
status: 400
title: Validation Failed
detail: ${ .validateInput.output.message }
The error to raise, following RFC 7807 Problem Details format
Conditional Error Raising
do:
- checkBusinessRules:
call: businessRuleEngine
with:
data: ${ .orderData }
- raiseIfViolation:
if: ${ .checkBusinessRules.output.violations | length > 0 }
raise:
error:
type: https://example.com/errors/business-rule
status: 422
title: Business Rule Violation
detail: ${ .checkBusinessRules.output.violations | map(.message) | join(", ") }
Error Logging and Monitoring
Logging Errors
processWithLogging:
try:
call: riskyOperation
catch:
errors: {}
as: caughtError
do:
- logError:
call: http
with:
method: post
endpoint:
uri: https://logging.example.com/errors
body:
workflowId: ${ $workflow.id }
taskName: ${ $task.name }
error: ${ .caughtError }
timestamp: ${ now }
- handleError:
call: errorHandler
with:
error: ${ .caughtError }
Metrics and Alerting
processWithMetrics:
try:
call: monitoredOperation
catch:
errors: {}
as: operationError
do:
- incrementErrorCounter:
call: http
with:
method: post
endpoint:
uri: https://metrics.example.com/increment
body:
metric: operation_errors
tags:
service: ${ $task.name }
errorType: ${ .operationError.type }
- sendAlert:
if: ${ .operationError.status >= 500 }
call: alertingService
with:
severity: high
message: ${ .operationError.detail }
Best Practices
Catch specific errors first
Handle specific error types before catching general errors to provide targeted recovery strategies.
Use appropriate retry strategies
Apply exponential backoff for transient failures and set reasonable retry limits to avoid infinite loops.
Log all errors
Always log errors with sufficient context for debugging and monitoring.
Provide fallback mechanisms
Implement fallback strategies like cached data or default values when services are unavailable.
Clean up resources
Use compensation tasks to release resources when errors occur midway through a process.
Set appropriate timeouts
Combine error handling with timeout configuration to prevent workflows from hanging indefinitely.
Use standard error types
Prefer standard error types for common failure scenarios to ensure consistency across workflows.
Common Pitfalls
Catching Too Broadly
# Bad: Catches and ignores all errors
try:
call: importantOperation
catch:
errors: {}
# No error handling or logging
# Good: Specific error handling with logging
try:
call: importantOperation
catch:
errors:
with:
type: https://example.com/errors/expected-error
as: error
do:
- logError:
call: logger
with:
error: ${ .error }
Infinite Retry Loops
# Bad: No retry limit
retry:
delay:
seconds: 1
# Good: Reasonable retry limit
retry:
delay:
seconds: 1
limit:
attempt:
count: 5
Not Handling Compensation
# Bad: No compensation for partial failures
do:
- reserveResource:
call: reserveService
- processResource:
call: processService # If this fails, resource remains reserved
# Good: Proper compensation
do:
- reserveResource:
try:
call: reserveService
catch:
errors: {}
then: end
- processResource:
try:
call: processService
catch:
errors: {}
do:
- releaseResource:
call: releaseService