Durability

What is durability?

Durability in DBOS means your workflows automatically recover from any failure or interruption. When your application crashes, is killed, or restarts, DBOS detects all in-progress workflows and resumes them from the last completed step. You don’t need to write any recovery code—it’s built into the framework.

from dbos import DBOS

@DBOS.workflow()
def durable_order_processing(order_id: str):
    charge_payment(order_id)      # Step 1: If app crashes here
    reserve_inventory(order_id)   # Step 2: ...recovery skips Step 1
    ship_order(order_id)          # Step 3: ...and continues from Step 2
    return "order completed"

# If the app crashes after Step 1 completes, when it restarts:
# - Step 1 is skipped (already completed, result is reused)
# - Step 2 executes
# - Step 3 executes
# - Workflow completes successfully

How durability works

DBOS achieves durability by checkpointing workflow execution in Postgres:

Workflow starts

When you call a workflow, DBOS creates a record in the system database with a unique workflow ID and initial state.

Each step is recorded

Before executing a step or transaction, DBOS records the operation in the database. After the operation completes, DBOS stores its result.

Progress is checkpointed

As your workflow progresses, each completed step is permanently recorded in Postgres.

Recovery on restart

When your application restarts, DBOS queries the database for incomplete workflows and automatically resumes each one from its last checkpoint.

What gets checkpointed?

DBOS stores in the system database:

Workflow metadata: ID, function name, start time, status
Workflow inputs: Serialized arguments passed to the workflow
Step results: Output of each completed step and transaction
Workflow output: Final return value when the workflow completes
Error information: Exceptions raised during execution
Events and notifications: Durable messages sent/received by workflows

Checkpointing happens automatically. You don’t need to call any save or commit functions.

Recovery guarantees

Workflows resume from last checkpoint

When a workflow is interrupted, DBOS guarantees:

Already-completed steps are skipped - Their stored results are reused
In-progress steps are retried - The step that was executing when the crash occurred runs again
Not-yet-started steps execute normally - The workflow continues forward

@DBOS.workflow()
def multi_step_workflow():
    step_a()  # ✅ Completed before crash → skipped on recovery
    step_b()  # ❌ Was running during crash → retried on recovery
    step_c()  # ⏸️ Not started yet → runs on recovery
    step_d()  # ⏸️ Not started yet → runs on recovery

At-least-once execution

Each step executes at least once but may execute multiple times if:

The step was running during a crash (replayed on recovery)
The step failed and is being retried
A workflow restart occurs mid-step

Important: Design your steps to be idempotent (safe to run multiple times with the same inputs). This ensures correct behavior even if a step is retried.

# ❌ Not idempotent - dangerous!
@DBOS.step()
def bad_step(account_id: str):
    balance = get_balance(account_id)
    new_balance = balance + 100  # Adds 100 every time
    set_balance(account_id, new_balance)

# ✅ Idempotent - safe!
@DBOS.step()
def good_step(account_id: str, amount: float):
    # Always sets the balance to a specific value
    set_balance(account_id, amount)

Deterministic replay

For recovery to work correctly, workflows must be deterministic: given the same inputs, they must execute the same sequence of steps in the same order.

❌ Non-deterministic workflow (broken)

import random
from datetime import datetime

@DBOS.workflow()
def non_deterministic_workflow():
    # Different behavior on replay!
    if random.random() > 0.5:
        step_a()
    else:
        step_b()
    
    # Different timestamp on replay!
    if datetime.now().hour < 12:
        morning_task()

Problem: On recovery, the workflow might take a different path than during original execution, leading to incorrect state.

✅ Deterministic workflow (correct)

@DBOS.workflow()
def deterministic_workflow():
    # Move non-deterministic logic into steps
    random_value = get_random_value()  # Step returns consistent result
    if random_value > 0.5:
        step_a()
    else:
        step_b()
    
    current_hour = get_current_hour()  # Step returns consistent result
    if current_hour < 12:
        morning_task()

@DBOS.step()
def get_random_value():
    import random
    return random.random()  # Result is checkpointed

@DBOS.step()
def get_current_hour():
    from datetime import datetime
    return datetime.now().hour  # Result is checkpointed

Solution: Put all non-deterministic operations (random numbers, timestamps, API calls) in steps. Their results are checkpointed and reused during recovery.

Workflow recovery lifecycle

When you call DBOS.launch(), this happens:

Connect to system database

DBOS connects to Postgres and checks the workflow execution table.

Find incomplete workflows

DBOS queries for all workflows with status PENDING (started but not completed).

Resume each workflow

For each incomplete workflow:

Load the workflow function by name
Deserialize the original arguments
Replay the workflow from the beginning
Skip completed steps (reuse stored results)
Execute incomplete steps

Handle errors

If recovery fails repeatedly, DBOS marks the workflow as ERROR after max recovery attempts.

# Start your application
dbos = DBOS(config)
dbos.launch()  # Automatically recovers all incomplete workflows

# Your application is now running
# All interrupted workflows have been resumed

Recovery configuration

Control recovery behavior through configuration and decorators:

Max recovery attempts

Limit how many times DBOS attempts to recover a workflow:

from dbos import DBOS

# Set globally in config
config = DBOSConfig(
    max_recovery_attempts=5  # Default: 50
)

# Or per-workflow via context
from dbos import SetWorkflowID

@DBOS.workflow()
def my_workflow():
    # This workflow will retry up to 10 times on recovery
    pass

After exceeding max recovery attempts, the workflow is marked as ERROR and stops retrying. You can inspect and manually restart such workflows using DBOSClient.

Recovery timeout

Workflows can have timeouts to prevent them from running indefinitely:

from dbos import SetWorkflowTimeout

@DBOS.workflow()
def timed_workflow():
    with SetWorkflowTimeout(300):  # 5 minutes
        long_running_process()

Handling workflow updates

What happens when you update workflow code while workflows are in progress?

Safe updates

These changes are safe:

Adding new steps at the end of a workflow
Changing step implementation (as long as the step name stays the same)
Modifying workflow logic after all active workflows complete

# Version 1 (deployed, workflows running)
@DBOS.workflow()
def process_order(order_id: str):
    charge_payment(order_id)
    ship_order(order_id)

# Version 2 (safe update - adds new step at end)
@DBOS.workflow()
def process_order(order_id: str):
    charge_payment(order_id)
    ship_order(order_id)
    send_confirmation_email(order_id)  # New step

Unsafe updates

These changes can break recovery:

Removing or reordering existing steps in workflows with active executions
Changing a step’s name (DBOS can’t match the stored checkpoint)
Changing workflow parameters (deserialization may fail)

# Version 1 (deployed, workflows running)
@DBOS.workflow()
def process_order(order_id: str):
    charge_payment(order_id)
    ship_order(order_id)

# Version 2 (UNSAFE - reordered steps)
@DBOS.workflow()
def process_order(order_id: str):
    ship_order(order_id)  # Now runs first!
    charge_payment(order_id)  # Now runs second!
# Recovery will replay with new order, causing issues

Application versioning

DBOS automatically computes an application version hash based on your workflow code. When you deploy a new version, DBOS detects the change:

# DBOS tracks app version automatically
app_version = dbos.app_version  # e.g., "a3f7d8e9..."

# Workflows from old versions can still recover
# But you can query workflows by app version:
client = DBOSClient(system_database_url)
old_workflows = client.list_workflows(app_version="a3f7d8e9...")

Use DBOSClient.list_workflows() to find workflows from previous versions that are still running. You can cancel or migrate them before deploying breaking changes.

Durable primitives

Beyond workflow recovery, DBOS provides durable primitives:

Durable sleep

Sleep for any duration (seconds to weeks) and resume exactly on schedule:

@DBOS.workflow()
def reminder_workflow(user_id: str):
    send_initial_message(user_id)
    
    # Sleep for 7 days - survives app restarts
    DBOS.sleep(7 * 24 * 60 * 60)
    
    send_reminder(user_id)

Durable events

Send and receive events that persist across restarts:

@DBOS.workflow()
def wait_for_approval(request_id: str):
    submit_request(request_id)
    
    # Wait for approval event - survives app restarts
    approval = DBOS.recv(f"approval-{request_id}", timeout=86400)
    
    if approval:
        process_approved_request(request_id)
    else:
        handle_timeout(request_id)

# Another workflow sends the approval
@DBOS.workflow()
def approve_request(request_id: str):
    DBOS.send(f"approval-{request_id}", {"approved": True})

Durable queues

Enqueue work that survives crashes:

from dbos import Queue

processing_queue = Queue("tasks", concurrency=10)

@DBOS.workflow()
def submit_tasks(tasks: list[dict]):
    # All enqueued tasks will be processed even if app crashes
    handles = [processing_queue.enqueue(process_task, t) for t in tasks]
    return [h.get_result() for h in handles]

Observability

Monitor workflow execution and recovery:

Query workflow status

from dbos import DBOSClient

client = DBOSClient(system_database_url)

# Get workflow status
status = client.get_workflow_status(workflow_id)
print(f"Status: {status.status}")  # PENDING, SUCCESS, ERROR
print(f"Started: {status.created_at}")
print(f"Completed: {status.updated_at}")

# List all workflows
workflows = client.list_workflows(
    status="ERROR",
    start_time="2025-01-01T00:00:00Z"
)
for wf in workflows:
    print(f"{wf.workflow_id}: {wf.status}")

Inspect workflow steps

# Get all steps in a workflow
steps = client.list_workflow_steps(workflow_id)

for step in steps:
    print(f"Step: {step.name}")
    print(f"Status: {step.status}")  # SUCCESS, ERROR, PENDING
    if step.error:
        print(f"Error: {step.error}")

Logs and traces

DBOS integrates with OpenTelemetry for distributed tracing:

# Enable OTLP export in config
config = DBOSConfig(
    otlp_endpoint="https://your-collector:4318"
)

# Workflows, steps, and transactions are automatically traced
# View traces in Jaeger, Honeycomb, or other OTLP-compatible tools

Best practices

Design for idempotency

Ensure steps can be safely retried without side effects.

# Use unique IDs for external operations
@DBOS.step()
def idempotent_api_call(request_id: str, data: dict):
    # API uses request_id to deduplicate
    return api.process(idempotency_key=request_id, data=data)

Keep workflows deterministic

Move all non-deterministic operations into steps.

@DBOS.workflow()
def deterministic_workflow():
    # Don't: random.random(), datetime.now(), uuid.uuid4()
    # Do: Call these in steps
    random_val = get_random()  # Step
    timestamp = get_timestamp()  # Step

Use workflow IDs for idempotency

Set custom workflow IDs to prevent duplicate execution.

from dbos import SetWorkflowID

# Process each event exactly once
with SetWorkflowID(f"event-{event_id}"):
    process_event(event_data)

Monitor failed workflows

Set up alerts for workflows in ERROR status.

# Check for failed workflows
failed = client.list_workflows(status="ERROR")
if failed:
    send_alert(f"{len(failed)} workflows failed")

Plan for workflow updates

Wait for active workflows to complete before deploying breaking changes, or use versioning strategies.

# Check for active workflows before deploy
active = client.list_workflows(status="PENDING")
if active:
    print(f"Warning: {len(active)} workflows still running")

Next steps

Workflows

Deep dive into workflows

Error handling

Handle failures gracefully

Workflow management

Manage workflows programmatically

Configuration

Configure recovery behavior

Get Started

Core Concepts

Guides

Integrations

What is durability?

How durability works

What gets checkpointed?

Recovery guarantees

Workflows resume from last checkpoint

At-least-once execution

Deterministic replay

❌ Non-deterministic workflow (broken)

✅ Deterministic workflow (correct)

Workflow recovery lifecycle

Recovery configuration

Max recovery attempts

Recovery timeout

Handling workflow updates

Safe updates

Unsafe updates

Application versioning

Durable primitives

Durable sleep

Durable events

Durable queues

Observability

Query workflow status

Inspect workflow steps

Logs and traces

Best practices

Next steps

Workflows

Error handling

Workflow management

Configuration

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Integrations

​What is durability?

​How durability works

​What gets checkpointed?

​Recovery guarantees

​Workflows resume from last checkpoint

​At-least-once execution

​Deterministic replay

​❌ Non-deterministic workflow (broken)

​✅ Deterministic workflow (correct)

​Workflow recovery lifecycle

​Recovery configuration

​Max recovery attempts

​Recovery timeout

​Handling workflow updates

​Safe updates

​Unsafe updates

​Application versioning

​Durable primitives

​Durable sleep

​Durable events

​Durable queues

​Observability

​Query workflow status

​Inspect workflow steps

​Logs and traces

​Best practices

​Next steps

Workflows

Error handling

Workflow management

Configuration

Build docs developers (and LLMs) love

What is durability?

How durability works

What gets checkpointed?

Recovery guarantees

Workflows resume from last checkpoint

At-least-once execution

Deterministic replay

❌ Non-deterministic workflow (broken)

✅ Deterministic workflow (correct)

Workflow recovery lifecycle

Recovery configuration

Max recovery attempts

Recovery timeout

Handling workflow updates

Safe updates

Unsafe updates

Application versioning

Durable primitives

Durable sleep

Durable events

Durable queues

Observability

Query workflow status

Inspect workflow steps

Logs and traces

Best practices

Next steps