Skip to main content

The fundamental insight

Interruption is inevitable. Servers restart, deployments roll out, processes crash. Traditional code loses all in-flight state when any of this happens. Durable execution treats interruption as a first-class concern: every meaningful result is persisted before the workflow continues, so when execution resumes it picks up exactly where it left off.
Workflow starts → Executes steps → Each step result is persisted

Server crashes → Workflow resumes → Cached results are replayed

                  Execution continues from last incomplete step
Two primitives create durability boundaries: Workflow.step() and Workflow.sleep(). Code inside a step runs non-durably — no step or sleep calls belong inside a step’s execute effect.

Step caching

Every Workflow.step() call persists its result to Durable Object storage before returning. On any subsequent execution of the same workflow instance, completed steps check the cache first:
  • Cache miss — the effect runs, the result is stored, and the value is returned.
  • Cache hit — the effect is skipped entirely and the stored value is returned.
This is what makes replay safe. If a workflow has completed steps A and B before crashing, a resumed execution returns cached results for both without re-running any side effects.
const order = yield* Workflow.step({
  name: "Fetch order",
  execute: fetchOrder(orderId),   // only runs if not already cached
});
Cached step results include metadata alongside the value:
interface CachedStepResult<T> {
  value: T
  meta: {
    completedAt: number
    attempt: number
    durationMs: number
  }
}
Step results must be JSON-serializable. If your effect returns a class instance, ORM result, or other non-serializable value, map it to a plain object or use Effect.asVoid to discard the result before the step completes.
// Map to serializable shape
yield* Workflow.step({
  name: "Create order",
  execute: createOrder(data).pipe(
    Effect.map((order) => ({ id: order.id, status: order.status }))
  ),
});

// Or discard the result entirely
yield* Workflow.step({
  name: "Update database",
  execute: updateRecord(id).pipe(Effect.asVoid),
});

Sleep and the pause signal

Workflow.sleep() does not block a thread. Instead, it throws a PauseSignal that propagates up to the executor:
// Inside Workflow.sleep():
yield* Effect.fail(new PauseSignal({
  reason: "sleep",
  resumeAt: Date.now() + durationMs,
}))
When the executor catches this signal it:
  1. Records the pause state in the state machine
  2. Schedules a Durable Object alarm for resumeAt
  3. Returns { _tag: "Paused", reason: "sleep", resumeAt: ... }
The Durable Object then goes idle. When the alarm fires, the orchestrator resumes the workflow in resume mode, replaying all cached steps and continuing past the sleep point.
// Short delays for rate limiting
yield* Workflow.sleep("30 seconds");

// Wait a full day — durable across restarts
yield* Workflow.sleep("24 hours");

// Subscription renewal
yield* Workflow.sleep("30 days");
Retry delays work the same way: a failed step with retry config throws a PauseSignal with reason: "retry", schedules an alarm, and resumes when the delay expires.

Three execution modes

When the executor runs a workflow definition, it operates in one of three modes:
ModeWhen usedBehavior
freshFirst execution of an instanceNo cached data; execute everything
resumeAfter a scheduled pause (sleep or retry delay)Replay cached steps; continue from the pause point
recoverAfter an infrastructure failureSame as resume, but triggered by the recovery system
In resume and recover modes the workflow function runs from the top, but every completed step returns its cached result immediately without executing the underlying effect.

The workflow state machine

Every workflow instance moves through a defined set of states. The valid transitions are:
┌─────────┐     ┌────────┐     ┌─────────┐
│ Pending │────▶│ Queued │────▶│ Running │
└─────────┘     └────────┘     └────┬────┘

             ┌──────────────────────┼──────────────────────┐
             ▼                      ▼                      ▼
        ┌─────────┐           ┌──────────┐          ┌───────────┐
        │ Running │◀─────────▶│  Paused  │          │ Cancelled │
        └────┬────┘           └──────────┘          └───────────┘

     ┌───────┴───────┐
     ▼               ▼
┌───────────┐  ┌──────────┐
│ Completed │  │  Failed  │
└───────────┘  └──────────┘
Valid transitions by status:
const VALID_TRANSITIONS = {
  Pending:   ["Start", "Queue"],
  Queued:    ["Start", "Cancel"],
  Running:   ["Complete", "Pause", "Fail", "Cancel", "Recover"],
  Paused:    ["Resume", "Cancel", "Recover"],
  Completed: [],  // Terminal
  Failed:    [],  // Terminal
  Cancelled: [],  // Terminal
} as const;
Completed, Failed, and Cancelled are terminal — no further transitions are possible.
The full discriminated union for workflow status:
type WorkflowStatus =
  | { _tag: "Pending" }
  | { _tag: "Queued"; queuedAt: number }
  | { _tag: "Running"; runningAt: number }
  | { _tag: "Paused"; reason: "sleep" | "retry"; resumeAt: number; stepName?: string }
  | { _tag: "Completed"; completedAt: number }
  | { _tag: "Failed"; failedAt: number; error: WorkflowError }
  | { _tag: "Cancelled"; cancelledAt: number; reason?: string }

Recovery after infrastructure failure

When a Durable Object restarts (process crash, deployment, eviction) any workflow that was Running did not get a chance to record a Completed or Paused transition. The recovery system detects this on startup:
  1. The engine’s constructor runs RecoveryManager.checkAndScheduleRecovery().
  2. The recovery manager reads the current status. If it is Running and lastUpdated is older than the stale threshold (default: 30 s), the workflow is considered stale.
  3. A short-delay alarm is scheduled. When it fires, the orchestrator re-executes the workflow in recover mode.
  4. Because all completed steps are cached in DO storage, replay is safe and the workflow continues from the last incomplete step.
Durable Object restarts (process crash, deployment, etc.)


RecoveryManager.checkAndScheduleRecovery()

    ├─► getStatus()               // Status: Running
    ├─► now - lastUpdated > 30s   // Stale check
    ├─► incrementRecoveryAttempts()
    └─► schedule(now + recoveryDelayMs)


        Alarm fires → WorkflowOrchestrator.handleAlarm()

            └─► execute(definition, { mode: "recover" })
Recovery attempt count is bounded by maxRecoveryAttempts (configurable in createDurableWorkflows). If the limit is exceeded the workflow transitions to Failed.

Durability boundaries

Only Workflow.step() and Workflow.sleep() create durability checkpoints. Code written directly in the workflow body between steps runs non-durably on each execution:
const orderWorkflow = Workflow.make((orderId: string) =>
  Effect.gen(function* () {
    // ✅ Durable — result is cached
    const order = yield* Workflow.step({
      name: "Fetch order",
      execute: fetchOrder(orderId),
    });

    // ✅ Durable — pause is recorded, alarm is scheduled
    yield* Workflow.sleep("24 hours");

    // ✅ Durable — result is cached
    yield* Workflow.step({
      name: "Charge card",
      execute: chargeCard(order),
    });
  })
);
Do not call Workflow.step() or Workflow.sleep() inside the execute effect of another step. The library enforces this with both a compile-time guard (WorkflowLevel context) and a runtime check (StepScope). Violating this constraint will cause an error.

Build docs developers (and LLMs) love