Spot instances

Spot instances (AWS) and preemptible instances (GCP) are unused cloud capacity offered at a significant discount — up to 90% cheaper than on-demand pricing. The trade-off is that the cloud provider can reclaim them with little warning. Flyte’s interruptible task flag makes it straightforward to run tasks on spot capacity while ensuring correctness through automatic retries.

Why spot instances can be reclaimed

A spot instance may be interrupted for several reasons:

Price: The current spot price exceeds your maximum bid price.
Capacity: AWS or GCP needs the capacity back and interrupts instances to reclaim it.
Constraints: A launch group or Availability Zone constraint can no longer be satisfied.

Most spot instances run for around 2 hours (median), with some running for much longer and others being reclaimed within 20 minutes.

Making tasks interruptible

Set interruptible=True on the @task decorator to signal that the task may run on spot/preemptible nodes. Always pair this with at least one retry so Flyte can recover from preemptions:

from flytekit import task

@task(interruptible=True, retries=1)
def add_one_and_print(value_to_print: int) -> int:
    return value_to_print + 1

Flyte schedules interruptible tasks on an auto-scaling group (ASG) that uses only spot/preemptible instances. If the task is preempted, Flyte retries it — on a spot instance for the first n-1 attempts, and on a regular (on-demand) instance for the final attempt. This guarantees eventual completion even in volatile spot markets.

Retry behavior

The retry semantics for interruptible tasks follow a specific pattern:

from flytekit import task

# With retries=3 and interruptible=True:
# - Attempt 1: spot instance
# - Attempt 2 (after preemption or failure): spot instance
# - Attempt 3: spot instance
# - Attempt 4 (final): on-demand instance
@task(interruptible=True, retries=3)
def resilient_training_step(batch_id: int) -> float:
    # Computation that may be preempted
    return run_batch(batch_id)

Tasks are only retried if retries is set to at least 1. Without any retries, a preempted task will fail permanently. Always set retries >= 1 for interruptible tasks.

Which tasks are good candidates for spot instances?

Most Flyte workloads are suitable for spot instances. Mark a task as interruptible unless it has any of the following properties:

Time-sensitive tasks

If a task must complete by a hard deadline and cannot tolerate unexpected delays from restarts, run it on on-demand instances (interruptible=False).

Tasks with side effects

If a task is not idempotent — for example, it writes to an external system in a way that cannot be safely retried — use on-demand instances to avoid duplicate writes.

Long-running tasks (> 2 hours)

Tasks that run for more than 2 hours risk wasting a large amount of compute time if preempted near completion. Consider using checkpointing (see below) or running on on-demand instances.

Combining spot instances with checkpointing

For long-running tasks where spot instances are still desirable (due to cost), combine interruptible=True with Flyte’s checkpointing API. This way, a preempted task resumes from its last checkpoint rather than starting from scratch:

from flytekit import task, current_context
from flytekit.exceptions.user import FlyteRecoverableException

@task(interruptible=True, retries=5)
def long_training_job(num_epochs: int) -> str:
    ctx = current_context()
    checkpoint = ctx.checkpoint

    # Restore from a previous checkpoint if available
    start_epoch = 0
    prev = checkpoint.read()
    if prev:
        start_epoch = int(prev.decode())

    for epoch in range(start_epoch, num_epochs):
        train_one_epoch(epoch)
        # Save progress after each epoch
        checkpoint.write(str(epoch + 1).encode())

    return "training_complete"

Checkpointing is especially powerful when training ML models over many epochs. Saving a checkpoint after each epoch means a preemption never costs more than one epoch of work.

Setting up spot/preemptible node groups

To isolate spot workloads from regular workloads on your cluster, configure a dedicated auto-scaling group (ASG) for spot instances with Kubernetes taints and tolerations.

Create a spot instance ASG following the AWS spot instance request guide. Then add a taint to the spot node group:

# Node group taint
taints:
  - key: cloud.google.com/gke-preemptible
    value: "true"
    effect: NoSchedule

Configure the tolerations in the FlytePropeller config so that tasks with interruptible=True are automatically scheduled to the spot node group. Refer to the flyteplugins configuration for the relevant config fields.

Cost impact

Using spot instances for suitable workloads can reduce compute costs by up to 90%. In practice, savings of 60–70% are typical for mixed workloads where tasks have short-to-medium runtimes and a small number of retries is acceptable.

Track cost savings by comparing on-demand vs. spot pricing for your instance types. AWS and GCP both publish current spot prices in their consoles and APIs.

Basics

Data Types & I/O

Advanced Composition

Productionizing

Flyte Agents

Spot instances

Why spot instances can be reclaimed

Making tasks interruptible

Retry behavior

Which tasks are good candidates for spot instances?

Combining spot instances with checkpointing

Setting up spot/preemptible node groups

Cost impact

Build docs developers (and LLMs) love

Basics

Data Types & I/O

Advanced Composition

Productionizing

Flyte Agents

​Why spot instances can be reclaimed

​Making tasks interruptible

​Retry behavior

​Which tasks are good candidates for spot instances?

​Combining spot instances with checkpointing

​Setting up spot/preemptible node groups

​Cost impact

Build docs developers (and LLMs) love

Why spot instances can be reclaimed

Making tasks interruptible

Retry behavior

Which tasks are good candidates for spot instances?

Combining spot instances with checkpointing

Setting up spot/preemptible node groups

Cost impact