interruptible task flag makes it straightforward to run tasks on spot capacity while ensuring correctness through automatic retries.
Why spot instances can be reclaimed
A spot instance may be interrupted for several reasons:- Price: The current spot price exceeds your maximum bid price.
- Capacity: AWS or GCP needs the capacity back and interrupts instances to reclaim it.
- Constraints: A launch group or Availability Zone constraint can no longer be satisfied.
Making tasks interruptible
Setinterruptible=True on the @task decorator to signal that the task may run on spot/preemptible nodes. Always pair this with at least one retry so Flyte can recover from preemptions:
Flyte schedules interruptible tasks on an auto-scaling group (ASG) that uses only spot/preemptible instances. If the task is preempted, Flyte retries it — on a spot instance for the first
n-1 attempts, and on a regular (on-demand) instance for the final attempt. This guarantees eventual completion even in volatile spot markets.Retry behavior
The retry semantics for interruptible tasks follow a specific pattern:Which tasks are good candidates for spot instances?
Most Flyte workloads are suitable for spot instances. Mark a task as interruptible unless it has any of the following properties:Time-sensitive tasks
Time-sensitive tasks
If a task must complete by a hard deadline and cannot tolerate unexpected delays from restarts, run it on on-demand instances (
interruptible=False).Tasks with side effects
Tasks with side effects
If a task is not idempotent — for example, it writes to an external system in a way that cannot be safely retried — use on-demand instances to avoid duplicate writes.
Long-running tasks (> 2 hours)
Long-running tasks (> 2 hours)
Tasks that run for more than 2 hours risk wasting a large amount of compute time if preempted near completion. Consider using checkpointing (see below) or running on on-demand instances.
Combining spot instances with checkpointing
For long-running tasks where spot instances are still desirable (due to cost), combineinterruptible=True with Flyte’s checkpointing API. This way, a preempted task resumes from its last checkpoint rather than starting from scratch:
Setting up spot/preemptible node groups
To isolate spot workloads from regular workloads on your cluster, configure a dedicated auto-scaling group (ASG) for spot instances with Kubernetes taints and tolerations.- AWS
- GCP
Create a spot instance ASG following the AWS spot instance request guide. Then add a taint to the spot node group:
interruptible=True are automatically scheduled to the spot node group. Refer to the flyteplugins configuration for the relevant config fields.
Cost impact
Using spot instances for suitable workloads can reduce compute costs by up to 90%. In practice, savings of 60–70% are typical for mixed workloads where tasks have short-to-medium runtimes and a small number of retries is acceptable.Track cost savings by comparing on-demand vs. spot pricing for your instance types. AWS and GCP both publish current spot prices in their consoles and APIs.