- Stop and resume a job, preserving all state
- Fork a job into two separate instances from the same point in time
- Update job logic or Flink version while retaining state
- Rescale parallelism up or down
Savepoints vs checkpoints
| Checkpoints | Savepoints | |
|---|---|---|
| Triggered by | Flink automatically | User explicitly |
| Purpose | Automatic fault tolerance | Planned maintenance and upgrades |
| Lifetime | Deleted on job cancel (unless retained) | Kept until you delete them |
| Portability | Not self-contained; paths may be absolute | Self-contained; can be moved |
| Format | State-backend specific | Canonical (default) or native |
Assigning operator IDs
Before relying on savepoints, assign stable UIDs to all stateful operators in your job. Flink uses these UIDs to map state from a savepoint back to the correct operator after a restart.Operator ID → State:
Triggering a savepoint
Triggering on YARN
Stopping a job with a savepoint
To atomically take a savepoint and stop the job:Starting with Flink 1.15, intermediate savepoints (triggered with
flink savepoint, not flink stop --savepointPath) are not used for recovery and do not commit transactional side effects. Use flink stop --savepointPath when you need side effects (e.g., Kafka transactions) to be committed at exactly the savepoint boundary.Savepoint directory structure
Flink writes savepoints under the configured or specified directory:Savepoint format
- Canonical (default)
- Native
A unified format that works across all state backends. You can take a savepoint with one backend and restore with a different backend. This is the most compatible format and is recommended for upgrades and long-term storage.
Resuming from a savepoint
_metadata file path.
Allowing non-restored state
If you removed an operator from your job, Flink will fail to restore by default because there is state in the savepoint that cannot be mapped to any operator. Allow this with the-n flag:
Claim mode
The claim mode controls who takes ownership of the savepoint files after restore.NO_CLAIM (default)
NO_CLAIM (default)
Flink does not take ownership of the savepoint files. You retain full control. You can start multiple jobs from the same savepoint. Flink forces the first checkpoint after restore to be a full checkpoint, then returns to incremental checkpointing as configured.
CLAIM
CLAIM
Flink takes ownership of the savepoint and treats it like a regular checkpoint. Flink may delete it once it is no longer needed for recovery. Do not manually delete the savepoint or start multiple jobs from it.
LEGACY (deprecated)
LEGACY (deprecated)
Flink never deletes the initial checkpoint but ownership is ambiguous. Deprecated since Flink 1.15 and will be removed in Flink 2.0. Use
NO_CLAIM or CLAIM instead.Disposing a savepoint
Configuration
Set a default savepoint directory to avoid specifying it on every trigger command:- config.yaml
- Java
FAQ
Should I assign UIDs to all operators?
Should I assign UIDs to all operators?
Yes, as a rule of thumb. Strictly speaking, only stateful operators need UIDs because savepoints only contain state for stateful operators. In practice, assign UIDs to all operators because built-in operators such as the Window operator are stateful, and it is not always obvious which built-in operators hold state.
What happens when I add a new stateful operator?
What happens when I add a new stateful operator?
The new operator starts with empty state. This is equivalent to a stateless operator on first run.
What happens when I delete a stateful operator?
What happens when I delete a stateful operator?
By default, the restore fails because there is unmatched state in the savepoint. Use
--allowNonRestoredState (-n) to skip the unmapped state.Can I change parallelism when restoring?
Can I change parallelism when restoring?
Yes. Specify the new parallelism in the run command. Flink redistributes the keyed state across the new number of tasks.
Can I move savepoint files to a different location?
Can I move savepoint files to a different location?
Yes. Savepoints are self-contained. Move the entire savepoint directory and restore from the new path. Two exceptions apply: (1) if S3 entropy injection is enabled, the savepoint contains absolute paths and cannot be moved; (2) if the job uses task-owned state (e.g.,
GenericWriteAheadLog sink).
