- The application must be able to take checkpoints reliably and within the checkpoint interval
- There must be enough resources to catch up with the input stream after a failure
Monitoring checkpoints
Before tuning, establish a baseline by monitoring checkpoint behaviour in the Flink Web UI under the Checkpoints tab. The two most important metrics are:- Time to first barrier: The time from triggering the checkpoint until operators receive their first checkpoint barrier. High values indicate back-pressure upstream.
- Alignment duration: The time between receiving the first and last checkpoint barrier at an operator. High values with aligned exactly-once checkpoints indicate back-pressure or data skew.
Tuning checkpoint intervals and pauses
When individual checkpoints take longer than the checkpoint interval, Flink starts the next checkpoint immediately after the previous one completes. This creates a feedback loop where the system spends all its time checkpointing. Break this loop by setting a minimum pause between checkpoints:Do not allow multiple concurrent checkpoints for jobs with large state. Concurrent checkpoints multiply the resource usage (network bandwidth, disk I/O) and usually make the situation worse.
Incremental checkpoints with RocksDB
The most impactful single change for large RocksDB state is enabling incremental checkpoints. Instead of uploading the entire state on every checkpoint, Flink uploads only the SST files that changed since the last checkpoint.- config.yaml
- Java
- If recovery is CPU/IOPs bound: incremental is faster (no need to rebuild all SST tables from scratch)
- If recovery is network bound: incremental may be slower (must fetch all accumulated deltas)
Tuning RocksDB memory
RocksDB performance is heavily influenced by memory allocation. By default, Flink allocates RocksDB memory from the TaskManager’s managed memory budget.Increase managed memory
The first step for any RocksDB performance problem is to increase the managed memory fraction:Tune write buffer vs. block cache ratio
If you see frequent RocksDB MemTable flushes (write-side bottleneck), increase the write buffer ratio:Understand per-state memory
RocksDB allocates one ColumnFamily per state per operator. Each ColumnFamily needs its own write buffers. Jobs with many states therefore need proportionally more memory to achieve the same write performance. To diagnose this, compare:- Number of states × operators × tasks in your job
- Current managed memory per slot
Expert: per-column-family options
For jobs with many states that suffer from frequent MemTable flushes, use a customRocksDBOptionsFactory to increase background flush threads and reduce arena block size:
Timers: RocksDB vs. heap
Timers are stored in RocksDB by default. For jobs with a small, bounded number of timers (no windows, minimalProcessFunction timer usage), storing timers on the JVM heap can reduce checkpoint overhead:
Task-local recovery
In a typical failure scenario, a failed task is rescheduled to the same TaskManager. However, it still reads its state from the distributed store (HDFS, S3), which can be slow for large state. Task-local recovery keeps a secondary copy of state on local disk. On recovery, Flink first tries to restore from the local copy (fast), and transparently falls back to the distributed store if the local copy is unavailable. Enable task-local recovery:EmbeddedRocksDBStateBackend with incremental checkpoints, the local copy reuses RocksDB’s native checkpoint mechanism. The local copy shares active SST files via hard links, so it consumes no additional disk space for those files.
Task-local recovery only covers keyed state. Unaligned checkpoints do not currently support task-local recovery.
Checkpoint compression
Flink supports Snappy compression for checkpoints and savepoints. Compression is off by default:Compression has no effect on incremental RocksDB checkpoints, which already use Snappy internally via RocksDB’s native format.
Capacity planning
When planning resources for a Flink job with large state:- Baseline throughput: establish the resources needed to run without back-pressure during normal operation. Always measure with checkpointing enabled, since checkpointing consumes CPU and network bandwidth.
- Recovery headroom: provision extra resources to allow catch-up after a failure. The required headroom depends on how long state recovery takes and how fast the source accumulates lag during recovery.
- Maximum parallelism: set
setMaxParallelism()to a value high enough to allow future scaling, since this cannot be changed after a job has started without discarding state. - Window spikes: operators downstream of large windows experience bursty load when the window fires. Ensure downstream parallelism can handle the peak output rate.

