How speculative execution works
Slow task detection
Flink periodically monitors the execution time of running tasks. When a sufficient fraction of tasks in a stage have finished, the scheduler computes a baseline execution time (median of finished tasks × a configurable multiplier). Any still-running task that exceeds this baseline is marked as slow.
Node blocklisting
Nodes (TaskManagers) hosting slow tasks are identified as problematic and added to a blocklist. No new task attempts are scheduled on blocked nodes until the block duration expires.
Speculative attempt launch
New backup task attempts are created for the slow tasks and deployed on nodes that are not blocked. These speculative attempts process the same input data as the original attempt.
Prerequisites
Speculative execution is only supported by the Adaptive Batch Scheduler, which is the default batch scheduler in Flink.Enabling speculative execution
Tuning configuration
Scheduler options
Slow task detector options
How the baseline is computed
- Wait until
baseline-ratiofraction of tasks in the stage have finished. - Compute the median execution time of the finished tasks.
- Multiply by
baseline-multiplierto get the baseline. - Any running task with execution time > baseline and >
baseline-lower-boundis marked slow.
Enabling sources for speculative execution
Most sources work with speculative execution without changes. However, if your custom source usesSourceEvent, you must implement SupportsHandleExecutionAttemptSourceEvent on the SplitEnumerator:
SplitEnumerator to route source events to the correct attempt. Without this interface, job failures will occur when events arrive from speculative attempts.
All built-in Apache Flink source connectors support speculative execution.
Enabling sinks for speculative execution
Sinks are excluded from speculative execution by default for safety reasons. To allow a sink to run speculatively, implement theSupportsConcurrentExecutionAttempts marker interface:
Sink, SinkFunction, and OutputFormat implementations.
Even if a Sink implements
SupportsConcurrentExecutionAttempts, the Committer portion of a two-phase commit sink is never speculatively executed. Flink disables speculative execution for Committer operators automatically.If any operator in a task does not support speculative execution, the entire task is marked as non-speculative.
Checking effectiveness
After enabling speculative execution, monitor its impact through: Web UI:- Navigate to your job’s vertex detail page and click the SubTasks tab to see speculative attempts listed alongside original attempts.
- Check the cluster Overview and Task Managers pages to see which TaskManagers are currently blocked.
speculative-execution scope, including the number of speculative task attempts started and the number of blocked nodes. See the metrics documentation for the full list.
