Log-based debugging
Each Flink daemon writes stdout and stderr to a file with an.out suffix and internal logging to a .log file. Configure Java options to produce additional diagnostic files alongside the standard logs using FLINK_LOG_PREFIX:
FLINK_LOG_PREFIX rotate with the standard .out and .log files.
Java Flight Recorder (JFR)
Java Flight Recorder is a low-overhead profiling and event collection framework built into the JDK (Oracle JDK and OpenJDK 11+). It records JVM-level events—thread activity, GC, memory allocation, I/O—with minimal impact on the running application.Enabling JFR at startup
.jfr file when the JVM exits. Open it with JDK Mission Control to analyse:
- Method profiling (CPU hotspots)
- Memory allocation and GC events
- Thread states and lock contention
- Exception frequency
Recording on a running process
For a running TaskManager, start a JFR recording usingjcmd:
Heap dumps
Heap dumps capture the full state of the Java heap for offline analysis with tools like Eclipse MAT or JDK Mission Control.Automatic heap dump on OOM
OutOfMemoryError occurs. Analyse it to find:
- Objects consuming the most memory
- Retained object graphs indicating memory leaks
- Unexpected large collections
On-demand heap dump
Thread dumps via REST API
Flink exposes thread dump endpoints that let you capture thread stacks from any TaskManager without SSH access:- Tasks stuck in
BLOCKEDorWAITINGstate (lock contention, I/O) - Deadlocks
- Threads spending time in GC or JVM safepoint operations
Garbage collection analysis
GC pauses can manifest as high operator latency, checkpoint timeouts, or apparent back-pressure. Enable GC logging:- Full GC frequency and duration
- Heap growth trends
- Long stop-the-world pauses
Memory debug logging
Enable TaskManager memory usage logging to track Flink’s managed memory consumption over time:JIT compiler analysis
For CPU-bound jobs where JIT compilation behaviour is relevant, JITWatch can visualise the HotSpot JIT compiler’s decisions:Profiling workflow
Identify the bottleneck
Check the Web UI for back-pressure indicators (red operators) and low throughput. Review checkpoint duration metrics to distinguish state-size problems from CPU or I/O problems.
Collect a thread dump
Use the REST API to collect a thread dump from the suspected TaskManager. Look for threads in
BLOCKED or WAITING states.Enable flame graphs
Enable
rest.flamegraph.enabled: true and check the Flame Graph tab in the Web UI to identify CPU hotspots at the operator level.Start a JFR recording
For deeper analysis, start a 2–5 minute JFR recording on the target TaskManager and open it in JDK Mission Control.

