Skip to main content
Profiling helps you identify CPU hotspots, memory leaks, GC pressure, and blocking operations in Flink jobs. Flink provides several mechanisms for profiling TaskManager and JobManager JVM processes.

Log-based debugging

Each Flink daemon writes stdout and stderr to a file with an .out suffix and internal logging to a .log file. Configure Java options to produce additional diagnostic files alongside the standard logs using FLINK_LOG_PREFIX:
# config.yaml
env.java.opts.all: "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=${FLINK_LOG_PREFIX}.hprof"
Files using FLINK_LOG_PREFIX rotate with the standard .out and .log files.

Java Flight Recorder (JFR)

Java Flight Recorder is a low-overhead profiling and event collection framework built into the JDK (Oracle JDK and OpenJDK 11+). It records JVM-level events—thread activity, GC, memory allocation, I/O—with minimal impact on the running application.

Enabling JFR at startup

# config.yaml
env.java.opts.all: >-
  -XX:+UnlockDiagnosticVMOptions
  -XX:+FlightRecorder
  -XX:+DebugNonSafepoints
  -XX:FlightRecorderOptions=defaultrecording=true,dumponexit=true,dumponexitpath=${FLINK_LOG_PREFIX}.jfr
The recording is written to a .jfr file when the JVM exits. Open it with JDK Mission Control to analyse:
  • Method profiling (CPU hotspots)
  • Memory allocation and GC events
  • Thread states and lock contention
  • Exception frequency

Recording on a running process

For a running TaskManager, start a JFR recording using jcmd:
# Find the TaskManager PID
ps aux | grep TaskManager

# Start a 60-second recording
jcmd <pid> JFR.start duration=60s filename=/tmp/taskmanager.jfr

# Or dump an ongoing recording
jcmd <pid> JFR.dump filename=/tmp/taskmanager-snapshot.jfr

Heap dumps

Heap dumps capture the full state of the Java heap for offline analysis with tools like Eclipse MAT or JDK Mission Control.

Automatic heap dump on OOM

# config.yaml
env.java.opts.all: >-
  -XX:+HeapDumpOnOutOfMemoryError
  -XX:HeapDumpPath=${FLINK_LOG_PREFIX}.hprof
The heap dump is written when an OutOfMemoryError occurs. Analyse it to find:
  • Objects consuming the most memory
  • Retained object graphs indicating memory leaks
  • Unexpected large collections

On-demand heap dump

# Trigger a heap dump from a running TaskManager
jmap -dump:format=b,file=/tmp/heap-$(date +%s).hprof <pid>
Heap dumps on busy JVMs can pause the process for seconds to minutes while the heap is serialised. Avoid triggering them on production TaskManagers handling latency-sensitive workloads.

Thread dumps via REST API

Flink exposes thread dump endpoints that let you capture thread stacks from any TaskManager without SSH access:
# Get a thread dump from a specific TaskManager
curl http://jobmanager:8081/v1/taskmanagers/:taskmanagerid/thread-dump
The response is a JSON object with thread names, states, and full stack traces. This is useful for diagnosing:
  • Tasks stuck in BLOCKED or WAITING state (lock contention, I/O)
  • Deadlocks
  • Threads spending time in GC or JVM safepoint operations
# Get TaskManager IDs first
curl http://jobmanager:8081/v1/taskmanagers | jq '.taskmanagers[].id'

# Then get a thread dump
curl http://jobmanager:8081/v1/taskmanagers/container_xyz_000001/thread-dump | \
  jq '.threadInfos[] | select(.state == "BLOCKED")'

Garbage collection analysis

GC pauses can manifest as high operator latency, checkpoint timeouts, or apparent back-pressure. Enable GC logging:
# config.yaml (Java 9+)
env.java.opts.all: >-
  -Xlog:gc*:file=${FLINK_LOG_PREFIX}.gc.log:time,uptime:filecount=10,filesize=10m
# config.yaml (Java 8)
env.java.opts.all: >-
  -Xloggc:${FLINK_LOG_PREFIX}.gc.log
  -XX:+PrintGCApplicationStoppedTime
  -XX:+PrintGCDetails
  -XX:+PrintGCDateStamps
  -XX:+UseGCLogFileRotation
  -XX:NumberOfGCLogFiles=10
  -XX:GCLogFileSize=10M
Analyse GC logs with GCEasy or GCViewer to identify:
  • Full GC frequency and duration
  • Heap growth trends
  • Long stop-the-world pauses

Memory debug logging

Enable TaskManager memory usage logging to track Flink’s managed memory consumption over time:
taskmanager.debug.memory.log: true
taskmanager.debug.memory.log-interval: 10000  # log every 10 seconds
This adds periodic memory usage summaries to the TaskManager log at DEBUG level.

JIT compiler analysis

For CPU-bound jobs where JIT compilation behaviour is relevant, JITWatch can visualise the HotSpot JIT compiler’s decisions:
env.java.opts.all: >-
  -XX:+UnlockDiagnosticVMOptions
  -XX:+TraceClassLoading
  -XX:+LogCompilation
  -XX:LogFile=${FLINK_LOG_PREFIX}.jit
  -XX:+PrintAssembly
-XX:+LogCompilation and -XX:+PrintAssembly produce very large log files on long-running applications. Use them only during short profiling sessions.

Profiling workflow

1

Identify the bottleneck

Check the Web UI for back-pressure indicators (red operators) and low throughput. Review checkpoint duration metrics to distinguish state-size problems from CPU or I/O problems.
2

Collect a thread dump

Use the REST API to collect a thread dump from the suspected TaskManager. Look for threads in BLOCKED or WAITING states.
3

Enable flame graphs

Enable rest.flamegraph.enabled: true and check the Flame Graph tab in the Web UI to identify CPU hotspots at the operator level.
4

Start a JFR recording

For deeper analysis, start a 2–5 minute JFR recording on the target TaskManager and open it in JDK Mission Control.
5

Analyse and act

Based on findings, tune serializers, reduce state size, adjust parallelism, or optimise user code hotspots.

Build docs developers (and LLMs) love