Production Readiness Checklist

This checklist covers the configuration options and operational practices you should review before running a Flink job in production. While Flink ships with sensible defaults, several settings require deliberate choices based on your specific workload and SLAs.

Job configuration

Set explicit operator UIDs

Assign a stable UID to every stateful operator using uid(String). Without explicit UIDs, Flink auto-generates IDs from the job graph topology. Any structural change to the job (even adding a stateless operator) regenerates these IDs and breaks savepoint compatibility.

stream
    .map(new StatefulMapper()).uid("mapper-v1")
    .keyBy(Event::getKey)
    .window(TumblingEventTimeWindows.of(Duration.ofMinutes(5)))
    .aggregate(new MyAggregator()).uid("aggregator-v1")
    .addSink(new KafkaSink<>(...)).uid("kafka-sink-v1");

Set maximum parallelism explicitly

Maximum parallelism determines the upper bound for rescaling and cannot be changed after a job has started without discarding state. Set it high enough for anticipated growth but low enough to keep metadata overhead reasonable.

env.setMaxParallelism(1024); // must be between parallelism and 2^15

If not set, Flink derives it from the operator’s current parallelism:

Parallelism ≤ 128: max parallelism = 128
Parallelism > 128: MIN(nextPowerOfTwo(parallelism × 1.5), 32768)

Choose the right state backend

Select a state backend appropriate for your state size:

State size	Recommended backend
Fits in JVM heap	`HashMapStateBackend`
Larger than heap, fits on disk	`EmbeddedRocksDBStateBackend`
Exceeds local disk	`ForStStateBackend` (experimental)

state.backend.type: rocksdb
execution.checkpointing.dir: hdfs:///flink/checkpoints

Configure checkpointing

Enable checkpointing with an interval appropriate for your SLA. The checkpoint interval determines the maximum data you can lose on failure.

execution.checkpointing.interval: 60 s
execution.checkpointing.mode: EXACTLY_ONCE
execution.checkpointing.min-pause: 30 s

Factors to consider:

Recovery SLA: a 5-minute interval means up to 5 minutes of data reprocessing after failure
Sink delivery: exactly-once sinks only make results visible at checkpoint boundaries; shorter intervals reduce output latency
Cluster load: checkpointing consumes CPU and network; incremental checkpoints reduce per-checkpoint cost

Enable incremental checkpoints for RocksDB

If you are using EmbeddedRocksDBStateBackend with large state, enable incremental checkpoints to reduce checkpoint duration and uploaded data volume:

execution.checkpointing.incremental: true

Configure retained checkpoints

Configure retained checkpoints so you can restart a job from a checkpoint after a cancellation:

env.getCheckpointConfig().setExternalizedCheckpointRetention(
    ExternalizedCheckpointRetention.RETAIN_ON_CANCELLATION
);

Cluster configuration

Configure JobManager high availability

Without HA, the JobManager is a single point of failure. Configure HA using ZooKeeper or Kubernetes to allow swift leader re-election and job recovery:

# ZooKeeper-based HA
high-availability.type: zookeeper
high-availability.zookeeper.quorum: zk1:2181,zk2:2181,zk3:2181
high-availability.storageDir: hdfs:///flink/ha

# Kubernetes-based HA
high-availability.type: kubernetes
high-availability.storageDir: s3://my-bucket/flink/ha

Tune TaskManager memory

Size TaskManager memory carefully. For RocksDB jobs, ensure managed memory is large enough:

taskmanager.memory.process.size: 8g
taskmanager.memory.managed.fraction: 0.5  # 50% of total to managed memory for RocksDB

Monitor Status.JVM.Memory.Heap.Used and Status.Flink.Memory.Managed.Used to verify memory is within bounds.

Configure a restart strategy

The default restart strategy when checkpointing is enabled is exponential delay, which is recommended for production. Verify or customise it:

restart-strategy.type: exponential-delay
restart-strategy.exponential-delay.initial-backoff: 10 s
restart-strategy.exponential-delay.max-backoff: 2 min
restart-strategy.exponential-delay.backoff-multiplier: 1.4
restart-strategy.exponential-delay.reset-backoff-threshold: 10 min
restart-strategy.exponential-delay.jitter-factor: 0.1

Use the region failover strategy

The region failover strategy restarts only the minimum set of tasks needed to recover, reducing the blast radius of individual task failures:

jobmanager.execution.failover-strategy: region

Observability

Configure a metrics reporter

Export metrics to an external monitoring system. Prometheus is the most common choice:

metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.port: 9249

Key metrics to alert on:

numberOfFailedCheckpoints — rising value indicates checkpointing problems
lastCheckpointDuration — high values indicate state growth or resource pressure
buffers.outPoolUsage (per task) — sustained >0.8 indicates back-pressure
numRecordsInPerSecond — monitor for drops indicating source problems

Review log configuration

Use INFO log level in production. Set noisy third-party libraries to WARN:

rootLogger.level = INFO
logger.zookeeper.name = org.apache.zookeeper
logger.zookeeper.level = WARN
logger.kafka.name = org.apache.kafka
logger.kafka.level = WARN

Integrate log shipping with your existing log aggregation infrastructure (ELK, Loki, etc.).

Enable flame graphs (optional)

For performance debugging in non-production environments:

rest.flamegraph.enabled: true

Flame graphs collect stack traces by sampling, which has a small CPU overhead. Enable in production only during active incident investigation.

Security

Restrict cluster access

Flink is designed to execute arbitrary user code remotely. Do not expose the JobManager REST API (port 8081) to the public internet. Restrict access via:

Network policies (Kubernetes)
Security groups or firewall rules
A reverse proxy with authentication

Configure SSL/TLS

Enable TLS for REST and RPC communication:

security.ssl.rest.enabled: true
security.ssl.rest.keystore: /path/to/keystore.jks
security.ssl.rest.keystore-password: changeit
security.ssl.rest.truststore: /path/to/truststore.jks
security.ssl.rest.truststore-password: changeit

Configure Kerberos (if applicable)

For Hadoop/HDFS environments with Kerberos authentication:

security.kerberos.login.principal: [email protected]
security.kerberos.login.keytab: /etc/security/keytabs/flink.keytab

Pre-launch checklist

Logging

Application Profiling

⌘I

State Management

Monitoring

Debugging

Production Readiness Checklist

Job configuration

Cluster configuration

Observability

Security

Pre-launch checklist

Build docs developers (and LLMs) love

State Management

Monitoring

Debugging

​Job configuration

​Cluster configuration

​Observability

​Security

​Pre-launch checklist

Build docs developers (and LLMs) love

Job configuration

Cluster configuration

Observability

Security

Pre-launch checklist