Job configuration
Set explicit operator UIDs
Assign a stable UID to every stateful operator using
uid(String). Without explicit UIDs, Flink auto-generates IDs from the job graph topology. Any structural change to the job (even adding a stateless operator) regenerates these IDs and breaks savepoint compatibility.Set maximum parallelism explicitly
Maximum parallelism determines the upper bound for rescaling and cannot be changed after a job has started without discarding state. Set it high enough for anticipated growth but low enough to keep metadata overhead reasonable.If not set, Flink derives it from the operator’s current parallelism:
- Parallelism ≤ 128: max parallelism = 128
- Parallelism > 128:
MIN(nextPowerOfTwo(parallelism × 1.5), 32768)
Choose the right state backend
Select a state backend appropriate for your state size:
| State size | Recommended backend |
|---|---|
| Fits in JVM heap | HashMapStateBackend |
| Larger than heap, fits on disk | EmbeddedRocksDBStateBackend |
| Exceeds local disk | ForStStateBackend (experimental) |
Configure checkpointing
Enable checkpointing with an interval appropriate for your SLA. The checkpoint interval determines the maximum data you can lose on failure.Factors to consider:
- Recovery SLA: a 5-minute interval means up to 5 minutes of data reprocessing after failure
- Sink delivery: exactly-once sinks only make results visible at checkpoint boundaries; shorter intervals reduce output latency
- Cluster load: checkpointing consumes CPU and network; incremental checkpoints reduce per-checkpoint cost
Enable incremental checkpoints for RocksDB
If you are using
EmbeddedRocksDBStateBackend with large state, enable incremental checkpoints to reduce checkpoint duration and uploaded data volume:Cluster configuration
Configure JobManager high availability
Without HA, the JobManager is a single point of failure. Configure HA using ZooKeeper or Kubernetes to allow swift leader re-election and job recovery:
Tune TaskManager memory
Size TaskManager memory carefully. For RocksDB jobs, ensure managed memory is large enough:Monitor
Status.JVM.Memory.Heap.Used and Status.Flink.Memory.Managed.Used to verify memory is within bounds.Configure a restart strategy
The default restart strategy when checkpointing is enabled is exponential delay, which is recommended for production. Verify or customise it:
Observability
Configure a metrics reporter
Export metrics to an external monitoring system. Prometheus is the most common choice:Key metrics to alert on:
numberOfFailedCheckpoints— rising value indicates checkpointing problemslastCheckpointDuration— high values indicate state growth or resource pressurebuffers.outPoolUsage(per task) — sustained >0.8 indicates back-pressurenumRecordsInPerSecond— monitor for drops indicating source problems
Review log configuration
Use Integrate log shipping with your existing log aggregation infrastructure (ELK, Loki, etc.).
INFO log level in production. Set noisy third-party libraries to WARN:Security
Restrict cluster access
Flink is designed to execute arbitrary user code remotely. Do not expose the JobManager REST API (port 8081) to the public internet. Restrict access via:
- Network policies (Kubernetes)
- Security groups or firewall rules
- A reverse proxy with authentication
Pre-launch checklist
- All stateful operators have explicit UIDs
- Maximum parallelism is set to a value that allows future scaling
- State backend is appropriate for the expected state size
- Checkpointing is enabled with an appropriate interval
- Checkpoint storage points to a distributed filesystem accessible by all nodes
- Incremental checkpoints are enabled (if using RocksDB with large state)
- Retained checkpoints or savepoints are configured
- JobManager HA is configured
- Restart and failover strategies are configured
- Memory is sized appropriately for the state backend
- Metrics reporter is configured and dashboards are in place
- Alerting is configured for checkpoint failures and back-pressure
- Log level is set to INFO; log aggregation is set up
- Cluster access is restricted; TLS is enabled if required

