Why HA matters
By default, each Flink cluster has a single JobManager. The JobManager is responsible for scheduling, resource management, and tracking distributed checkpoints. If it fails, no new jobs can be submitted and any running jobs fail. With HA configured, one or more standby JobManagers are ready to take over leadership if the active JobManager crashes. Failover is automatic and transparent to running jobs.How Flink HA works
At any point in time, exactly one leading JobManager is active. The others remain on standby waiting to be elected leader. Flink’s HA services provide three capabilities:| Capability | Description |
|---|---|
| Leader election | Selects a single active JobManager from a pool of candidates |
| Service discovery | Allows TaskManagers and clients to find the current leader |
| State persistence | Stores recovery metadata so the successor can resume job execution |
What is persisted for recovery
When a new JobManager takes over, it reads persisted state to resume jobs:- JobGraph: the logical execution plan of each submitted job
- User code JARs: uploaded application JARs
- Completed checkpoint metadata: location of the latest successful checkpoint for each job
- Leader pointers: ZooKeeper/Kubernetes records pointing to the latest persisted state
high-availability.storageDir. Only a small pointer to that storage is kept in ZooKeeper or Kubernetes.
Available HA services
ZooKeeper HA
Works with any Flink deployment. Requires a running ZooKeeper quorum. Uses ZooKeeper for leader election and stores metadata pointers.
Kubernetes HA
Only available when running on Kubernetes. Uses Kubernetes ConfigMaps for metadata and the Kubernetes LeaderElection API for leader election. No external ZooKeeper required.
High availability data lifecycle
Flink retains HA data only as long as the associated job is running or recovering. Once a job reaches a globally terminal state (finished, cancelled, or failed), Flink deletes all associated HA data, including metadata in ZooKeeper or Kubernetes and files inhigh-availability.storageDir.
JobResultStore
The JobResultStore archives the final result of jobs that have reached a globally terminal state. Results are stored on a filesystem path configured viajob-result-store.storage-path.
Entries are marked as dirty while the corresponding job’s artifacts haven’t been fully cleaned up. Dirty entries are scheduled for cleanup by Flink. Once cleanup succeeds, the entry is removed.
If cleanup fails repeatedly and exceeds the configured maximum retries, the job is left in a dirty state. You can use high-availability.storageDir to inspect and manually remove the artifacts.
Deployment-specific HA setup
Standalone + ZooKeeper
Configure multiple JobManagers in conf/masters and point them at a ZooKeeper quorum.
YARN + ZooKeeper
YARN handles JobManager container restarts; ZooKeeper or Kubernetes HA handles metadata persistence.
Native Kubernetes
Use Kubernetes HA with multiple JobManager replicas configured via kubernetes.jobmanager.replicas.

