High Availability Overview

JobManager High Availability (HA) hardens a Flink cluster against JobManager failures. With HA enabled, Flink always continues executing your submitted jobs, even if the active JobManager crashes.

Why HA matters

By default, each Flink cluster has a single JobManager. The JobManager is responsible for scheduling, resource management, and tracking distributed checkpoints. If it fails, no new jobs can be submitted and any running jobs fail. With HA configured, one or more standby JobManagers are ready to take over leadership if the active JobManager crashes. Failover is automatic and transparent to running jobs.

How Flink HA works

At any point in time, exactly one leading JobManager is active. The others remain on standby waiting to be elected leader. Flink’s HA services provide three capabilities:

Capability	Description
Leader election	Selects a single active JobManager from a pool of candidates
Service discovery	Allows TaskManagers and clients to find the current leader
State persistence	Stores recovery metadata so the successor can resume job execution

What is persisted for recovery

When a new JobManager takes over, it reads persisted state to resume jobs:

JobGraph: the logical execution plan of each submitted job
User code JARs: uploaded application JARs
Completed checkpoint metadata: location of the latest successful checkpoint for each job
Leader pointers: ZooKeeper/Kubernetes records pointing to the latest persisted state

The heavy state (checkpoint data) lives in the file system at high-availability.storageDir. Only a small pointer to that storage is kept in ZooKeeper or Kubernetes.

Available HA services

ZooKeeper HA

Works with any Flink deployment. Requires a running ZooKeeper quorum. Uses ZooKeeper for leader election and stores metadata pointers.

Kubernetes HA

Only available when running on Kubernetes. Uses Kubernetes ConfigMaps for metadata and the Kubernetes LeaderElection API for leader election. No external ZooKeeper required.

High availability data lifecycle

Flink retains HA data only as long as the associated job is running or recovering. Once a job reaches a globally terminal state (finished, cancelled, or failed), Flink deletes all associated HA data, including metadata in ZooKeeper or Kubernetes and files in high-availability.storageDir.

JobResultStore

The JobResultStore archives the final result of jobs that have reached a globally terminal state. Results are stored on a filesystem path configured via job-result-store.storage-path. Entries are marked as dirty while the corresponding job’s artifacts haven’t been fully cleaned up. Dirty entries are scheduled for cleanup by Flink. Once cleanup succeeds, the entry is removed. If cleanup fails repeatedly and exceeds the configured maximum retries, the job is left in a dirty state. You can use high-availability.storageDir to inspect and manually remove the artifacts.

Deployment-specific HA setup

Standalone + ZooKeeper

Configure multiple JobManagers in conf/masters and point them at a ZooKeeper quorum.

YARN + ZooKeeper

YARN handles JobManager container restarts; ZooKeeper or Kubernetes HA handles metadata persistence.

Native Kubernetes

Use Kubernetes HA with multiple JobManager replicas configured via kubernetes.jobmanager.replicas.

Setup

Resource Providers

High Availability

Memory & Performance

Security

High Availability Overview

Why HA matters

How Flink HA works

What is persisted for recovery

Available HA services

ZooKeeper HA

Kubernetes HA

High availability data lifecycle

JobResultStore

Deployment-specific HA setup

Standalone + ZooKeeper

YARN + ZooKeeper

Native Kubernetes

Build docs developers (and LLMs) love

Setup

Resource Providers

High Availability

Memory & Performance

Security

​Why HA matters

​How Flink HA works

​What is persisted for recovery

​Available HA services

ZooKeeper HA

Kubernetes HA

​High availability data lifecycle

​JobResultStore

​Deployment-specific HA setup

Standalone + ZooKeeper

YARN + ZooKeeper

Native Kubernetes

Build docs developers (and LLMs) love

Why HA matters

How Flink HA works

What is persisted for recovery

Available HA services

High availability data lifecycle

JobResultStore

Deployment-specific HA setup