Reference architecture
Every Flink cluster consists of three core components: a client, a JobManager, and one or more TaskManagers.- The client takes your application code, compiles it into a
JobGraph, and submits it to the JobManager. - The JobManager coordinates the distributed execution: it schedules tasks, tracks distributed checkpoints, and reacts to failures.
- TaskManagers (also called workers) execute the actual operators — sources, transformations, and sinks.
Deployment modes
Flink supports two modes for submitting jobs, which differ in cluster lifecycle, resource isolation, and where the application’smain() method runs.
- Application mode
- Session mode
In Application mode, Flink creates a dedicated cluster for a single application. The application’s
main() method executes on the JobManager, which means:- User JARs must be available on the classpath of all Flink components (typically in the
usrlib/folder or bundled into the container image). - Dependencies are not shipped via RPC from the client, making submission faster and cheaper.
- The cluster shuts down when the application finishes.
- Resource isolation is strong: each application has its own JobManager.
Resource providers
The JobManager has different implementations depending on where you deploy Flink. Choose the one that matches your infrastructure.Standalone
Launch Flink directly as JVM processes. The simplest setup with no external dependencies. Suitable for local testing and on-premises clusters.
YARN
Deploy Flink on Apache Hadoop YARN. Flink dynamically allocates and releases TaskManager containers from YARN’s ResourceManager.
Kubernetes
Native Kubernetes integration. Flink directly talks to the Kubernetes API to spawn and release pods as needed.
External components
The following components are optional but important for production deployments.| Component | Purpose |
|---|---|
| High Availability Service | Enables JobManager failover. Supported backends: ZooKeeper and Kubernetes. |
| File storage | Used for checkpointing and savepoints. Flink supports HDFS, S3, GCS, Azure Blob Storage, and local filesystems. |
| Metrics storage | Flink components emit internal metrics. Supported reporters: Prometheus, Graphite, InfluxDB, and others. |
| Data sources and sinks | External systems like Apache Kafka, Amazon S3, Elasticsearch, and Apache Cassandra that your jobs read from or write to. |
Repeatable resource cleanup
When a job reaches a globally terminal state (finished, cancelled, or failed), Flink cleans up all associated external resources. If cleanup fails, Flink retries according to a configurable retry strategy. If the maximum number of retries is exhausted, the job is left in a dirty state and its artifacts must be cleaned up manually. Restarting the same job (using the same job ID) will restart the cleanup process rather than re-execute the job.There is a known issue where CompletedCheckpoints that failed to be deleted during normal checkpoint management are not covered by repeatable cleanup. These must be deleted manually. See FLINK-26606 for details.
Explore deployment options
Configuration
Configure Flink using conf/config.yaml, environment variables, and dynamic properties.
CLI
Use the flink command-line interface to submit, monitor, and manage jobs.
High availability
Set up JobManager HA with ZooKeeper or Kubernetes to eliminate single points of failure.
Memory configuration
Understand and tune Flink’s memory model for TaskManagers and JobManagers.
Elastic scaling
Automatically adjust job parallelism at runtime using Reactive Mode and the Adaptive Scheduler.
Security
Secure Flink’s network communication with SSL/TLS and authenticate with Kerberos.

