Job Manager in Flinkπ
The Job Manager in Apache Flink is the brain or control center of the entire Flink application. It is responsible for coordinating, scheduling, recovering, and managing the execution of your Flink job.
Hereβs a clear, structured explanation:
What is the Job Manager in Flink?π
The Job Manager is the master node in a Flink cluster that controls and manages the execution of Flink jobs.
A Flink cluster typically has:
- 1 Job Manager (or multiple in HA mode)
- Many Task Managers (workers)
What does the Job Manager do?π
1. Receives the job and builds an execution planπ
When you submit a Flink job:
- Job Manager parses your program
-
It creates:
-
Logical graph (DataStream or Table API DAG)
- Optimized execution plan (physical graph)
2. Schedules tasks to Task Managersπ
The Job Manager decides:
- How many parallel tasks to create
- Which tasks run where
- How to allocate slots
- How to balance load across the cluster
It is the scheduler for the whole cluster.
3. Coordinates checkpoints (fault tolerance)π
To provide exactly-once guarantees, the JM:
- Triggers checkpoints
- Coordinates with all Task Managers
- Ensures consistent snapshots across operators
- Stores checkpoint metadata
This is critical for correctness.
4. Handles failures and restarts jobsπ
If a Task Manager fails:
- JM detects the failure
- Cancels affected tasks
- Restarts tasks from last successful checkpoint
- Reassigns tasks to healthy Task Managers
It ensures the job keeps running even when machines die.
5. Manages job lifecycleπ
Job Manager controls transitions:
- Created
- Running
- Failing
- Restarting
- Finished
- Canceled
It keeps the jobβs state machine.
6. Provides the Web UIπ
The Flink dashboard (port 8081) is served by the Job Manager:
- Job DAG
- Throughput & backpressure
- Checkpoints
- Task metrics
- Logs
- Watermarks
- Operator graphs
Is there only one Job Manager?π
It depends:
Standalone modeπ
- One active Job Manager
- Optional standby Job Managers for HA
Kubernetes / YARNπ
- Multiple replicas for HA
- Only one is active at a time
- Others are standby (via leader election)
Analogyπ
Think of a Flink cluster like a large kitchen:
-
Job Manager = Head Chef
-
Decides who cooks what
- Designs the plan
- Checks progress
-
Recovers if something goes wrong
-
Task Managers = Line cooks
-
Actually do the work
- Handle ingredients (streams of data)
- Store intermediate states
- Report progress to the head chef
Short summaryπ
The Job Manager is the command center that schedules, monitors, restarts, and coordinates every part of the Flink job. Without it, the workers have no instructions.