Skip to content

Job Manager in FlinkπŸ”—

The Job Manager in Apache Flink is the brain or control center of the entire Flink application. It is responsible for coordinating, scheduling, recovering, and managing the execution of your Flink job.

Here’s a clear, structured explanation:


What is the Job Manager in Flink?πŸ”—

The Job Manager is the master node in a Flink cluster that controls and manages the execution of Flink jobs.

A Flink cluster typically has:

  • 1 Job Manager (or multiple in HA mode)
  • Many Task Managers (workers)

What does the Job Manager do?πŸ”—

1. Receives the job and builds an execution planπŸ”—

When you submit a Flink job:

  • Job Manager parses your program
  • It creates:

  • Logical graph (DataStream or Table API DAG)

  • Optimized execution plan (physical graph)

2. Schedules tasks to Task ManagersπŸ”—

The Job Manager decides:

  • How many parallel tasks to create
  • Which tasks run where
  • How to allocate slots
  • How to balance load across the cluster

It is the scheduler for the whole cluster.


3. Coordinates checkpoints (fault tolerance)πŸ”—

To provide exactly-once guarantees, the JM:

  • Triggers checkpoints
  • Coordinates with all Task Managers
  • Ensures consistent snapshots across operators
  • Stores checkpoint metadata

This is critical for correctness.


4. Handles failures and restarts jobsπŸ”—

If a Task Manager fails:

  • JM detects the failure
  • Cancels affected tasks
  • Restarts tasks from last successful checkpoint
  • Reassigns tasks to healthy Task Managers

It ensures the job keeps running even when machines die.


5. Manages job lifecycleπŸ”—

Job Manager controls transitions:

  • Created
  • Running
  • Failing
  • Restarting
  • Finished
  • Canceled

It keeps the job’s state machine.


6. Provides the Web UIπŸ”—

The Flink dashboard (port 8081) is served by the Job Manager:

  • Job DAG
  • Throughput & backpressure
  • Checkpoints
  • Task metrics
  • Logs
  • Watermarks
  • Operator graphs

Is there only one Job Manager?πŸ”—

It depends:

Standalone modeπŸ”—

  • One active Job Manager
  • Optional standby Job Managers for HA

Kubernetes / YARNπŸ”—

  • Multiple replicas for HA
  • Only one is active at a time
  • Others are standby (via leader election)

AnalogyπŸ”—

Think of a Flink cluster like a large kitchen:

  • Job Manager = Head Chef

  • Decides who cooks what

  • Designs the plan
  • Checks progress
  • Recovers if something goes wrong

  • Task Managers = Line cooks

  • Actually do the work

  • Handle ingredients (streams of data)
  • Store intermediate states
  • Report progress to the head chef

Short summaryπŸ”—

The Job Manager is the command center that schedules, monitors, restarts, and coordinates every part of the Flink job. Without it, the workers have no instructions.