Skip to content

Airflow Job & Task Lifecycle (Deep, Internals-Oriented)πŸ”—

In Apache AirflowπŸ”—

Image

Image

Image

This explanation follows what actually happens in the metadata database and scheduler, not just the UI view.


1) What is a β€œJob” in Airflow?πŸ”—

In Airflow terminology:

  • Job (informal) β†’ a DAG Run
  • Task execution β†’ a Task Instance

So when people say β€œAirflow job”, they usually mean:

One execution of a DAG for a specific logical date


2) DAG Run Lifecycle (End-to-End)πŸ”—

A DAG Run represents one logical execution window.


Step 1: Scheduler decides a DAG run is neededπŸ”—

Scheduler checks:

  • start_date
  • schedule_interval
  • catchup
  • is_paused

Metadata DB impactπŸ”—

  • Inserts a row into dag_run

Initial state:

dag_run.state = RUNNING

Important:

  • This is logical time, not wall-clock time
  • A DAG run for 2025-01-01 may run on 2025-01-02

Step 2: Scheduler creates Task InstancesπŸ”—

For each task in the DAG:

  • Scheduler creates a row in task_instance

Initial state:

task_instance.state = NONE

Nothing has executed yet.


3) Task Instance Lifecycle (Very Detailed)πŸ”—

This is the most important part to understand.


TaskInstance States (Core)πŸ”—

NONE β†’ SCHEDULED β†’ QUEUED β†’ RUNNING β†’ SUCCESS
                               ↓
                             FAILED β†’ UP_FOR_RETRY

Each transition is persisted to the metadata DB.


Step 1: Dependency ResolutionπŸ”—

Scheduler evaluates whether a task can run:

Checks include:

  • Upstream task states
  • Trigger rules
  • Pools
  • Concurrency limits
  • depends_on_past
  • Sensors

If all conditions pass:

Metadata DBπŸ”—

task_instance.state = SCHEDULED

Step 2: QueuingπŸ”—

Scheduler hands off the task to the executor.

Metadata DBπŸ”—

task_instance.state = QUEUED

At this point:

  • Task is eligible for execution
  • Not yet running
  • Waiting for a worker slot

Step 3: Execution Begins (Worker Side)πŸ”—

Executor assigns the task to a worker.

Worker:

  • Forks a process
  • Loads DAG code
  • Instantiates the operator
  • Calls operator.execute()

Metadata DBπŸ”—

task_instance.state = RUNNING
task_instance.start_date = now()

This is when:

  • Logs start streaming
  • Retries counter is incremented

Step 4a: Successful CompletionπŸ”—

Operator finishes without exception.

Metadata DBπŸ”—

task_instance.state = SUCCESS
task_instance.end_date = now()

Scheduler now evaluates downstream tasks.


Step 4b: Failure PathπŸ”—

If an exception is raised:

Metadata DBπŸ”—

task_instance.state = FAILED

Scheduler decides next step based on retry policy.


Step 5: Retry LogicπŸ”—

If retries are configured:

Metadata DBπŸ”—

task_instance.state = UP_FOR_RETRY
task_instance.try_number += 1

After retry_delay:

  • Task goes back to SCHEDULED
  • Entire lifecycle repeats

Retries are state transitions, not new rows.


4) DAG Run Completion LogicπŸ”—

Scheduler continuously checks:

  • Are all task instances in terminal states?

Terminal states:

SUCCESS, FAILED, SKIPPED

If all tasks succeedπŸ”—

dag_run.state = SUCCESS

If any critical task failsπŸ”—

dag_run.state = FAILED

5) Clearing Tasks (Why It Works)πŸ”—

When you clear a task in UI:

  • task_instance.state is reset to NONE
  • Same DAG run
  • Same logical date

Airflow replays the lifecycle without creating a new DAG run.

This is why partial reprocessing is cheap and safe.


6) Backfills (Special Case)πŸ”—

Backfill:

  • Inserts multiple rows into dag_run
  • Each DAG run has its own task instances
  • Scheduler treats them as independent runs

Key insight:

Backfills are just normal DAG runs created retroactively


7) Failure Scenarios (Real Internals)πŸ”—

Worker CrashπŸ”—

  • Task remains RUNNING
  • Scheduler eventually marks it FAILED
  • Retry logic kicks in

Scheduler RestartπŸ”—

  • Scheduler reloads state from metadata DB
  • No task state is lost

Metadata DB SlowπŸ”—

  • Tasks stay QUEUED
  • DAG appears β€œstuck”
  • No scheduling progress

8) Why Task Granularity MattersπŸ”—

One large task:

  • Large blast radius
  • Expensive retries

Multiple small tasks:

  • Fine-grained recovery
  • Faster reruns
  • Better observability

This is a metadata DB design advantage, not just a coding style.


9) Mental Model (Critical)πŸ”—

Think of lifecycle like this:

DAG Run  = orchestration instance (time window)
Task     = logical step
TaskInstance = execution record
State changes = rows updated in metadata DB

Airflow is state-machine driven, not event-driven.


One-Line SummaryπŸ”—

A DAG run is created first, task instances are materialized next, and every scheduling, retry, failure, and success is driven entirely by state transitions stored in the metadata database.