Airflow Job & Task Lifecycle (Deep, Internals-Oriented)π
In Apache Airflowπ



This explanation follows what actually happens in the metadata database and scheduler, not just the UI view.
1) What is a βJobβ in Airflow?π
In Airflow terminology:
- Job (informal) β a DAG Run
- Task execution β a Task Instance
So when people say βAirflow jobβ, they usually mean:
One execution of a DAG for a specific logical date
2) DAG Run Lifecycle (End-to-End)π
A DAG Run represents one logical execution window.
Step 1: Scheduler decides a DAG run is neededπ
Scheduler checks:
start_dateschedule_intervalcatchupis_paused
Metadata DB impactπ
- Inserts a row into
dag_run
Initial state:
Important:
- This is logical time, not wall-clock time
- A DAG run for
2025-01-01may run on2025-01-02
Step 2: Scheduler creates Task Instancesπ
For each task in the DAG:
- Scheduler creates a row in
task_instance
Initial state:
Nothing has executed yet.
3) Task Instance Lifecycle (Very Detailed)π
This is the most important part to understand.
TaskInstance States (Core)π
Each transition is persisted to the metadata DB.
Step 1: Dependency Resolutionπ
Scheduler evaluates whether a task can run:
Checks include:
- Upstream task states
- Trigger rules
- Pools
- Concurrency limits
depends_on_past- Sensors
If all conditions pass:
Metadata DBπ
Step 2: Queuingπ
Scheduler hands off the task to the executor.
Metadata DBπ
At this point:
- Task is eligible for execution
- Not yet running
- Waiting for a worker slot
Step 3: Execution Begins (Worker Side)π
Executor assigns the task to a worker.
Worker:
- Forks a process
- Loads DAG code
- Instantiates the operator
- Calls
operator.execute()
Metadata DBπ
This is when:
- Logs start streaming
- Retries counter is incremented
Step 4a: Successful Completionπ
Operator finishes without exception.
Metadata DBπ
Scheduler now evaluates downstream tasks.
Step 4b: Failure Pathπ
If an exception is raised:
Metadata DBπ
Scheduler decides next step based on retry policy.
Step 5: Retry Logicπ
If retries are configured:
Metadata DBπ
After retry_delay:
- Task goes back to
SCHEDULED - Entire lifecycle repeats
Retries are state transitions, not new rows.
4) DAG Run Completion Logicπ
Scheduler continuously checks:
- Are all task instances in terminal states?
Terminal states:
If all tasks succeedπ
If any critical task failsπ
5) Clearing Tasks (Why It Works)π
When you clear a task in UI:
task_instance.stateis reset toNONE- Same DAG run
- Same logical date
Airflow replays the lifecycle without creating a new DAG run.
This is why partial reprocessing is cheap and safe.
6) Backfills (Special Case)π
Backfill:
- Inserts multiple rows into
dag_run - Each DAG run has its own task instances
- Scheduler treats them as independent runs
Key insight:
Backfills are just normal DAG runs created retroactively
7) Failure Scenarios (Real Internals)π
Worker Crashπ
- Task remains
RUNNING - Scheduler eventually marks it
FAILED - Retry logic kicks in
Scheduler Restartπ
- Scheduler reloads state from metadata DB
- No task state is lost
Metadata DB Slowπ
- Tasks stay
QUEUED - DAG appears βstuckβ
- No scheduling progress
8) Why Task Granularity Mattersπ
One large task:
- Large blast radius
- Expensive retries
Multiple small tasks:
- Fine-grained recovery
- Faster reruns
- Better observability
This is a metadata DB design advantage, not just a coding style.
9) Mental Model (Critical)π
Think of lifecycle like this:
DAG Run = orchestration instance (time window)
Task = logical step
TaskInstance = execution record
State changes = rows updated in metadata DB
Airflow is state-machine driven, not event-driven.
One-Line Summaryπ
A DAG run is created first, task instances are materialized next, and every scheduling, retry, failure, and success is driven entirely by state transitions stored in the metadata database.