Skip to content

Airflow Architecture - Metadata DB

Airflow Architecture Deep DiveπŸ”—

Metadata Database (Very Important Component)πŸ”—

Image

Image

Image

In Apache Airflow, the Metadata Database is the single source of truth. If you truly understand this database, you understand how Airflow actually works.


1) What is the Metadata Database?πŸ”—

The metadata database is a relational database (PostgreSQL or MySQL in production) that stores:

  • DAG definitions (state, schedule, pause status)
  • DAG runs (logical dates, run status)
  • Task instances (success, failed, retries)
  • Scheduling decisions
  • XComs (inter-task communication)
  • Connections, Variables, Pools
  • SLA misses, logs metadata

Airflow does not store state in memory. Everything important is persisted here.


2) Why Metadata DB is Central to AirflowπŸ”—

Airflow is stateless at the process level:

  • Scheduler can restart
  • Webserver can restart
  • Workers can die

But the metadata DB preserves reality.

If the DB is lost or corrupted:

  • Airflow forgets what ran
  • Backfills break
  • Task retries become incorrect

3) Core Tables (Conceptual Model)πŸ”—

Below are the most important tables you must understand.


3.1 dag TableπŸ”—

Stores DAG-level metadata.

Column Meaning
dag_id DAG name
is_paused Whether DAG is paused
is_active Loaded by scheduler
last_parsed_time DAG parse timestamp
owners DAG owners

Purpose:

  • Controls whether a DAG is eligible for scheduling

3.2 dag_run TableπŸ”—

Each logical execution of a DAG.

Column Meaning
dag_id DAG name
run_id Unique run identifier
execution_date Logical date
state running / success / failed
start_date Actual start time
end_date Completion time

Key insight:

  • Airflow schedules by logical time, not wall-clock time

3.3 task_instance Table (MOST IMPORTANT)πŸ”—

This table represents execution reality.

Column Meaning
dag_id DAG name
task_id Task name
execution_date Logical date
state success / failed / retry
try_number Attempt count
start_date When execution started
end_date When execution ended
hostname Worker host

Every retry = new row update.

If you want to debug:

  • Why task retried
  • Why dependency failed
  • Why DAG is stuck

You inspect this table.


3.4 task_failπŸ”—

Stores failure metadata.

  • Exception info
  • Task failure timestamps
  • Retry eligibility

Used for:

  • Alerting
  • SLA failures
  • Debugging flaky tasks

3.5 xcomπŸ”—

Used for inter-task communication.

Column Meaning
dag_id DAG
task_id Task
key XCom key
value Serialized payload

Key rule:

  • XComs are not for large data
  • Stored in metadata DB
  • Large XComs cause DB bloat

3.6 slot_poolπŸ”—

Controls concurrency limits.

Column Meaning
pool Pool name
slots Max parallel tasks
used_slots Currently used

Used to:

  • Protect downstream systems
  • Control blast radius

3.7 connection and variableπŸ”—

Runtime configuration storage.

  • connection: credentials, endpoints
  • variable: flags, parameters

Stored encrypted at rest (optional).


4) How Scheduler Uses Metadata DB (Step-by-Step)πŸ”—

This is critical.

Step 1: DAG ParsingπŸ”—

  • Scheduler parses DAG files
  • Writes DAG metadata to dag

Step 2: Create DAG RunsπŸ”—

  • Scheduler checks:

  • schedule interval

  • start_date
  • catchup
  • Inserts rows into dag_run

Step 3: Create Task InstancesπŸ”—

  • For each DAG run:

  • Scheduler creates rows in task_instance

  • Initial state = none

Step 4: Dependency ResolutionπŸ”—

Scheduler queries:

  • task_instance.state
  • Upstream task states
  • Pools, concurrency limits
  • Sensors

Only then does it queue tasks.


Step 5: Executor & WorkersπŸ”—

  • Executor pulls queued tasks
  • Workers update:

  • start_date

  • state
  • end_date

Everything is persisted back to metadata DB.


5) Why Metadata DB Becomes a BottleneckπŸ”—

Common enterprise issues:

High Write VolumeπŸ”—

  • Millions of task instances/day
  • Frequent state updates

Slow QueriesπŸ”—

  • Scheduler constantly queries task_instance
  • Missing indexes = scheduler lag

XCom AbuseπŸ”—

  • Large payloads stored
  • DB grows rapidly

6) Production Best PracticesπŸ”—

Database ChoiceπŸ”—

  • PostgreSQL recommended
  • Avoid SQLite except local dev

Performance TuningπŸ”—

  • Enable proper indexing
  • Archive old task instances
  • Clean XComs periodically

High AvailabilityπŸ”—

  • Use managed Postgres
  • Enable backups
  • Monitor DB latency

7) Failure Scenarios (Real World)πŸ”—

DB DownπŸ”—

  • Scheduler stops scheduling
  • Running tasks may finish
  • No state updates recorded

DB LaggingπŸ”—

  • DAGs appear β€œstuck”
  • Tasks remain queued
  • UI becomes inconsistent

8) Mental Model (Very Important)πŸ”—

Think of Airflow as:

Scheduler = Decision Maker
Executor  = Task Dispatcher
Workers   = Task Runners
Metadata DB = Brain & Memory

All components read from and write to the metadata database.


One-Line SummaryπŸ”—

The metadata database is Airflow’s brain: every DAG run, task state, retry, dependency, and scheduling decision lives there.