Airflow Architecture - Metadata DB
Airflow Architecture Deep Diveπ
Metadata Database (Very Important Component)π



In Apache Airflow, the Metadata Database is the single source of truth. If you truly understand this database, you understand how Airflow actually works.
1) What is the Metadata Database?π
The metadata database is a relational database (PostgreSQL or MySQL in production) that stores:
- DAG definitions (state, schedule, pause status)
- DAG runs (logical dates, run status)
- Task instances (success, failed, retries)
- Scheduling decisions
- XComs (inter-task communication)
- Connections, Variables, Pools
- SLA misses, logs metadata
Airflow does not store state in memory. Everything important is persisted here.
2) Why Metadata DB is Central to Airflowπ
Airflow is stateless at the process level:
- Scheduler can restart
- Webserver can restart
- Workers can die
But the metadata DB preserves reality.
If the DB is lost or corrupted:
- Airflow forgets what ran
- Backfills break
- Task retries become incorrect
3) Core Tables (Conceptual Model)π
Below are the most important tables you must understand.
3.1 dag Tableπ
Stores DAG-level metadata.
| Column | Meaning |
|---|---|
dag_id |
DAG name |
is_paused |
Whether DAG is paused |
is_active |
Loaded by scheduler |
last_parsed_time |
DAG parse timestamp |
owners |
DAG owners |
Purpose:
- Controls whether a DAG is eligible for scheduling
3.2 dag_run Tableπ
Each logical execution of a DAG.
| Column | Meaning |
|---|---|
dag_id |
DAG name |
run_id |
Unique run identifier |
execution_date |
Logical date |
state |
running / success / failed |
start_date |
Actual start time |
end_date |
Completion time |
Key insight:
- Airflow schedules by logical time, not wall-clock time
3.3 task_instance Table (MOST IMPORTANT)π
This table represents execution reality.
| Column | Meaning |
|---|---|
dag_id |
DAG name |
task_id |
Task name |
execution_date |
Logical date |
state |
success / failed / retry |
try_number |
Attempt count |
start_date |
When execution started |
end_date |
When execution ended |
hostname |
Worker host |
Every retry = new row update.
If you want to debug:
- Why task retried
- Why dependency failed
- Why DAG is stuck
You inspect this table.
3.4 task_failπ
Stores failure metadata.
- Exception info
- Task failure timestamps
- Retry eligibility
Used for:
- Alerting
- SLA failures
- Debugging flaky tasks
3.5 xcomπ
Used for inter-task communication.
| Column | Meaning |
|---|---|
dag_id |
DAG |
task_id |
Task |
key |
XCom key |
value |
Serialized payload |
Key rule:
- XComs are not for large data
- Stored in metadata DB
- Large XComs cause DB bloat
3.6 slot_poolπ
Controls concurrency limits.
| Column | Meaning |
|---|---|
pool |
Pool name |
slots |
Max parallel tasks |
used_slots |
Currently used |
Used to:
- Protect downstream systems
- Control blast radius
3.7 connection and variableπ
Runtime configuration storage.
connection: credentials, endpointsvariable: flags, parameters
Stored encrypted at rest (optional).
4) How Scheduler Uses Metadata DB (Step-by-Step)π
This is critical.
Step 1: DAG Parsingπ
- Scheduler parses DAG files
- Writes DAG metadata to
dag
Step 2: Create DAG Runsπ
-
Scheduler checks:
-
schedule interval
start_datecatchup- Inserts rows into
dag_run
Step 3: Create Task Instancesπ
-
For each DAG run:
-
Scheduler creates rows in
task_instance - Initial state =
none
Step 4: Dependency Resolutionπ
Scheduler queries:
task_instance.state- Upstream task states
- Pools, concurrency limits
- Sensors
Only then does it queue tasks.
Step 5: Executor & Workersπ
- Executor pulls queued tasks
-
Workers update:
-
start_date
- state
- end_date
Everything is persisted back to metadata DB.
5) Why Metadata DB Becomes a Bottleneckπ
Common enterprise issues:
High Write Volumeπ
- Millions of task instances/day
- Frequent state updates
Slow Queriesπ
- Scheduler constantly queries
task_instance - Missing indexes = scheduler lag
XCom Abuseπ
- Large payloads stored
- DB grows rapidly
6) Production Best Practicesπ
Database Choiceπ
- PostgreSQL recommended
- Avoid SQLite except local dev
Performance Tuningπ
- Enable proper indexing
- Archive old task instances
- Clean XComs periodically
High Availabilityπ
- Use managed Postgres
- Enable backups
- Monitor DB latency
7) Failure Scenarios (Real World)π
DB Downπ
- Scheduler stops scheduling
- Running tasks may finish
- No state updates recorded
DB Laggingπ
- DAGs appear βstuckβ
- Tasks remain queued
- UI becomes inconsistent
8) Mental Model (Very Important)π
Think of Airflow as:
Scheduler = Decision Maker
Executor = Task Dispatcher
Workers = Task Runners
Metadata DB = Brain & Memory
All components read from and write to the metadata database.
One-Line Summaryπ
The metadata database is Airflowβs brain: every DAG run, task state, retry, dependency, and scheduling decision lives there.