Skip to content

What is Airflow?

What is Apache Airflow?πŸ”—

Image

Image

Image

Image

Apache Airflow is an open-source workflow orchestration platform used to author, schedule, and monitor batch workflows (data pipelines). It lets you define what should run, in what order, and when, while handling retries, failures, dependencies, and visibility.


Core IdeaπŸ”—

Airflow coordinates work across systems. It does not process data itself. Instead, it triggers and monitors tools like Spark, Databricks, dbt, warehouses, APIs, and scripts.


Key ConceptsπŸ”—

1) DAG (Directed Acyclic Graph)πŸ”—

  • A DAG defines a workflow
  • Written in Python
  • Describes:

  • Tasks

  • Dependencies
  • Schedule
  • β€œAcyclic” means no loops

Example flow:

Extract β†’ Transform β†’ Load β†’ Validate β†’ Notify

2) Tasks and OperatorsπŸ”—

  • A task is one unit of work
  • Operators define how that task runs

Common operators:

  • PythonOperator – run Python logic
  • BashOperator – run shell commands
  • SparkSubmitOperator – submit Spark jobs
  • DatabricksRunNowOperator – trigger Databricks jobs
  • Warehouse operators (Snowflake, Redshift, BigQuery)

3) SchedulerπŸ”—

  • Determines when a DAG run should start
  • Handles:

  • Cron schedules

  • Dependencies
  • Backfills
  • Catchup logic

4) ExecutorπŸ”—

  • Controls how tasks are executed
  • Common executors:

  • SequentialExecutor (local testing)

  • LocalExecutor
  • CeleryExecutor (distributed workers)
  • KubernetesExecutor (cloud-native)

5) Web UIπŸ”—

  • Visual DAG graphs
  • Task execution history
  • Logs per task attempt
  • Retry, clear, and rerun controls
  • SLA monitoring

What Airflow Is Good AtπŸ”—

  • Orchestrating complex batch pipelines
  • Managing dependencies across systems
  • Handling retries and failures
  • Scheduling jobs reliably
  • Providing operational visibility

What Airflow Is NOTπŸ”—

  • Not a data processing engine
  • Not a streaming engine
  • Not a replacement for Spark, Flink, or SQL engines

Airflow only orchestrates those systems.


Typical Use CasesπŸ”—

  • Daily ETL pipelines
  • Triggering Spark or Databricks jobs
  • Running dbt models in sequence
  • ML training and scoring pipelines
  • Data quality and validation checks
  • SLA-based alerting

Where Airflow Fits in a Data ArchitectureπŸ”—

Image

Image

Example:

Sources (APIs, Kafka, DBs)
        ↓
     Airflow
        ↓
Spark / Databricks / dbt
        ↓
Data Warehouse / Lakehouse
        ↓
BI / Analytics / ML

Airflow sits above compute systems and coordinates them.