Skip to content

πŸ— Azure Data Factory Architecture (In Depth)πŸ”—

image

At a high level, ADF has 5 core building blocks:

  1. Pipelines
  2. Activities
  3. Datasets
  4. Linked Services
  5. Integration Runtimes

Let’s explore step by step.


πŸ”Ή 1. Control Plane vs Data PlaneπŸ”—

ADF runs on a serverless architecture inside Azure. It is split into two planes:

  • Control Plane:

  • Manages metadata, pipelines, triggers, monitoring.

  • What you see in the ADF Studio (the UI).
  • Stores JSON definitions of pipelines in Azure.

  • Data Plane:

  • Where the actual data movement/processing happens.

  • Uses Integration Runtime (IR) to copy or transform data.
  • Example: Copying a file from On-prem SQL β†’ Blob storage.

πŸ”Ή 2. Core ComponentsπŸ”—

βœ… PipelinesπŸ”—

  • A pipeline = workflow.
  • Groups multiple activities into a sequence/graph.
  • Example:

  • Step 1: Copy sales data from SQL β†’ Data Lake

  • Step 2: Run Databricks transformation
  • Step 3: Load into Synapse

βœ… ActivitiesπŸ”—

  • Steps inside a pipeline.
  • Types:

  • Data Movement β†’ Copy Activity (move data between stores).

  • Data Transformation β†’ Mapping Data Flows, Databricks, Synapse SQL, HDInsight.
  • Control Activities β†’ If/Else, ForEach loops, Web calls, Execute pipeline.

βœ… DatasetsπŸ”—

  • Definition of data structure you want to read/write.
  • Think of it as a pointer to data inside a storage system.
  • Example:

  • A dataset for "SalesTable in SQL DB".

  • A dataset for "CSV file in Data Lake folder".

βœ… Linked ServicesπŸ”—

  • Connection info (credentials + endpoints).
  • Similar to connection strings.
  • Examples:

  • Linked Service for Azure SQL DB

  • Linked Service for Blob Storage
  • Linked Service for On-prem SQL via Self-hosted IR

βœ… Integration Runtime (IR)πŸ”—

This is the engine that actually runs ADF activities. Types of IR:

  1. Azure IR β†’ Managed, serverless compute (default). Used for copying data in the cloud.
  2. Self-Hosted IR β†’ Installed on your on-prem VM. Used for hybrid (on-prem ↔ cloud).
  3. Azure SSIS IR β†’ Run legacy SSIS packages in Azure.

πŸ“Œ Example:

  • If your data is in on-prem SQL Server, you must install Self-hosted IR in your data center to move data to Azure.
  • If your data is in Azure Blob β†’ Synapse, then Azure IR handles it.

πŸ”Ή 3. Orchestration LayerπŸ”—

  • Pipelines are triggered by:

  • Schedule (daily, hourly)

  • Event-based (new file arrives in Blob)
  • Manual/REST API call
  • Pipelines can branch, loop, or run in parallel.

πŸ”Ή 4. Monitoring LayerπŸ”—

  • Built-in monitoring in ADF Studio.
  • Shows pipeline runs, activity runs, duration, errors.
  • Integrated with Azure Monitor & Log Analytics for alerts.

πŸ”Ή 5. Security LayerπŸ”—

  • Authentication: Managed Identity, Service Principal, Key Vault.
  • Data never passes through control plane β†’ only through IR.
  • Network isolation possible with VNet integration.

πŸ”Ή 6. Typical Data Flow ExampleπŸ”—

Scenario: Ingest daily sales data from On-prem SQL to Synapse

  1. Trigger fires daily.
  2. Pipeline starts.
  3. Copy Activity (using Self-hosted IR) moves data β†’ Azure Data Lake.
  4. Mapping Data Flow Activity cleans & transforms data.
  5. Copy Activity loads transformed data β†’ Synapse DW.
  6. Monitoring logs success/failure.

πŸ”Ή Architecture Diagram (Explained in Words)πŸ”—

Imagine:

  • Top Layer (UI + Control Plane) β†’ ADF Studio where you design pipelines.
  • Middle Layer (Orchestration) β†’ Pipelines + Triggers + Activities.
  • Bottom Layer (Execution via IR) β†’ Data is copied/transformed by IR across data sources.

So:

  • Control Plane = Think "Blueprint + Control Room".
  • Integration Runtime (Data Plane) = Think "Workers doing the job".

βœ… In short: ADF is an orchestrator + integration engine with pipelines as workflows, activities as tasks, linked services as connections, datasets as pointers, and IR as the worker that executes jobs.