π Azure Data Factory Architecture (In Depth)π
At a high level, ADF has 5 core building blocks:
- Pipelines
- Activities
- Datasets
- Linked Services
- Integration Runtimes
Letβs explore step by step.
πΉ 1. Control Plane vs Data Planeπ
ADF runs on a serverless architecture inside Azure. It is split into two planes:
-
Control Plane:
-
Manages metadata, pipelines, triggers, monitoring.
- What you see in the ADF Studio (the UI).
-
Stores JSON definitions of pipelines in Azure.
-
Data Plane:
-
Where the actual data movement/processing happens.
- Uses Integration Runtime (IR) to copy or transform data.
- Example: Copying a file from On-prem SQL β Blob storage.
πΉ 2. Core Componentsπ
β Pipelinesπ
- A pipeline = workflow.
- Groups multiple activities into a sequence/graph.
-
Example:
-
Step 1: Copy sales data from SQL β Data Lake
- Step 2: Run Databricks transformation
- Step 3: Load into Synapse
β Activitiesπ
- Steps inside a pipeline.
-
Types:
-
Data Movement β Copy Activity (move data between stores).
- Data Transformation β Mapping Data Flows, Databricks, Synapse SQL, HDInsight.
- Control Activities β If/Else, ForEach loops, Web calls, Execute pipeline.
β Datasetsπ
- Definition of data structure you want to read/write.
- Think of it as a pointer to data inside a storage system.
-
Example:
-
A dataset for "SalesTable in SQL DB".
- A dataset for "CSV file in Data Lake folder".
β Linked Servicesπ
- Connection info (credentials + endpoints).
- Similar to connection strings.
-
Examples:
-
Linked Service for Azure SQL DB
- Linked Service for Blob Storage
- Linked Service for On-prem SQL via Self-hosted IR
β Integration Runtime (IR)π
This is the engine that actually runs ADF activities. Types of IR:
- Azure IR β Managed, serverless compute (default). Used for copying data in the cloud.
- Self-Hosted IR β Installed on your on-prem VM. Used for hybrid (on-prem β cloud).
- Azure SSIS IR β Run legacy SSIS packages in Azure.
π Example:
- If your data is in on-prem SQL Server, you must install Self-hosted IR in your data center to move data to Azure.
- If your data is in Azure Blob β Synapse, then Azure IR handles it.
πΉ 3. Orchestration Layerπ
-
Pipelines are triggered by:
-
Schedule (daily, hourly)
- Event-based (new file arrives in Blob)
- Manual/REST API call
- Pipelines can branch, loop, or run in parallel.
πΉ 4. Monitoring Layerπ
- Built-in monitoring in ADF Studio.
- Shows pipeline runs, activity runs, duration, errors.
- Integrated with Azure Monitor & Log Analytics for alerts.
πΉ 5. Security Layerπ
- Authentication: Managed Identity, Service Principal, Key Vault.
- Data never passes through control plane β only through IR.
- Network isolation possible with VNet integration.
πΉ 6. Typical Data Flow Exampleπ
Scenario: Ingest daily sales data from On-prem SQL to Synapse
- Trigger fires daily.
- Pipeline starts.
- Copy Activity (using Self-hosted IR) moves data β Azure Data Lake.
- Mapping Data Flow Activity cleans & transforms data.
- Copy Activity loads transformed data β Synapse DW.
- Monitoring logs success/failure.
πΉ Architecture Diagram (Explained in Words)π
Imagine:
- Top Layer (UI + Control Plane) β ADF Studio where you design pipelines.
- Middle Layer (Orchestration) β Pipelines + Triggers + Activities.
- Bottom Layer (Execution via IR) β Data is copied/transformed by IR across data sources.
So:
- Control Plane = Think "Blueprint + Control Room".
- Integration Runtime (Data Plane) = Think "Workers doing the job".
β In short: ADF is an orchestrator + integration engine with pipelines as workflows, activities as tasks, linked services as connections, datasets as pointers, and IR as the worker that executes jobs.