Databricks Control Plane and Data Planeπ
π What is Databricks Lakehouse Architecture?π
Traditionally, companies had two separate systems:
- Data Lake πͺ£ (cheap storage, e.g., Azure Data Lake, S3, Blob): stores raw structured, semi-structured, unstructured data β flexible but lacks strong data management (ACID, governance, BI).
- Data Warehouse π’ (expensive but fast): optimized for SQL queries, BI, and analytics β great schema enforcement and governance but limited flexibility and costly.
πΉ The Lakehouse combines both in one system:
- The low-cost, flexible storage of a data lake
- The governance, ACID transactions, performance of a warehouse
ποΈ Core Components of Databricks Lakehouseπ
-
Storage Layer (Data Lake foundation)
-
Data stored in open formats like Parquet, ORC, Avro, Delta.
-
Uses cloud object storage (e.g., Azure Data Lake Storage Gen2, AWS S3, GCS).
-
Delta Lake (the secret sauce π§)
-
Adds ACID transactions on top of data lake storage.
- Provides schema enforcement, schema evolution, time travel, data versioning.
-
Solves problems like βeventual consistencyβ and corrupted files in raw data lakes.
-
Unified Governance (Unity Catalog)
-
Centralized metadata & permissions for files, tables, ML models, dashboards.
-
Manages security, lineage, and data discovery across the Lakehouse.
-
Compute Layer (Databricks Runtime / Spark + Photon)
-
Uses Apache Spark + Photon execution engine for batch, streaming, ML, BI.
-
Same engine for ETL, streaming, AI, SQL queries β no silos.
-
Data Management Features
-
Streaming + Batch = One Pipeline (via Delta Live Tables).
- Materialized Views, Incremental Processing, Change Data Capture (CDC).
- MLflow integration for machine learning lifecycle management.
π Architecture Diagram (Conceptual Flow)π
βββββββββββββββββββββββββ
β Business Apps / BI β
β (Power BI, Tableau) β
βββββββββββ²ββββββββββββββ
β
βββββββββββ΄ββββββββββββββ
β Databricks SQL β
β & Photon Engine β
βββββββββββ²ββββββββββββββ
β
ββββββββββββββββββββββββ΄βββββββββββββββββββββββββ
β Delta Lake (ACID, Schema, CDC) β
β (Open Storage Format on Parquet + Log) β
ββββββββββββββββββββββββ²βββββββββββββββββββββββββ
β
βββββββββββ΄ββββββββββββββ
β Cloud Object Store β
β (ADLS, S3, GCS) β
βββββββββββββββββββββββββ
β‘ Benefits of Lakehouseπ
- β One platform β no need for separate warehouse + lake.
- β Cost efficient β cheap storage, scalable compute.
- β Flexibility β structured + semi-structured + unstructured.
- β ACID reliability β transactions, schema enforcement.
- β End-to-end β supports ETL, real-time streaming, ML/AI, BI in the same system.
π In Databricks Azure Contextπ
- Storage β Azure Data Lake Storage (ADLS Gen2)
- Security/Governance β Azure Key Vault + Unity Catalog
- Compute β Databricks Clusters with Photon
- Serving β Power BI (Direct Lake Mode)
Data Plane vs Control Planeπ
π¦ 1. Simple Analogyπ
Think of Databricks like Uber:
- Control Plane = Uber App π± β handles where you go, who drives, billing, monitoring.
- Data Plane = The Car π β where the actual ride happens (your data processing).
So, Databricks separates management functions (control) from execution functions (data).
ποΈ 2. Databricks Architectureπ
πΉ Control Planeπ
- Managed by Databricks itself (runs in Databricksβ own AWS/Azure/GCP accounts).
-
Contains:
-
Web UI / REST API β you log in here, create clusters, manage jobs.
- Cluster Manager β decides how to spin up VMs/compute.
- Job Scheduler β triggers pipelines, notebooks, workflows.
- Metadata Storage β notebooks, workspace configs, Unity Catalog metadata.
- Monitoring / Logging β cluster health, job logs, error reporting.
β οΈ Important: Your raw data does not go here. This plane is about orchestration, configs, and metadata.
πΉ Data Planeπ
- Runs inside your cloud account (your subscription/project).
-
Contains:
-
Clusters/Compute (Spark Executors, Driver, Photon) β where the data is processed.
- Your Data β stored in ADLS, S3, or GCS.
- Networking β VNETs, Private Endpoints, Peering.
- Libraries / Runtime β Spark, Delta Lake, MLflow, etc.
β οΈ Key Point: The actual data never leaves your cloud account. Processing happens within your boundary.
π 3. Security Perspectiveπ
-
Control Plane:
-
Managed by Databricks.
- Contains metadata, credentials, configs, but not raw data.
-
Can be hardened with SCIM, SSO, RBAC, IP Access Lists.
-
Data Plane:
-
Fully inside your cloud subscription.
- Your sensitive data (PII, transactions, crypto, etc.) never touches Databricksβ account.
-
You control networking:
- Private Link / VNET Injection β ensures traffic never goes over the public internet.
- Key Vault / KMS for secrets.
- Storage firewalls.
πΌοΈ 4. Architecture Diagramπ
ββββββββββββββββββββββββββββββββββββββββ
β CONTROL PLANE β
β (Databricks-managed account) β
β β
β - Web UI / API β
β - Cluster Manager β
β - Job Scheduler β
β - Unity Catalog Metadata β
β - Logs / Monitoring β
βββββββββββββββββ²βββββββββββββββββββββββ
β
β Secure REST/API Calls
β
βββββββββββββββββ΄βββββββββββββββββββββββ
β DATA PLANE β
β (Your cloud subscription/project) β
β β
β - Spark Driver & Executors β
β - Photon Engine β
β - Data in ADLS/S3/GCS β
β - Networking (VNET, Firewall, PEs) β
β - Secrets from Key Vault/KMS β
βββββββββββββββββββββββββββββββββββββββββ
β‘ 5. Why This Separation?π
β Security β Your data never leaves your account. β Scalability β Databricks manages orchestration, you manage compute. β Multi-cloud β Same control plane works across AWS, Azure, GCP. β Compliance β Helps with HIPAA, GDPR, financial regulations.
π 6. Special Feature: Databricks Serverless SQLπ
- Here, the data plane compute is managed by Databricks too (not your account).
- Good for quick BI queries (like Power BI), but some enterprises avoid it for sensitive data.