Azure Integration Databricks
Azure Integration with Databricks๐
Managed Resource Group๐
We can see:
- Identity
- Managed Storage Account : DBFS is stored here.
- Access Connector : Used to connect to storage account from databricks.
Blob Containers inside the storage account
Creating Compute in Databricks๐
When we create compute in databricks we can see VM created. One VM for each driver and worker node
When compute is terminated the VM is deleted.
Unity Catalog๐
-
Metadata is stored in control plane and data is stored in data plane. One region is adviced to have single metastore.
-
Only catalog is securable object, others are non securable.
Location in Managed Tables
Detailed Steps to Setup๐
1. Create an Azure Databricks Workspace๐
- Go to the Azure Portal.
- Create a new resource โ search Azure Databricks โ click Create.
-
Fill in:
-
Workspace name
- Region
- Pricing tier (Standard, Premium, or Trial)
- Under Networking, choose whether to deploy into a VNet (VNet Injection) or let Databricks create a managed VNet.
- Deploy the workspace.
2. Understand the Storage Setup๐
When you create a Databricks workspace, Azure automatically provisions:
-
Managed Storage Account:
-
A hidden, Microsoft-managed storage account that Databricks uses for workspace metadata (notebooks, cluster logs, ML models, job configs).
- You donโt manage or see this storage directly.
- This is different from your customer-managed storage account, where you store your actual data (e.g., in ADLS Gen2, Blob).
So youโll normally integrate Databricks with your own Azure Data Lake Storage (ADLS Gen2) for raw and processed data.
3. Grant Access Using an Access Connector๐
To allow Databricks to access your storage securely:
-
Create a resource called Azure Databricks Access Connector.
-
This acts like a bridge between Databricks and Azure services.
- Itโs assigned a Managed Identity (system-assigned).
-
Assign RBAC roles to this Access Connector on your storage account. Example:
-
Storage Blob Data Contributor
โ read/write data in ADLS Gen2. Storage Blob Data Reader
โ read-only.- Go to your Databricks workspace โ Advanced Settings โ Access Connector โ attach the Access Connector you created.
4. Use Managed Identity in Databricks๐
- The Access Connectorโs managed identity is used by Databricks clusters and jobs to authenticate to Azure services without secrets.
-
Benefits:
-
No need to store SAS tokens, keys, or service principal secrets in Databricks.
- Authentication happens via Azure AD automatically.
- When you mount ADLS or connect to other services (like Key Vault, Synapse, Event Hub), Databricks uses this managed identity.
Example for mounting ADLS with managed identity:
spark.conf.set("fs.azure.account.auth.type.<storage_account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage_account>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ManagedIdentityTokenProvider")
df = spark.read.format("parquet").load("abfss://container@storage_account.dfs.core.windows.net/data/")
5. Typical Flow in Production๐
- Create Databricks workspace โ Managed storage account provisioned automatically.
- Create Access Connector โ acts as Databricksโ identity in Azure.
- Assign roles on your ADLS storage account to Access Connectorโs identity.
- Enable Access Connector in Databricks workspace.
- Access ADLS data from notebooks or jobs using managed identity authentication.