Azure Integration Databricks

Azure Integration with Databricks🔗

Managed Resource Group🔗

We can see:

Identity
Managed Storage Account : DBFS is stored here.
Access Connector : Used to connect to storage account from databricks.

Blob Containers inside the storage account

Creating Compute in Databricks🔗

When we create compute in databricks we can see VM created. One VM for each driver and worker node

When compute is terminated the VM is deleted.

Unity Catalog🔗

Metadata is stored in control plane and data is stored in data plane. One region is adviced to have single metastore.
Only catalog is securable object, others are non securable.

Location in Managed Tables

Detailed Steps to Setup🔗

1. Create an Azure Databricks Workspace🔗

Go to the Azure Portal.
Create a new resource → search Azure Databricks → click Create.
Fill in:
Workspace name
Region
Pricing tier (Standard, Premium, or Trial)
Under Networking, choose whether to deploy into a VNet (VNet Injection) or let Databricks create a managed VNet.
Deploy the workspace.

2. Understand the Storage Setup🔗

When you create a Databricks workspace, Azure automatically provisions:

Managed Storage Account:
A hidden, Microsoft-managed storage account that Databricks uses for workspace metadata (notebooks, cluster logs, ML models, job configs).
You don’t manage or see this storage directly.
This is different from your customer-managed storage account, where you store your actual data (e.g., in ADLS Gen2, Blob).

So you’ll normally integrate Databricks with your own Azure Data Lake Storage (ADLS Gen2) for raw and processed data.

3. Grant Access Using an Access Connector🔗

To allow Databricks to access your storage securely:

Create a resource called Azure Databricks Access Connector.
This acts like a bridge between Databricks and Azure services.
It’s assigned a Managed Identity (system-assigned).
Assign RBAC roles to this Access Connector on your storage account. Example:
Storage Blob Data Contributor → read/write data in ADLS Gen2.
Storage Blob Data Reader → read-only.
Go to your Databricks workspace → Advanced Settings → Access Connector → attach the Access Connector you created.

4. Use Managed Identity in Databricks🔗

The Access Connector’s managed identity is used by Databricks clusters and jobs to authenticate to Azure services without secrets.
Benefits:
No need to store SAS tokens, keys, or service principal secrets in Databricks.
Authentication happens via Azure AD automatically.
When you mount ADLS or connect to other services (like Key Vault, Synapse, Event Hub), Databricks uses this managed identity.

Example for mounting ADLS with managed identity:

spark.conf.set("fs.azure.account.auth.type.<storage_account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage_account>.dfs.core.windows.net",
               "org.apache.hadoop.fs.azurebfs.oauth2.ManagedIdentityTokenProvider")

df = spark.read.format("parquet").load("abfss://container@storage_account.dfs.core.windows.net/data/")

5. Typical Flow in Production🔗

Create Databricks workspace → Managed storage account provisioned automatically.
Create Access Connector → acts as Databricks’ identity in Azure.
Assign roles on your ADLS storage account to Access Connector’s identity.
Enable Access Connector in Databricks workspace.
Access ADLS data from notebooks or jobs using managed identity authentication.