Skip to content

Azure Integration Databricks

Azure Integration with Databricks๐Ÿ”—

image

Managed Resource Group๐Ÿ”—

image

We can see:

  • Identity
  • Managed Storage Account : DBFS is stored here.
  • Access Connector : Used to connect to storage account from databricks.

Blob Containers inside the storage account

image

Creating Compute in Databricks๐Ÿ”—

When we create compute in databricks we can see VM created. One VM for each driver and worker node

image

When compute is terminated the VM is deleted.

Unity Catalog๐Ÿ”—

  • Metadata is stored in control plane and data is stored in data plane. One region is adviced to have single metastore.

  • Only catalog is securable object, others are non securable.

image

Location in Managed Tables

image

Detailed Steps to Setup๐Ÿ”—


1. Create an Azure Databricks Workspace๐Ÿ”—

  • Go to the Azure Portal.
  • Create a new resource โ†’ search Azure Databricks โ†’ click Create.
  • Fill in:

  • Workspace name

  • Region
  • Pricing tier (Standard, Premium, or Trial)
  • Under Networking, choose whether to deploy into a VNet (VNet Injection) or let Databricks create a managed VNet.
  • Deploy the workspace.

2. Understand the Storage Setup๐Ÿ”—

When you create a Databricks workspace, Azure automatically provisions:

  • Managed Storage Account:

  • A hidden, Microsoft-managed storage account that Databricks uses for workspace metadata (notebooks, cluster logs, ML models, job configs).

  • You donโ€™t manage or see this storage directly.
  • This is different from your customer-managed storage account, where you store your actual data (e.g., in ADLS Gen2, Blob).

So youโ€™ll normally integrate Databricks with your own Azure Data Lake Storage (ADLS Gen2) for raw and processed data.


3. Grant Access Using an Access Connector๐Ÿ”—

To allow Databricks to access your storage securely:

  1. Create a resource called Azure Databricks Access Connector.

  2. This acts like a bridge between Databricks and Azure services.

  3. Itโ€™s assigned a Managed Identity (system-assigned).
  4. Assign RBAC roles to this Access Connector on your storage account. Example:

  5. Storage Blob Data Contributor โ†’ read/write data in ADLS Gen2.

  6. Storage Blob Data Reader โ†’ read-only.
  7. Go to your Databricks workspace โ†’ Advanced Settings โ†’ Access Connector โ†’ attach the Access Connector you created.

4. Use Managed Identity in Databricks๐Ÿ”—

  • The Access Connectorโ€™s managed identity is used by Databricks clusters and jobs to authenticate to Azure services without secrets.
  • Benefits:

  • No need to store SAS tokens, keys, or service principal secrets in Databricks.

  • Authentication happens via Azure AD automatically.
  • When you mount ADLS or connect to other services (like Key Vault, Synapse, Event Hub), Databricks uses this managed identity.

Example for mounting ADLS with managed identity:

spark.conf.set("fs.azure.account.auth.type.<storage_account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage_account>.dfs.core.windows.net",
               "org.apache.hadoop.fs.azurebfs.oauth2.ManagedIdentityTokenProvider")

df = spark.read.format("parquet").load("abfss://container@storage_account.dfs.core.windows.net/data/")

5. Typical Flow in Production๐Ÿ”—

  1. Create Databricks workspace โ†’ Managed storage account provisioned automatically.
  2. Create Access Connector โ†’ acts as Databricksโ€™ identity in Azure.
  3. Assign roles on your ADLS storage account to Access Connectorโ€™s identity.
  4. Enable Access Connector in Databricks workspace.
  5. Access ADLS data from notebooks or jobs using managed identity authentication.