Skip to content

πŸ”Ή What is Blob Storage?πŸ”—

Blob = Binary Large Object β†’ any file (text, image, video, parquet, JSON, etc.) Azure Blob Storage is Microsoft’s object storage solution for unstructured data.

It’s cheap, scalable, durable β†’ you can store petabytes of data and pay only for what you use.


πŸ”Ή Types of BlobsπŸ”—

Azure Blob storage supports 3 types:

  1. Block Blob (most common)

  2. Optimized for streaming and storing files.

  3. Stores data as blocks β†’ you can upload in chunks.
  4. Used for: documents, CSV, Parquet, images, logs.

  5. Append Blob

  6. Optimized for append operations.

  7. Great for logs β†’ you can only add to the end, not modify existing content.

  8. Page Blob

  9. Optimized for random read/write.

  10. Used for VM disks (VHD files).

πŸ‘‰ For Data Engineering / Delta Lake β†’ you’ll almost always use Block Blobs.


πŸ”Ή Storage Account + ContainersπŸ”—

  • A Storage Account = the root of your blob storage.
  • Inside it, you create containers β†’ logical groups of blobs.
  • Inside containers, you can have folders (if Hierarchical Namespace is enabled = ADLS Gen2).

Example path:

abfss://bronze@mydatalake.dfs.core.windows.net/2025/08/data.csv

Breakdown:

  • abfss β†’ protocol for ADLS Gen2 secure access.
  • bronze β†’ container.
  • mydatalake β†’ storage account.
  • dfs.core.windows.net β†’ ADLS Gen2 endpoint.
  • 2025/08/data.csv β†’ folder path + file.

πŸ”Ή Access Tiers (Cost Optimization)πŸ”—

Blob storage offers 3 main tiers:

  1. Hot – frequently accessed, higher cost per GB, lower access cost.
  2. Cool – infrequently accessed, cheaper storage, higher access charges.
  3. Archive – very cheap storage, but must be β€œrehydrated” before use (hours).

Example:

  • Store last 30 days of logs in Hot.
  • Move logs > 30 days old to Cool.
  • Move logs > 1 year old to Archive.

πŸ”Ή Security & AccessπŸ”—

  1. Authentication options:

  2. Azure AD (recommended) β†’ RBAC roles, Managed Identity.

  3. Shared Key (account key) β†’ full access, risky.
  4. SAS Tokens β†’ temporary, limited access (e.g., read-only link valid for 1 hour).

  5. Authorization:

  6. RBAC roles:

    • Storage Blob Data Reader β†’ read only.
    • Storage Blob Data Contributor β†’ read/write.
    • Storage Blob Data Owner β†’ full control.
  7. Networking:

  8. Private endpoints (VNet integration).

  9. Firewalls + IP restrictions.

πŸ”Ή Features for Data EngineeringπŸ”—

  • Hierarchical Namespace (HNS) β†’ required for Data Lake Gen2.

  • Allows directories + POSIX-like permissions.

  • Needed for Delta Lake + Databricks UC.
  • Soft delete / versioning β†’ recover accidentally deleted blobs.
  • Lifecycle rules β†’ auto-move data across tiers.
  • Event Grid integration β†’ trigger pipelines when new data arrives.
  • Immutable blobs (WORM) β†’ compliance, can’t be modified/deleted.

πŸ”Ή Example Scenario (ETL Pipeline with Blob Storage)πŸ”—

  1. Raw CSV files land in bronze container.
  2. Azure Function + Event Grid detects new files.
  3. Data Factory (ADF) or Databricks picks up files β†’ transforms β†’ saves as Delta in silver.
  4. Aggregated tables saved in gold.
  5. Access controlled via Unity Catalog external location with Managed Identity.

πŸ”Ή Quick AnalogyπŸ”—

  • Block Blob = Lego blocks (you can build files in chunks).
  • Append Blob = notebook (you can only keep adding pages).
  • Page Blob = hard disk (you can jump to any page and edit).