Skip to content

Microsoft Fabric

Microsoft Fabric

Section 1 : Introduction and Installation

1. Introduction

image

image

2. Pre Requisites

image

3. Project Architecture

image

image

image

4. Installation

  • Ensure Hierarchial namespace enabled to create Azure Data Lake Storage Gen2 resource.
  • We dont get charged for Synapse Analytics until we create compute.

Create Spark Pool

Spark Pool Settings

image

Section 2 : Understanding Microsoft Fabric

5. Evolution of Architecture

image

Metadata caching layer brings ACID properties.

6. Delta Lake Structure

image

image

image

What happens under the hood?

image

7. Why Fabric?

image

Lot of services need to be created individually.

image

image

8. Starting Up Fabric

  • Login to fabric with Microsoft entra id account for free trial.
  • Go to Settings -> Admin Portal -> enable fabric

9. Licensing and Costs

image

image

image

Creating Azure Capacity

Microsoft Official Link

If you are not able to select subscription follow these steps

  • If you can open the subscription but not perform actions:

  • Go to Azure Portal > Subscriptions

  • Click on the subscription

  • Go to Access Control (IAM) > Role Assignments

  • Filter by Role = Owner

  • You’ll see a list of users, groups, or service principals who are assigned the “Owner” role.

10. Fabric Terms

image

Example

image

11. OneLake in Fabric

Data is stored in One data lake based on the workspace names.

There is only one storage point.

image

We will have only one copy of data and nothing is duplicated.

The files are stored in parquet metadata powered by delta lake.

12. One Copy for all computes

All engines store data in One Lake.

image

All data stored in delta parquet format.

Section 3 : Fabric Lakehouse

13. Microsoft Fabric Workspaces

image

14. Workspace Roles

image

15. Creating a Lakehouse

When we create a Lakehouse there are three things:

image

Lakehouse - data platform to store the data.

Semantic Model - Dataset to present to powerbi.

SQL Endpoint - we can run queries.

16. Lakehouse Explorer

image

Data Uploaded to table

image

Table created from the file

image

Files are stored in parquet with delta log

image

Here is delta log info image

On clicking properties we can see if its managed or not.

image

17. SQL Analytics Endpoint

We can only read data from this enpoint not write / update.

image

We can create views

image

18. Default Semantic Model View

image

image

In the context of the semantic model in Microsoft Fabric lakehouse, the semantic model itself doesn't directly store raw data. Instead, it provides a logical, structured view of the data stored in the underlying data lake or warehouse.

The semantic model acts as an abstraction layer that organizes and simplifies access to the data, making it easier for reporting and analysis tools to query the data efficiently. The raw data is stored in the data lake or data warehouse, and the semantic model helps to structure and shape this data into meaningful formats suitable for analysis and reporting.

Section 4 : Data Factory in Fabric

19. How to Load data in Lakehouse

image

20. Fabric vs Azure Data Factory

image

21. Data Gateway Types

  • Gateway connects two networks

image

Imagine you have a big box of toys at home, and you want to show them to your friends who live far away. You have two ways to show them: one way is through a special window, and the other way is to use a delivery truck.

On-Premise Data Gateway (like a special window):

This is like a window that you open to let your friends see your toys without taking them out of the box. It connects your toys (data) at home to an online game or app that your friends are using. You can think of this as a way to share data that's stored in your house but don't let your friends take it out or change it. It keeps your toys safe inside but lets you show them off.

VNet Data Gateway (like a delivery truck):

This method is like using a delivery truck to send some of your toys to your friends' houses. The VNet (Virtual Network) is a big, secure road that connects your house and your friends' houses. When you use this truck, you're moving data across this secure road, allowing your friends to actually play with the toys (data) over at their place, but still keeping it safe and controlled. So, in simple terms, the on-premise data gateway lets you show your toys to friends securely while they are still at home, and the VNet data gateway lets you share some toys by sending them out safely to your friend's houses.

image

image

image

22. Connections

image

Click Gear Icon -> Manage Connections and Gateways

image

  • Gateway : Equivalent to Integration Runtime in ADF
  • Connection : Similar to Linked Service in ADF

23. Creating Pipeline

Step 1 : Lookup Activity to query the SQL connecte ddatabase

image

Step 2 : Foreach activity to go over both tables

image

Step 3 : For each iteration run copy data activity

image

Destination : Our Onelake data lakehouse

image

24. Dataflow Gen2

image

Adding Data Source to Dataflow Gen 2

image

Alice here has Blob Storage Contributor role that can be granted in container screen.

image

Click Combine

image

Click Add Column -> Custom Column

image

if [State] = "CA" then "California" else if 
[State] = "NJ" then "New Jersey" else if
[State] = "FL" then "Florida"
else [State]

Click Add Destination -> Lakehouse

image

Next Go to Home -> Save and Run, refresh should automatically start

We should be able to see the data once refresh is completed.

image

Section 5 : Fabric One Lake

25. Shortcuts in Fabric

Shortcuts can be created only at OneLake level.

Let's say finance team wants data from marketing lakehouse.

They can create a shortcut to the marketing lakehouse without copying the data.

The data is refreshed/updated automatically.

image

No need to copy data while loading from Amazon S3.

26. How to create a shortcut?

image

image

image

27. Creating Files Shortcut

image

image

Deleting file at Azure Data Lake Storage

image

Data gets deleted here in fabric also.

Deleting data in Fabric

image

Data gets deleted in Azure Blob also.

image

28. Creating Table Shortcut

image

We can see that this table is in unmanaged section

image

image

In Microsoft Fabric, unidentified tables are entries displayed in the managed section of your data environment that lack associated metadata or table references. Here’s a breakdown of the concept:

Managed vs. Unmanaged: In Fabric, the managed section refers to tables that have both metadata and data managed by the Fabric engine. In contrast, the unmanaged section allows you to upload files in any format, which do not have the same management.

Unidentified Tables: If you create a table that is not in the delta format, it will be saved in the unidentified folder. This often occurs when files, such as CSVs, are included without a defined table structure, leading Fabric to categorize them as unidentified.

Purpose: The main goal of the unidentified prompt is to alert users that these files do not conform to the required structure for the managed section and do not support any tables. Essentially, it indicates that there are files present that need to be reviewed and potentially removed.

If we want files from sub folder we cant create shortcut.

When we create shortcut from files it can be from sub directories also.

Now I dropped a parquet file in adls and there is no unmanaged error.

image

29. Creating Delta from Parquet

image

image

  1. Go to synapse workspace

  2. Create new notebook.

df = spark.read.format("parquet").load('abfss://containername@storageaccountname.dfs.core.windows.net/UnEmployment.parquet')
df.write.format('delta').save('abfss://shortcutdelta@msfabriclakehousevedanth.dfs.core.windows.net/')

30. Creating Shortcut in Fabric

Just execute above code and create a table level shortcut.

image

image

31. Shortcut from a subfolder in Fabric

We cannot create a delta table / shortcut from a sub folder in ADLS Gen2.

image

32. Creating Shortcut from Parquet file

image

df.write.format('parquet').mode('append').save('abfss://shortcutparquet@msfabriclakehousevedanth.dfs.core.windows.net/')

We cannot create shortcut from parquet files as well, it lands in unidentified folder.

image

33. Summary of Shortcuts

image

image

34. Update Scenarios Using Shortcuts : Lakehouse to Datalake

image

What effects on table and file when either is updated?

image

We cannot update using SQL editor in fabric so let's use notebook.

Code:

df = spark.sql("SELECT * FROM demo_lakehouse.Unemployment LIMIT 1000")
display(df)

We can see that the session is created in 11s, much faster than synapse

image

df_updated = spark.sql("UPDATE demo_lakehouse.Unemployment SET Industry = 'Healthcare Updated' WHERE Industry = 'Healthcare'")
display(df_updated)

If we have only reader access then this operation will fail.

This update also reflects on the file in delta lake.

35. Storage to Data Lake Shortcut Updates

1st Version

image

Update Code

spark.sql('''update vw_unemployment_new set Industry = 'Retail' where Industry = 'Retail Trade' ''')

The Industry has changed from Retail Trade to Retail

image

Update Operation Delta Log

{"commitInfo":{"timestamp":1752330893911,"operation":"UPDATE","operationParameters":{"predicate":"[\"(Industry#761 = Retail Trade)\"]"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"39721","numCopiedRows":"1510","numAddedChangeFiles":"0","executionTimeMs":"10094","scanTimeMs":"9455","numAddedFiles":"1","numUpdatedRows":"14","numAddedBytes":"39715","rewriteTimeMs":"635"},"engineInfo":"Apache-Spark/3.4.3.5.3.20250511.1 Delta-Lake/2.4.0.24","txnId":"6ae56a22-9ff9-4b9a-aec1-d7ab21bb57e8"}}
{"remove":{"path":"part-00000-33b67e12-a6d1-40de-86ac-20a6843bdc2a-c000.snappy.parquet","deletionTimestamp":1752330893906,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":39721,"tags":{}}}

36. Deleting File Data in Fabric Lake House or ADLS

Deleting any record / file itself from Data Lake deletes it from storage also.

Reverse scenario is also same.

37. Deleting Data from Tables in Lake House

image

Reflected in Storage file also, we see 1428 records.

image

The reverse is also same scenario, deleting from storage reflects in data lakehouse table also.

38. Deleting Shortcut

Deleting the entire shortcut does not delete data in storage.

Section 7 : Fabric Synapse Data Engineering

39. Spark Pools

image

Starter Pools in Spark

Spark Starter Pools are machines that are ready and can be spun up anytime.

image

Billing time doesnt include the idle time to initialize the spark session.

40. Spark Pool Node Size (Starter Pool)

image

image

If Starter Pool is not used for 20 min then session expires

Only the min and max number of nodes can be changed in starter pools, its always going to be medium node.

41. Custom Pools

image

We can adjust the node family and number of nodes but caveat is that it will not give the same instant start time as starter pools, Fabric needs to allocate the resources.

42. Standard vs High Concurrency Sessions

It acts like shared cluster in Databricks

image

43. Custom Magic Commands

image

44. MSSparkUtils

image

For using Azure Key vault : mssparkutils.credentials.help

image

image

The Key Vault Secret Officer role will allow creation of secrets and using it in notebook.

45. Call Fabric Notebook from Fabric Pipeline

Create New cloud connection

image

image

We cannot use keyvault at the moment

image

image

46. Managed Vs External Table

image

image

Shortcut tables are managed because we are storing the data in the tables section and its coming from root level folder, the data is being managed by the Fabric Engine.

The changes of data on the table are replicated on the data lake.

47. Environments in Fabric

image

image

We cannot change the node family since its a default pool.

image

48. V Order in Fabric

image

spark.conf.get("spark.sql.parquet.vorder.enabled")

48. Domains in Fabric

We can isolate data to create a data mesh using domains. One for IT, one for HR and so on.

Go to settings -> then click 'Create Domain'

image

We can assign workspaces to each domain.

image

image

image

image

Section 6 : Synapse to Fabric Migration

49. Migrating notebooks from Synapse to Fabric

image

First Create Service Principal using App Registration.

Then enter credentials.

azure_client_id  = ""
azure_tenant_id  = ""
azure_client_secret  = ""
synapse_workspace_name  = "synapseprojectvedanth"

Go to Synapse workspace and add role assignment.

image

Give Synapse Admin role to the SP.

Go to any table in lakehouse, click properties and get this abfss path

abfss://264d9187-xxxx-yyyy-zzzz-aaaaa@onelake.dfs.fabric.microsoft.com/46d6a3de-xxxx-xxxx-xxxx-xxxxxxx/

Now configure fabric details this way

workspace_id  = "Data_Engineering_Workspace"
xlakehouse_id = "46d6a3de-fc48-4a41-xxx-xxxxxx"
export_folder_name = f"export/{synapse_workspace_name}"
prefix = "mig"

output_folder = f"abfss://{workspace_guid}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}/Files/{export_folder_name}"
print(output_folder)

Import utility file

sc.addPyFile('https://raw.githubusercontent.com/microsoft/fabric-migration/main/data-engineering/utils/util.py')

Export notebooks from synapse to fabric

Utils.export_notebooks(azure_client_id, azure_tenant_id, azure_client_secret, synapse_workspace_name, output_folder)

Import everything as notebooks in fabric

Utils.import_notebooks(f"/lakehouse/default/Files/{export_folder_name}", workspace_guid, prefix)

50. Migrating / Running Pipelines from Synapse in Fabric

Lecture Link

ADF Pipeline Migration Docs

51. Migrating ADLS DataLake Gen2 to Fabric OneLake

image

Using FastCP - Best Way

# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "nyctlc"
blob_relative_path = "green"
blob_sas_token = r""

# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)

# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
NYDestinationCPPath =  'abfss://Fabric_trail@onelake.dfs.fabric.microsoft.com/LH_Fabric.Lakehouse/Files/NYDestinationCP'
NYDestinationFSPath =  'abfss://Fabric_trail@onelake.dfs.fabric.microsoft.com/LH_Fabric.Lakehouse/Files/NYDestinationFS'
mssparkutils.fs.fastcp(wasbs_path,NYDestinationFSPath,True)

Section 7 : Capacity Metrics App

image

51. UI of the App

image

52. Capacity Metrics and Units in Fabrics

image

If we are using F64 mode, then we will have max of 64*30 = 1920 CU capacity.

Out of this whatever we see in blue (29.99 units) is used for background processes like refresh of reports etc...

Whatever we see in red are the interactive unit consumption that we users use.

Throttling

Throttling occurs when the tenant consumes more capacity units that it has purchased.

image

Smoothing creates capacity balance

image

image

Overage is the capacity beyond 100%

image

If we overuse for less than 10 min Fabric will not charge us anything

image

The overage capacity units will be compensated to run background processes for next 24 hours.

image

53. Interactive Delay

If our workload exceeds 100% for more than 10 min, then there will be some delay for a person running notebooks / SQL queries

image

If we are over utilizing for more than 10 min less than 60 min there will be 5 seconds delay

53. Interactive Rejection

We get Interactive Rejection if the over utilization exceed 100 % for more than 1 hour and less than 24 hours interactive queries are rejected but pipelines keep running.

image

54. Background Rejection

We get background / pipeline failures when the over utilization exceeds 24 hours.

image

image

Imagine you have a toy box, and you can fit 10 toys inside. If you add 12 toys, you have 2 extra toys that don't have a place. This is like an 'overage'—you are over the limit!

Now, let's break down the terms:

Add % (Added Percentage): This is like saying, "How many new toys did you put in the box?" For example, if you added 2 more toys, the added percentage would show that you've put in more than what fits.

Burndown %: Imagine playing with your toys. As you play, you may give some away or remove them. The burndown percentage is like counting how many toys you no longer have because you've given or taken them out of the box.

Cumulative % (Cumulative Percentage): This is like keeping track of all the toys that you didn't have space for but still remember they were there. If your toy box is overflowing for a few days, you add each extra toy to a list, showing how many toys you owe the box over time—it's the total amount of toys that you've added and not yet managed!

So, when you look at how many toys you have in the box and how many more you need to fit, you're really keeping track of your 'add %', 'burndown %', and 'cumulative %' to make sure you know how many you can play with without going over!

Section 8 : Synapse Data Warehouse

image

There is no SQL Analytics Endpoint in Warehouse.

image

55. Limitations of Tables in SQL Warehouse

  1. Both table and column names are case sensitive.

image

  1. No support for Primary Keys.

  2. No support for unique constraint.

56. Loading Data In Warehouse

image