Microsoft Fabric
Microsoft Fabric
Section 1 : Introduction and Installation
1. Introduction
2. Pre Requisites
3. Project Architecture
4. Installation
- Ensure Hierarchial namespace enabled to create Azure Data Lake Storage Gen2 resource.
- We dont get charged for Synapse Analytics until we create compute.
Create Spark Pool
Spark Pool Settings
Section 2 : Understanding Microsoft Fabric
5. Evolution of Architecture
Metadata caching layer brings ACID properties.
6. Delta Lake Structure
What happens under the hood?
7. Why Fabric?
Lot of services need to be created individually.
8. Starting Up Fabric
- Login to fabric with Microsoft entra id account for free trial.
- Go to Settings -> Admin Portal -> enable fabric
9. Licensing and Costs
Creating Azure Capacity
If you are not able to select subscription follow these steps
-
If you can open the subscription but not perform actions:
-
Go to Azure Portal > Subscriptions
-
Click on the subscription
-
Go to Access Control (IAM) > Role Assignments
-
Filter by Role = Owner
-
You’ll see a list of users, groups, or service principals who are assigned the “Owner” role.
10. Fabric Terms
Example
11. OneLake in Fabric
Data is stored in One data lake based on the workspace names.
There is only one storage point.
We will have only one copy of data and nothing is duplicated.
The files are stored in parquet metadata powered by delta lake.
12. One Copy for all computes
All engines store data in One Lake.
All data stored in delta parquet format.
Section 3 : Fabric Lakehouse
13. Microsoft Fabric Workspaces
14. Workspace Roles
15. Creating a Lakehouse
When we create a Lakehouse there are three things:
Lakehouse - data platform to store the data.
Semantic Model - Dataset to present to powerbi.
SQL Endpoint - we can run queries.
16. Lakehouse Explorer
Data Uploaded to table
Table created from the file
Files are stored in parquet with delta log
Here is delta log info
On clicking properties we can see if its managed or not.
17. SQL Analytics Endpoint
We can only read data from this enpoint not write / update.
We can create views
18. Default Semantic Model View
In the context of the semantic model in Microsoft Fabric lakehouse, the semantic model itself doesn't directly store raw data. Instead, it provides a logical, structured view of the data stored in the underlying data lake or warehouse.
The semantic model acts as an abstraction layer that organizes and simplifies access to the data, making it easier for reporting and analysis tools to query the data efficiently. The raw data is stored in the data lake or data warehouse, and the semantic model helps to structure and shape this data into meaningful formats suitable for analysis and reporting.
Section 4 : Data Factory in Fabric
19. How to Load data in Lakehouse
20. Fabric vs Azure Data Factory
21. Data Gateway Types
- Gateway connects two networks
Imagine you have a big box of toys at home, and you want to show them to your friends who live far away. You have two ways to show them: one way is through a special window, and the other way is to use a delivery truck.
On-Premise Data Gateway (like a special window):
This is like a window that you open to let your friends see your toys without taking them out of the box. It connects your toys (data) at home to an online game or app that your friends are using. You can think of this as a way to share data that's stored in your house but don't let your friends take it out or change it. It keeps your toys safe inside but lets you show them off.
VNet Data Gateway (like a delivery truck):
This method is like using a delivery truck to send some of your toys to your friends' houses. The VNet (Virtual Network) is a big, secure road that connects your house and your friends' houses. When you use this truck, you're moving data across this secure road, allowing your friends to actually play with the toys (data) over at their place, but still keeping it safe and controlled. So, in simple terms, the on-premise data gateway lets you show your toys to friends securely while they are still at home, and the VNet data gateway lets you share some toys by sending them out safely to your friend's houses.
22. Connections
Click Gear Icon -> Manage Connections and Gateways
- Gateway : Equivalent to Integration Runtime in ADF
- Connection : Similar to Linked Service in ADF
23. Creating Pipeline
Step 1 : Lookup Activity to query the SQL connecte ddatabase
Step 2 : Foreach activity to go over both tables
Step 3 : For each iteration run copy data activity
Destination : Our Onelake data lakehouse
24. Dataflow Gen2
Adding Data Source to Dataflow Gen 2
Alice here has Blob Storage Contributor role that can be granted in container screen.
Click Combine
Click Add Column -> Custom Column
if [State] = "CA" then "California" else if
[State] = "NJ" then "New Jersey" else if
[State] = "FL" then "Florida"
else [State]
Click Add Destination -> Lakehouse
Next Go to Home -> Save and Run, refresh should automatically start
We should be able to see the data once refresh is completed.
Section 5 : Fabric One Lake
25. Shortcuts in Fabric
Shortcuts can be created only at OneLake level.
Let's say finance team wants data from marketing lakehouse.
They can create a shortcut to the marketing lakehouse without copying the data.
The data is refreshed/updated automatically.
No need to copy data while loading from Amazon S3.
26. How to create a shortcut?
27. Creating Files Shortcut
Deleting file at Azure Data Lake Storage
Data gets deleted here in fabric also.
Deleting data in Fabric
Data gets deleted in Azure Blob also.
28. Creating Table Shortcut
We can see that this table is in unmanaged section
In Microsoft Fabric, unidentified tables are entries displayed in the managed section of your data environment that lack associated metadata or table references. Here’s a breakdown of the concept:
Managed vs. Unmanaged: In Fabric, the managed section refers to tables that have both metadata and data managed by the Fabric engine. In contrast, the unmanaged section allows you to upload files in any format, which do not have the same management.
Unidentified Tables: If you create a table that is not in the delta format, it will be saved in the unidentified folder. This often occurs when files, such as CSVs, are included without a defined table structure, leading Fabric to categorize them as unidentified.
Purpose: The main goal of the unidentified prompt is to alert users that these files do not conform to the required structure for the managed section and do not support any tables. Essentially, it indicates that there are files present that need to be reviewed and potentially removed.
If we want files from sub folder we cant create shortcut.
When we create shortcut from files it can be from sub directories also.
Now I dropped a parquet file in adls and there is no unmanaged error.
29. Creating Delta from Parquet
-
Go to synapse workspace
-
Create new notebook.
df = spark.read.format("parquet").load('abfss://containername@storageaccountname.dfs.core.windows.net/UnEmployment.parquet')
df.write.format('delta').save('abfss://shortcutdelta@msfabriclakehousevedanth.dfs.core.windows.net/')
30. Creating Shortcut in Fabric
Just execute above code and create a table level shortcut.
31. Shortcut from a subfolder in Fabric
We cannot create a delta table / shortcut from a sub folder in ADLS Gen2.
32. Creating Shortcut from Parquet file
df.write.format('parquet').mode('append').save('abfss://shortcutparquet@msfabriclakehousevedanth.dfs.core.windows.net/')
We cannot create shortcut from parquet files as well, it lands in unidentified folder.
33. Summary of Shortcuts
34. Update Scenarios Using Shortcuts : Lakehouse to Datalake
What effects on table and file when either is updated?
We cannot update using SQL editor in fabric so let's use notebook.
Code:
We can see that the session is created in 11s, much faster than synapse
df_updated = spark.sql("UPDATE demo_lakehouse.Unemployment SET Industry = 'Healthcare Updated' WHERE Industry = 'Healthcare'")
display(df_updated)
If we have only reader access then this operation will fail.
This update also reflects on the file in delta lake.
35. Storage to Data Lake Shortcut Updates
1st Version
Update Code
spark.sql('''update vw_unemployment_new set Industry = 'Retail' where Industry = 'Retail Trade' ''')
The Industry has changed from Retail Trade to Retail
Update Operation Delta Log
{"commitInfo":{"timestamp":1752330893911,"operation":"UPDATE","operationParameters":{"predicate":"[\"(Industry#761 = Retail Trade)\"]"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"39721","numCopiedRows":"1510","numAddedChangeFiles":"0","executionTimeMs":"10094","scanTimeMs":"9455","numAddedFiles":"1","numUpdatedRows":"14","numAddedBytes":"39715","rewriteTimeMs":"635"},"engineInfo":"Apache-Spark/3.4.3.5.3.20250511.1 Delta-Lake/2.4.0.24","txnId":"6ae56a22-9ff9-4b9a-aec1-d7ab21bb57e8"}}
{"remove":{"path":"part-00000-33b67e12-a6d1-40de-86ac-20a6843bdc2a-c000.snappy.parquet","deletionTimestamp":1752330893906,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":39721,"tags":{}}}
36. Deleting File Data in Fabric Lake House or ADLS
Deleting any record / file itself from Data Lake deletes it from storage also.
Reverse scenario is also same.
37. Deleting Data from Tables in Lake House
Reflected in Storage file also, we see 1428 records.
The reverse is also same scenario, deleting from storage reflects in data lakehouse table also.
38. Deleting Shortcut
Deleting the entire shortcut does not delete data in storage.
Section 7 : Fabric Synapse Data Engineering
39. Spark Pools
Starter Pools in Spark
Spark Starter Pools are machines that are ready and can be spun up anytime.
Billing time doesnt include the idle time to initialize the spark session.
40. Spark Pool Node Size (Starter Pool)
If Starter Pool is not used for 20 min then session expires
Only the min and max number of nodes can be changed in starter pools, its always going to be medium node.
41. Custom Pools
We can adjust the node family and number of nodes but caveat is that it will not give the same instant start time as starter pools, Fabric needs to allocate the resources.
42. Standard vs High Concurrency Sessions
It acts like shared cluster in Databricks
43. Custom Magic Commands
44. MSSparkUtils
For using Azure Key vault : mssparkutils.credentials.help
The Key Vault Secret Officer role will allow creation of secrets and using it in notebook.
45. Call Fabric Notebook from Fabric Pipeline
Create New cloud connection
We cannot use keyvault at the moment
46. Managed Vs External Table
Shortcut tables are managed because we are storing the data in the tables section and its coming from root level folder, the data is being managed by the Fabric Engine.
The changes of data on the table are replicated on the data lake.
47. Environments in Fabric
We cannot change the node family since its a default pool.
48. V Order in Fabric
48. Domains in Fabric
We can isolate data to create a data mesh using domains. One for IT, one for HR and so on.
Go to settings -> then click 'Create Domain'
We can assign workspaces to each domain.
Section 6 : Synapse to Fabric Migration
49. Migrating notebooks from Synapse to Fabric
First Create Service Principal using App Registration.
Then enter credentials.
azure_client_id = ""
azure_tenant_id = ""
azure_client_secret = ""
synapse_workspace_name = "synapseprojectvedanth"
Go to Synapse workspace and add role assignment.
Give Synapse Admin role to the SP.
Go to any table in lakehouse, click properties and get this abfss path
abfss://264d9187-xxxx-yyyy-zzzz-aaaaa@onelake.dfs.fabric.microsoft.com/46d6a3de-xxxx-xxxx-xxxx-xxxxxxx/
Now configure fabric details this way
workspace_id = "Data_Engineering_Workspace"
xlakehouse_id = "46d6a3de-fc48-4a41-xxx-xxxxxx"
export_folder_name = f"export/{synapse_workspace_name}"
prefix = "mig"
output_folder = f"abfss://{workspace_guid}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}/Files/{export_folder_name}"
print(output_folder)
Import utility file
sc.addPyFile('https://raw.githubusercontent.com/microsoft/fabric-migration/main/data-engineering/utils/util.py')
Export notebooks from synapse to fabric
Utils.export_notebooks(azure_client_id, azure_tenant_id, azure_client_secret, synapse_workspace_name, output_folder)
Import everything as notebooks in fabric
50. Migrating / Running Pipelines from Synapse in Fabric
51. Migrating ADLS DataLake Gen2 to Fabric OneLake
Using FastCP - Best Way
# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "nyctlc"
blob_relative_path = "green"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
print('Remote blob path: ' + wasbs_path)
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
NYDestinationCPPath = 'abfss://Fabric_trail@onelake.dfs.fabric.microsoft.com/LH_Fabric.Lakehouse/Files/NYDestinationCP'
NYDestinationFSPath = 'abfss://Fabric_trail@onelake.dfs.fabric.microsoft.com/LH_Fabric.Lakehouse/Files/NYDestinationFS'
Section 7 : Capacity Metrics App
51. UI of the App
52. Capacity Metrics and Units in Fabrics
If we are using F64 mode, then we will have max of 64*30 = 1920 CU capacity.
Out of this whatever we see in blue (29.99 units) is used for background processes like refresh of reports etc...
Whatever we see in red are the interactive unit consumption that we users use.
Throttling
Throttling occurs when the tenant consumes more capacity units that it has purchased.
Smoothing creates capacity balance
Overage is the capacity beyond 100%
If we overuse for less than 10 min Fabric will not charge us anything
The overage capacity units will be compensated to run background processes for next 24 hours.
53. Interactive Delay
If our workload exceeds 100% for more than 10 min, then there will be some delay for a person running notebooks / SQL queries
If we are over utilizing for more than 10 min less than 60 min there will be 5 seconds delay
53. Interactive Rejection
We get Interactive Rejection if the over utilization exceed 100 % for more than 1 hour and less than 24 hours interactive queries are rejected but pipelines keep running.
54. Background Rejection
We get background / pipeline failures when the over utilization exceeds 24 hours.
Imagine you have a toy box, and you can fit 10 toys inside. If you add 12 toys, you have 2 extra toys that don't have a place. This is like an 'overage'—you are over the limit!
Now, let's break down the terms:
Add % (Added Percentage): This is like saying, "How many new toys did you put in the box?" For example, if you added 2 more toys, the added percentage would show that you've put in more than what fits.
Burndown %: Imagine playing with your toys. As you play, you may give some away or remove them. The burndown percentage is like counting how many toys you no longer have because you've given or taken them out of the box.
Cumulative % (Cumulative Percentage): This is like keeping track of all the toys that you didn't have space for but still remember they were there. If your toy box is overflowing for a few days, you add each extra toy to a list, showing how many toys you owe the box over time—it's the total amount of toys that you've added and not yet managed!
So, when you look at how many toys you have in the box and how many more you need to fit, you're really keeping track of your 'add %', 'burndown %', and 'cumulative %' to make sure you know how many you can play with without going over!
Section 8 : Synapse Data Warehouse
There is no SQL Analytics Endpoint in Warehouse.
55. Limitations of Tables in SQL Warehouse
- Both table and column names are case sensitive.
-
No support for Primary Keys.
-
No support for unique constraint.