Microsoft Fabric

Section 1 : Introduction and Installation

1. Introduction

2. Pre Requisites

3. Project Architecture

4. Installation

Ensure Hierarchial namespace enabled to create Azure Data Lake Storage Gen2 resource.
We dont get charged for Synapse Analytics until we create compute.

Create Spark Pool

Spark Pool Settings

Section 2 : Understanding Microsoft Fabric

5. Evolution of Architecture

Metadata caching layer brings ACID properties.

6. Delta Lake Structure

What happens under the hood?

7. Why Fabric?

Lot of services need to be created individually.

8. Starting Up Fabric

Login to fabric with Microsoft entra id account for free trial.
Go to Settings -> Admin Portal -> enable fabric

9. Licensing and Costs

Creating Azure Capacity

Microsoft Official Link

If you are not able to select subscription follow these steps

If you can open the subscription but not perform actions:
Go to Azure Portal > Subscriptions
Click on the subscription
Go to Access Control (IAM) > Role Assignments
Filter by Role = Owner
You’ll see a list of users, groups, or service principals who are assigned the “Owner” role.

10. Fabric Terms

Example

11. OneLake in Fabric

Data is stored in One data lake based on the workspace names.

There is only one storage point.

We will have only one copy of data and nothing is duplicated.

The files are stored in parquet metadata powered by delta lake.

12. One Copy for all computes

All engines store data in One Lake.

All data stored in delta parquet format.

Section 3 : Fabric Lakehouse

13. Microsoft Fabric Workspaces

14. Workspace Roles

15. Creating a Lakehouse

When we create a Lakehouse there are three things:

Lakehouse - data platform to store the data.

Semantic Model - Dataset to present to powerbi.

SQL Endpoint - we can run queries.

16. Lakehouse Explorer

Data Uploaded to table

Table created from the file

Files are stored in parquet with delta log

Here is delta log info

On clicking properties we can see if its managed or not.

17. SQL Analytics Endpoint

We can only read data from this enpoint not write / update.

We can create views

18. Default Semantic Model View

In the context of the semantic model in Microsoft Fabric lakehouse, the semantic model itself doesn't directly store raw data. Instead, it provides a logical, structured view of the data stored in the underlying data lake or warehouse.

The semantic model acts as an abstraction layer that organizes and simplifies access to the data, making it easier for reporting and analysis tools to query the data efficiently. The raw data is stored in the data lake or data warehouse, and the semantic model helps to structure and shape this data into meaningful formats suitable for analysis and reporting.

Section 4 : Data Factory in Fabric

19. How to Load data in Lakehouse

20. Fabric vs Azure Data Factory

21. Data Gateway Types

Gateway connects two networks

Imagine you have a big box of toys at home, and you want to show them to your friends who live far away. You have two ways to show them: one way is through a special window, and the other way is to use a delivery truck.

On-Premise Data Gateway (like a special window):

This is like a window that you open to let your friends see your toys without taking them out of the box. It connects your toys (data) at home to an online game or app that your friends are using. You can think of this as a way to share data that's stored in your house but don't let your friends take it out or change it. It keeps your toys safe inside but lets you show them off.

VNet Data Gateway (like a delivery truck):

This method is like using a delivery truck to send some of your toys to your friends' houses. The VNet (Virtual Network) is a big, secure road that connects your house and your friends' houses. When you use this truck, you're moving data across this secure road, allowing your friends to actually play with the toys (data) over at their place, but still keeping it safe and controlled. So, in simple terms, the on-premise data gateway lets you show your toys to friends securely while they are still at home, and the VNet data gateway lets you share some toys by sending them out safely to your friend's houses.

22. Connections

Click Gear Icon -> Manage Connections and Gateways

Gateway : Equivalent to Integration Runtime in ADF
Connection : Similar to Linked Service in ADF

23. Creating Pipeline

Step 1 : Lookup Activity to query the SQL connecte ddatabase

Step 2 : Foreach activity to go over both tables

Step 3 : For each iteration run copy data activity

Destination : Our Onelake data lakehouse

24. Dataflow Gen2

Adding Data Source to Dataflow Gen 2

Alice here has Blob Storage Contributor role that can be granted in container screen.

Click Combine

Click Add Column -> Custom Column

if [State] = "CA" then "California" else if 
[State] = "NJ" then "New Jersey" else if
[State] = "FL" then "Florida"
else [State]

Click Add Destination -> Lakehouse

Next Go to Home -> Save and Run, refresh should automatically start

We should be able to see the data once refresh is completed.

Section 5 : Fabric One Lake

25. Shortcuts in Fabric

Shortcuts can be created only at OneLake level.

Let's say finance team wants data from marketing lakehouse.

They can create a shortcut to the marketing lakehouse without copying the data.

The data is refreshed/updated automatically.

No need to copy data while loading from Amazon S3.

26. How to create a shortcut?

27. Creating Files Shortcut

Deleting file at Azure Data Lake Storage

Data gets deleted here in fabric also.

Deleting data in Fabric

Data gets deleted in Azure Blob also.

28. Creating Table Shortcut

We can see that this table is in unmanaged section

In Microsoft Fabric, unidentified tables are entries displayed in the managed section of your data environment that lack associated metadata or table references. Here’s a breakdown of the concept:

Managed vs. Unmanaged: In Fabric, the managed section refers to tables that have both metadata and data managed by the Fabric engine. In contrast, the unmanaged section allows you to upload files in any format, which do not have the same management.

Unidentified Tables: If you create a table that is not in the delta format, it will be saved in the unidentified folder. This often occurs when files, such as CSVs, are included without a defined table structure, leading Fabric to categorize them as unidentified.

Purpose: The main goal of the unidentified prompt is to alert users that these files do not conform to the required structure for the managed section and do not support any tables. Essentially, it indicates that there are files present that need to be reviewed and potentially removed.

If we want files from sub folder we cant create shortcut.

When we create shortcut from files it can be from sub directories also.

Now I dropped a parquet file in adls and there is no unmanaged error.

29. Creating Delta from Parquet

Go to synapse workspace
Create new notebook.

df = spark.read.format("parquet").load('abfss://containername@storageaccountname.dfs.core.windows.net/UnEmployment.parquet')

df.write.format('delta').save('abfss://shortcutdelta@msfabriclakehousevedanth.dfs.core.windows.net/')

30. Creating Shortcut in Fabric

Just execute above code and create a table level shortcut.

31. Shortcut from a subfolder in Fabric

We cannot create a delta table / shortcut from a sub folder in ADLS Gen2.

32. Creating Shortcut from Parquet file

df.write.format('parquet').mode('append').save('abfss://shortcutparquet@msfabriclakehousevedanth.dfs.core.windows.net/')

We cannot create shortcut from parquet files as well, it lands in unidentified folder.

33. Summary of Shortcuts

34. Update Scenarios Using Shortcuts : Lakehouse to Datalake

What effects on table and file when either is updated?

We cannot update using SQL editor in fabric so let's use notebook.

Code:

df = spark.sql("SELECT * FROM demo_lakehouse.Unemployment LIMIT 1000")
display(df)

We can see that the session is created in 11s, much faster than synapse

df_updated = spark.sql("UPDATE demo_lakehouse.Unemployment SET Industry = 'Healthcare Updated' WHERE Industry = 'Healthcare'")
display(df_updated)

If we have only reader access then this operation will fail.

This update also reflects on the file in delta lake.

35. Storage to Data Lake Shortcut Updates

1st Version

Update Code

spark.sql('''update vw_unemployment_new set Industry = 'Retail' where Industry = 'Retail Trade' ''')

The Industry has changed from Retail Trade to Retail

Update Operation Delta Log

{"commitInfo":{"timestamp":1752330893911,"operation":"UPDATE","operationParameters":{"predicate":"[\"(Industry#761 = Retail Trade)\"]"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"39721","numCopiedRows":"1510","numAddedChangeFiles":"0","executionTimeMs":"10094","scanTimeMs":"9455","numAddedFiles":"1","numUpdatedRows":"14","numAddedBytes":"39715","rewriteTimeMs":"635"},"engineInfo":"Apache-Spark/3.4.3.5.3.20250511.1 Delta-Lake/2.4.0.24","txnId":"6ae56a22-9ff9-4b9a-aec1-d7ab21bb57e8"}}
{"remove":{"path":"part-00000-33b67e12-a6d1-40de-86ac-20a6843bdc2a-c000.snappy.parquet","deletionTimestamp":1752330893906,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":39721,"tags":{}}}

36. Deleting File Data in Fabric Lake House or ADLS

Deleting any record / file itself from Data Lake deletes it from storage also.

Reverse scenario is also same.

37. Deleting Data from Tables in Lake House

Reflected in Storage file also, we see 1428 records.

The reverse is also same scenario, deleting from storage reflects in data lakehouse table also.

38. Deleting Shortcut

Deleting the entire shortcut does not delete data in storage.

Section 7 : Fabric Synapse Data Engineering

39. Spark Pools

Starter Pools in Spark

Spark Starter Pools are machines that are ready and can be spun up anytime.

Billing time doesnt include the idle time to initialize the spark session.

40. Spark Pool Node Size (Starter Pool)

If Starter Pool is not used for 20 min then session expires

Only the min and max number of nodes can be changed in starter pools, its always going to be medium node.

41. Custom Pools

We can adjust the node family and number of nodes but caveat is that it will not give the same instant start time as starter pools, Fabric needs to allocate the resources.

42. Standard vs High Concurrency Sessions

It acts like shared cluster in Databricks

43. Custom Magic Commands

44. MSSparkUtils

For using Azure Key vault : mssparkutils.credentials.help

The Key Vault Secret Officer role will allow creation of secrets and using it in notebook.

45. Call Fabric Notebook from Fabric Pipeline

Create New cloud connection

We cannot use keyvault at the moment

46. Managed Vs External Table

Shortcut tables are managed because we are storing the data in the tables section and its coming from root level folder, the data is being managed by the Fabric Engine.

The changes of data on the table are replicated on the data lake.

47. Environments in Fabric

We cannot change the node family since its a default pool.

48. V Order in Fabric

spark.conf.get("spark.sql.parquet.vorder.enabled")

48. Domains in Fabric

We can isolate data to create a data mesh using domains. One for IT, one for HR and so on.

Go to settings -> then click 'Create Domain'

We can assign workspaces to each domain.

Section 6 : Synapse to Fabric Migration

49. Migrating notebooks from Synapse to Fabric

First Create Service Principal using App Registration.

Then enter credentials.

azure_client_id  = ""
azure_tenant_id  = ""
azure_client_secret  = ""
synapse_workspace_name  = "synapseprojectvedanth"

Go to Synapse workspace and add role assignment.

Give Synapse Admin role to the SP.

Go to any table in lakehouse, click properties and get this abfss path

abfss://264d9187-xxxx-yyyy-zzzz-aaaaa@onelake.dfs.fabric.microsoft.com/46d6a3de-xxxx-xxxx-xxxx-xxxxxxx/

Now configure fabric details this way

workspace_id  = "Data_Engineering_Workspace"
xlakehouse_id = "46d6a3de-fc48-4a41-xxx-xxxxxx"
export_folder_name = f"export/{synapse_workspace_name}"
prefix = "mig"

output_folder = f"abfss://{workspace_guid}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}/Files/{export_folder_name}"
print(output_folder)

Import utility file

sc.addPyFile('https://raw.githubusercontent.com/microsoft/fabric-migration/main/data-engineering/utils/util.py')

Export notebooks from synapse to fabric

Utils.export_notebooks(azure_client_id, azure_tenant_id, azure_client_secret, synapse_workspace_name, output_folder)

Import everything as notebooks in fabric

Utils.import_notebooks(f"/lakehouse/default/Files/{export_folder_name}", workspace_guid, prefix)

50. Migrating / Running Pipelines from Synapse in Fabric

Lecture Link

ADF Pipeline Migration Docs

51. Migrating ADLS DataLake Gen2 to Fabric OneLake

Using FastCP - Best Way

# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "nyctlc"
blob_relative_path = "green"
blob_sas_token = r""

# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)

# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)

wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
NYDestinationCPPath =  'abfss://Fabric_trail@onelake.dfs.fabric.microsoft.com/LH_Fabric.Lakehouse/Files/NYDestinationCP'
NYDestinationFSPath =  'abfss://Fabric_trail@onelake.dfs.fabric.microsoft.com/LH_Fabric.Lakehouse/Files/NYDestinationFS'

mssparkutils.fs.fastcp(wasbs_path,NYDestinationFSPath,True)

Section 7 : Capacity Metrics App

51. UI of the App

52. Capacity Metrics and Units in Fabrics

If we are using F64 mode, then we will have max of 64*30 = 1920 CU capacity.

Out of this whatever we see in blue (29.99 units) is used for background processes like refresh of reports etc...

Whatever we see in red are the interactive unit consumption that we users use.

Throttling

Throttling occurs when the tenant consumes more capacity units that it has purchased.

Smoothing creates capacity balance

Overage is the capacity beyond 100%

If we overuse for less than 10 min Fabric will not charge us anything

The overage capacity units will be compensated to run background processes for next 24 hours.

53. Interactive Delay

If our workload exceeds 100% for more than 10 min, then there will be some delay for a person running notebooks / SQL queries

If we are over utilizing for more than 10 min less than 60 min there will be 5 seconds delay

53. Interactive Rejection

We get Interactive Rejection if the over utilization exceed 100 % for more than 1 hour and less than 24 hours interactive queries are rejected but pipelines keep running.

54. Background Rejection

We get background / pipeline failures when the over utilization exceeds 24 hours.

Imagine you have a toy box, and you can fit 10 toys inside. If you add 12 toys, you have 2 extra toys that don't have a place. This is like an 'overage'—you are over the limit!

Now, let's break down the terms:

Add % (Added Percentage): This is like saying, "How many new toys did you put in the box?" For example, if you added 2 more toys, the added percentage would show that you've put in more than what fits.

Burndown %: Imagine playing with your toys. As you play, you may give some away or remove them. The burndown percentage is like counting how many toys you no longer have because you've given or taken them out of the box.

Cumulative % (Cumulative Percentage): This is like keeping track of all the toys that you didn't have space for but still remember they were there. If your toy box is overflowing for a few days, you add each extra toy to a list, showing how many toys you owe the box over time—it's the total amount of toys that you've added and not yet managed!

So, when you look at how many toys you have in the box and how many more you need to fit, you're really keeping track of your 'add %', 'burndown %', and 'cumulative %' to make sure you know how many you can play with without going over!

Section 8 : Synapse Data Warehouse

There is no SQL Analytics Endpoint in Warehouse.

55. Limitations of Tables in SQL Warehouse

Both table and column names are case sensitive.

No support for Primary Keys.
No support for unique constraint.

Microsoft Fabric