Spark Joins Intro
Lecture 24 : Joins in Spark part 1🔗
Which customers joined platform but never brought anything?
Whenever we need information from another table, we use joins and there should be some common column.
Join is a costly wide dependency operation.
How do joins work?🔗
How many records do we get after inner joining the below two tables.
We get a total of 9 records.
Sometimes data gets duplicated when we do joins, so we should use distinct() but remember distinct is wide dependency transform.
Lecture 25 : Types of Join in Spark🔗
Inner Join🔗
Left Join🔗
All records in left table + those that join with right table, whereever we dont get match on right table the columns become null.
Right Join🔗
Full Outer Join🔗
Left Semi Join🔗
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("LeftSemiJoinExample").getOrCreate()
# Left DataFrame: Orders
orders = spark.createDataFrame([
(1, "iPhone"),
(2, "Pixel"),
(3, "OnePlus"),
(4, "Nokia")
], ["customer_id", "product"])
# Right DataFrame: Valid Customers
valid_customers = spark.createDataFrame([
(1,), (3,)
], ["customer_id"])
# Perform left semi join
filtered_orders = orders.join(valid_customers, on="customer_id", how="left_semi")
filtered_orders.show()
Output
+-----------+--------+
|customer_id|product |
+-----------+--------+
| 1|iPhone |
| 3|OnePlus |
+-----------+--------+
Left Anti Join🔗
Find out all customers who have never purchased any product.
Cross Join🔗
Never use cross join!