Spark Joins Intro

Lecture 24 : Joins in Spark part 1🔗

Which customers joined platform but never brought anything?

Whenever we need information from another table, we use joins and there should be some common column.

Join is a costly wide dependency operation.

How do joins work?🔗

How many records do we get after inner joining the below two tables.

We get a total of 9 records.

Sometimes data gets duplicated when we do joins, so we should use distinct() but remember distinct is wide dependency transform.

Lecture 25 : Types of Join in Spark🔗

Inner Join🔗

Left Join🔗

All records in left table + those that join with right table, whereever we dont get match on right table the columns become null.

Right Join🔗

Full Outer Join🔗

Left Semi Join🔗

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LeftSemiJoinExample").getOrCreate()

# Left DataFrame: Orders
orders = spark.createDataFrame([
    (1, "iPhone"),
    (2, "Pixel"),
    (3, "OnePlus"),
    (4, "Nokia")
], ["customer_id", "product"])

# Right DataFrame: Valid Customers
valid_customers = spark.createDataFrame([
    (1,), (3,)
], ["customer_id"])

# Perform left semi join
filtered_orders = orders.join(valid_customers, on="customer_id", how="left_semi")
filtered_orders.show()

Output

+-----------+--------+
|customer_id|product |
+-----------+--------+
|          1|iPhone  |
|          3|OnePlus |
+-----------+--------+

Left Anti Join🔗

Find out all customers who have never purchased any product.

Cross Join🔗

Never use cross join!