16 dataset distribution executors

Dataset Distribution in Executors🔗

Spark breaks each dataset into partitions, and those partitions are distributed across all executors.

So with:

2 datasets (A, B)
2 executors (E1, E2)

You get something like:

Dataset A partitions → A1, A2, A3, A4
Dataset B partitions → B1, B2, B3, B4

Distribution:

Executor E1 → A1, A3, B2, B4
Executor E2 → A2, A4, B1, B3

Both datasets are spread across both executors

Why Spark does this🔗

Because Spark is:

Partition-based, not dataset-based
Designed for parallelism
Each executor processes many partitions from different datasets

What happens during a join🔗

Example:

df1.join(df2, "id")

Step 1: Shuffle🔗

Both datasets are repartitioned by key (id)
Same keys from both datasets must land in same partition

After shuffle:

Partition P1 → df1(id=1,2) + df2(id=1,2)
Partition P2 → df1(id=3,4) + df2(id=3,4)

Step 2: Execution🔗

Executor E1 → processes P1 (both df1 + df2 data)
Executor E2 → processes P2 (both df1 + df2 data)

Each executor works on both datasets together

Key Insight🔗

Executors do not “own” datasets — they execute tasks on partitions, and each partition may contain data from multiple datasets (especially after shuffle).

Exception: Broadcast Join🔗

If one dataset is small:

df_large.join(broadcast(df_small), "id")

Then:

df_large → partitioned across executors  
df_small → copied to ALL executors

So:

E1 → large partition + full small dataset  
E2 → large partition + full small dataset

How shuffling of data happens?🔗

Almost—you’re very close. The idea is correct, but the exact logic is slightly more precise.

How Spark decides partition during shuffle🔗

When you do something like:

df.groupby("id")
# or
df.join(other_df, "id")

Spark uses a hash partitioning function, not a direct division.

Actual formula🔗

partition_id = hash(id) % num_partitions

So if:

num_partitions = 200

Then:

partition_id = hash(id) % 200

Why not just `id % 200`?🔗

Because:

IDs may not be numeric
Could be strings, composite keys, etc.
Even if numeric:
Raw id % 200 may lead to poor distribution
Hashing helps spread data more uniformly

Example🔗

Let’s say:

id = 12345
hash(12345) = 987654321   (example)

Then:

partition_id = 987654321 % 200 = 121

So this row goes to:

Partition 121

Important Property🔗

Same key always goes to same partition

Because:

hash(id) % 200 → deterministic

This is critical for:

joins
groupBy
aggregations

Where skew comes from🔗

If one key appears a lot:

id = 1 → 1 million rows

Then:

hash(1) % 200 → say partition 37

Now:

Partition 37 → huge
Others → small

That’s skew

Final Clarification🔗

Your statement:

“do we divide by 200 to decide partition?”

Correct version:

Spark hashes the key and then takes modulo with number of partitions (e.g., 200) to decide which partition the data goes to.

One-Line Answer🔗

Partition assignment is done using hash(key) % num_partitions, not direct division, ensuring consistent and distributed placement of data across partitions.

16 dataset distribution executors

Dataset Distribution in Executors🔗

Why Spark does this🔗

What happens during a join🔗

Step 1: Shuffle🔗

Step 2: Execution🔗

Key Insight🔗

Exception: Broadcast Join🔗

How shuffling of data happens?🔗

How Spark decides partition during shuffle🔗

Actual formula🔗

Why not just id % 200?🔗

Example🔗

Important Property🔗

Where skew comes from🔗

Final Clarification🔗

One-Line Answer🔗

Why not just `id % 200`?🔗