Skip to content

16 dataset distribution executors

Dataset Distribution in ExecutorsπŸ”—

Spark breaks each dataset into partitions, and those partitions are distributed across all executors.

So with:

2 datasets (A, B)
2 executors (E1, E2)

You get something like:

Dataset A partitions β†’ A1, A2, A3, A4
Dataset B partitions β†’ B1, B2, B3, B4

Distribution:

Executor E1 β†’ A1, A3, B2, B4
Executor E2 β†’ A2, A4, B1, B3

Both datasets are spread across both executors


Why Spark does thisπŸ”—

Because Spark is:

  • Partition-based, not dataset-based
  • Designed for parallelism
  • Each executor processes many partitions from different datasets

What happens during a joinπŸ”—

Example:

df1.join(df2, "id")

Step 1: ShuffleπŸ”—

  • Both datasets are repartitioned by key (id)
  • Same keys from both datasets must land in same partition

After shuffle:

Partition P1 β†’ df1(id=1,2) + df2(id=1,2)
Partition P2 β†’ df1(id=3,4) + df2(id=3,4)

Step 2: ExecutionπŸ”—

Executor E1 β†’ processes P1 (both df1 + df2 data)
Executor E2 β†’ processes P2 (both df1 + df2 data)

Each executor works on both datasets together


Key InsightπŸ”—

Executors do not β€œown” datasets β€” they execute tasks on partitions, and each partition may contain data from multiple datasets (especially after shuffle).


Exception: Broadcast JoinπŸ”—

If one dataset is small:

df_large.join(broadcast(df_small), "id")

Then:

df_large β†’ partitioned across executors  
df_small β†’ copied to ALL executors

So:

E1 β†’ large partition + full small dataset  
E2 β†’ large partition + full small dataset

How shuffling of data happens?πŸ”—

Almostβ€”you’re very close. The idea is correct, but the exact logic is slightly more precise.


How Spark decides partition during shuffleπŸ”—

When you do something like:

df.groupby("id")
# or
df.join(other_df, "id")

Spark uses a hash partitioning function, not a direct division.


Actual formulaπŸ”—

partition_id = hash(id) % num_partitions

So if:

num_partitions = 200

Then:

partition_id = hash(id) % 200

Why not just id % 200?πŸ”—

Because:

  1. IDs may not be numeric

  2. Could be strings, composite keys, etc.

  3. Even if numeric:

  4. Raw id % 200 may lead to poor distribution

  5. Hashing helps spread data more uniformly

ExampleπŸ”—

Let’s say:

id = 12345
hash(12345) = 987654321   (example)

Then:

partition_id = 987654321 % 200 = 121

So this row goes to:

Partition 121

Important PropertyπŸ”—

Same key always goes to same partition

Because:

hash(id) % 200 β†’ deterministic

This is critical for:

  • joins
  • groupBy
  • aggregations

Where skew comes fromπŸ”—

If one key appears a lot:

id = 1 β†’ 1 million rows

Then:

hash(1) % 200 β†’ say partition 37

Now:

Partition 37 β†’ huge
Others β†’ small

That’s skew


Final ClarificationπŸ”—

Your statement:

β€œdo we divide by 200 to decide partition?”

Correct version:

Spark hashes the key and then takes modulo with number of partitions (e.g., 200) to decide which partition the data goes to.


One-Line AnswerπŸ”—

Partition assignment is done using hash(key) % num_partitions, not direct division, ensuring consistent and distributed placement of data across partitions.