Skip to content

Smj Spill To Disk Q2

Explain How Streaming Data from Disk to Exector works for SMJ?πŸ”—


πŸ”Ή Sort-Merge Join Execution (with skew + spill)πŸ”—

  1. Both sides sorted & partitioned

  2. Table A (smaller side) β†’ all rows sorted by key.

  3. Table B (skewed side) β†’ rows also sorted by key, but because the skewed key is huge, Spark may spill a lot of its sorted chunks to disk.

  4. Executor merge phase

  5. Spark creates iterators:

    • One for A (fits in memory).
    • One for B (some in memory, some spilled to disk).
  6. When join key is encountered

  7. Spark buffers all rows for that key from A (usually small enough to keep in memory).

  8. Spark starts pulling rows for that key from B in batches.

    • If the rows are in memory, read directly.
    • If rows were spilled, load them back sequentially from disk (streaming).
  9. Join output

  10. For each batch of rows from B, Spark does the Cartesian product with A’s buffered rows.

  11. Emits results in a streaming fashion.
  12. If B’s key group is gigantic, Spark keeps pulling more batches from disk until all pairs are produced.

πŸ”Ή Key InsightπŸ”—

  • Spark never tries to load all of skewed table B into memory at once.
  • Instead:

  • A is small β†’ fully in memory.

  • B is big β†’ read batch β†’ join with A β†’ emit results β†’ read next batch β†’ repeat.
  • If B is insanely large, Spark may spill intermediate join buffers again, but the logic is still stream & spill, not β€œload all at once.”

πŸ”Ή AnalogyπŸ”—

Imagine:

  • Table A = a tiny bowl of 5 apples 🍎.
  • Table B = a giant truckload of apples 🚚.
  • Spark doesn’t dump the whole truck into memory.
  • Instead, it unloads one crate at a time, joins with the 5 apples from A, writes results out, and then grabs the next crate.

βœ… So yes, exactly: executor keeps small table A in memory, and streams batches of rows for the skewed key from table B (from memory and disk), joining them incrementally.