Skip to content

Does the data output after all the execution on the executors is complete?๐Ÿ”—


๐Ÿ”น What Happens Inside the Executor๐Ÿ”—

  1. Join Execution Starts

  2. Executor begins processing its partition (say, skewed table B against smaller A).

  3. Table A (small) is buffered in memory.
  4. Table B rows are streamed batch by batch (from memory and disk if spilled).

  5. Streaming Join Loop

  6. For each batch of rows from B, executor does the Cartesian product with A.

  7. Emits output rows immediately (doesnโ€™t wait to finish all batches).
  8. If output rows themselves donโ€™t fit in memory, they too can be spilled to temporary files (shuffle/disk spill).

  9. Completion of Task

  10. Executor keeps producing and spilling/streaming until all rows for that partition are joined.

  11. When done, the results of that task are either:

    • Stored in shuffle files (if another stage depends on it).
    • Sent to the driver (if you requested .collect()).

๐Ÿ”น What Happens at the Driver (for .collect())๐Ÿ”—

  • The driver doesnโ€™t wait for all executors to finish globally before receiving anything.
  • Instead, each executor/task sends its partitionโ€™s results back as soon as theyโ€™re ready.
  • Spark driver accumulates those partitions until the entire dataset is received.
  • Only when all partitions are received does .collect() return the final Python list.

๐Ÿ”น So to Your Question๐Ÿ”—

โ€œDoes the executor show all data in collect() only when it finishes joining all batches of table B from disk?โ€

โœ… Yes, but at the executor level:

  • Each executor must finish its partition (processing all batches of table B, including spills) before it can hand that partitionโ€™s results to the driver.
  • The driver only gets complete partitions from executors, not row-by-row streaming.
  • When all executors finish and send their partitions โ†’ driver merges them โ†’ .collect() returns.

๐Ÿ”น Analogy (Apples Again ๐ŸŽ)๐Ÿ”—

  • Executor = worker making apple pairs.
  • Worker has one small bowl (A) and one giant truck (B).
  • Worker processes crates from the truck one at a time, makes pairs with the bowl, and stacks results.
  • Worker doesnโ€™t hand over pairs to the boss (driver) crate by crate โ€” he waits until all his crates are processed (partition done).
  • Then he delivers his entire stack of results to the boss.
  • Boss waits for all workers to deliver their stacks โ†’ only then shows you the final full list (collect()).

โœ… Takeaway๐Ÿ”—

  • Executors stream through skewed/spilled data batch by batch.
  • But the driver only receives results partition by partition (after executor finishes).
  • .collect() blocks until all executors finish and return their partitions.
  • Thatโ€™s why .collect() can OOM the driver โ†’ it tries to hold the entire dataset at once.