Skip to content

RDD Execution

What actually happens with RDD processing?πŸ”—

When you write Spark code (Scala, Python, or Java), Spark does not execute RDD transformations as raw bytecode pipelines like a JVM JIT would for normal programs.

Instead, Spark works at a higher abstraction level:

1. Your code β†’ function objects (closures)πŸ”—

When you write something like:

rdd.map(lambda x: x * 2)

or in Scala:

rdd.map(x => x * 2)

Spark converts that logic into a function (closure).


2. Spark builds a logical execution plan (RDD lineage)πŸ”—

RDDs are lazy. Spark doesn’t execute immediately.

It builds a lineage graph (DAG):

  • map
  • filter
  • flatMap

These are just transformations stored as metadata, not executed yet.


3. When an action is triggered β†’ execution startsπŸ”—

Example:

rdd.map(lambda x: x * 2).collect()

Now Spark:

  • Breaks the DAG into stages
  • Sends tasks to executors

4. What executors actually runπŸ”—

This is the key part:

  • Executors run JVM processes
  • Your transformation logic is sent as serialized functions
  • These functions are executed on data partitions

So internally:

  • In Scala/Java Spark:

  • Yes, your code is compiled to JVM bytecode

  • BUT Spark is not "processing RDDs as bytecode"
  • It is executing functions on partitions

  • In PySpark:

  • Python code runs in a Python worker process

  • JVM ↔ Python communication happens via sockets
  • No JVM bytecode execution for your Python logic

Important distinctionπŸ”—

Concept Reality
Is Spark executing bytecode pipelines for RDDs? No
Does Spark run on JVM bytecode? Yes (for Scala/Java)
What is actually executed? Functions applied to partitions
Optimization level Low (compared to DataFrames)

Why people get confusedπŸ”—

Because:

  • Spark runs on JVM β†’ so bytecode exists
  • But RDD execution is function-based, not query-plan optimized

Compare with DataFrames (important)πŸ”—

RDD:

  • Executes your functions as-is
  • No optimization
  • No code generation

DataFrame (Catalyst + Tungsten):

  • Generates optimized bytecode at runtime (WholeStageCodeGen)
  • Much faster

Final takeawayπŸ”—

  • RDDs are not processed as bytecode pipelines

  • They are processed as:

Serialized functions executed on partitions across executors

  • Bytecode exists only as a lower-level implementation detail of JVM, not as Spark’s execution model

Here’s a concise summary of all three topics (RDD execution, PySpark vs Scala, Python workers):


1. RDD ExecutionπŸ”—

  • RDDs are not processed as bytecode pipelines
  • Spark builds a DAG (lineage) of transformations and executes it lazily
  • Executors run serialized functions on partitions
  • In Scala/Java: functions run as JVM bytecode
  • In Python: functions run in Python workers
  • RDDs have no query optimization or code generation

2. PySpark vs Scala ExecutionπŸ”—

Scala Spark

  • Runs entirely inside JVM
  • Functions execute directly on data
  • Benefits from Catalyst Optimizer and Tungsten Engine
  • Supports whole-stage code generation (optimized bytecode)
  • Faster, minimal overhead

PySpark

  • Uses dual runtime: Python + JVM
  • Python code runs outside JVM
  • Data moves between JVM and Python
  • Slower due to serialization and communication overhead
  • Fast only when using DataFrame APIs without Python UDFs

3. Python WorkersπŸ”—

  • Python workers are separate processes outside the JVM
  • Each executor JVM communicates with them via sockets
  • No shared memory; all data is serialized
  • Driver uses Py4J, executors use socket communication
  • Apache Arrow improves transfer efficiency (Pandas UDFs) but does not eliminate the boundary

Final Mental ModelπŸ”—

  • RDDs: function execution on partitions (no optimization)
  • Scala Spark: native JVM execution (fast, optimized)
  • PySpark: Python execution outside JVM (flexible, but overhead due to data movement)

Here’s a concise summary of the last two points:


Spark SQL FunctionsπŸ”—

  • Best performance option in Spark
  • Fully optimized by Catalyst Optimizer
  • Uses Tungsten Engine for execution
  • Supports whole-stage code generation
  • Runs entirely inside JVM (no Python overhead)
  • Avoids serialization and cross-language communication

Execution Model (Bytecode vs Python)πŸ”—

  • Spark SQL / DataFrame logic:

  • Logical plan β†’ optimized β†’ physical plan

  • Then compiled to JVM bytecode at runtime
  • Executed inside executor JVM

  • No Python workers involved unless:

  • Python UDFs or Pandas UDFs are used


Final takeawayπŸ”—

Spark SQL = optimized + codegen + JVM execution (fast) Python UDF = JVM ↔ Python boundary (slow)

Not exactly β€” it’s the other way around.

RDDs are divided into partitions, not partitions into RDDs.


Correct relationshipπŸ”—

  • An RDD (Resilient Distributed Dataset) is the logical dataset
  • It is split into multiple partitions
  • Each partition is:

  • A chunk of data

  • Processed independently by a task

Mental modelπŸ”—

RDD
 β”œβ”€β”€ Partition 1
 β”œβ”€β”€ Partition 2
 β”œβ”€β”€ Partition 3
  • Spark schedules one task per partition
  • Parallelism = number of partitions

How it works in executionπŸ”—

  1. You create an RDD / DataFrame
  2. Spark splits data into partitions
  3. Each partition is processed on an executor

Important clarificationπŸ”—

  • Partitions are not smaller RDDs
  • They are internal pieces of an RDD

However:

  • Each transformation (like map, filter) creates a new RDD
  • That new RDD still consists of partitions

ExampleπŸ”—

rdd = sc.parallelize([1,2,3,4], 2)
  • Creates 1 RDD
  • With 2 partitions

Final takeawayπŸ”—

RDD = dataset abstraction Partitions = physical chunks of that dataset used for parallel processing