Spark Executor Oom

Executor Memory OOM🔗

10 GB per executor and 4 cores

Expanding one executor

Exceeding either 10GB or 1GB leads to OOM

How is 10GB divided?🔗

What does each part of the user memory do?🔗

Reserved Memory

Minimum 450mb must be our memory of executor.

User Memory

Storage Memory Usage

Executor Memory Usage

What does each part of the spark memory do?🔗

⚙️ Background: Memory in Spark Executors

Each executor in Spark has a limited memory budget. This memory is split for:

Execution Memory: used for joins, aggregations, shuffles
Storage Memory: used for caching RDDs or DataFrames
User Memory: everything else (broadcast vars, UDFs, JVM overhead)

🔄 1. Static Memory Manager (Old)

This was Spark's memory model before Spark 1.6.

🔧 How It Works:

Fixed memory boundaries set in config.
You manually allocate how much memory goes to:
Storage (RDD cache)
Execution (shuffles, joins)
If storage fills up → cached blocks are evicted.
No sharing between execution and storage.

Example fractions

spark.storage.memoryFraction = 0.6
spark.shuffle.memoryFraction = 0.2

🔄 2. Unified Memory Manager (Modern - Default)

Introduced in Spark 1.6+ and is default since Spark 2.0.

🔧 How It Works:

Combines execution + storage into a single unified memory pool.

Dynamic memory sharing: if execution needs more, storage can give up memory — and vice versa.

Much more flexible and efficient.

✅ Benefits:

Less tuning needed
Avoids wasted memory in one region while another needs more
Better stability under pressure

In bwlo case execution memory is empty so storage mmemory uses more of execution memory for caching

Now executor does some work in blue boxes

Now entire memory is full, so we need to evict some data that has been cached. This happens in LRU fashion.

Now let's say executor has entire memory used 2.9 something gb... but it needs more memory.

If the storage pool memory is free it can utilize that.

If the storage pool is also full, then we get OOM!!!

When can we neither evict the data nor spill to disk?🔗

Suppose we have two dataframes df1 and df2 and the key id = 1 is heavily skewed in both dataframes, and its 3GB

Since we need to get all the data from df1 and df2 with id = 1 onto the same executor to perform the join, we have just 2.9GB but the data is 3gb so it gives OOM.

We can handle 3-4 cores per executor beyond that we get memory executor error.

❓ When can Spark neither evict nor spill data from executor memory?

This happens when both eviction and spilling are not possible, and it leads to:

💥 OutOfMemoryError in executors.

✅ These are the main scenarios:

🧱 1. Execution Memory Pressure with No Spill Support

Execution memory is used for:

Joins (SortMergeJoin, HashJoin)
Aggregations (groupByKey, reduceByKey)
Sorts

Some operations (like hash-based aggregations) need a lot of memory, and not all are spillable.

🔥 Example:

df.groupBy("user_id").agg(collect_set("event"))

If collect_set() builds a huge in-memory structure (e.g., millions of unique events per user)

And that structure can’t be spilled to disk

And execution memory is full

👉 Spark can’t evict (no caching), and can’t spill (not supported for this op) → 💣 OOM

🔁 2. Execution Takes Priority, So Storage Can't Evict Enough

In Unified Memory Manager, execution gets priority over storage.

But sometimes, even after evicting all cache, execution still doesn’t get enough memory.

🔥 Example: - You cached a large DataFrame. - Then you do a massive join.

Spark evicts all cached data, but still can't free enough memory.

👉 No more memory to give → 💥

User Code holding References

🍕 Imagine Spark is a Pizza Party Spark is throwing a pizza party. You and your friends (the executors) are each given a plate (memory) to hold some pizza slices (data).

The rule is:

“Eat your slice, then give your plate back so someone else can use it.”

😬 But You Keep Holding Your Plate You finish your slice, but instead of giving the plate back, you say:

“Hmm… I might want to lick the plate later,” so you hold on to it.

And you keep doing this with every plate 🍽️.

Now, you have 10 plates stacked up, all empty, but you're still holding them.

🍕 But There’s a Problem… Spark wants to serve more pizza (more data), but now there are no plates left. Even though you’re not using yours, Spark can’t take them back, because you’re still holding on.

💥 Result? Spark gets frustrated and says:

“I’m out of plates! I can’t serve any more pizza!”

That’s when Spark crashes with a memory error (OOM) — because it can’t clean up the memory you're holding onto.

✅ What Should You Do? Let go of the plates as soon as you're done eating (i.e., don’t store data in variables or lists forever).

That way, Spark can reuse memory and everyone gets more pizza. 🍕

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("HoldingReferencesOOM") \
    .config("spark.driver.memory", "1g") \
    .getOrCreate()

# Create a large DataFrame
df = spark.range(1_000_000)  # 1 million rows

# ❌ BAD: Holding all rows in a Python list
all_data = df.collect()  # Loads entire DataFrame into driver memory

# Still holding reference to a big object
# Spark can't clean this up because Python is holding it

# Do more operations
df2 = df.selectExpr("id * 2 as double_id")
df2.show()

Spark wants to free memory, but it can’t, because your code is still holding a reference to the list all_list is still a reference and even though we may not use it later Java GC doesnt know that. its like we finish playing with a teddy bear but still hold onto it, the teacher thinks we are still playing with it, so they cant take it back.

df = spark.range(1_000_000)

# ✅ Process data without collecting everything into memory
df.filter("id % 2 == 0").show(10)  # only shows first 10 rows