Spark Executor Oom
Executor Memory OOM๐
10 GB per executor and 4 cores
Expanding one executor
Exceeding either 10GB or 1GB leads to OOM
How is 10GB divided?๐
What does each part of the user memory do?๐
- Reserved Memory
Minimum 450mb must be our memory of executor.
- User Memory
- Storage Memory Usage
- Executor Memory Usage
What does each part of the spark memory do?๐
โ๏ธ Background: Memory in Spark Executors
Each executor in Spark has a limited memory budget. This memory is split for:
-
Execution Memory: used for joins, aggregations, shuffles
-
Storage Memory: used for caching RDDs or DataFrames
-
User Memory: everything else (broadcast vars, UDFs, JVM overhead)
๐ 1. Static Memory Manager (Old)
This was Spark's memory model before Spark 1.6.
๐ง How It Works:
- Fixed memory boundaries set in config.
- You manually allocate how much memory goes to:
- Storage (RDD cache)
- Execution (shuffles, joins)
- If storage fills up โ cached blocks are evicted.
- No sharing between execution and storage.
Example fractions
๐ 2. Unified Memory Manager (Modern - Default)
Introduced in Spark 1.6+ and is default since Spark 2.0.
๐ง How It Works:
Combines execution + storage into a single unified memory pool.
Dynamic memory sharing: if execution needs more, storage can give up memory โ and vice versa.
Much more flexible and efficient.
โ Benefits:
- Less tuning needed
- Avoids wasted memory in one region while another needs more
- Better stability under pressure
In bwlo case execution memory is empty so storage mmemory uses more of execution memory for caching
Now executor does some work in blue boxes
Now entire memory is full, so we need to evict some data that has been cached. This happens in LRU fashion.
Now let's say executor has entire memory used 2.9 something gb... but it needs more memory.
If the storage pool memory is free it can utilize that.
If the storage pool is also full, then we get OOM!!!
When can we neither evict the data nor spill to disk?๐
Suppose we have two dataframes df1 and df2 and the key id = 1 is heavily skewed in both dataframes, and its 3GB
Since we need to get all the data from df1 and df2 with id = 1 onto the same executor to perform the join, we have just 2.9GB but the data is 3gb so it gives OOM.
We can handle 3-4 cores per executor beyond that we get memory executor error.
โ When can Spark neither evict nor spill data from executor memory?
This happens when both eviction and spilling are not possible, and it leads to:
๐ฅ OutOfMemoryError in executors.
โ These are the main scenarios:
๐งฑ 1. Execution Memory Pressure with No Spill Support
Execution memory is used for:
- Joins (SortMergeJoin, HashJoin)
- Aggregations (groupByKey, reduceByKey)
- Sorts
Some operations (like hash-based aggregations) need a lot of memory, and not all are spillable.
๐ฅ Example:
If collect_set() builds a huge in-memory structure (e.g., millions of unique events per user)And that structure canโt be spilled to disk
And execution memory is full
๐ Spark canโt evict (no caching), and canโt spill (not supported for this op) โ ๐ฃ OOM
๐ 2. Execution Takes Priority, So Storage Can't Evict Enough
In Unified Memory Manager, execution gets priority over storage.
But sometimes, even after evicting all cache, execution still doesnโt get enough memory.
๐ฅ Example: - You cached a large DataFrame. - Then you do a massive join.
Spark evicts all cached data, but still can't free enough memory.
๐ No more memory to give โ ๐ฅ
User Code holding References
๐ Imagine Spark is a Pizza Party Spark is throwing a pizza party. You and your friends (the executors) are each given a plate (memory) to hold some pizza slices (data).
The rule is:
โEat your slice, then give your plate back so someone else can use it.โ
๐ฌ But You Keep Holding Your Plate You finish your slice, but instead of giving the plate back, you say:
โHmmโฆ I might want to lick the plate later,โ so you hold on to it.
And you keep doing this with every plate ๐ฝ๏ธ.
Now, you have 10 plates stacked up, all empty, but you're still holding them.
๐ But Thereโs a Problemโฆ Spark wants to serve more pizza (more data), but now there are no plates left. Even though youโre not using yours, Spark canโt take them back, because youโre still holding on.
๐ฅ Result? Spark gets frustrated and says:
โIโm out of plates! I canโt serve any more pizza!โ
Thatโs when Spark crashes with a memory error (OOM) โ because it canโt clean up the memory you're holding onto.
โ What Should You Do? Let go of the plates as soon as you're done eating (i.e., donโt store data in variables or lists forever).
That way, Spark can reuse memory and everyone gets more pizza. ๐
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("HoldingReferencesOOM") \
.config("spark.driver.memory", "1g") \
.getOrCreate()
# Create a large DataFrame
df = spark.range(1_000_000) # 1 million rows
# โ BAD: Holding all rows in a Python list
all_data = df.collect() # Loads entire DataFrame into driver memory
# Still holding reference to a big object
# Spark can't clean this up because Python is holding it
# Do more operations
df2 = df.selectExpr("id * 2 as double_id")
df2.show()
Spark wants to free memory, but it canโt, because your code is still holding a reference to the list all_list
is still a reference and even though we may not use it later Java GC doesnt know that. its like we finish playing with a teddy bear but still hold onto it, the teacher thinks we are still playing with it, so they cant take it back.