Types of Memory in Sparkπ
Great question π β Sparkβs memory model is tricky but very important for tuning and avoiding OOMs. Letβs break it down.
πΉ Types of Memory in Sparkπ
Broadly, Spark memory can be thought of at two levels:
- Execution vs Storage memory (inside the JVM heap managed by Spark)
- Other JVM memory categories (outside Sparkβs unified memory)
1. Execution Memoryπ
-
Used for:
-
Shuffle operations (sort, join, aggregation)
- Hash tables for joins and aggregations
- Temporary buffers when spilling to disk
- When it runs out: Data is spilled to disk.
π Example:
When Spark does a groupByKey
or sortByKey
, it needs execution memory to build in-memory data structures.
2. Storage Memoryπ
-
Used for:
-
Caching/persisting RDDs/DataFrames (
df.cache()
,rdd.persist()
) - Broadcast variables (e.g., in broadcast joins)
-
Unrolling RDD elements before caching
-
When it runs out:
-
Cached blocks may be evicted (LRU).
- Broadcast variables may spill.
π Example: If you do:
The DataFrame sits in storage memory.
3. Unified Memory Managementπ
Since Spark 1.6+, execution and storage memory share a unified pool (spark.memory.fraction
, default 60% of JVM heap).
- If execution needs more β it can borrow from storage (by evicting cached blocks).
- If storage needs more β it can borrow from execution, but only if execution isnβt using it.
4. User Memoryπ
-
Used for:
-
Data structures created by your Spark code inside UDFs, accumulators, custom objects, etc.
- Spark doesnβt manage this β itβs just regular JVM heap outside the unified pool.
π Example: If you write a UDF that builds a big in-memory map, it goes into user memory.
5. Reserved Memoryπ
- A fixed amount Spark reserves for internal operations (default \~300 MB per executor).
- Not configurable (except by changing Spark code).
- Ensures Spark doesnβt use 100% of JVM heap and leave nothing for itself.
6. Off-Heap Memoryπ
-
Used for:
-
Tungstenβs optimized binary storage format (off-heap caching)
- When
spark.memory.offHeap.enabled=true
- Managed outside JVM heap β avoids GC overhead.
- Configurable with
spark.memory.offHeap.size
.
π Example: When you enable off-heap caching, Spark stores columnar data in native memory instead of the JVM heap for efficiency.
πΉ Spark Memory Layout (Executor JVM Heap)π
+-------------------------------------------------------------+
| JVM Heap |
| |
| Reserved Memory (~300MB, always kept aside) |
|-------------------------------------------------------------|
| Unified Memory Region (spark.memory.fraction ~ 60%) |
| - Execution Memory <---- shareable ----> Storage Memory |
|-------------------------------------------------------------|
| User Memory (UDF objects, data structures, not Spark-managed)|
+-------------------------------------------------------------+
Outside JVM Heap:
- Off-Heap Memory (optional, managed by Spark)
πΉ Where They Are Used in Practiceπ
- Execution Memory β Sorting, shuffling, joins, aggregations
- Storage Memory β Caching/persist, broadcast variables
- User Memory β UDFs, custom data structures, accumulators
- Reserved Memory β Spark internal bookkeeping
- Off-Heap Memory β Tungsten, columnar cache, avoids GC overhead
β Summary:
- Spark divides memory into execution (processing/shuffle) and storage (cache/broadcast).
- These share a unified pool for efficiency.
- User memory and reserved memory sit outside Sparkβs control.
- Off-heap memory is optional but useful for performance.
πΉ 1. Execution Memoryπ
Definition: Memory used for processing computations in Spark.
What it stores:
- Shuffle operations (sorts, aggregations, joins)
- Hash tables for joins and aggregations
- Temporary buffers for sorting, spilling data to disk
Behavior:
- Borrowable from storage memory if storage is not using all of its share (because Spark uses unified memory pool)
- If execution memory runs out, Spark spills intermediate data to disk to avoid crashing
Example:
- Spark builds a hash map of states β execution memory is used.
- If too many states to fit in memory β spills to disk.
πΉ 2. Storage Memoryπ
Definition: Memory used for caching and storing data in memory.
What it stores:
- Cached/persisted RDDs or DataFrames (
df.cache()
) - Broadcast variables for joins
- Unrolled blocks before writing to cache
Behavior:
- Evictable (Spark uses LRU β least recently used blocks get removed if execution needs memory)
- Part of unified memory pool (
spark.memory.fraction
) - Helps avoid recomputation or re-reading data from disk
Example:
πΉ 3. Key Differenceπ
Feature | Execution Memory | Storage Memory |
---|---|---|
Purpose | For computations (shuffle, join, aggregation) | For caching/persisting data, broadcast variables |
Managed by Spark? | Yes (part of unified memory pool) | Yes (part of unified memory pool) |
Borrowable? | Can borrow from storage if free | Can be borrowed by execution if storage is not full |
Eviction policy | Spills to disk when full | LRU eviction to free memory for execution |
Examples | groupBy, sort, join buffers | df.cache(), broadcast joins |
πΉ 4. Unified Memory Poolπ
Spark >= 1.6 uses unified memory management:
+-----------------------------------------+
| Unified Memory Pool (spark.memory.fraction ~60%) |
|-----------------------------------------|
| Execution Memory <------> Storage Memory |
+-----------------------------------------+
- Execution can borrow from storage memory if needed.
- Storage can use free execution memory.
- Prevents fixed partitioning of memory and reduces OOMs.
β In short:
- Execution Memory β used temporarily for computations like shuffles, joins, sorts.
- Storage Memory β used to store cached RDDs/DataFrames or broadcast variables for reuse.
- Both share the same unified memory pool.
Perfect! Letβs visualize Execution Memory vs Storage Memory in Spark with a clear diagram and examples.
πΉ Spark Executor Memory Layout (Simplified)π
Executor JVM Heap
+-----------------------------------------------------------+
| Reserved Memory (~300MB) |
|-----------------------------------------------------------|
| User Memory (UDFs, accumulators, temp objects) |
|-----------------------------------------------------------|
| Unified Memory Pool (~60% of executor memory) |
| +---------------------+ +--------------------------+ |
| | Execution Memory | | Storage Memory | |
| |---------------------| |--------------------------| |
| | Used for: | | Used for: | |
| | - Shuffle buffers | | - Cached RDD/DataFrames | |
| | - Join/hash tables | | - Broadcast variables | |
| | - Aggregations | | - Unrolled blocks | |
| | If full -> spills | | If needed -> evict LRU | |
| +---------------------+ +--------------------------+ |
+-----------------------------------------------------------+
πΉ Examples of Memory Usageπ
Operation / Action | Memory Used | Notes |
---|---|---|
df.groupBy("state").agg(sum("revenue")) |
Execution Memory | Hash map for aggregation stored here. If too large β spill to disk. |
df.sort("date") |
Execution Memory | Sort buffers stored in memory before writing or returning results. |
df.cache() |
Storage Memory | Cached DataFrame resides here for reuse. |
broadcast(df) |
Storage Memory | Broadcasted DataFrame for joins stored here. |
Temporary object inside a UDF | User Memory | Not managed by Sparkβs unified memory. |
πΉ Unified Memory Behaviorπ
- Execution can borrow from storage if storage has free space.
- Storage can borrow from free execution memory if execution isnβt using it.
- Helps prevent OOM errors and improves memory efficiency.
πΉ Quick Visual Summaryπ
Execution Memory <----> Storage Memory
(shuffle, join) (cache, broadcast)
| |
v v
spills to disk evict LRU
β Key Takeaways:
- Execution Memory: Temporary, computation-related, spills to disk if needed.
- Storage Memory: Persistent, caching/broadcast, evictable.
- Unified Memory Pool: Flexible sharing to reduce memory pressure.
When can we neither spill to disk or evict storage memory? Link
Would you like me to also give you a real-world scenario of an executor OOM and show which type of memory usually causes it (shuffle-heavy job vs cache-heavy job vs UDF-heavy job)?