Skip to content

Types of Memory in SparkπŸ”—

Great question πŸš€ β€” Spark’s memory model is tricky but very important for tuning and avoiding OOMs. Let’s break it down.


πŸ”Ή Types of Memory in SparkπŸ”—

Broadly, Spark memory can be thought of at two levels:

  1. Execution vs Storage memory (inside the JVM heap managed by Spark)
  2. Other JVM memory categories (outside Spark’s unified memory)

1. Execution MemoryπŸ”—

  • Used for:

  • Shuffle operations (sort, join, aggregation)

  • Hash tables for joins and aggregations
  • Temporary buffers when spilling to disk
  • When it runs out: Data is spilled to disk.

πŸ‘‰ Example: When Spark does a groupByKey or sortByKey, it needs execution memory to build in-memory data structures.


2. Storage MemoryπŸ”—

  • Used for:

  • Caching/persisting RDDs/DataFrames (df.cache(), rdd.persist())

  • Broadcast variables (e.g., in broadcast joins)
  • Unrolling RDD elements before caching

  • When it runs out:

  • Cached blocks may be evicted (LRU).

  • Broadcast variables may spill.

πŸ‘‰ Example: If you do:

val cachedDF = df.cache()

The DataFrame sits in storage memory.


3. Unified Memory ManagementπŸ”—

Since Spark 1.6+, execution and storage memory share a unified pool (spark.memory.fraction, default 60% of JVM heap).

  • If execution needs more β†’ it can borrow from storage (by evicting cached blocks).
  • If storage needs more β†’ it can borrow from execution, but only if execution isn’t using it.

4. User MemoryπŸ”—

  • Used for:

  • Data structures created by your Spark code inside UDFs, accumulators, custom objects, etc.

  • Spark doesn’t manage this β€” it’s just regular JVM heap outside the unified pool.

πŸ‘‰ Example: If you write a UDF that builds a big in-memory map, it goes into user memory.


5. Reserved MemoryπŸ”—

  • A fixed amount Spark reserves for internal operations (default \~300 MB per executor).
  • Not configurable (except by changing Spark code).
  • Ensures Spark doesn’t use 100% of JVM heap and leave nothing for itself.

6. Off-Heap MemoryπŸ”—

  • Used for:

  • Tungsten’s optimized binary storage format (off-heap caching)

  • When spark.memory.offHeap.enabled=true
  • Managed outside JVM heap β†’ avoids GC overhead.
  • Configurable with spark.memory.offHeap.size.

πŸ‘‰ Example: When you enable off-heap caching, Spark stores columnar data in native memory instead of the JVM heap for efficiency.


πŸ”Ή Spark Memory Layout (Executor JVM Heap)πŸ”—

+-------------------------------------------------------------+
| JVM Heap                                                    |
|                                                             |
|  Reserved Memory (~300MB, always kept aside)                |
|-------------------------------------------------------------|
| Unified Memory Region (spark.memory.fraction ~ 60%)         |
|   - Execution Memory  <---- shareable ----> Storage Memory  |
|-------------------------------------------------------------|
| User Memory (UDF objects, data structures, not Spark-managed)|
+-------------------------------------------------------------+

Outside JVM Heap:

  • Off-Heap Memory (optional, managed by Spark)

πŸ”Ή Where They Are Used in PracticeπŸ”—

  • Execution Memory β†’ Sorting, shuffling, joins, aggregations
  • Storage Memory β†’ Caching/persist, broadcast variables
  • User Memory β†’ UDFs, custom data structures, accumulators
  • Reserved Memory β†’ Spark internal bookkeeping
  • Off-Heap Memory β†’ Tungsten, columnar cache, avoids GC overhead

βœ… Summary:

  • Spark divides memory into execution (processing/shuffle) and storage (cache/broadcast).
  • These share a unified pool for efficiency.
  • User memory and reserved memory sit outside Spark’s control.
  • Off-heap memory is optional but useful for performance.

πŸ”Ή 1. Execution MemoryπŸ”—

Definition: Memory used for processing computations in Spark.

What it stores:

  • Shuffle operations (sorts, aggregations, joins)
  • Hash tables for joins and aggregations
  • Temporary buffers for sorting, spilling data to disk

Behavior:

  • Borrowable from storage memory if storage is not using all of its share (because Spark uses unified memory pool)
  • If execution memory runs out, Spark spills intermediate data to disk to avoid crashing

Example:

df.groupBy("state").agg(sum("revenue"))
  • Spark builds a hash map of states β†’ execution memory is used.
  • If too many states to fit in memory β†’ spills to disk.

πŸ”Ή 2. Storage MemoryπŸ”—

Definition: Memory used for caching and storing data in memory.

What it stores:

  • Cached/persisted RDDs or DataFrames (df.cache())
  • Broadcast variables for joins
  • Unrolled blocks before writing to cache

Behavior:

  • Evictable (Spark uses LRU β€” least recently used blocks get removed if execution needs memory)
  • Part of unified memory pool (spark.memory.fraction)
  • Helps avoid recomputation or re-reading data from disk

Example:

val cachedDF = df.cache()
cachedDF.count()  // Storage memory used to keep DF in memory

πŸ”Ή 3. Key DifferenceπŸ”—

Feature Execution Memory Storage Memory
Purpose For computations (shuffle, join, aggregation) For caching/persisting data, broadcast variables
Managed by Spark? Yes (part of unified memory pool) Yes (part of unified memory pool)
Borrowable? Can borrow from storage if free Can be borrowed by execution if storage is not full
Eviction policy Spills to disk when full LRU eviction to free memory for execution
Examples groupBy, sort, join buffers df.cache(), broadcast joins

πŸ”Ή 4. Unified Memory PoolπŸ”—

Spark >= 1.6 uses unified memory management:

+-----------------------------------------+
| Unified Memory Pool (spark.memory.fraction ~60%) |
|-----------------------------------------|
| Execution Memory   <------>  Storage Memory |
+-----------------------------------------+
  • Execution can borrow from storage memory if needed.
  • Storage can use free execution memory.
  • Prevents fixed partitioning of memory and reduces OOMs.

βœ… In short:

  • Execution Memory β†’ used temporarily for computations like shuffles, joins, sorts.
  • Storage Memory β†’ used to store cached RDDs/DataFrames or broadcast variables for reuse.
  • Both share the same unified memory pool.

Perfect! Let’s visualize Execution Memory vs Storage Memory in Spark with a clear diagram and examples.


πŸ”Ή Spark Executor Memory Layout (Simplified)πŸ”—

Executor JVM Heap
+-----------------------------------------------------------+
| Reserved Memory (~300MB)                                  |
|-----------------------------------------------------------|
| User Memory (UDFs, accumulators, temp objects)           |
|-----------------------------------------------------------|
| Unified Memory Pool (~60% of executor memory)            |
|   +---------------------+  +--------------------------+ |
|   | Execution Memory    |  | Storage Memory           | |
|   |---------------------|  |--------------------------| |
|   | Used for:           |  | Used for:                | |
|   | - Shuffle buffers   |  | - Cached RDD/DataFrames  | |
|   | - Join/hash tables  |  | - Broadcast variables    | |
|   | - Aggregations      |  | - Unrolled blocks        | |
|   | If full -> spills   |  | If needed -> evict LRU   | |
|   +---------------------+  +--------------------------+ |
+-----------------------------------------------------------+

πŸ”Ή Examples of Memory UsageπŸ”—

Operation / Action Memory Used Notes
df.groupBy("state").agg(sum("revenue")) Execution Memory Hash map for aggregation stored here. If too large β†’ spill to disk.
df.sort("date") Execution Memory Sort buffers stored in memory before writing or returning results.
df.cache() Storage Memory Cached DataFrame resides here for reuse.
broadcast(df) Storage Memory Broadcasted DataFrame for joins stored here.
Temporary object inside a UDF User Memory Not managed by Spark’s unified memory.

πŸ”Ή Unified Memory BehaviorπŸ”—

  • Execution can borrow from storage if storage has free space.
  • Storage can borrow from free execution memory if execution isn’t using it.
  • Helps prevent OOM errors and improves memory efficiency.

πŸ”Ή Quick Visual SummaryπŸ”—

Execution Memory   <----> Storage Memory
 (shuffle, join)         (cache, broadcast)
      |                        |
      v                        v
  spills to disk           evict LRU

βœ… Key Takeaways:

  • Execution Memory: Temporary, computation-related, spills to disk if needed.
  • Storage Memory: Persistent, caching/broadcast, evictable.
  • Unified Memory Pool: Flexible sharing to reduce memory pressure.

When can we neither spill to disk or evict storage memory? Link

Would you like me to also give you a real-world scenario of an executor OOM and show which type of memory usually causes it (shuffle-heavy job vs cache-heavy job vs UDF-heavy job)?