What Actually Happens in coalesce()π
coalesce(n) is a narrow transformation, but that does not mean zero data movement.
It means:
- No full shuffle across all partitions
- But some data may still be reassigned/moved
How Coalesce Reduces Partitionsπ
Spark:
- Selects a subset of existing partitions as targets
- Maps multiple old partitions β fewer new partitions
- Avoids re-hashing / global redistribution
Exampleπ
Before (4 partitions across executors)π
After coalesce(2)π
Possible mapping:
Does Data Move?π
Case 1: Ideal (Same Executor)π
If:
- P1 and P2 are already on same executor
Then:
- Minimal or no data movement
Case 2: Different Executorsπ
If:
- P1 is on Executor 1
- P2 is on Executor 2
Then:
- One partitionβs data must move to the other executor
Key Insightπ
coalesce()avoids global shuffle, but does not guarantee data locality
- It does not rebalance data evenly
- It just collapses partitions with minimal coordination
Why Itβs Still Faster Than repartition()π
coalesce()π
- Moves only necessary partitions
- No re-hashing
- No all-to-all communication
repartition()π
- Every partition sends data to every other partition
- Full network shuffle
Internal Behavior (Important)π
- Uses a Partition Coalescer
- Groups partitions into fewer buckets
-
Tries to:
-
Prefer data locality
- Minimize cross-node transfer
But:
- No strict guarantee
Visual Comparisonπ
coalesce(2)π
(Limited movement)
repartition(2)π
P1 ββ¬ββ P1'
P2 ββΌββ P1'
P3 ββΌββ P1'
P4 ββ
P1 ββ¬ββ P2'
P2 ββΌββ P2'
P3 ββΌββ P2'
P4 ββ
(All-to-all shuffle)
Final Answerπ
coalesce()does not strictly merge partitions within the same executor. It may move data across executors, but only as needed, avoiding a full shuffle and minimizing network I/O.
Interview One-Linerπ
coalesce()minimizes data movement but does not eliminate it; it can move data across executors, just without triggering a full shuffle likerepartition().
Hereβs a real-world case study that clearly shows how coalesce() behaves (including data movement) and why itβs chosen over repartition() in practice.
Case Study: Databricks ETL Pipeline (Event Logs β Delta Lake)π
Scenarioπ
Youβre a Data Engineer processing clickstream / event data:
- Source: Kafka β Bronze table (raw JSON)
- Processing: Spark (Databricks)
- Output: Delta table in S3 / ADLS
- Volume: ~500 GB/day
- Initial partitions: 2000 partitions
Step 1: Raw Data Loadπ
After ingestion:
Why so many partitions?
- Auto Loader / Kafka ingestion
- Small micro-batches
- High parallelism
Step 2: Transform + Filterπ
Now:
- Data volume drops to ~50 GB
- But partitions are still 2000
Problemπ
If you write directly:
You get:
- 2000 small files
- Poor query performance
- Metadata overhead
- Slow downstream reads
Step 3: Optimize with coalesce()π
What Actually Happens Internallyπ
Beforeπ
Coalesce Mapping (Example)π
New Partition 1 β P1 + P2 + ... + P20
New Partition 2 β P21 + ... + P40
...
New Partition 100 β ...
Important: Does Data Move?π
Yes β but selectivelyπ
Case A: Same Executorπ
If P1βP20 are on same executor:
- No network movement
- Just merged locally
Case B: Different Executorsπ
If partitions are spread:
- Some partitions must move across executors
-
But:
-
No full shuffle
- No all-to-all exchange
Why Not Use repartition(100)?π
This would:
- Trigger full shuffle
- Every partition sends data everywhere
- Massive network I/O for 50 GB
Performance Comparison (Realistic)π
| Operation | Shuffle | Time | Network |
|---|---|---|---|
| No change (2000 files) | None | Fast write, slow reads | Low |
coalesce(100) |
Minimal | Fast | Low |
repartition(100) |
Full | Slow | High |
Step 4: Write Optimized Outputπ
Result:
- ~100 files (~500 MB each)
-
Much better for:
-
Query performance
- File pruning
- Metadata handling
Where This Matters in Real Systemsπ
1. Delta Lake Optimizationπ
In platforms like Databricks:
- Small file problem is very common
coalesce()is frequently used before writes
2. Daily Batch Pipelinesπ
Typical pattern:
3. Cost Optimizationπ
- Less shuffle β less compute time
- Less network β lower cloud cost
- Fewer files β faster queries
When This Strategy Failsπ
Skewed Data Exampleπ
If:
Then:
coalesce()may create uneven partitions- One task becomes a bottleneck
In such cases:
is better
Key Takeawaysπ
1. coalesce() in real lifeπ
- Used to reduce output files
- Common before writes
2. Data Movement Realityπ
It can move data across executors, but avoids full shuffle
3. Why Engineers Prefer Itπ
- Faster than repartition
- Good enough distribution for writes
- Minimizes shuffle cost
Letβs build a clean, first-principles understanding of skew, partitions, tasks, and how Spark actually executes workβthe way you should think about it in interviews and real systems.
1. Fundamental Building Blocksπ
Partitionπ
- A logical chunk of data
- Smallest unit Spark can process independently
Taskπ
- A unit of computation on one partition
1 partition β 1 task
Executorπ
- A JVM process with CPU + memory
- Runs multiple tasks (based on available cores)
2. How Spark Actually Executes a Jobπ
Suppose you have:
Total parallel capacity:π
Execution:π
- Spark launches 20 tasks in parallel
- Remaining tasks wait in queue
3. Where Skew Comes Fromπ
Ideal Case (Balanced)π
All tasks:
- Start together
- Finish roughly together
Cluster utilization = high
Skewed Case (Problem)π
Execution:
Task 1 β 10MB β finishes fast
Task 2 β 10MB β finishes
Task 3 β 10MB β finishes
Task 4 β 500MB β runs long
Now:
- 3 executors become idle
- Only 1 executor is busy
This is called a straggler task
4. Why Spark Cannot Fix This Automaticallyπ
Because:
- Each partition is independent
- A task cannot be split across executors
- Spark cannot βstealβ work from a running task
So one big partition = one long-running task
5. Where Skew Happens Most Oftenπ
5.1 Joins (Most common)π
Spark:
- Shuffles data by
user_id - Same keys go to same partition
Problem:π
β One partition becomes huge
5.2 Aggregationsπ
If:
β One partition gets overloaded
5.3 Repartition by keyπ
Same issue if key distribution is uneven
6. How Shuffle Creates Skewπ
During shuffle:
- Data is redistributed based on key
- Hash function assigns key β partition
If one key is frequent:
β That partition gets most of the data
7. Diagnosing Skewπ
In Spark UI:
- One task takes much longer than others
-
Large difference in:
-
Input size
- Shuffle read size
- Duration
This is a clear sign of skew
8. Solutions (Deep Understanding)π
8.1 Repartition (General balancing)π
- Full shuffle
- Redistributes rows more evenly
But:
- Does NOT fix skew caused by a single heavy key
8.2 Salting (Best for key skew)π
Idea:
Break one heavy key into multiple smaller keys
Example:π
Instead of:
Make:
Code:
from pyspark.sql.functions import col, rand
df = df.withColumn("salt", (rand()*10).cast("int"))
df = df.withColumn("user_id_salted", concat(col("user_id"), col("salt")))
Now:
- Data spreads across partitions
- No single partition is overloaded
8.3 Broadcast Join (Best when one table is small)π
- No shuffle on large dataset
- Eliminates skew entirely
8.4 Skew Join Optimization (Spark 3+)π
Spark can automatically:
- Detect skewed partitions
- Split large partitions
Config:
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
8.5 Coalesce (Not for skew)π
- Reduces partitions
- No full shuffle
- Does NOT rebalance data
9. Repartition vs Coalesce (Final Clarity)π
| Feature | Repartition | Coalesce |
|---|---|---|
| Shuffle | Yes | No (mostly) |
| Balancing | Yes | No |
| Increase partitions | Yes | No |
| Decrease partitions | Yes | Yes |
| Fix skew | Partial | No |
10. Final Mental Modelπ
Think like this:
- Spark = parallel system
- Parallelism depends on number of partitions
- Efficiency depends on balanced partitions
The goal is not just βmore partitionsβ but evenly sized partitions
11. One-Line Summaryπ
Skew occurs when one or few partitions contain disproportionately large data, causing long-running tasks (stragglers) and underutilized executors, and it must be handled using techniques like repartitioning, salting, broadcast joins, or adaptive execution.

Example Use Cases
Repartition for Large Joins:
Ensures both datasets have balanced partitions before joining. Coalesce Before Writing to Disk:
Reduces shuffle and writes fewer output files.
Repartitioning by column nameπ
When you use .repartition(column_name), Spark redistributes the data based on the values in the specified column. This helps when working with data that needs to be grouped by a specific column before performing operations like joins or aggregations.
Syntax:
- This will shuffle data so that all rows with the same category value end up in the same partition.
- The number of partitions will be decided by Spark's default settings (typically based on cluster size and data volume).
Repartition with Both Column & Number of Partitions
You can also specify both a column name and a number of partitions: df.repartition(10, "category") # Redistributes by 'category' into 10 partitions
Spark will first hash partition the data based on category values. Then, it will further distribute the data into 10 partitions.
This ensures a controlled number of partitions while still grouping data by column.