Spark Repartition Vs Coalesce

Lecture 19: Repartitioning and Coalesce🔗

Suppose we have 5 partitions and one of them is skewed a lot 100MB, let's say this is the best selling product records. This partition takes lot of time to compute. So the other executors have to wait until this executor finishes processing.

Repartitioning vs Coalesce🔗

Repartitioning🔗

Suppose we have the above partitions and total data is 100mb. let's say we do repartition(5) so we will have 5 partitions now for the data with 40mb per partition.

Coalesce🔗

In case of coalesce there is no equal splitting of partition memory, rather the already existing partitions get merged together.

There is no shuffling in coalesce but in repartitioning there is shuffling of data.

Pros and Cons in repartitioning🔗

There is evenly distributed data.
Con is that IO operations are more, its expensive.
Con of coalesce is that the data is unevenly distributed.

Repartitioning can increase or decrease the partitions but coalescing can only decrease the partitions.

How to get number of partitions?🔗

flight_df.rdd.getNumPartitions() gets the initial number of partitions and then we can repartition flight_df.repartition(4). Data is evenly distributed.

Repartitioning based on columns

Since we asked for 300 partitions and we have 255 records some partitions will have null record.

Coalescing🔗

Suppose we have 8 partitions and we coalesce into 3 partitions. Coalesce has only one arg.

Uneven distribution of data in partitions.