Spark Dynamic Partition Pruning
Lecture 32 : Dynamic Partition Pruning🔗
In below code we have a filter applied to select only 19th April 2023 data,
Below we can see that only one file that is for 19th April 2023 is read, not all of them.
DPP with 2 tables🔗
Partition pruning does not happen on first table but will happen on table 2. Dynamic Partition Pruning helps us to update filter on runtime.
Two conditions:
- Data should be partitioned.
- 2nd Table should be broadcasted.
Without Dynamic Partition Pruning
Total 123 files read from first table not one like previous case.
With Dynamic Partition Pruning
The smaller dimdate table is broadcasted and hash join performed. Only 3 files are read this time.
At runtime a subquery is run...
Now because of the runtime filter only 4 partitions are read/scanned.