Spark Dag Lazy Eval
Lecture 10 : DAG and Lazy Evaluation in Spark🔗
- For every action there is a new job, here there are three actions : read,inferSchema,sum and show
- When used with groupBy().sum(): It is considered an action because it triggers computation to aggregate data across partitions and produce a result. This operation forces Spark to execute the transformations leading up to it, effectively creating a job.
-
When used as a column expression df.select(sum("value")): It acts more like a transformation in Spark's context, especially if part of a larger query or pipeline that does not immediately trigger execution. In this case, it only defines the operation and does not create a job until an action (like show() or collect()) is called.
-
Job for reading file
Whole Stage Codegen - generate Java ByteCode
-
Inferschema
-
GroupBy and Count As explained above this is an action.
-
Show Final action to display df
After we read the csv and inferSchema there are no jobs created since filter and repartition both are transformations not actions.
When there are two filters on same dataset
This is the job
Optimizations on the Filter🔗
Both the filters are on the same task
The optimizations can be applied because Spark is lazily evaluated.