Skip to content

Spark Dag Lazy Eval

Lecture 10 : DAG and Lazy Evaluation in Spark🔗

image

  • For every action there is a new job, here there are three actions : read,inferSchema,sum and show
  • When used with groupBy().sum(): It is considered an action because it triggers computation to aggregate data across partitions and produce a result. This operation forces Spark to execute the transformations leading up to it, effectively creating a job.
  • When used as a column expression df.select(sum("value")): It acts more like a transformation in Spark's context, especially if part of a larger query or pipeline that does not immediately trigger execution. In this case, it only defines the operation and does not create a job until an action (like show() or collect()) is called.

  • Job for reading file image Whole Stage Codegen - generate Java ByteCode

  • Inferschema image

  • GroupBy and Count As explained above this is an action.

  • Show Final action to display df

image After we read the csv and inferSchema there are no jobs created since filter and repartition both are transformations not actions.

When there are two filters on same dataset

image

This is the job image

Optimizations on the Filter🔗

Both the filters are on the same task image The optimizations can be applied because Spark is lazily evaluated.