Skip to content

Spark Rdd

Lecture 13: Resilient Distributed Dataset🔗

image

Data Storage of List🔗

image

Data Storage in RDD🔗

Suppose we have 500MB of data and 128MB partition, so we will have 4 partitions.

The data is scattered on various executors. image

Its not in single contiguous location like elements of a list. The data structure used ot process this data is called RDD image

image

Why is RDD recoverable?

  • RDD is immutable. If we apply multiple filters each dataset after filtering is a different dataset image

  • In below case if rdd2 fails then we can restore rdd1 because of the lineage. image

Disadvantage of RDD🔗

  • No optimization done by Spark on RDD. The dev must specify explicitly on how to optimize RDD.

Advantage🔗

  • Works well with unstructured data where there are no columns and rows / key-value pairs
  • RDD is type safe, we get error on compile time rather than runtime which happens with Dataframe API.

Avoiding RDDs🔗

image

  • RDD : How to do? Dataframe API: Just specify what to do?

image You can see in above case that we have a join and filter but we are specifically saying that first join then filter so it triggers a shuffle first and then filter which is not beneficial.