Spark Rdd

Lecture 13: Resilient Distributed Dataset🔗

Data Storage of List🔗

Data Storage in RDD🔗

Suppose we have 500MB of data and 128MB partition, so we will have 4 partitions.

The data is scattered on various executors.

Its not in single contiguous location like elements of a list. The data structure used ot process this data is called RDD

Why is RDD recoverable?

RDD is immutable. If we apply multiple filters each dataset after filtering is a different dataset
In below case if rdd2 fails then we can restore rdd1 because of the lineage.

Disadvantage of RDD🔗

No optimization done by Spark on RDD. The dev must specify explicitly on how to optimize RDD.

Advantage🔗

Works well with unstructured data where there are no columns and rows / key-value pairs
RDD is type safe, we get error on compile time rather than runtime which happens with Dataframe API.

Avoiding RDDs🔗

RDD : How to do? Dataframe API: Just specify what to do?

You can see in above case that we have a join and filter but we are specifically saying that first join then filter so it triggers a shuffle first and then filter which is not beneficial.