Spark Rdd
Lecture 13: Resilient Distributed Dataset🔗
Data Storage of List🔗
Data Storage in RDD🔗
Suppose we have 500MB of data and 128MB partition, so we will have 4 partitions.
The data is scattered on various executors.
Its not in single contiguous location like elements of a list. The data structure used ot process this data is called RDD
Why is RDD recoverable?
-
RDD is immutable. If we apply multiple filters each dataset after filtering is a different dataset
-
In below case if rdd2 fails then we can restore rdd1 because of the lineage.
Disadvantage of RDD🔗
- No optimization done by Spark on RDD. The dev must specify explicitly on how to optimize RDD.
Advantage🔗
- Works well with unstructured data where there are no columns and rows / key-value pairs
- RDD is type safe, we get error on compile time rather than runtime which happens with Dataframe API.
Avoiding RDDs🔗
- RDD : How to do? Dataframe API: Just specify what to do?
You can see in above case that we have a join and filter but we are specifically saying that first join then filter so it triggers a shuffle first and then filter which is not beneficial.