Skip to content

Data Engineering Knowledge Base

Hadoop Vs Spark

Hadoop Vs Spark

Lecture 3 : Hadoop vs Spark🔗

Misconception:🔗

Hadoop is a database - its not a database just a filesystem (hdfs)
Spark is 100 times faster than hadoop
Spark processes data in RAM but Hadoop doesnt

Differences🔗

Performance🔗

Hadoop does lot of read write IO operations and sends data back and forth to the disk.

But in spark each executor has its own memory.

Where is there no difference?

When we have very less data like 10 GB, there is no difference because the hadoop cluster also doesnt write to the disk it fits first time in memory.

Batch vs Stream Processing🔗

Ease of Use🔗

Spark has both low level and high level API in Python which is easier than using Hive. Low level programming is on RDD level.

Security🔗

Hadoop has in built Kerberos Authentication via YARN whereas Spark doesnt have any security mechanism.
The authentication helps create ACL lists at directory level in HDFS.
Spark uses HDFS Storage so it gets ACL feature / ability and when it uses YARN it gets Kerberos Authentication.

Fault Tolerance🔗

Data Replication in Hadoop

HDFS keeps track of which node / rack has the data from A B C and D

DAG in Spark

So Spark computes / transforms in multiple processes Process 1 -> Process 2 -> Process 3 ....
After each process the data is stored in a data structure called RDD which is immutable. So even if there is a failure Spark engine knows how to reconstruct the data for a particular process from the RDD at that stage.