Skip to content

Hadoop Vs Spark

Lecture 3 : Hadoop vs Spark🔗

Misconception:🔗

  • Hadoop is a database - its not a database just a filesystem (hdfs)
  • Spark is 100 times faster than hadoop
  • Spark processes data in RAM but Hadoop doesnt

Differences🔗

Performance🔗

image

Hadoop does lot of read write IO operations and sends data back and forth to the disk. image

But in spark each executor has its own memory. image

Where is there no difference?

When we have very less data like 10 GB, there is no difference because the hadoop cluster also doesnt write to the disk it fits first time in memory.

Batch vs Stream Processing🔗

image

Ease of Use🔗

image

Spark has both low level and high level API in Python which is easier than using Hive. Low level programming is on RDD level.

Security🔗
  • Hadoop has in built Kerberos Authentication via YARN whereas Spark doesnt have any security mechanism.

  • The authentication helps create ACL lists at directory level in HDFS.

  • Spark uses HDFS Storage so it gets ACL feature / ability and when it uses YARN it gets Kerberos Authentication.

Fault Tolerance🔗

image

Data Replication in Hadoop image

HDFS keeps track of which node / rack has the data from A B C and D

image

DAG in Spark

  • So Spark computes / transforms in multiple processes Process 1 -> Process 2 -> Process 3 ....
  • After each process the data is stored in a data structure called RDD which is immutable. So even if there is a failure Spark engine knows how to reconstruct the data for a particular process from the RDD at that stage.