Spark Architecture
Lecture 6 : Spark Architecture🔗
Spark Cluster🔗
- 20 core per machine and 100 GB RAM / each machine
- Total Cluster : 200 cores and 1TB RAM
- The master is controlled by Resource Manager and the workers are controlled by Node Manager.
What happens when user submits code?🔗
- The user submits some Spark code for execution to the Resource Manager. It needs 20 GB RAM, 25 GB executor, 5 total executors and 5 CPU cores.
- So the manager goes to W5 and asks to create 20GB container as the driver node.
What happens inside the container?🔗
Driver Allocation🔗
Now this 20 GB driver is called Application Master
There are two main() functions inside the master, one is for PySpark and other is for JVM like Java,Scala etc...
The JVM main() is called Application Driver.
The Spark Core has a Java Wrapper and the Java Wrapper has a Python Wrapper.
When we write code in PySpark it gets converted to Java Wrapper.
The PySpark driver is not a requirement but the Java Wrapper is required to run any code.
Worker Allocation🔗
- Now the Application master asks for the executors to be assigned and the resource manager allocates.
Executor Container🔗
Each executor has 5 core CPU and 25GB RAM. Each one of them runs on separate container.
THe above is when we have pure Java code and dont use Python UDF.
But what if we use Python UDF functions?
We need a Python worker inside the executor to be able to run the code.