Spark Architecture

The master is controlled by Resource Manager and the workers are controlled by Node Manager.

The user submits some Spark code for execution to the Resource Manager. It needs 20 GB RAM, 25 GB executor, 5 total executors and 5 CPU cores.
So the manager goes to W5 and asks to create 20GB container as the driver node.

Now this 20 GB driver is called Application Master

There are two main() functions inside the master, one is for PySpark and other is for JVM like Java,Scala etc...

The JVM main() is called Application Driver.

The Spark Core has a Java Wrapper and the Java Wrapper has a Python Wrapper.

When we write code in PySpark it gets converted to Java Wrapper.

The PySpark driver is not a requirement but the Java Wrapper is required to run any code.

Now the Application master asks for the executors to be assigned and the resource manager allocates.

Each executor has 5 core CPU and 25GB RAM. Each one of them runs on separate container.

THe above is when we have pure Java code and dont use Python UDF.

But what if we use Python UDF functions?

We need a Python worker inside the executor to be able to run the code.