In an EMR cluster or any cluster is it possible that YARN allocate driver and executor in same EC2 instance?
I want to know if driver can utilize the storage and processing power of 1 EC2 instance or some part of this instance will be used for serving other spark jobs running in the cluster. This could cause my driver to run out of memory.
I think Resource manager decide this based on the cluster resource availability?
In non-AWS EMR: driver and executorS can be on same machine / instance.
In AWS EMR: driver may run on Master Node or one of Core Instances --> so on same EC2 instance.
Incidentally, on EMR YARN like aspects run on Master Node.
Related
Does executors in spark nd application master in yarn do the same
In Spark, there is a Driver and Executors. I'm not gonna go into detail of what driver and executors are but in a one-liner, the driver manages the job flow and schedules tasks, and Executors are worker nodes processes in charge of running individual tasks.
YARN is basically a resource manager which allocates memory to compute engines. Now, this compute engine can be Spark/Tez/Map-reduce. What you need to understand here is when YARN successfully allocates memory they are called containers.
Now when Spark Job is deployed in YARN, Assuming that YARN has sufficient memory for the spark job to run, Yarn first allocates resources as containers for Spark Application Master which will have the driver program (in case of cluster mode). This Application Master will further requests resources for spark executors which YARN will further allocate as containers. So spark job will have multiple containers, one for the driver program and n containers for n executors. So you see in the computing sense the fundamental difference between spark running in a spark cluster and spark running in YARN is the use of containers.
So executors and application master in YARN run inside containers and do the same thing as spark on spark clusters.
I 'm a bit confused about how master and worker nodes are assigned to the respective connected machines (VMs) on the network in the cluster mode of Spark.
My question is when i launch a Spark job (using Spark-submit) what is the process workflow that is responsible of assigning a master node and a worker node.
Thanks !
The Driver and Executors requests containers from yarn to launch and do work. Yarn takes care of the allocations for you so you don't need to worry about where the master(driver)/slave(executor) are allocated.
If I am using Kubernetes cluster to run spark, then I am using Kubernetes resource manager in Spark.
If I am using Hadoop cluster to run spark, then I am using Yarn resource manager in Spark.
But my question is, if I am spawning multiple linux nodes in kebernetes, and use one of the node as spark maste and three other as worker, what resource manager should I use? can I use yarn over here?
Second question, in case of any 4 node linux spark cluster (not in kubernetes and not hadoop, simple connected linux machines), even if I do not have hdfs, can I use yarn here as resource manager? if not, then what resource manager should be used for saprk?
Thanks.
if I am spawning multiple linux nodes in kebernetes,
Then you'd obviously use kubernetes, since it's available
in case of any 4 node linux spark cluster (not in kubernetes and not hadoop, simple connected linux machines), even if I do not have hdfs, can I use yarn here
You can, or you can use Spark Standalone scheduler, instead. However Spark requires a shared filesystem for reading and writing data, so, while you could attempt to use NFS, or S3/GCS for this, HDFS is faster
I have just launched Amazon Elastic MapReduce server after trying java.lang.OutofMemorySpace:Java heap space while fetching 120 million rows from database in pyspark where I have 1 master and 2 slave nodes running each having 4 cores and 8G RAM.
I am trying to load a massive dataset from MySQL database (containing approx. 120M rows). The query loads fine but when I do a df.show() operation or when I try to perform operations on the spark dataframe I am getting errors like -
org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
Task 0 in stage 0.0 failed 1 times; aborting job
java.lang.OutOfMemoryError: GC overhead limit exceeded
My questions are -
When I SSH into the Amazon EMR server and do htop, I see that 5GB out of 8GB is already in use. Why is this?
On the Amazon EMR portal, I can see that the master and slave servers are running. I'm not sure if the slave servers are being used or if its just the master doing all the work. Do I have to separately launch or "start" the 2 slave nodes or does Spark do that automatically? If yes, how do I do this?
If you are running spark as standalone mode (local[*]) from master then it will only use master node.
How are you submitting spark job?
Use yarn cluster or client mode while submitting spark job to use resources efficiently.
Read more on YARN cluster vs client
Master node runs all the other services like hive, mysql, etc. Those services may
taking 5GB of ram if aren’t using standalone mode.
In yarn UI (http://<master-public-dns>:8088) you can check what other containers are running in more detail.
You can check where your spark driver and executer are spinning,
in spark UI http://<master-public-dns>:18080.
Select your job and go to the Executor section, there you would find machine ip of each executor.
Enable ganglia in EMR OR go to CloudWatch ec2 metric to check each machine utilization.
Spark doesn’t start or terminates nodes.
If you want to scale your cluster depending upon job load, apply autoscaling policy to CORE or TASK instance group.
But at-least you need 1 CORE node always running.
In a Spark standalone cluster, what is exactly the role of the master (a node started with start_master.sh script)?
I understand that is the node that receives the jobs from the submit-job.sh script, but what is its role when processing a job?
I'm seeing in the web UI that always delivers the job to a slave (a node started with start_slave.sh) and is not participating from processing, Am I right? In that case, should I also run also the script start_slave.sh in the same machine than master to to take advantage of its resources (cpu and memory)?
Thanks in advance.
Spark runs in the following cluster modes:
Local
Standalone
Mesos
Yarn
The above are cluster modes which offer resources to Spark Applications
Spark standalone mode is master slave architecture, we have Spark Master and Spark Workers. Spark Master runs in one of the cluster nodes and Spark Workers run on the Slave nodes of the cluster.
Spark Master (often written standalone Master) is the resource manager
for the Spark Standalone cluster to allocate the resources (CPU, Memory, Disk etc...) among the
Spark applications. The resources are used to run the Spark Driver and Executors.
Spark Workers report to Spark Master about resources information on the Slave nodes.
[apache-spark]
Spark standalone comes with its own resource manager. Think about Spark Master/Worker as YARN ResourceManager/NodeManager.