I'm provisioning a Google Cloud Dataproc cluster in the following way:
gcloud dataproc clusters create spark --async --image-version 1.2 \
--master-machine-type n1-standard-1 --master-boot-disk-size 10 \
--worker-machine-type n1-highmem-8 --num-workers 4 --worker-boot-disk-size 10 \
--num-worker-local-ssds 1
Launching a Spark application in yarn-cluster mode with
spark.driver.cores=1
spark.driver.memory=1g
spark.executor.instances=4
spark.executor.cores=8
spark.executor.memory=36g
will only ever launch 3 executor instances instead of the requested 4, effectively "wasting" a full worker node which seems to be running the driver only. Also, reducing spark.executor.cores=7 to "reserve" a core on a worker node for the driver does not seem to help.
What configuration is required to be able to run the driver in yarn-cluster mode alongside executor processes, making optimal use of the available resources?
An n1-highmem-8 using Dataproc 1.2 is configured to have 40960m allocatable per YARN NodeManager. Instructing spark to use 36g of heap memory per executor will also include 3.6g of memoryOverhead (0.1 * heap memory). YARN will allocate this as the full 40960m.
The driver will use 1g of heap and 384m for memoryOverhead (the minimum value). YARN will allocate this as 2g. As the driver will always launch before executors, its memory is allocated first. When an allocation request comes in for 40960 for an executor, there is no node with that much memory available and so no container is allocated on the same node as the driver.
Using spark.executor.memory=34g will allow the driver and executor to run on the same node.
Related
I am learning spark and trying to execute simple wordcount application. I am using
spark version 2.4.7-bin-hadoop.2.7
scala 2.12
java 8
spark cluster having 1 master and 2 worker node is running as stand alone cluster
spark config is
spark.master spark://localhost:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 500M
master start script is ${SPARK_HOME}/sbin/start-master.sh
slave start script is ${SPARK_HOME}/sbin/start-slave.sh spark://localhost:7077 -c 1 -m 50M
I want to start the driver in cluster mode
${SPARK_HOME}/bin/spark-submit --master spark://localhost:7077 --deploy-mode cluster --driver-memory 500M --driver-cores 8 --executor-memory 50M --executor-cores 4 <absolut path to the jar file having code>
Note: The completed driver/apps are the ones I had to kill
I have used the above params after reading spark doc and checking the blogs.
But after I submit the job driver does not run. It always shows worker as none. I have read multiple blogs and checked the documentation to find out how to submit the job in cluster mode. I tweaked different params for spark-submit but it does not execute. Interesting thing to note is that when i submit in client mode it works.
Can you help me in fixing this issue?
Take a look at CPU and memory configurations of your workers and the driver.
Your application requires 500 Mb of RAM and one CPU core to run the driver and 50 Mb and one core to run computational jobs. So you need 550 Mb of RAM and two cores. These resources are provided by a worker when you run your driver in cluster mode. But each worker is allowed to use only one CPU core and 50 Mb of RAM. So the resources that the worker has are not enough to execute your driver.
You have to allocate your Spark cluster as much resources as you need for your work:
Worker Cores >= Driver Cores + Executor Cores
Worker Memory >= Driver Memory + Executor Memory
Perhaps you have to increase amount of memory for both the driver and the executor. Try to run Worker with 1 Gb memory and your driver with 512 Mb --driver-memory and --executor-memory.
what is the relationship between spark executor and yarn container when using spark on yarn?
For example, when I set executor-memory = 20G and yarn container memory = 10G, does 1 executor contains 2 containers?
Spark Executor Runs within a Yarn Container. A Yarn Container is provided by Resource Manager on demand. A Yarn container can have 1 or more Spark Executors.
Spark-Executors are the one which runs the Tasks.
Spark Executor will be started on a Worker Node(DataNode)
In your case when you set executor-memory = 20G -> This means you are asking for a Container of size 20GB in which your Executors will be running. Now you might have 1 or more Executors using this 20GB of Memory and this is Per Worker Node.
So for example if u have a Cluster to 8 nodes, it will be 8 * 20 GB of Total Memory for your Job.
Below are the 3 config options available in yarn-site.xml with which you can play around and see the differences.
yarn.scheduler.minimum-allocation-mb
yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.resource.memory-mb
When running Spark on YARN, each Spark executor runs as a YARN container, This means the number of containers will always be the same as the executors created by a Spark application e.g. via --num-executors parameter in spark-submit.
https://stackoverflow.com/a/38348175/9605741
In YARN mode, each executor runs in one container. The number of executors is the same as the number of containers allocated from YARN(except in cluster mode, which will allocate another container to run the driver).
Currently I am using Spark 2.0.0 in cluster mode (Standalone cluster) with the following cluster config:
Workers: 4
Cores in use: 32 Total, 32 Used
Memory in use: 54.7 GB Total, 42.0 GB Used
I have 4 slaves (workers), and 1 master machine. There are 3 main parts to a Spark cluster - Master, Driver, Workers (ref)
Now my problem is that driver is starting up in one of the worker nodes, which is blocking me in using worker nodes in their full capacity (RAM wise). For example, if I run my spark job with 2g memory for driver, then I am left with only ~13gb memory in each machine for executor memory (assuming total RAM in each machine is 15gb). Now I think there can be 2 ways to fix this:
1) Run driver on master machine, this way I can specify full 15gb RAM as executor memory
2) Specify driver machine explicitly (one of the worker nodes), and assign memory to both driver and executor for this machine accordingly. For rest of the worker nodes I can specify max executor memory.
How do I achieve point 1 or 2? Or it is even possible?
Any pointers to it are appreciated.
To run the driver on the master, run spark-submit from the master and specify --deploy-mode client. Launching applications with spark-submit.
It is not possible to specify which worker the driver will run on when using --deploy-mode cluster. However you can run the driver on a worker and achieve maximum cluster utilisation if you use a cluster manager such as yarn or mesos.
I have a cluster of 3 nodes, each has 12 cores, and 30G, 20G and 10G of RAM respectively. When I run my application, I set the executor memory to 20G, which prevent the executor from being launched in the 10G machine since it's exceeding the slave memory threshold, it also under utilize the resources on the 30G machine. I searched but didn't find any way to set the executor memory dynamically base on the capacity of the node, so how can I config the cluster or my Spark job to fully utilize the resources of the cluster?
The solution is to have more executors with to less memory. You can use all of the memory by having 6- 10G executors (1 on the 10G node, 2 on the 20G node, 3 on the 30G node). Or by having 12- 5G executors. etc
I have an Apache Spark application running on a YARN cluster (spark has 3 nodes on this cluster) on cluster mode.
When the application is running the Spark-UI shows that 2 executors (each running on a different node) and the driver are running on the third node.
I want the application to use more executors so I tried adding the argument --num-executors to Spark-submit and set it to 6.
spark-submit --driver-memory 3G --num-executors 6 --class main.Application --executor-memory 11G --master yarn-cluster myJar.jar <arg1> <arg2> <arg3> ...
However, the number of executors remains 2.
On spark UI I can see that the parameter spark.executor.instances is 6, just as I intended, and somehow there are still only 2 executors.
I even tried setting this parameter from the code
sparkConf.set("spark.executor.instances", "6")
Again, I can see that the parameter was set to 6, but still there are only 2 executors.
Does anyone know why I couldn't increase the number of my executors?
yarn.nodemanager.resource.memory-mb is 12g in yarn-site.xml
Increase yarn.nodemanager.resource.memory-mb in yarn-site.xml
With 12g per node you can only launch driver(3g) and 2 executors(11g).
Node1 - driver 3g (+7% overhead)
Node2 - executor1 11g (+7% overhead)
Node3 - executor2 11g (+7% overhead)
now you are requesting for executor3 of 11g and no node has 11g memory available.
for 7% overhead refer spark.yarn.executor.memoryOverhead and spark.yarn.driver.memoryOverhead in https://spark.apache.org/docs/1.2.0/running-on-yarn.html
Note that yarn.nodemanager.resource.memory-mb is total memory that a single NodeManager can allocate across all containers on one node.
In your case, since yarn.nodemanager.resource.memory-mb = 12G, if you add up the memory allocated to all YARN containers on any single node, it cannot exceed 12G.
You have requested 11G (-executor-memory 11G) for each Spark Executor container. Though 11G is less than 12G, this still won't work. Why ?
Because you have to account for spark.yarn.executor.memoryOverhead, which is min(executorMemory * 0.10, 384) (by default, unless you override it).
So, following math must hold true:
spark.executor.memory + spark.yarn.executor.memoryOverhead <= yarn.nodemanager.resource.memory-mb
See: https://spark.apache.org/docs/latest/running-on-yarn.html for latest documentation on spark.yarn.executor.memoryOverhead
Moreover, spark.executor.instances is merely a request. Spark ApplicationMaster for your application will make a request to YARN ResourceManager for number of containers = spark.executor.instances. Request will be granted by ResourceManager on NodeManager node based on:
Resource availability on the node. YARN scheduling has its own nuances - this is a good primer on how YARN FairScheduler works.
Whether yarn.nodemanager.resource.memory-mb threshold has not been exceeded on the node:
(number of spark containers running on the node * (spark.executor.memory + spark.yarn.executor.memoryOverhead)) <= yarn.nodemanager.resource.memory-mb*
If the request is not granted, request will be queued and granted when above conditions are met.
To utilize the spark cluster to its full capacity you need to set values for --num-executors, --executor-cores and --executor-memory as per your cluster:
--num-executors command-line flag or spark.executor.instances configuration property controls the number of executors requested ;
--executor-cores command-line flag or spark.executor.cores configuration property controls the number of concurrent tasks an executor can run ;
--executor-memory command-line flag or spark.executor.memory configuration property controls the heap size.
You only have 3 nodes in the cluster, and one will be used as the driver, you have only 2 nodes left, how can you create 6 executors?
I think you confused --num-executors with --executor-cores.
To increase concurrency, you need more cores, you want to utilize all the CPUs in your cluster.