Running Spark on heterogeneous cluster in standalone mode

Running Spark on heterogeneous cluster in standalone mode - apache-spark

I have a cluster of 3 nodes, each has 12 cores, and 30G, 20G and 10G of RAM respectively. When I run my application, I set the executor memory to 20G, which prevent the executor from being launched in the 10G machine since it's exceeding the slave memory threshold, it also under utilize the resources on the 30G machine. I searched but didn't find any way to set the executor memory dynamically base on the capacity of the node, so how can I config the cluster or my Spark job to fully utilize the resources of the cluster?

The solution is to have more executors with to less memory. You can use all of the memory by having 6- 10G executors (1 on the 10G node, 2 on the 20G node, 3 on the 30G node). Or by having 12- 5G executors. etc

Related

Spark Driver does not have any worker allotted

I am learning spark and trying to execute simple wordcount application. I am using
spark version 2.4.7-bin-hadoop.2.7
scala 2.12
java 8
spark cluster having 1 master and 2 worker node is running as stand alone cluster
spark config is
spark.master spark://localhost:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 500M
master start script is ${SPARK_HOME}/sbin/start-master.sh
slave start script is ${SPARK_HOME}/sbin/start-slave.sh spark://localhost:7077 -c 1 -m 50M
I want to start the driver in cluster mode
${SPARK_HOME}/bin/spark-submit --master spark://localhost:7077 --deploy-mode cluster --driver-memory 500M --driver-cores 8 --executor-memory 50M --executor-cores 4 <absolut path to the jar file having code>
Note: The completed driver/apps are the ones I had to kill
I have used the above params after reading spark doc and checking the blogs.
But after I submit the job driver does not run. It always shows worker as none. I have read multiple blogs and checked the documentation to find out how to submit the job in cluster mode. I tweaked different params for spark-submit but it does not execute. Interesting thing to note is that when i submit in client mode it works.
Can you help me in fixing this issue?

Take a look at CPU and memory configurations of your workers and the driver.
Your application requires 500 Mb of RAM and one CPU core to run the driver and 50 Mb and one core to run computational jobs. So you need 550 Mb of RAM and two cores. These resources are provided by a worker when you run your driver in cluster mode. But each worker is allowed to use only one CPU core and 50 Mb of RAM. So the resources that the worker has are not enough to execute your driver.
You have to allocate your Spark cluster as much resources as you need for your work:
Worker Cores >= Driver Cores + Executor Cores
Worker Memory >= Driver Memory + Executor Memory
Perhaps you have to increase amount of memory for both the driver and the executor. Try to run Worker with 1 Gb memory and your driver with 512 Mb --driver-memory and --executor-memory.

what is the relationship between spark executor and yarn container when using spark on yarn

what is the relationship between spark executor and yarn container when using spark on yarn?
For example, when I set executor-memory = 20G and yarn container memory = 10G, does 1 executor contains 2 containers?

Spark Executor Runs within a Yarn Container. A Yarn Container is provided by Resource Manager on demand. A Yarn container can have 1 or more Spark Executors.
Spark-Executors are the one which runs the Tasks.
Spark Executor will be started on a Worker Node(DataNode)
In your case when you set executor-memory = 20G -> This means you are asking for a Container of size 20GB in which your Executors will be running. Now you might have 1 or more Executors using this 20GB of Memory and this is Per Worker Node.
So for example if u have a Cluster to 8 nodes, it will be 8 * 20 GB of Total Memory for your Job.
Below are the 3 config options available in yarn-site.xml with which you can play around and see the differences.
yarn.scheduler.minimum-allocation-mb
yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.resource.memory-mb

When running Spark on YARN, each Spark executor runs as a YARN container, This means the number of containers will always be the same as the executors created by a Spark application e.g. via --num-executors parameter in spark-submit.
https://stackoverflow.com/a/38348175/9605741

In YARN mode, each executor runs in one container. The number of executors is the same as the number of containers allocated from YARN(except in cluster mode, which will allocate another container to run the driver).

Forcing driver to run on specific slave in spark standalone cluster running with "--deploy-mode cluster"

I am running a small spark cluster, with two EC2 instances (m4.xlarge).
So far I have been running the spark master on one node, and a single spark slave (4 cores, 16g memory) on the other, then deploying my spark (streaming) app in client deploy-mode on the master. Summary of settings is:
--executor-memory 16g
--executor-cores 4
--driver-memory 8g
--driver-cores 2
--deploy-mode client
This results in a single executor on my single slave running with 4 cores and 16Gb memory. The driver runs "outside" of the cluster on the master-node (i.e. it is not allocated its resources by the master).
Ideally I'd like to use cluster deploy-mode so that I can take advantage of the supervise option. I have started a second slave on the master node giving it 2 cores and 8g memory (smaller allocated resources so as to leave space for the master daemon).
When I run my spark job in cluster deploy-mode (using the same settings as above but with --deploy-mode cluster). Around 50% of the time I get the desired deployment which is that the driver runs through the slave running on the master node (which has the right resources of 2 cores & 8Gb) which leaves the original slave node free to allocate an executor of 4 cores & 16Gb. However the other 50% of the time the master runs the driver on the non-master slave node, which means I get an driver on that node with 2 cores & 8Gb memory, which then leaves no node with sufficient resources to start an executor (which requires 4 cores & 16Gb).
Is there any way to force the spark master to use a specific worker / slave for my driver? Given spark knows that there are two slave nodes, one with 2 cores and the other with 4 cores, and that my driver needs 2 cores, and my executor needs 4 cores it would ideally work out the right optimal placement, but this doesn't seem to be the case.
Any ideas / suggestions gratefully received!
Thanks!

I can see that this is an old question, but let me answer it still, someone might find it useful.
Add --driver-java-options="-Dspark.driver.host=<HOST>" option to spark-submit script, when submitting application, and Spark should deploy driver to specified host.

Spark executor configuration

As per My spark cluster the below configuration is set
spark.executor.memory=2g
I would like to know that this 2G of RAM is shared by all executors or this 2G of RAM is used by each executor in each worker machine?

I would like to know that this 2G of RAM is shared by all executors or
this 2G of RAM is used by each executor in each worker machine
This setting will cause each executor on every one of your Worker nodes to have 2G memory. This setting doesn't mean "share 2G of memory between all executors", it means "give each executor 2G of memory".
This is explicitly stated in the documentation (emphasis mine):
spark.executor.memory | 1g | Amount of memory to use per executor process
(e.g. 2g, 8g).
If you have multiple executors per Worker node, this means that each one of these executors will consume 2G of memory.

Apache Spark: setting executor instances does not change the executors

I have an Apache Spark application running on a YARN cluster (spark has 3 nodes on this cluster) on cluster mode.
When the application is running the Spark-UI shows that 2 executors (each running on a different node) and the driver are running on the third node.
I want the application to use more executors so I tried adding the argument --num-executors to Spark-submit and set it to 6.
spark-submit --driver-memory 3G --num-executors 6 --class main.Application --executor-memory 11G --master yarn-cluster myJar.jar <arg1> <arg2> <arg3> ...
However, the number of executors remains 2.
On spark UI I can see that the parameter spark.executor.instances is 6, just as I intended, and somehow there are still only 2 executors.
I even tried setting this parameter from the code
sparkConf.set("spark.executor.instances", "6")
Again, I can see that the parameter was set to 6, but still there are only 2 executors.
Does anyone know why I couldn't increase the number of my executors?
yarn.nodemanager.resource.memory-mb is 12g in yarn-site.xml

Increase yarn.nodemanager.resource.memory-mb in yarn-site.xml
With 12g per node you can only launch driver(3g) and 2 executors(11g).
Node1 - driver 3g (+7% overhead)
Node2 - executor1 11g (+7% overhead)
Node3 - executor2 11g (+7% overhead)
now you are requesting for executor3 of 11g and no node has 11g memory available.
for 7% overhead refer spark.yarn.executor.memoryOverhead and spark.yarn.driver.memoryOverhead in https://spark.apache.org/docs/1.2.0/running-on-yarn.html

Note that yarn.nodemanager.resource.memory-mb is total memory that a single NodeManager can allocate across all containers on one node.
In your case, since yarn.nodemanager.resource.memory-mb = 12G, if you add up the memory allocated to all YARN containers on any single node, it cannot exceed 12G.
You have requested 11G (-executor-memory 11G) for each Spark Executor container. Though 11G is less than 12G, this still won't work. Why ?
Because you have to account for spark.yarn.executor.memoryOverhead, which is min(executorMemory * 0.10, 384) (by default, unless you override it).
So, following math must hold true:
spark.executor.memory + spark.yarn.executor.memoryOverhead <= yarn.nodemanager.resource.memory-mb
See: https://spark.apache.org/docs/latest/running-on-yarn.html for latest documentation on spark.yarn.executor.memoryOverhead
Moreover, spark.executor.instances is merely a request. Spark ApplicationMaster for your application will make a request to YARN ResourceManager for number of containers = spark.executor.instances. Request will be granted by ResourceManager on NodeManager node based on:
Resource availability on the node. YARN scheduling has its own nuances - this is a good primer on how YARN FairScheduler works.
Whether yarn.nodemanager.resource.memory-mb threshold has not been exceeded on the node:
(number of spark containers running on the node * (spark.executor.memory + spark.yarn.executor.memoryOverhead)) <= yarn.nodemanager.resource.memory-mb*
If the request is not granted, request will be queued and granted when above conditions are met.

To utilize the spark cluster to its full capacity you need to set values for --num-executors, --executor-cores and --executor-memory as per your cluster:
--num-executors command-line flag or spark.executor.instances configuration property controls the number of executors requested ;
--executor-cores command-line flag or spark.executor.cores configuration property controls the number of concurrent tasks an executor can run ;
--executor-memory command-line flag or spark.executor.memory configuration property controls the heap size.

You only have 3 nodes in the cluster, and one will be used as the driver, you have only 2 nodes left, how can you create 6 executors?
I think you confused --num-executors with --executor-cores.
To increase concurrency, you need more cores, you want to utilize all the CPUs in your cluster.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string