executor memory and local deployment - apache-spark

When adjusting the memory for the executors (e.g., by setting --executor-memory 2g) and setting for the master a local deployment (local[4]), does each local thread receive 2 GB of memory or are the 2 GB set in total for the local run?

spark.executor.memory is set per executor process and this amount is shared between executor threads.

Related

Spark Driver does not have any worker allotted

I am learning spark and trying to execute simple wordcount application. I am using
spark version 2.4.7-bin-hadoop.2.7
scala 2.12
java 8
spark cluster having 1 master and 2 worker node is running as stand alone cluster
spark config is
spark.master spark://localhost:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 500M
master start script is ${SPARK_HOME}/sbin/start-master.sh
slave start script is ${SPARK_HOME}/sbin/start-slave.sh spark://localhost:7077 -c 1 -m 50M
I want to start the driver in cluster mode
${SPARK_HOME}/bin/spark-submit --master spark://localhost:7077 --deploy-mode cluster --driver-memory 500M --driver-cores 8 --executor-memory 50M --executor-cores 4 <absolut path to the jar file having code>
Note: The completed driver/apps are the ones I had to kill
I have used the above params after reading spark doc and checking the blogs.
But after I submit the job driver does not run. It always shows worker as none. I have read multiple blogs and checked the documentation to find out how to submit the job in cluster mode. I tweaked different params for spark-submit but it does not execute. Interesting thing to note is that when i submit in client mode it works.
Can you help me in fixing this issue?
Take a look at CPU and memory configurations of your workers and the driver.
Your application requires 500 Mb of RAM and one CPU core to run the driver and 50 Mb and one core to run computational jobs. So you need 550 Mb of RAM and two cores. These resources are provided by a worker when you run your driver in cluster mode. But each worker is allowed to use only one CPU core and 50 Mb of RAM. So the resources that the worker has are not enough to execute your driver.
You have to allocate your Spark cluster as much resources as you need for your work:
Worker Cores >= Driver Cores + Executor Cores
Worker Memory >= Driver Memory + Executor Memory
Perhaps you have to increase amount of memory for both the driver and the executor. Try to run Worker with 1 Gb memory and your driver with 512 Mb --driver-memory and --executor-memory.

Each application submitted by client can launch how many YARN container in each Node Manager?

A container is an abstract notion in YARN. When running Spark on YARN, each Spark executor runs as a YARN container. How many YARN containers can be launched in each Node Manager, by each client-submitted application?
You can run as many executors on a single NodeManager as you want, so long as you have the resources. If you have a server with 20gb RAM and 10 cores, you can run 10 2gb 1core executors on that nodemanager. It wouldn't be advisable to run multiple executors on the same nodemanager as there is overhead cost in shuffling data between executors, even if they process is running on the same machine.
Each executor runs in a YARN container.
Depending on how big your YARN cluster is, how your data is spread out among the worker nodes to have better data locality, how many executors you requested for your application, how much resource(cores per executor, memory per executor) you requested per executor and whether your have enabled dynamic resource allocation, Spark decides on how many executors are needed in total and, how many executors to launch per worker nodes.
If you requested for resource that YARN cluster could not accommodate, your requested will be rejected.
Following are the properties to look out for when making spark-submit request.
--num-executors - number of total executors you need
--executor-cores - number of cores per executor. Max 5 is recommended.
--executor-memory - amount of memory per executor.
--spark.dynamicAllocation.enabled
-- spark.dynamicAllocation.maxExecutors

Running Spark on heterogeneous cluster in standalone mode

I have a cluster of 3 nodes, each has 12 cores, and 30G, 20G and 10G of RAM respectively. When I run my application, I set the executor memory to 20G, which prevent the executor from being launched in the 10G machine since it's exceeding the slave memory threshold, it also under utilize the resources on the 30G machine. I searched but didn't find any way to set the executor memory dynamically base on the capacity of the node, so how can I config the cluster or my Spark job to fully utilize the resources of the cluster?
The solution is to have more executors with to less memory. You can use all of the memory by having 6- 10G executors (1 on the 10G node, 2 on the 20G node, 3 on the 30G node). Or by having 12- 5G executors. etc

Spark executor configuration

As per My spark cluster the below configuration is set
spark.executor.memory=2g
I would like to know that this 2G of RAM is shared by all executors or this 2G of RAM is used by each executor in each worker machine?
I would like to know that this 2G of RAM is shared by all executors or
this 2G of RAM is used by each executor in each worker machine
This setting will cause each executor on every one of your Worker nodes to have 2G memory. This setting doesn't mean "share 2G of memory between all executors", it means "give each executor 2G of memory".
This is explicitly stated in the documentation (emphasis mine):
spark.executor.memory | 1g | Amount of memory to use per executor process
(e.g. 2g, 8g).
If you have multiple executors per Worker node, this means that each one of these executors will consume 2G of memory.

Configuring Executor memory and number of executors per Worker node

How to configure the Executor's memory in the Spark cluster. Also, How to configure number of executors per worker node ?
Is there any way to know how much executor's memory is free to cache or persist new RDD's.
Configuring Spark executor memory - use the parameter spark.executor.memory or key --executor-memory when submitting the job
Configuring number of executors per node depends on which scheduler you use for Spark. In case of YARN and Mesos you don't have a control over this, you can just set the number of executors. In case of Spark Standalone cluster, you can tune SPARK_WORKER_INSTANCES parameter
You can check the amount of free memory in WebUI of the Spark driver. Refer here How to set Apache Spark Executor memory to see why this is not equal to the total executor memory you've set

Resources