spark.executor.instances over spark.dynamicAllocation.enabled = True - apache-spark

I'm working in a Spark project using MapR distribution where the dynamic allocation is enabled. Please refer to the below parameters :
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 0
spark.dynamicAllocation.maxExecutors 20
spark.executor.instances 2
As per my understanding spark.executor.instances is what we define as --num-executors while submitting our pySpark job.
I have following 2 questions :
if I use --num-executors 5 during my job submission will it overwrite spark.executor.instances 2 config setting?
what is the purpose of having spark.executor.instances defined when dynamic allocation min and max executors are already defined?

There is one more parameter which is
spark.dynamicAllocation.initialExecutors
it takes the value of spark.dynamicAllocation.minExecutors. If spark.executor.instances
is defined and its larger than the minExecutors then it will take the value of the initial executors.

spark.executor.instances basically is the property for static allocation. However, if dynamic allocation is enabled, the initial set of executors will be at least equal to spark.executor.instances.
It wont get overwritten in the config setting, when you set --num-executors.
Extra read: official doc

Related

How to ensure Spark use the number of set executors is used across every stage?

I've found that some intermediate stages use (much) fewer number of executors than the value of spark.executor.instances.
Currently, spark.executor.instances is set with spark.dynamicAllocation.enabled is set to false
We've also tried enabling spark.dynamicAllocation.enabled to true and set spark.dynamicAllocation.minExecutors to some value.
But, in both scenarios, spark.dynamicAllocation.minExecutors and spark.executor.instances seem to be ignored.
I wonder if anyone has any pointer on how to investigate further or what might be a root cause.
Edit: when the dynamic allocation is enabled, we also set spark.dynamicAllocation.maxExecutors
Try setting
spark.dynamicAllocation.maxExecutors

Reset the value to the configuration "spark.executor.instances"

We have a YARN cluster and we use Spark 2.3.2 version. I wanted to use Spark's dynamic resource allocation while submitting the spark applications but in the spark-defaults.conf file the value of the property spark.executor.instances is set to 16. From my understanding, I should not set spark.executor.instances if we wanted to use dynamic resource allocation. Otherwise, even if the dynamic resource allocation is enabled, it is getting overridden by the property spark.executor.instances.
I can't edit the spark-defaults.conf, so I wanted to reset the value assigned to spark.executor.instances through the --conf argument to spark-submit. I tried setting it to blank or 0 but in both of the cases, the job failed with spark.executor.instances should be a positive number.
Given this situation, how can I successfully use Spark's dynamic resource allocation?.
One more observation is, I do not see that the dynamic resource allocation enabled in the spark-defaults.conf file. Isn't it a default property since Spark 2.0?.
Read about the dynamic allocation here
https://dzone.com/articles/spark-dynamic-allocation
This will give you a brief idea of how it works.
There is no way to reset the configuration spark.executor.instances. If dynamic resource allocation is enabled as well as if we set spark.executor.instances, then the parameter spark.executor.instances is used to calculate the initial number of executors required
initial executors = max(spark.executor.instances,spark.dynamicAllocation.initialExecutors,spark.dynamicAllocation.minExecutors)
As long as we set either spark.dynamicAllocation.initialExecutors or spark.dynamicAllocation.minExecutors more than spark.executor.instances, The configuration spark.executor.instances does not have any impact on the dynamic resource allocation.

Spark dynamic allocation configuration settings precedence

I have a spark job that runs on a cluster with dynamic resource allocation enabled.. I submit the spark job with num executors and the executor memory properties.. what will take precedence here? Will the job run with dynamic allocation or with the resources that I mention in the config?
It depends on which config parameter has a greater value ...
spark.dynamicAllocation.initialExecutors or spark.executor.instances aka --num-executors (when launching via terminal at runtime)
Here is the reference doc if you are using Cloudera on YARN and make sure you are looking at the correct CDH version according to your environment.
https://www.cloudera.com/documentation/enterprise/6/6.2/topics/cdh_ig_running_spark_on_yarn.html#spark_on_yarn_dynamic_allocation__table_tkb_nyv_yr
Apache YARN documentation too:
https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
So to sum it up if you are using --num-executors it is most likely overriding (cancelling and not using) dynamic allocation unless you set spark.dynamicAllocation.initialExecutors to be a higher value.
The configuration documentation (2.4.4) says about spark.dynamicAllocation.initialExecutors :
Initial number of executors to run if dynamic allocation is enabled.
If --num-executors (or spark.executor.instances) is set and larger than this value, it will be used as the initial number of executors.
So for me if dynamic allocation is enabled (i.e. spark.dynamicAllocation.enabled is true) then it will be used, and the initial number of executor will simply be max(spark.dynamicAllocation.initialExecutors, spark.executor.instances)

How is YARN ResourceManager's Total Memory calculated?

I'm running a Spark cluster in a 1 MasterNode, 3 WorkerNode configuration using aws emr and YARN-client, with the MasterNode being the client machine. All 4 nodes have 8GB of memory and 4 cores each. Given that hardware setup, I set the following:
spark.executor.memory = 5G
spark.executor.cores = 3
spark.yarn.executor.memoryOverhead = 600
With that configuration, would the expected Total Memory recognized by Yarn's ResourceManager be 15GB? It's displaying 18GB. I've only seen Yarn use up to 15GB when running Spark applications. Is that 15GB from the spark.executor.memory * 3 nodes?
I want to assume that the YARN Total Memory is calculated by spark.executor.memory + spark.yarn.executor.memoryOverhead but I can't find that documented anywhere. What's the proper way to find the exact number?
And I should be able to increase the value of spark.executor.memory to 6G right? I've gotten errors in the past when it was set like that. Would there be other configurations I need to set?
Edit- So it looks like the workerNodes' value for yarn.scheduler.maximum-allocation-mb is 6114 or 6GB. This is the default that EMR sets for the instance type. And since 6GB * 3 = 18GB, that likely makes sense. I want to restart Yarn and increase that value from 6GB to 7GB, but can't since this is a cluster being used, so I guess my question still stands.
I want to assume that the YARN Total Memory is calculated by spark.executor.memory + spark.yarn.executor.memoryOverhead but I can't find that documented anywhere. What's the proper way to find the exact number?
This is sort of correct, but said backwards. YARN's total memory is independent of any configurations you set up for Spark. yarn.scheduler.maximum-allocation-mb controls how much memory YARN has access to, and can be found here. To use all available memory with Spark, you would set spark.executor.memory + spark.yarn.executor.memoryOverhead to equal yarn.scheduler.maximum-allocation-mb. See here for more info on tuning your spark job and this spreadsheet for calculating configurations.
And I should be able to increase the value of spark.executor.memory to 6G right?
Based on the spreadsheet, the upper limit of spark.executor.memory is 5502M if yarn.scheduler.maximum-allocation-mb is 6114M. Calculated by hand, this is .9 * 6114 as spark.executor.memoryOverhead defaults to
executorMemory * 0.10, with minimum of 384 (source)

How to increase the number of executors in Spark Standalone mode if spark.executor.instances and spark.cores.max aren't working

I looked everywhere but couldn't find the answer I need. I'm running Spark 1.5.2 in Standalone Mode, SPARK_WORKER_INSTANCES=1 because I only want 1 executor per worker per host. What I would like is to increase the number of hosts for my job and hence the number of executors. I've tried changing spark.executor.instances and spark.cores.max in spark-defaults.conf, still seeing the same number of executors. People suggest changing --num-executors, is that not the same as spark.executor.instances?
This Cloudera blog post
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ says "The --num-executors command-line flag or spark.executor.instances configuration property control the number of executors requested. Starting in CDH 5.4/Spark 1.3, you will be able to avoid setting this property by turning on dynamic allocation with the spark.dynamicAllocation.enabled property"
but I"m not sure if spark.dynamicAllocation.enabled only works for YARN.
Any suggestion on how to do this for Spark 1.5.2 is great appreciated!
I don't believe you need to setup SPARK_WORKER_INSTANCES! and if you want to use it, you need to set SPARK_WORKER_CORES environment variable, otherwise, you will end up with a worker consuming all the cores. Hence, the other workers can't be launched correctly!
I haven't seen spark.executor.instances used outside YARN Configuration with Spark
That said, I would definitely suggest using --num-executors having your cluster have multiple workers!

Resources