We have a YARN cluster and we use Spark 2.3.2 version. I wanted to use Spark's dynamic resource allocation while submitting the spark applications but in the spark-defaults.conf file the value of the property spark.executor.instances is set to 16. From my understanding, I should not set spark.executor.instances if we wanted to use dynamic resource allocation. Otherwise, even if the dynamic resource allocation is enabled, it is getting overridden by the property spark.executor.instances.
I can't edit the spark-defaults.conf, so I wanted to reset the value assigned to spark.executor.instances through the --conf argument to spark-submit. I tried setting it to blank or 0 but in both of the cases, the job failed with spark.executor.instances should be a positive number.
Given this situation, how can I successfully use Spark's dynamic resource allocation?.
One more observation is, I do not see that the dynamic resource allocation enabled in the spark-defaults.conf file. Isn't it a default property since Spark 2.0?.
Read about the dynamic allocation here
https://dzone.com/articles/spark-dynamic-allocation
This will give you a brief idea of how it works.
There is no way to reset the configuration spark.executor.instances. If dynamic resource allocation is enabled as well as if we set spark.executor.instances, then the parameter spark.executor.instances is used to calculate the initial number of executors required
initial executors = max(spark.executor.instances,spark.dynamicAllocation.initialExecutors,spark.dynamicAllocation.minExecutors)
As long as we set either spark.dynamicAllocation.initialExecutors or spark.dynamicAllocation.minExecutors more than spark.executor.instances, The configuration spark.executor.instances does not have any impact on the dynamic resource allocation.
Related
I've found that some intermediate stages use (much) fewer number of executors than the value of spark.executor.instances.
Currently, spark.executor.instances is set with spark.dynamicAllocation.enabled is set to false
We've also tried enabling spark.dynamicAllocation.enabled to true and set spark.dynamicAllocation.minExecutors to some value.
But, in both scenarios, spark.dynamicAllocation.minExecutors and spark.executor.instances seem to be ignored.
I wonder if anyone has any pointer on how to investigate further or what might be a root cause.
Edit: when the dynamic allocation is enabled, we also set spark.dynamicAllocation.maxExecutors
Try setting
spark.dynamicAllocation.maxExecutors
I'm working in a Spark project using MapR distribution where the dynamic allocation is enabled. Please refer to the below parameters :
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 0
spark.dynamicAllocation.maxExecutors 20
spark.executor.instances 2
As per my understanding spark.executor.instances is what we define as --num-executors while submitting our pySpark job.
I have following 2 questions :
if I use --num-executors 5 during my job submission will it overwrite spark.executor.instances 2 config setting?
what is the purpose of having spark.executor.instances defined when dynamic allocation min and max executors are already defined?
There is one more parameter which is
spark.dynamicAllocation.initialExecutors
it takes the value of spark.dynamicAllocation.minExecutors. If spark.executor.instances
is defined and its larger than the minExecutors then it will take the value of the initial executors.
spark.executor.instances basically is the property for static allocation. However, if dynamic allocation is enabled, the initial set of executors will be at least equal to spark.executor.instances.
It wont get overwritten in the config setting, when you set --num-executors.
Extra read: official doc
I have a spark job that runs on a cluster with dynamic resource allocation enabled.. I submit the spark job with num executors and the executor memory properties.. what will take precedence here? Will the job run with dynamic allocation or with the resources that I mention in the config?
It depends on which config parameter has a greater value ...
spark.dynamicAllocation.initialExecutors or spark.executor.instances aka --num-executors (when launching via terminal at runtime)
Here is the reference doc if you are using Cloudera on YARN and make sure you are looking at the correct CDH version according to your environment.
https://www.cloudera.com/documentation/enterprise/6/6.2/topics/cdh_ig_running_spark_on_yarn.html#spark_on_yarn_dynamic_allocation__table_tkb_nyv_yr
Apache YARN documentation too:
https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
So to sum it up if you are using --num-executors it is most likely overriding (cancelling and not using) dynamic allocation unless you set spark.dynamicAllocation.initialExecutors to be a higher value.
The configuration documentation (2.4.4) says about spark.dynamicAllocation.initialExecutors :
Initial number of executors to run if dynamic allocation is enabled.
If --num-executors (or spark.executor.instances) is set and larger than this value, it will be used as the initial number of executors.
So for me if dynamic allocation is enabled (i.e. spark.dynamicAllocation.enabled is true) then it will be used, and the initial number of executor will simply be max(spark.dynamicAllocation.initialExecutors, spark.executor.instances)
I'm trying to setup a small Dataproc Spark cluster of 3 workers (2 regular and one preemptible) but I'm running into problems.
Specifically, I've been struggling to find a way to let the Spark application submitters to have freedom to specify the number of executors while being able to specify how many cores should be assigned to them
Dataproc image of Yarn and Spark has the following defaults:
Spark dynamic allocation enabled
Yarn Capacity Scheduler configured with DefaultResourceCalculator
With these defaults the number of cores is not taken into account (the ratio container-vcores is always 1:1), as DefaultResourceCalculator only cares about memory. In any case, when configured this way, the number of executors is honored (by means of setting spark.dynamicAllocation.enabled = false and spark.executor.instances = <num> as properties in gcloud submit)
So I changed it to DominantResourceCalculator and now it takes care of the requested cores but I'm no longer able to specify the number of executors, regardless of disabling the Spark dynamic allocation or not.
It might also be of interest to know that the default YARN queue is limited to 70 % of capacity by configuration (in capacity-scheduler.xml) and that there is also another non-default queue configured (but not used yet). My understanding is that both Capacity and Fair schedulers do not limit the resource allocation in case of uncontended job submission as long as the max capacity is kept at 100. In any case, for the sake of clarity, these are the properties setup during the cluster creation:
capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
capacity-scheduler:yarn.scheduler.capacity.root.queues=default,online
capacity-scheduler:yarn.scheduler.capacity.root.default.capacity=30
capacity-scheduler:yarn.scheduler.capacity.root.online.capacity=70
capacity-scheduler:yarn.scheduler.capacity.root.online.user-limit-factor=1
capacity-scheduler:yarn.scheduler.capacity.root.online.maximum-capacity=100
capacity-scheduler:yarn.scheduler.capacity.root.online.state=RUNNING
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_submit_applications=*
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_administer_queue=*
The job submission is done by means of gcloud tool and the queue used is the default.
E.g, the following properties set when executing gcloud dataproc submit:
--properties spark.dynamicAllocation.enabled=false,spark.executor.memory=5g,spark.executor.instances=3
end up in the following assignment:
Is there a way to configure YARN so that it accepts both?
EDITED to specify queue setup
You may try setting a higher value, such as 2, for yarn.scheduler.capacity.root.online.user-limit-factor in place of the present value of 1, the value you have set. This setting enables the user to leverage twice the chosen capacity. Your setting of 100% as the maximum capacity allows for this doubling of the chosen capacity.
I looked everywhere but couldn't find the answer I need. I'm running Spark 1.5.2 in Standalone Mode, SPARK_WORKER_INSTANCES=1 because I only want 1 executor per worker per host. What I would like is to increase the number of hosts for my job and hence the number of executors. I've tried changing spark.executor.instances and spark.cores.max in spark-defaults.conf, still seeing the same number of executors. People suggest changing --num-executors, is that not the same as spark.executor.instances?
This Cloudera blog post
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ says "The --num-executors command-line flag or spark.executor.instances configuration property control the number of executors requested. Starting in CDH 5.4/Spark 1.3, you will be able to avoid setting this property by turning on dynamic allocation with the spark.dynamicAllocation.enabled property"
but I"m not sure if spark.dynamicAllocation.enabled only works for YARN.
Any suggestion on how to do this for Spark 1.5.2 is great appreciated!
I don't believe you need to setup SPARK_WORKER_INSTANCES! and if you want to use it, you need to set SPARK_WORKER_CORES environment variable, otherwise, you will end up with a worker consuming all the cores. Hence, the other workers can't be launched correctly!
I haven't seen spark.executor.instances used outside YARN Configuration with Spark
That said, I would definitely suggest using --num-executors having your cluster have multiple workers!