Spark dynamic allocation configuration settings precedence

Spark dynamic allocation configuration settings precedence - apache-spark

I have a spark job that runs on a cluster with dynamic resource allocation enabled.. I submit the spark job with num executors and the executor memory properties.. what will take precedence here? Will the job run with dynamic allocation or with the resources that I mention in the config?

It depends on which config parameter has a greater value ...
spark.dynamicAllocation.initialExecutors or spark.executor.instances aka --num-executors (when launching via terminal at runtime)
Here is the reference doc if you are using Cloudera on YARN and make sure you are looking at the correct CDH version according to your environment.
https://www.cloudera.com/documentation/enterprise/6/6.2/topics/cdh_ig_running_spark_on_yarn.html#spark_on_yarn_dynamic_allocation__table_tkb_nyv_yr
Apache YARN documentation too:
https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
So to sum it up if you are using --num-executors it is most likely overriding (cancelling and not using) dynamic allocation unless you set spark.dynamicAllocation.initialExecutors to be a higher value.

The configuration documentation (2.4.4) says about spark.dynamicAllocation.initialExecutors :
Initial number of executors to run if dynamic allocation is enabled.
If --num-executors (or spark.executor.instances) is set and larger than this value, it will be used as the initial number of executors.
So for me if dynamic allocation is enabled (i.e. spark.dynamicAllocation.enabled is true) then it will be used, and the initial number of executor will simply be max(spark.dynamicAllocation.initialExecutors, spark.executor.instances)

Related

Explain the difference between Spark configurations

I have to set the number of executors in my spark application as 20. While looking at the official documentation I'm confused which is a better config to set
spark.dynamicAllocation.initialExecutors = 20
spark.executor.instances=20
I have the following config enabled
spark.dynamicAllocation.enabled = true
In what use case scenario will I use either?

As per the spark documentation
spark.dynamicAllocation.initialExecutors
Initial number of executors to run if dynamic allocation is enabled.
If --num-executors (or spark.executor.instances) is set and larger
than this value, it will be used as the initial number of executors.
as you can see in the highlighted text it can be overwritten by --num-executors when it is set to a higher value then spark.dynamicAllocation.initialExecutors.
basically, when your application starts it will launch spark.dynamicAllocation.initialExecutors and then slowly increase till spark.dynamicAllocation.maxExecutors when dynamic allocation enabled.
spark.executor.instances
number of executors for static allocation.
In layman terms,
It is like saying I want x resources(spark.executor.instances) to finish a job
(OR)
I want min(x resources) and max(y resources) and initially(z resources) to finish a job...
condition (x<=z<=y) should always satisfy and your resources usage will be decided on the needed when your job is running.
when to use dynamic allocation?
when you have multiple streaming applications running on your cluster OR on-demand spark-sql jobs.most of the time your jobs might need few resources and almost remain idle only in big data stream chunks(peak hours) job might need more resource to process data otherwise cluster resources should be freed and used for other purposes.
Note: make sure to enable external shuffle service (spark.shuffle.service.enabled=true) when dynamic allocation is enabled.
The purpose of the external shuffle service is to allow executors to
be removed without deleting shuffle files written by them (more
detail). The way to set up this service varies across cluster
managers
Referrences :
https://dzone.com/articles/spark-dynamic-allocation

Reset the value to the configuration "spark.executor.instances"

We have a YARN cluster and we use Spark 2.3.2 version. I wanted to use Spark's dynamic resource allocation while submitting the spark applications but in the spark-defaults.conf file the value of the property spark.executor.instances is set to 16. From my understanding, I should not set spark.executor.instances if we wanted to use dynamic resource allocation. Otherwise, even if the dynamic resource allocation is enabled, it is getting overridden by the property spark.executor.instances.
I can't edit the spark-defaults.conf, so I wanted to reset the value assigned to spark.executor.instances through the --conf argument to spark-submit. I tried setting it to blank or 0 but in both of the cases, the job failed with spark.executor.instances should be a positive number.
Given this situation, how can I successfully use Spark's dynamic resource allocation?.
One more observation is, I do not see that the dynamic resource allocation enabled in the spark-defaults.conf file. Isn't it a default property since Spark 2.0?.

Read about the dynamic allocation here
https://dzone.com/articles/spark-dynamic-allocation
This will give you a brief idea of how it works.

There is no way to reset the configuration spark.executor.instances. If dynamic resource allocation is enabled as well as if we set spark.executor.instances, then the parameter spark.executor.instances is used to calculate the initial number of executors required
initial executors = max(spark.executor.instances,spark.dynamicAllocation.initialExecutors,spark.dynamicAllocation.minExecutors)
As long as we set either spark.dynamicAllocation.initialExecutors or spark.dynamicAllocation.minExecutors more than spark.executor.instances, The configuration spark.executor.instances does not have any impact on the dynamic resource allocation.

Is there a way to specify all three resource properties (executor instances, cores and memory) in Spark on YARN (Dataproc)

I'm trying to setup a small Dataproc Spark cluster of 3 workers (2 regular and one preemptible) but I'm running into problems.
Specifically, I've been struggling to find a way to let the Spark application submitters to have freedom to specify the number of executors while being able to specify how many cores should be assigned to them
Dataproc image of Yarn and Spark has the following defaults:
Spark dynamic allocation enabled
Yarn Capacity Scheduler configured with DefaultResourceCalculator
With these defaults the number of cores is not taken into account (the ratio container-vcores is always 1:1), as DefaultResourceCalculator only cares about memory. In any case, when configured this way, the number of executors is honored (by means of setting spark.dynamicAllocation.enabled = false and spark.executor.instances = <num> as properties in gcloud submit)
So I changed it to DominantResourceCalculator and now it takes care of the requested cores but I'm no longer able to specify the number of executors, regardless of disabling the Spark dynamic allocation or not.
It might also be of interest to know that the default YARN queue is limited to 70 % of capacity by configuration (in capacity-scheduler.xml) and that there is also another non-default queue configured (but not used yet). My understanding is that both Capacity and Fair schedulers do not limit the resource allocation in case of uncontended job submission as long as the max capacity is kept at 100. In any case, for the sake of clarity, these are the properties setup during the cluster creation:
capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
capacity-scheduler:yarn.scheduler.capacity.root.queues=default,online
capacity-scheduler:yarn.scheduler.capacity.root.default.capacity=30
capacity-scheduler:yarn.scheduler.capacity.root.online.capacity=70
capacity-scheduler:yarn.scheduler.capacity.root.online.user-limit-factor=1
capacity-scheduler:yarn.scheduler.capacity.root.online.maximum-capacity=100
capacity-scheduler:yarn.scheduler.capacity.root.online.state=RUNNING
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_submit_applications=*
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_administer_queue=*
The job submission is done by means of gcloud tool and the queue used is the default.
E.g, the following properties set when executing gcloud dataproc submit:
--properties spark.dynamicAllocation.enabled=false,spark.executor.memory=5g,spark.executor.instances=3
end up in the following assignment:
Is there a way to configure YARN so that it accepts both?
EDITED to specify queue setup

You may try setting a higher value, such as 2, for yarn.scheduler.capacity.root.online.user-limit-factor in place of the present value of 1, the value you have set. This setting enables the user to leverage twice the chosen capacity. Your setting of 100% as the maximum capacity allows for this doubling of the chosen capacity.

How to increase the number of executors in Spark Standalone mode if spark.executor.instances and spark.cores.max aren't working

I looked everywhere but couldn't find the answer I need. I'm running Spark 1.5.2 in Standalone Mode, SPARK_WORKER_INSTANCES=1 because I only want 1 executor per worker per host. What I would like is to increase the number of hosts for my job and hence the number of executors. I've tried changing spark.executor.instances and spark.cores.max in spark-defaults.conf, still seeing the same number of executors. People suggest changing --num-executors, is that not the same as spark.executor.instances?
This Cloudera blog post
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ says "The --num-executors command-line flag or spark.executor.instances configuration property control the number of executors requested. Starting in CDH 5.4/Spark 1.3, you will be able to avoid setting this property by turning on dynamic allocation with the spark.dynamicAllocation.enabled property"
but I"m not sure if spark.dynamicAllocation.enabled only works for YARN.
Any suggestion on how to do this for Spark 1.5.2 is great appreciated!

I don't believe you need to setup SPARK_WORKER_INSTANCES! and if you want to use it, you need to set SPARK_WORKER_CORES environment variable, otherwise, you will end up with a worker consuming all the cores. Hence, the other workers can't be launched correctly!
I haven't seen spark.executor.instances used outside YARN Configuration with Spark
That said, I would definitely suggest using --num-executors having your cluster have multiple workers!

How to control how many executors to run in yarn-client mode?

I have a Hadoop cluster of 5 nodes where Spark runs in yarn-client mode.
I use --num-executors for the number of executors. The maximum number of executors I am able to get is 20. Even if I specify more, I get only 20 executors.
Is there any upper limit on the number of executors that can get allocated ? Is it a configuration or the decision is made on the basis of the resources available ?

Apparently your 20 running executors consume all available memory. You can try decreasing Executor memory with spark.executor.memory parameter, which should leave a bit more place for other executors to spawn.
Also, are you sure that you correctly set the executors number? You can verify your environment settings from Spark UI view by looking at the spark.executor.instances value in the Environment tab.
EDIT: As Mateusz Dymczyk pointed out in comments, limited number of executors may not only be caused by overused RAM memory, but also by CPU cores. In both cases the limit comes from the resource manager.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string