Can we have same Spark Config for all jobs/applications? - apache-spark

I am trying to understand the Spark Config, I see that the number of executors , executor cores and executor memory is being calculated based on the cluster. Eg:
Cluster Config:
10 Nodes
16 cores per Node
64GB RAM per Node
Recommned Config is 29 executors, 18GB memory each and 5 cores each!!
However, would this config be the same of all the jobs/applications that run on the cluster ? What if more than 1 job/app is running at the same time what would happen ? Also, would this config be the same irrespective of the data that I am processing whether it be 1GB or 100GB or would the config change based on the data aswell, if so how to calculated ?
Reference for recommend config- https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html

The default configuration in spark will apply to all the jobs, this you can set in the spark-defaults.conf.
In case of yarn: Jobs are automatically put in the queue if enough resources are not available
You can set the number of executor cores and other configuration during spark submit to override the defaults. You can also look at dynamic allocation to avoid doing this yourself, this is not guaranteed to work as efficiently as setting the configuration yourself.

Related

Spark - only one partition is processed at each node

I see in my Spark job that usually (but not always) only one partition is being processed on each node. What could be the possible reasons? How can I debug it?
You should check the executor's resources configuration:
spark.executor.memory
spark.executor.cores
These configs control how many executors can run concurrently on each node and therefore -- how many partitions are processed concurrently (by default, every executor processes a single partition).
For example, if your nodes have 8 cores and 32gb memory each and your spark application is defined with:
spark.executor.memory=25g
spark.executor.cores=3
only one executor will be able to run concurrently on each node and in order to run 2 executors concurrntly the node should have at least 50gb memory.

Spark jobs seem to only be using a small amount of resources

Please bear with me because I am still quite new to Spark.
I have a GCP DataProc cluster which I am using to run a large number of Spark jobs, 5 at a time.
Cluster is 1 + 16, 8 cores / 40gb mem / 1TB storage per node.
Now I might be misunderstanding something or not doing something correctly, but I currently have 5 jobs running at once, and the Spark UI is showing that only 34/128 vcores are in use, and they do not appear to be evenly distributed (The jobs were executed simultaneously, but the distribution is 2/7/7/11/7. There is only one core allocated per running container.
I have used the flags --executor-cores 4 and --num-executors 6 which doesn't seem to have made any difference.
Can anyone offer some insight/resources as to how I can fine tune these jobs to use all available resources?
I have managed to solve the issue - I had no cap on the memory usage so it looked as though all memory was allocated to just 2 cores per node.
I added the property spark.executor.memory=4G and re-ran the job, it instantly allocated 92 cores.
Hope this helps someone else!
The Dataproc default configurations should take care of the number of executors. Dataproc also enables dynamic allocation, so executors will only be allocated if needed (according to Spark).
Spark cannot parallelize beyond the number of partitions in a Dataset/RDD. You may need to set the following properties to get good cluster utilization:
spark.default.parallelism: the default number of output partitions from transformations on RDDs (when not explicitly set)
spark.sql.shuffle.partitions: the number of output partitions from aggregations using the SQL API
Depending on your use case, it may make sense to explicitly set partition counts for each operation.

Is there a way to specify all three resource properties (executor instances, cores and memory) in Spark on YARN (Dataproc)

I'm trying to setup a small Dataproc Spark cluster of 3 workers (2 regular and one preemptible) but I'm running into problems.
Specifically, I've been struggling to find a way to let the Spark application submitters to have freedom to specify the number of executors while being able to specify how many cores should be assigned to them
Dataproc image of Yarn and Spark has the following defaults:
Spark dynamic allocation enabled
Yarn Capacity Scheduler configured with DefaultResourceCalculator
With these defaults the number of cores is not taken into account (the ratio container-vcores is always 1:1), as DefaultResourceCalculator only cares about memory. In any case, when configured this way, the number of executors is honored (by means of setting spark.dynamicAllocation.enabled = false and spark.executor.instances = <num> as properties in gcloud submit)
So I changed it to DominantResourceCalculator and now it takes care of the requested cores but I'm no longer able to specify the number of executors, regardless of disabling the Spark dynamic allocation or not.
It might also be of interest to know that the default YARN queue is limited to 70 % of capacity by configuration (in capacity-scheduler.xml) and that there is also another non-default queue configured (but not used yet). My understanding is that both Capacity and Fair schedulers do not limit the resource allocation in case of uncontended job submission as long as the max capacity is kept at 100. In any case, for the sake of clarity, these are the properties setup during the cluster creation:
capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
capacity-scheduler:yarn.scheduler.capacity.root.queues=default,online
capacity-scheduler:yarn.scheduler.capacity.root.default.capacity=30
capacity-scheduler:yarn.scheduler.capacity.root.online.capacity=70
capacity-scheduler:yarn.scheduler.capacity.root.online.user-limit-factor=1
capacity-scheduler:yarn.scheduler.capacity.root.online.maximum-capacity=100
capacity-scheduler:yarn.scheduler.capacity.root.online.state=RUNNING
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_submit_applications=*
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_administer_queue=*
The job submission is done by means of gcloud tool and the queue used is the default.
E.g, the following properties set when executing gcloud dataproc submit:
--properties spark.dynamicAllocation.enabled=false,spark.executor.memory=5g,spark.executor.instances=3
end up in the following assignment:
Is there a way to configure YARN so that it accepts both?
EDITED to specify queue setup
You may try setting a higher value, such as 2, for yarn.scheduler.capacity.root.online.user-limit-factor in place of the present value of 1, the value you have set. This setting enables the user to leverage twice the chosen capacity. Your setting of 100% as the maximum capacity allows for this doubling of the chosen capacity.

Spark shows different number of cores than what is passed to it using spark-submit

TL;DR
Spark UI shows different number of cores and memory than what I'm asking it when using spark-submit
more details:
I'm running Spark 1.6 in standalone mode.
When I run spark-submit I pass it 1 executor instance with 1 core for the executor and also 1 core for the driver.
What I would expect to happen is that my application will be ran with 2 cores total.
When I check the environment tab on the UI I see that it received the correct parameters I gave it, however it still seems like its using a different number of cores. You can see it here:
This is my spark-defaults.conf that I'm using:
spark.executor.memory 5g
spark.executor.cores 1
spark.executor.instances 1
spark.driver.cores 1
Checking the environment tab on the Spark UI shows that these are indeed the received parameters but the UI still shows something else
Does anyone have any idea on what might cause Spark to use different number of cores than what I want I pass it? I obviously tried googling it but didn't find anything useful on that topic
Thanks in advance
TL;DR
Use spark.cores.max instead to define the total number of cores available, and thus limit the number of executors.
In standalone mode, a greedy strategy is used and as many executors will be created as there are cores and memory available on your worker.
In your case, you specified 1 core and 5GB of memory per executor.
The following will be calculated by Spark :
As there are 8 cores available, it will try to create 8 executors.
However, as there is only 30GB of memory available, it can only create 6 executors : each executor will have 5GB of memory, which adds up to 30GB.
Therefore, 6 executors will be created, and a total of 6 cores will be used with 30GB of memory.
Spark basically fulfilled what you asked for. In order to achieve what you want, you can make use of the spark.cores.max option documented here and specify the exact number of cores you need.
A few side-notes :
spark.executor.instances is a YARN-only configuration
spark.driver.memory defaults to 1 core already
I am also working on easing the notion of the number of executors in standalone mode, this might get integrated into a next release of Spark and hopefully help figuring out exactly the number of executors you are going to have, without having to calculate it on the go.

How to control how many executors to run in yarn-client mode?

I have a Hadoop cluster of 5 nodes where Spark runs in yarn-client mode.
I use --num-executors for the number of executors. The maximum number of executors I am able to get is 20. Even if I specify more, I get only 20 executors.
Is there any upper limit on the number of executors that can get allocated ? Is it a configuration or the decision is made on the basis of the resources available ?
Apparently your 20 running executors consume all available memory. You can try decreasing Executor memory with spark.executor.memory parameter, which should leave a bit more place for other executors to spawn.
Also, are you sure that you correctly set the executors number? You can verify your environment settings from Spark UI view by looking at the spark.executor.instances value in the Environment tab.
EDIT: As Mateusz Dymczyk pointed out in comments, limited number of executors may not only be caused by overused RAM memory, but also by CPU cores. In both cases the limit comes from the resource manager.

Resources