spark-submit criteria to set parameter values - apache-spark

I am so confused about the right criteria to use when it comes to setting the following spark-submit parameters, for example:
spark-submit --deploy-mode cluster --name 'CoreLogic Transactions Curated ${var_date}' \
--driver-memory 4G --executor-memory 4G --num-executors 10 --executor-cores 4 \
/etl/scripts/corelogic/transactions/corelogic_transactions_curated.py \
--from_date ${var_date} \
--to_date ${var_to_date}
One person is telling me that I am using a lot of executors and cores but he is not explaining why he said that.
Can someone explain to me the right criteria to use when it comes to setting these parameters (--driver-memory 4G --executor-memory 4G --num-executors 10 --executor-cores 4) according to my dataset?
The same in the following case
spark = SparkSession.builder \
.appName('DemoEcon PEP hist stage') \
.config('spark.sql.shuffle.partitions', args.shuffle_partitions) \
.enableHiveSupport() \
.getOrCreate()
I am not quite sure which is the criteria used to set this parameter "spark.sql.shuffle.partitions"
can someone help me to get this clear in my mind?
Thank you in advance

In this website is the answer that I needed, an excellent explanation, with some examples.
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
Here is one of those examples:
Case 1 Hardware – 6 Nodes and each node have 16 cores, 64 GB RAM
First on each node, 1 core and 1 GB is needed for Operating System and Hadoop Daemons, so we have 15 cores, 63 GB RAM for each node
We start with how to choose number of cores:
Number of cores = Concurrent tasks an executor can run
So we might think, more concurrent tasks for each executor will give better performance. But research shows that any application with more than 5 concurrent tasks, would lead to a bad show. So the optimal value is 5.
This number comes from the ability of an executor to run parallel tasks and not from how many cores a system has. So the number 5 stays same even if we have double (32) cores in the CPU
Number of executors:
Coming to the next step, with 5 as cores per executor, and 15 as total available cores in one node (CPU) – we come to 3 executors per node which is 15/5. We need to calculate the number of executors on each node and then get the total number for the job.
So with 6 nodes, and 3 executors per node – we get a total of 18 executors. Out of 18 we need 1 executor (java process) for Application Master in YARN. So final number is 17 executors
This 17 is the number we give to spark using –num-executors while running from spark-submit shell command
Memory for each executor:
From above step, we have 3 executors per node. And available RAM on each node is 63 GB
So memory for each executor in each node is 63/3 = 21GB.
However small overhead memory is also needed to determine the full memory request to YARN for each executor.
The formula for that overhead is max(384, .07 * spark.executor.memory)
Calculating that overhead: .07 * 21 (Here 21 is calculated as above 63/3) = 1.47
Since 1.47 GB > 384 MB, the overhead is 1.47
Take the above from each 21 above => 21 – 1.47 ~ 19 GB
So executor memory – 19 GB
Final numbers – Executors – 17, Cores 5, Executor Memory – 19 GB

Related

yarn allocate containers for Spark in AWS

I was able to create YARN containers for my spark jobs.
I have come across various blogs and youtube videos to efficiently use --executors-cores (use values from 4 -6 for efficient throughput) and --executor memory after reserving 1 CPU cores and 1GB RAM for hadoop deamons and determined the right values for each executor.
I also came across articles like these.
I am checking how many containers are created by YARN from spark shell and i am not able to understand how the containers are allocated.
For example i have created EMR cluster with 1 master node m5.xlarge (4 vcore , 16 Gib) and 1 core node with instance type c5.2xlarge ( 8 vcore and 16 Gib RAM)
When i create the spark shell with the following command spark-shell --num-executors=6 --executor-cores=5 --conf spark.executor.memoryOverhead=1G --executor-memory 1G --driver-memory 1G
i see that 6 executors including a driver are being created with 5 cores for each executor for a total of 25 cores
However the metrics from hadoop history server does not reflect the right calculations
I am very confused how in spark UI , more cores than available were allocated for each executor . The total vcores in the cluster is 8 cores considering the core nodes but a total of 25 executors are allocated for the executors.
Can someone please explain what i am missing.

Spark Resource Allocation: Number of Cores

Require understanding on how to configure Cores for an Spark Job.
My Machine can have a max. of 11 Cores , 28 Gb memory .
Below is how I'm allocating resources for my Spark Job and it's execution time is 4.9 mins
--driver-memory 2g \
--executor-memory 24g \
--executor-cores 10 \
--num-executors 6
But I ran through multiple articles mentioning number of cores should be ~ 5, when I ran job with this configuration it's execution time increased to 6.9 mins
--driver-memory 2g \
--executor-memory 24g \
--executor-cores 5 \
--num-executors 6 \
Will there be any issue keeping Number of Cores close to Max. value (10 in my case) ?
Are there any benefits of keeping No. of Cores to 5 , as suggested in many articles ?
So in general what are the factors to consider in determining Number of cores?
It all depends on the behaviour of job, one config does not optimise all needs.
--executor-cores means no of cores on 1 machine.
It that number is too big (>5) then the machine's disk and network (which will be shared among all executor spark cores on that machine) will create bottleneck. If that no is too less (~1) then it will not achieve good data parallelism and won't benefit from locality of data on same machine.
TLDR: --executor-coers 5 is fine.

Optimize Spark and Yarn configuration

We have a cluster of 4 nodes with the characteristics above :
Spark jobs make a lot of times in processing, how could we optimize this time knowing that our jobs run from RStudio and we still have a lot of memory not utilized.
To add more context to the answer above, I would like to give explanation on how to set those parameters --num-executors, --executor-memory, --executor-cores appropriately.
The following answer covers the 3 main aspects mentioned in title - number of executors, executor memory and number of cores.
There may be other parameters like driver memory and others which I did not address as of this answer.
Case 1 Hardware - 6 Nodes, and Each node 16 cores, 64 GB RAM
Each executor is a JVM instance. So we can have multiple executors in a single Node
First 1 core and 1 GB is needed for OS and Hadoop Daemons, so available are 15 cores, 63 GB RAM for each node
Start with one by one how to choose these parameters.
Number of cores:
Number of cores = Concurrent tasks as executor can run
So we might think, more concurrent tasks for each executor will give better performance.
But research shows that any application with more than 5 concurrent tasks, would lead to bad show. So stick this to 5.
This number came from the ability of executor and not from how many cores a system has. So the number 5 stays same
even if you have double(32) cores in the CPU.
Number of executors:
Coming back to next step, with 5 as cores per executor, and 15 as total available cores in one Node(CPU) - we come to
3 executors per node.
So with 6 nodes, and 3 executors per node - we get 18 executors. Out of 18 we need 1 executor (java process) for AM in YARN we get 17 executors
This 17 is the number we give to spark using --num-executors while running from spark-submit shell command
Memory for each executor:
From above step, we have 3 executors per node. And available RAM is 63 GB
So memory for each executor is 63/3 = 21GB.
However small overhead memory is also needed to determine the full memory request to YARN for each executor.
Formula for that over head is max(384, .07 * spark.executor.memory)
Calculating that overhead - .07 * 21 (Here 21 is calculated as above 63/3)
= 1.47
Since 1.47 GB > 384 MB, the over head is 1.47.
Take the above from each 21 above => 21 - 1.47 ~ 19 GB
So executor memory - 19 GB
Final numbers - Executors - 17 per node, Cores 5 per executor, Executor Memory - 19 GB
This way, assigning the resources properly to the spark jobs in the cluster would speed up the jobs; efficiently using available resources.
I recommend you to have a look to these parameters :
--num-executors : controls how many executors will be allocated
--executor-memory : RAM for each executor
--executor-cores : cores for each executor

Spark num-executors

I have setup a 10 node HDP platform on AWS. Below is my configuration
2 Servers - Name Node and Standby Name node
7 Data Nodes and each node has 40 vCPU and 160 GB of memory.
I am trying to calculate the number of executors while submitting spark applications and after going through different blogs I am confused on what this parameter actually means.
Looking at the below blog it seems the num executors are the total number of executors across all nodes
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
But looking at the below blog it seems that the num executors is per node or server
https://blogs.aws.amazon.com/bigdata/post/Tx578UTQUV7LRP/Submitting-User-Applications-with-spark-submit
Can anyone please clarify and review the below :-
Is the num-executors value is per node or the total number of executors across all the data nodes.
I am using the below calculation to come up with the core count, executor count and memory per executor
Number of cores <= 5 (assuming 5)
Num executors = (40-1)/5 = 7
Memory = (160-1)/7 = 22 GB
With the above calculation which would be the correct way
--master yarn-client --driver-memory 10G --executor-memory 22G --num-executors 7 --executor-cores 5
OR
--master yarn-client --driver-memory 10G --executor-memory 22G --num-executors 49 --executor-cores 5
Thanks,
Jayadeep
Can anyone please clarify and review the below :-
Is the num-executors value is per node or the total number of executors across all the data nodes.
You need to first understand that the executors run on the NodeManagers (You can think of this like workers in Spark standalone). A number of Containers (includes vCPU, memory, network, disk, etc.) equal to number of executors specified will be allocated for your Spark application on YARN. Now these executor containers will be run on multiple NodeManagers and that depends on the CapacityScheduler (default scheduler in HDP).
So to sum up, total number of executors is the number of resource containers you specify for your application to run.
Refer this blog to understand better.
I am using the below calculation to come up with the core count, executor count and memory per executor
Number of cores <= 5 (assuming 5) Num executors = (40-1)/5 = 7 Memory = (160-1)/7 = 22 GB
There is no rigid formula for calculating the number of executors. Instead you can try enabling Dynamic Allocation in YARN for your application.
There is a hiccup with the capacity scheduler. As far as I understand it allows you to only schedule by memory. You will first need to change that to the dominant resource calculator scheduling type. That will allow you to ask for more memory and cores combination. Once you change that out you should be able to ask for both cup and memory with your spark application.
As for --num-executors flag, you can even keep it at a very high value of 1000. It will still allocate only the number of containers that is possible to launch on each node. As and when your cluster resources increase your containers attached to your application will increase. The number of containers that you can launch per node will be limited by the amount of resources allocated to the nodemanagers on those nodes.

Spark increasing the number of executors in yarn mode

I am running Spark over Yarn on a 4 Node Cluster. The configuration of each machine in the node is 128GB Memory, 24 Core CPU per node. I run Spark on using this command
spark-shell --master yarn --num-executors 19 --executor-memory 18g --executor-cores 4 --driver-memory 4g
But Spark only launches 16 executors maximum. I have maximum-vcore allocation in yarn set to 80 (out of the 94 cores i have). So i was under the impression that this will launch 19 executors but it can only go upto 16 executors. Also I don't think even these executors are using the allocated VCores completely.
These are my questions
Why isn't spark creating 19 executors. Is there a computation behind
the scenes that's limiting it?
What is the optimal configuration to run spark-shell given my cluster configuration, if I wanted to get the best possible spark performance
driver-core is set to 1 by default. Will increasing it improve performance.
Here is my Yarn Config
yarn.nodemanager.resource.memory-mb: 106496
yarn..minimum-allocation-mb: 3584
yarn..maximum-allocation-mb: 106496
yarn..minimum-allocation-vcores: 1
yarn..maximum-allocation-vcores: 20
yarn.nodemanager.resource.cpu-vcores: 20
Ok so going by your configurations we have:
(I am also a newbie at Spark but below is what I speculate in this scenario)
24 cores and 128GB ram per node and we have 4 nodes in the cluster.
We allocate 1 core and 1 GB memory for overhead and considering you're running your cluster in YARN-Client mode.
We have 127GB Ram and 23 Cores left with us in 4 nodes.
As mentioned in Cloudera blog YARN runs at optimal performance when 5 cores are allocated per executor at max.
So, 23X4 = 92 Cores.
If we allocated 5 cores per executor then 18 executor have 5 cores and 1 executor has 2 cores or likewise.
So lets assume we have 18 executor in our application and 5 cores per executor.
Spark distributes these 18 executors across 4 nodes. suppose its distributed as:
1st node : 4 executors
2nd node : 4 executors
3rd node : 5 executors
4th node : 5 executors
Now, as 'yarn.nodemanager.resource.memory-mb: 106496' is set as 104GB in your configurations, each node can have max 104 GB memory allocated (I would suggest increasing this parameter).
For nodes with 4 executors: 104/4 - 26GB per executor
For nodes with 5 executors: 104/5 ~ 21GB per executor.
Now leaving out 7% memory for overhead we get 24GB and 20GB.
So i would suggest using following configurations:-
--num-executors : 18
--executor-memory : 20G
--executor-cores : 5
Also, This is considering that you're running your cluster in client mode but if you run your cluster in Yarn-cluster mode 1 node will be allocated fir driver program and the calculations will need to be done differently.
I still cannot comment, so it will be as an answer.
See this question. Could you please decrease executor memory and try run this again?

Resources