Spark Resource Allocation: Number of Cores - apache-spark

Require understanding on how to configure Cores for an Spark Job.
My Machine can have a max. of 11 Cores , 28 Gb memory .
Below is how I'm allocating resources for my Spark Job and it's execution time is 4.9 mins
--driver-memory 2g \
--executor-memory 24g \
--executor-cores 10 \
--num-executors 6
But I ran through multiple articles mentioning number of cores should be ~ 5, when I ran job with this configuration it's execution time increased to 6.9 mins
--driver-memory 2g \
--executor-memory 24g \
--executor-cores 5 \
--num-executors 6 \
Will there be any issue keeping Number of Cores close to Max. value (10 in my case) ?
Are there any benefits of keeping No. of Cores to 5 , as suggested in many articles ?
So in general what are the factors to consider in determining Number of cores?

It all depends on the behaviour of job, one config does not optimise all needs.
--executor-cores means no of cores on 1 machine.
It that number is too big (>5) then the machine's disk and network (which will be shared among all executor spark cores on that machine) will create bottleneck. If that no is too less (~1) then it will not achieve good data parallelism and won't benefit from locality of data on same machine.
TLDR: --executor-coers 5 is fine.

Related

spark-submit criteria to set parameter values

I am so confused about the right criteria to use when it comes to setting the following spark-submit parameters, for example:
spark-submit --deploy-mode cluster --name 'CoreLogic Transactions Curated ${var_date}' \
--driver-memory 4G --executor-memory 4G --num-executors 10 --executor-cores 4 \
/etl/scripts/corelogic/transactions/corelogic_transactions_curated.py \
--from_date ${var_date} \
--to_date ${var_to_date}
One person is telling me that I am using a lot of executors and cores but he is not explaining why he said that.
Can someone explain to me the right criteria to use when it comes to setting these parameters (--driver-memory 4G --executor-memory 4G --num-executors 10 --executor-cores 4) according to my dataset?
The same in the following case
spark = SparkSession.builder \
.appName('DemoEcon PEP hist stage') \
.config('spark.sql.shuffle.partitions', args.shuffle_partitions) \
.enableHiveSupport() \
.getOrCreate()
I am not quite sure which is the criteria used to set this parameter "spark.sql.shuffle.partitions"
can someone help me to get this clear in my mind?
Thank you in advance
In this website is the answer that I needed, an excellent explanation, with some examples.
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
Here is one of those examples:
Case 1 Hardware – 6 Nodes and each node have 16 cores, 64 GB RAM
First on each node, 1 core and 1 GB is needed for Operating System and Hadoop Daemons, so we have 15 cores, 63 GB RAM for each node
We start with how to choose number of cores:
Number of cores = Concurrent tasks an executor can run
So we might think, more concurrent tasks for each executor will give better performance. But research shows that any application with more than 5 concurrent tasks, would lead to a bad show. So the optimal value is 5.
This number comes from the ability of an executor to run parallel tasks and not from how many cores a system has. So the number 5 stays same even if we have double (32) cores in the CPU
Number of executors:
Coming to the next step, with 5 as cores per executor, and 15 as total available cores in one node (CPU) – we come to 3 executors per node which is 15/5. We need to calculate the number of executors on each node and then get the total number for the job.
So with 6 nodes, and 3 executors per node – we get a total of 18 executors. Out of 18 we need 1 executor (java process) for Application Master in YARN. So final number is 17 executors
This 17 is the number we give to spark using –num-executors while running from spark-submit shell command
Memory for each executor:
From above step, we have 3 executors per node. And available RAM on each node is 63 GB
So memory for each executor in each node is 63/3 = 21GB.
However small overhead memory is also needed to determine the full memory request to YARN for each executor.
The formula for that overhead is max(384, .07 * spark.executor.memory)
Calculating that overhead: .07 * 21 (Here 21 is calculated as above 63/3) = 1.47
Since 1.47 GB > 384 MB, the overhead is 1.47
Take the above from each 21 above => 21 – 1.47 ~ 19 GB
So executor memory – 19 GB
Final numbers – Executors – 17, Cores 5, Executor Memory – 19 GB

how to decide no. of cores and executor in aws

i have a data size of 5 TB and if i decide to use r4.8xlarge EC2 machine which has memory = 244 GB per machine and CPU = 32
how can i now decide no. of cores and executors i have to use?
i tried few combination of cores and executors but spark job fails with heap space issue, i have listed it below excluding other parameters
--master yarn
--conf spark.yarn.executor.memoryOverhead=4000
--driver-memory 25G
--executor-memory 240G
--executor-cores 26
--num-executors 13

Spark-submit executor memory issue

I have a 10 node cluster, 8 DNs(256 GB, 48 cores) and 2 NNs. I have a spark sql job being submitted to the yarn cluster. Below are the parameters which I have used for spark-submit.
--num-executors 8 \
--executor-cores 50 \
--driver-memory 20G \
--executor-memory 60G \
As can be seen above executor-memory is 60GB, but when I check Spark UI is shows 31GB.
1) Can anyone explain me why it is showing 31GB instead of 60GB.
2) Also help in setting optimal values for parameters mentioned above.
I think,
Memory allocated gets divided into two parts:
1. Storage (caching dataframes/tables)
2. Processing (the one you can see)
31gb is the memory available for processing.
Play around with spark.memory.fraction property to increase/decrease the memory available for processing.
I would suggest to reduce the executor cores to about 8-10
My configuration :
spark-shell --executor-memory 40g --executor-cores 8 --num-executors 100 --conf spark.memory.fraction=0.2

Spark tuning job

I have a problem with tuning Spark jobs executing on Yarn cluster. I'm having a feeling that I'm not getting most of my cluster and additionally, my jobs fail (executors get removed all the time).
I have the following setup:
4 machines
each machine has 10GB of RAM
each machine has 8 cores
8GBs of RAM are allocated for yarn jobs
14 (of 16) virtual cores are allocated for yarn jobs
I have run my spark job (actually connected to a jupyter notebook) using different setups, e.g.
pyspark --master yarn --num-executors 7 --executor-cores 4 --executor-memory 3G
pyspark --master yarn --num-executors 7 --executor-cores 7 --executor-memory 2G
pyspark --master yarn --num-executors 11 --executor-cores 4 --executor-memory 1G
I've tried different combinations and none of them seems to be working as my executors get destroyed. Additionally, I've read somewhere that it is a good way to increase spark.yarn.executor.memoryOverhead to 600MB as a way not to loose executors (and I did that), but seems that doesn't help. How should I setup my job?
Additionally, it confuses me that when I look at the ResourceManager UI it says for my job vcores used 8 vcores total 56. It seems that I'm using a single core per executor, but I don't understand why?
One more thing, when I setup my job, how many partitions should I specify when I'm reading data from HDFS to get maximal performance?
Donald Knuth said premature optimisation is the root of all evil. I am sure faster running program which fails is on no use. Start by giving all the memory to one executor. Say 7GB/8GB and just 1 core. This is a complete wastage of cores, but if it works, it proves your application can possibly run on this hardware. If even this doesn't work, you should try getting bigger machines. Assuming it works, try increasing the number of cores, until it still works.
The gist of the argument is: your application requires certain memory per task. But the number of tasks running per executor is dependent on number of cores. First find the worst case memory per cores for you application and then you can set executor memory and cores to some multiple of this number.

Using all resources in Apache Spark with Yarn

I am using Apache Spark with Yarn client.
I have 4 worker PCs with 8 vcpus each and 30 GB of ram in my spark cluster.
Im set my executor memory to 2G and number of instances to 33.
My job is taking 10 hours to run and all machines are about 80% idle.
I dont understand the correlation between executor memory and executor instances. Should I have an instance per Vcpu? Should I set the executor memory to be memory of machine/#executors per machine?
I believe that you have to use the following command:
spark-submit --num-executors 4 --executor-memory 7G --driver-memory 2G --executor-cores 8 --class \"YourClassName\" --master yarn-client
Number of executors should be 4, since you have 4 workers. The executor memory should be close to the maximum memory that each yarn node has allocated, roughly ~5-6GB (I assume you have 30GB total RAM).
You should take a look on the spark-submit parameters and fully understand them.
We were using cassandra as our data source for spark. The problem was there were not enough partitions. We needed to split up the data more. Our mapping for # of cassandra partitions to spark partitions was not small enough and we would only generate 10 or 20 tasks instead of 100s of tasks.

Resources