How to increase resource utilization in Spark cluster

How to increase resource utilization in Spark cluster - apache-spark

I'm fairly new to Spark in cluster-mode and want to know what is wrong with my current configuration since resource utilization seems extremely low.
I am using Dataproc in GCP. I am comparing two cases in reading 10+gb data that is split into 100K+ files in JSON format into a Spark dataframe.
4 vCPU, 16gb memory for both master and 10 worker nodes. 10 executors with 3 vCPU and 11gb memory each are spawned (assuming 1 vCPU reserved for each worker node and 30% memory overhead)
4 vCPU, 16gb memory for master, 16 vCPU, 64gb memory for 4 worker nodes. 12 executors with 5 vCPU and 15gb memory each are spawned.
In both cases, I tested a few different parallelism constant C, where parallelism = C * <num_executors> * <num_cores_per_executor>. Driver node is set 11gb memory and 3 vCPU.
Case 1 result:
2/10 containers used
<memory:7168, vCores:2> out of <memory:110000, vCores:27> available executor resource used
Case 2 result
2/4 containers used
<memory:26624, vCores:2> out of <memory:180000, vCores:60> available executor resource used.
In either case, resource utilization (confirmed through YARN console) is extremely low.
I thought I am missing some major configurations, and tried the following but none of them showed any identifiable difference.
{
"spark.dynamicAllocation.enabled": "false",
"yarn.scheduler.maximum-allocation-mb": "11 or 15gb",
"yarn.nodemanager.resource.memory-mb": "11g or 15gb"
"spark.submit.deployMode": "cluster"
}
What am I missing/doing wrong?

Related

Experiencing very Low CPU utilization of Spark Jobs on AWS EMR

we have a spark job that reads a csv file and applies a series of transformations, and writes the result in an orc file,
the spark job breaks into close to 20 stages and runs for around an hour
input csv file size: 10 GB
spark-submit job resource configuration:
driver-memory= 5 GB
num-executors= 2
executor-core= 3
executor-memory= 20 GB
EC2 instance type: r5d.xlarge i.e. 32GB Memory and 4 vCPU with attached 128 GB EBS volume
EMR Cluster comprises of 1 Master Node and 2 Core machines
when we run the spark job on the above cluster configurations, the cpu utilization is only close to 10-15%
our requirement is to maximize the cpu utilization of EC2 instances for my spark job.
Appreciate for any suggestion!

AFAIK if you increase the parllelism automatically CPU usage will increase try using these in your spark job configuration
num-executors= 4
executor-core= 5
executor-memory= 25 GB
specially if you increase cpu cores parllelism will increase..
more than 5 cores not recommended for each executor. This is based on a study where any application with more than 5 concurrent threads would start hampering the performance.
spark.dynamicAllocation.enabled could be another option.
spark.default.parallelism = 2 * number of CPUs in total on worker nodes
Make sure that you always use yarn mode
Follow Using maximizeResourceAllocation from aws docs there all these things are discussed in detail. Read it completely
You can configure your executors to utilize the maximum resources possible on each node in a cluster by using the spark configuration classification to set maximizeResourceAllocation option to true. This EMR-specific option calculates the maximum compute and memory resources available for an executor on an instance in the core instance group. It then sets the corresponding spark-defaults settings based on this information.
[
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
Further reading
Best practices for successfully managing memory for Apache Spark applications on Amazon EMR
EMR-spark-tuning-demystified

How to calculate the Executor memory,No of executor ,No of executor cores and Driver memory to read a file of 40GB using Spark?

Yarn Cluster Configuration:
8 Nodes
8 cores per Node
8 GB RAM per Node
1TB HardDisk per Node

Executor memory & No of Executors
Executor memory and no of executors/node are interlinked so you would first start selecting Executor memory or No of executors then based on your choice you can follow this to set properties to get desired results
In YARN these properties would affect number of containers (/executors in Spark) that can be instantiated in a NodeManager based on spark.executor.cores, spark.executor.memory property values (along with executor memory overhead)
For example, if a cluster with 10 nodes (RAM : 16 GB, cores : 6) and set with following yarn properties
yarn.scheduler.maximum-allocation-mb=10GB
yarn.nodemanager.resource.memory-mb=10GB
yarn.scheduler.maximum-allocation-vcores=4
yarn.nodemanager.resource.cpu-vcores=4
Then with spark properties spark.executor.cores=2, spark.executor.memory=4GB you can expect 2 Executors/Node so total you'll get 19 executors + 1 container for Driver
If the spark properties are spark.executor.cores=3, spark.executor.memory=8GB then you will get 9 Executor (only 1 Executor/Node) + 1 container for Driver link
Driver memory
spark.driver.memory —Maximum size of each Spark driver's Java heap memory
spark.yarn.driver.memoryOverhead —Amount of extra off-heap memory that can be requested from YARN, per driver. This, together with spark.driver.memory, is the total memory that YARN can use to create a JVM for a driver process.
Spark driver memory does not impact performance directly, but it ensures that the Spark jobs run without memory constraints at the driver. Adjust the total amount of memory allocated to a Spark driver by using the following formula, assuming the value of yarn.nodemanager.resource.memory-mb is X:
12 GB when X is greater than 50 GB
4 GB when X is between 12 GB and 50 GB
1 GB when X is between 1GB and 12 GB
256 MB when X is less than 1 GB
These numbers are for the sum of spark.driver.memory and spark.yarn.driver.memoryOverhead . Overhead should be 10-15% of the total.
You can also follow this Cloudera link for tuning Spark jobs

Where does spark job run in a cluster of 2 nodes, but the spark submit configurations can easily accommodate in a single node? (cluster mode)

spark cluster has 2 worker nodes.
Node 1: 64 GB, 8 cores.
Node 2: 64 GB, 8 cores.
Now if i submit a spark job using spark-submit in cluster mode with
2 executors and each executor memory as 32 GB, 4 cores/executor.
Now my question is, as the above configuration can be accommodated in a single node itself, will spark run it using 2 worker nodes or just in one node?
Also, if a configuration doesn't have a multiple of cores as the executors then how many cores allocated for each executor?
Example: if num of cores in a node available after excluding one core for yarn deamon are 7. since 2 nodes, 2*7=14 (total cores available)and as HDFS give good throughput if num of cores per executor were 5..
Now 14/5 to find the num of executors. should i consider 14/5 as 2 or 3 exeutors? then how these cores are equally distributed?

It is more of a resource manager question then a Spark question, but in your case the 2 executors cant run in a single machine cause the OS has an overhead that uses at least 1 core and 1GB RAM , even if you will set the ram to 30 GB and 3 cores/executor. they will run on different nodes because Spark tries to get the best data locality it can so obviously it wont use the same node for 2 executors.

Optimize Spark and Yarn configuration

We have a cluster of 4 nodes with the characteristics above :
Spark jobs make a lot of times in processing, how could we optimize this time knowing that our jobs run from RStudio and we still have a lot of memory not utilized.

To add more context to the answer above, I would like to give explanation on how to set those parameters --num-executors, --executor-memory, --executor-cores appropriately.
The following answer covers the 3 main aspects mentioned in title - number of executors, executor memory and number of cores.
There may be other parameters like driver memory and others which I did not address as of this answer.
Case 1 Hardware - 6 Nodes, and Each node 16 cores, 64 GB RAM
Each executor is a JVM instance. So we can have multiple executors in a single Node
First 1 core and 1 GB is needed for OS and Hadoop Daemons, so available are 15 cores, 63 GB RAM for each node
Start with one by one how to choose these parameters.
Number of cores:
Number of cores = Concurrent tasks as executor can run
So we might think, more concurrent tasks for each executor will give better performance.
But research shows that any application with more than 5 concurrent tasks, would lead to bad show. So stick this to 5.
This number came from the ability of executor and not from how many cores a system has. So the number 5 stays same
even if you have double(32) cores in the CPU.
Number of executors:
Coming back to next step, with 5 as cores per executor, and 15 as total available cores in one Node(CPU) - we come to
3 executors per node.
So with 6 nodes, and 3 executors per node - we get 18 executors. Out of 18 we need 1 executor (java process) for AM in YARN we get 17 executors
This 17 is the number we give to spark using --num-executors while running from spark-submit shell command
Memory for each executor:
From above step, we have 3 executors per node. And available RAM is 63 GB
So memory for each executor is 63/3 = 21GB.
However small overhead memory is also needed to determine the full memory request to YARN for each executor.
Formula for that over head is max(384, .07 * spark.executor.memory)
Calculating that overhead - .07 * 21 (Here 21 is calculated as above 63/3)
= 1.47
Since 1.47 GB > 384 MB, the over head is 1.47.
Take the above from each 21 above => 21 - 1.47 ~ 19 GB
So executor memory - 19 GB
Final numbers - Executors - 17 per node, Cores 5 per executor, Executor Memory - 19 GB
This way, assigning the resources properly to the spark jobs in the cluster would speed up the jobs; efficiently using available resources.

I recommend you to have a look to these parameters :
--num-executors : controls how many executors will be allocated
--executor-memory : RAM for each executor
--executor-cores : cores for each executor

Spark repartition do not divide data to all executors

I have 8 executors with 4 core each , I repartition a rdd to 32. I expects all 8 executors play a part on on the next action that i call on the repartitioned data. But seems like sometime 3 executors participate and sometimes 4 but not more than that.
How can i ensure that the data gets divided on all executors?
rdd.repartition(32).foreachPartition{ part =>
updateMem(part)
}
The last part calls inser/update into memsql.

Below answer is valid only if you are using AWS- EMR.
I dont think it is correct to say that you have 8 executors of 4 cores each. here is the explanation. Say, I am using m3.2xlarge machine (EMR).
each machine contains 30 GB memory(total) 8 vcores
There is no way that you can use all the 30 GB memory for executors
as machine would need some memory for its own use.
You would like to leave enough memory for machine own use (like OS
and other usage) so that there will not be any system failure.
say, you want to leave 10GB memory for machines then you are left
with 20GB memory
In 20 GB memory I can have 6 executors (3GB each, 6*3GB =18GB) , can
have 4 executors (5GB each, 4*5GB = 20GB) etc
so, you can decide the number of executors depending upon your need
on memory for each executor.
To be specific to your use case, look into your total memory available in each machine and the spark-conf(/etc/spark/conf/spark-defaults.conf) for these two parameters and adjust accordingly.
spark.executor.memory
spark.executor.cores

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string