we have a spark job that reads a csv file and applies a series of transformations, and writes the result in an orc file,
the spark job breaks into close to 20 stages and runs for around an hour
input csv file size: 10 GB
spark-submit job resource configuration:
driver-memory= 5 GB
num-executors= 2
executor-core= 3
executor-memory= 20 GB
EC2 instance type: r5d.xlarge i.e. 32GB Memory and 4 vCPU with attached 128 GB EBS volume
EMR Cluster comprises of 1 Master Node and 2 Core machines
when we run the spark job on the above cluster configurations, the cpu utilization is only close to 10-15%
our requirement is to maximize the cpu utilization of EC2 instances for my spark job.
Appreciate for any suggestion!
AFAIK if you increase the parllelism automatically CPU usage will increase try using these in your spark job configuration
num-executors= 4
executor-core= 5
executor-memory= 25 GB
specially if you increase cpu cores parllelism will increase..
more than 5 cores not recommended for each executor. This is based on a study where any application with more than 5 concurrent threads would start hampering the performance.
spark.dynamicAllocation.enabled could be another option.
spark.default.parallelism = 2 * number of CPUs in total on worker nodes
Make sure that you always use yarn mode
Follow Using maximizeResourceAllocation from aws docs there all these things are discussed in detail. Read it completely
You can configure your executors to utilize the maximum resources possible on each node in a cluster by using the spark configuration classification to set maximizeResourceAllocation option to true. This EMR-specific option calculates the maximum compute and memory resources available for an executor on an instance in the core instance group. It then sets the corresponding spark-defaults settings based on this information.
[
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
Further reading
Best practices for successfully managing memory for Apache Spark applications on Amazon EMR
EMR-spark-tuning-demystified
Related
I'm fairly new to Spark in cluster-mode and want to know what is wrong with my current configuration since resource utilization seems extremely low.
I am using Dataproc in GCP. I am comparing two cases in reading 10+gb data that is split into 100K+ files in JSON format into a Spark dataframe.
4 vCPU, 16gb memory for both master and 10 worker nodes. 10 executors with 3 vCPU and 11gb memory each are spawned (assuming 1 vCPU reserved for each worker node and 30% memory overhead)
4 vCPU, 16gb memory for master, 16 vCPU, 64gb memory for 4 worker nodes. 12 executors with 5 vCPU and 15gb memory each are spawned.
In both cases, I tested a few different parallelism constant C, where parallelism = C * <num_executors> * <num_cores_per_executor>. Driver node is set 11gb memory and 3 vCPU.
Case 1 result:
2/10 containers used
<memory:7168, vCores:2> out of <memory:110000, vCores:27> available executor resource used
Case 2 result
2/4 containers used
<memory:26624, vCores:2> out of <memory:180000, vCores:60> available executor resource used.
In either case, resource utilization (confirmed through YARN console) is extremely low.
I thought I am missing some major configurations, and tried the following but none of them showed any identifiable difference.
{
"spark.dynamicAllocation.enabled": "false",
"yarn.scheduler.maximum-allocation-mb": "11 or 15gb",
"yarn.nodemanager.resource.memory-mb": "11g or 15gb"
"spark.submit.deployMode": "cluster"
}
What am I missing/doing wrong?
I'm running a spark batch job on aws fargate in standalone mode. On the compute environment, I have 8 vcpu and job definition has 1 vcpu and 2048 mb memory. In the spark application I can specify how many core I want to use and doing this using below code
sparkSess = SparkSession.builder.master("local[8]")\
.appName("test app")\
.config("spark.debug.maxToStringFields", "1000")\
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")\
.getOrCreate()
local[8] is specifying 8 cores/threads (that’s what I'm assuming).
Initially I was running the spark app without specifying cores and I think job was running in single thread and was taking around 10 min to complete but with this number it is reducing the time to process. I started with 2 it almost reduced to 5 minutes and then I have changed to 4, 8 and now it is taking almost 4 minutes. But I don't understand the relation between vcpu and spark threads. Whatever the number I specify for cores, sparkContext.defaultParallelism shows me that value.
Is this the correct way? Is there any relation between this number and the vcpu that I specify on job definition or compute environment.
You are running in Spark Local Mode. Learning Spark has this to say about Local mode:
Spark driver runs on a single JVM, like a laptop or single node
Spark executor runs on the same JVM as the driver
Cluster manager Runs on the same host
Damji, Jules S.,Wenig, Brooke,Das, Tathagata,Lee, Denny. Learning Spark (p. 30). O'Reilly Media. Kindle Edition.
local[N] launches with N threads. Given the above definition of Local Mode, those N threads must be shared by the Local Mode Driver, Executor and Cluster Manager.
As such, from the available vCPUs, allotting one vCPU for the Driver thread, one for the Cluster Manager, one for the OS and the remaining for Executor seems reasonable.
The optimal number of threads/vCPUs for the Executor will depend on the number of partitions your data has.
spark cluster has 2 worker nodes.
Node 1: 64 GB, 8 cores.
Node 2: 64 GB, 8 cores.
Now if i submit a spark job using spark-submit in cluster mode with
2 executors and each executor memory as 32 GB, 4 cores/executor.
Now my question is, as the above configuration can be accommodated in a single node itself, will spark run it using 2 worker nodes or just in one node?
Also, if a configuration doesn't have a multiple of cores as the executors then how many cores allocated for each executor?
Example: if num of cores in a node available after excluding one core for yarn deamon are 7. since 2 nodes, 2*7=14 (total cores available)and as HDFS give good throughput if num of cores per executor were 5..
Now 14/5 to find the num of executors. should i consider 14/5 as 2 or 3 exeutors? then how these cores are equally distributed?
It is more of a resource manager question then a Spark question, but in your case the 2 executors cant run in a single machine cause the OS has an overhead that uses at least 1 core and 1GB RAM , even if you will set the ram to 30 GB and 3 cores/executor. they will run on different nodes because Spark tries to get the best data locality it can so obviously it wont use the same node for 2 executors.
I have a 20 mode c4.4xlarge cluster to run a spark job. Each node is a 16 vCore, 30 GiB memory, EBS only storage EBS Storage:32 GiB machine.
Since each node has 16 vCore, I understand that maximum number of executors are 16*20 > 320 executors. Total memory available is 20(#nodes)*30 ~ 600GB. Assigning 1/3rd to system operations, I have 400 GB of Memory to process my data in-memory. Is this the right understanding.
Also, Spark History shows non-uniform distribution of input and shuffle. I believe the processing is not distributed evenly across executors. I pass these config parameters in my spark-submit -
> —-conf spark.dynamicAllocation.enabled=true —-conf spark.dynamicAllocation.minExecutors=20
Executor summary from spark history UI also shows that data distribution load is completely skewed, and I am not using the cluster in the best way. How can I distribute my load in a better way -
Seeing low # of writes to elasticsearch using spark java.
Here are the Configurations
using 13.xlarge machines for ES cluster
4 instances each have 4 processors.
Set refresh interval to -1 and replications to '0' and other basic
configurations required for better writing.
Spark :
2 node EMR cluster with
2 Core instances
- 8 vCPU, 16 GiB memory, EBS only storage
- EBS Storage:1000 GiB
1 Master node
- 1 vCPU, 3.8 GiB memory, 410 SSD GB storage
ES index has 16 shards defined in mapping.
having below config when running job,
executor-memory - 8g
spark.executor.instances=2
spark.executor.cores=4
and using
es.batch.size.bytes - 6MB
es.batch.size.entries - 10000
es.batch.write.refresh - false
with this configuration, I try to load 1Million documents (each document has a size of 1300 Bytes) , so it does the load at 500 records/docs per ES nodes.
and in the spark log am seeing each task
-1116 bytes result sent to driver
Spark Code
JavaRDD<String> javaRDD = jsc.textFile("<S3 Path>");
JavaEsSpark.saveJsonToEs(javaRDD,"<Index name>");
Also when I look at the In-Network graph in ES cluster it is very low, and I see EMR is not sending huge data over a network. Is there a way I can tell Spark to send a right number of data to make write faster?
OR
Is there any other config that I am missing to tweak.
Cause I see 500docs per sec per es instance is lower. Can someone please guide what am missing with this settings to improve my es write performance
Thanks in advance
You may have an issue here.
spark.executor.instances=2
You are limited to two executors, where you could have 4 based on your cluster configuration. I would change this to 4 or greater. I might also try executor-memory = 1500M, cores=1, instances=16. I like to leave a little overhead in my memory, which is why I dropped from 2G to 1.5G(but you can't do 1.5G so we have to do 1500M). If you are connecting via your executors this will improve performance.
Would need some code to debug further. I wonder if you are connected to elastic search only in your driver, and not in your worker nodes. Meaning you are only getting one connection instead of one for each executor.