Spark Job using more executors than allocated in jobs - apache-spark

I have following settings in my Spark job:
--num-executors 2
--executor-cores 1
--executor-memory 12G
--driver memory 16G
--conf spark.streaming.dynamicAllocation.enabled=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.streaming.receiver.writeAheadLog.enable=false
--conf spark.executor.memoryOverhead=8192
--conf spark.driver.memoryOverhead=8192'
My understanding is job should run with 2 executors however it is running with 3. This is happening to multiple of my jobs. Could someone please explain the reason?

Related

Can we have multiple executors in Spark master local[*] deployment code client

I have a 1 node Hadoop Cluster, I am submitting a spark job like this
spark-submit \
--class com.compq.scriptRunning \
--master local[*] \
--deploy-mode client \
--num-executors 3 \
--executor-cores 4 \
--executor-memory 21g \
--driver-cores 2 \
--driver-memory 5g \
--conf "spark.local.dir=/data/spark_tmp" \
--conf "spark.sql.shuffle.partitions=2000" \
--conf "spark.sql.inMemoryColumnarStorage.compressed=true" \
--conf "spark.sql.autoBroadcastJoinThreshold=200000" \
--conf "spark.speculation=false" \
--conf "spark.hadoop.mapreduce.map.speculative=false" \
--conf "spark.hadoop.mapreduce.reduce.speculative=false" \
--conf "spark.ui.port=8099" \
.....
Though I define 3 executors, I see only 1 executor in spark UI page running all the time. Can we have multiple executors running in parallel with
--master local[*] \
--deploy-mode client \
Its a on-prem, plain open source hadoop flavor installed in the cluster.
I tried changing master local to local[*] and playing around with deployment modes still, I could see only 1 executor running in spark UI

Why resources are being shared among exclusive node labels?

I'm having 5 different node labels and all of them are exclusive and each belongs to the same queue. When I'm submitting spark job on different node labels, resources are being shared between them but that should not be the case when using exclusive node labels.
What could be the possible reason?
HDP Version - HDP-3.1.0.0
My Spark Submit -
bash $SPARK_HOME/bin/spark-submit --packages net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1,com.amazonaws:aws-java-sdk:1.7.4,com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 --master yarn --queue prodQueue --conf "spark.executor.extraJavaOptions= -XX:SurvivorRatio=16 -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintReferenceGC -XX:+PrintAdaptiveSizePolicy -XX:MaxDirectMemorySize=4g -XX:NewRatio=1" --conf spark.hadoop.yarn.timeline-service.enabled=false --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.yarn.executor.memoryOverhead=2048 --conf spark.task.maxFailures=40 --conf spark.kryoserializer.buffer.max=8m --conf spark.driver.memory=3g --conf spark.shuffle.sort.bypassMergeThreshold=5000 --conf spark.executor.heartbeatInterval=60s --conf spark.memory.storageFraction=0.20 --conf spark.ui.port=7070 --conf spark.reducer.maxReqsInFlight=10 --conf spark.scheduler.mode=FAIR --conf spark.port.maxRetries=100 --conf spark.yarn.max.executor.failures=280 --conf spark.shuffle.service.enabled=true --conf spark.cleaner.ttl=600 --executor-cores 2 --executor-memory 6g --num-executors 8 --conf spark.yarn.am.nodeLabelExpression=amNodeLabel --conf spark.yarn.executor.nodeLabelExpression=myNodeLabel my-application.jar
Thanks for helping.

Correct spark configuration to fully utilise EMR cluster resources

I'm quite new to configuring spark, so wanted to know whether I am fully utilising my EMR cluster.
The EMR cluster is using spark 2.4 and hadoop 2.8.5.
The app reads loads of small gzipped json files from s3, transforms the data and writes them back out to s3.
I've read various articles, but I was hoping I could get my configuration double checked in case there were set settings that conflict with each other or something.
I'm using a c4.8xlarge cluster with each of the 3 worker nodes having 36 cpu cores and 60gb of ram.
So that's 108 cpu cores and 180gb of ram overall.
Here is my spark-submit settings that I paste in the EMR add step box:
--class com.example.app
--master yarn
--driver-memory 12g
--executor-memory 3g
--executor-cores 3
--num-executors 33
--conf spark.executor.memory=5g
--conf spark.executor.cores=3
--conf spark.executor.instances=33
--conf spark.driver.cores=16
--conf spark.driver.memory=12g
--conf spark.default.parallelism=200
--conf spark.sql.shuffle.partitions=500
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
--conf spark.speculation=false
--conf spark.yarn.am.memory=1g
--conf spark.executor.heartbeatInterval=360000
--conf spark.network.timeout=420000
--conf spark.hadoop.fs.hdfs.impl.disable.cache=true
--conf spark.kryoserializer.buffer.max=512m
--conf spark.shuffle.consolidateFiles=true
--conf spark.hadoop.fs.s3a.multiobjectdelete.enable=false
--conf spark.hadoop.fs.s3a.fast.upload=true
--conf spark.worker.instances=3

Spark Streaming - Diagnostics: Container is running beyond physical memory limits

My Spark Streaming job failed with the below exception
Diagnostics: Container is running beyond physical memory limits.
Current usage: 1.5 GB of 1.5 GB physical memory used; 3.6 GB of 3.1 GB
virtual memory used. Killing container.
Here is my spark submit command
spark2-submit \
--name App name \
--class Class name \
--master yarn \
--deploy-mode cluster \
--queue Queue name \
--num-executors 5 --executor-cores 3 --executor-memory 5G \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.locality.wait=10 \
--conf spark.task.maxFailures=8 \
--conf spark.ui.killEnabled=false \
--conf spark.logConf=true \
--conf spark.yarn.driver.memoryOverhead=512 \
--conf spark.yarn.executor.memoryOverhead=2048 \
--conf spark.yarn.max.executor.failures=40 \
jar path
I am not sure what's causing the above issue. Am I missing something in the above command or is it failing as I didn't set --driver-memory in my spark submit command?

What should be my spark-submit options for better performance and now Heap memory issue

I have 1 driver and 6 core instances with 16GB ram and 8 cores each.
I am running spark-submit with below options:
spark-submit --driver-memory 4g \
--executor-memory 6g \
--num-executors 12 \
--executor-cores 2 \
--conf spark.driver.maxResultSize=0 \
--conf spark.network.timeout=800 job.py
I am getting Java heap memory error multiple times, I think there is something wrong with the options can someone help me out with this.
Thanks

Resources