spark-sql on yarn hangs when number of executors is increased - v1.3.0 - apache-spark

I am running spark-sql on a hive table.
It runs successfully when the spark-shell is started with the following parameters,
"--driver-memory 8G --executor-memory 10G --executor-cores 1 --num-executors 30"
however the job hangs when the spark-shell is started with
"--driver-memory 8G --executor-memory 10G --executor-cores 1 --num-executors 40"
The difference is only in the number of executors (30 vs 40).
In the second case i see that there is 1 task active on each executor but it does not run. I do not see any "task completed" messages on the spark-shell.
The job runs successfully with number of executors below 30.
My yarn cluster has 42 nodes and 30 cores per node and about 50G memory per node.
Any pointers to where I have to look ?
I compared the debug level logs from both the runs, the runs that appeared to hang did not have any such log lines. The good runs had a bunch of these lines.
"org.apache.spark.storage.BlockManager logDebug - Level for block broadcast_0_piece0 is StorageLevel(true, true, false, false, 1)"
"org.apache.spark.storage.BlockManager logDebug - Level for block broadcast_1 is StorageLevel(true, true, false, true, 1)"

This was because of classpath issues, i was including some older versions of the dependencies which when removed no longer caused the issue

Related

Spark: use of driver-memory parameter

When I submit this command, my job failed with error "Container is running beyond physical memory limits".
spark-submit --master yarn --deploy-mode cluster --executor-memory 5G --total-executor-cores 30 --num-executors 15 --conf spark.yarn.executor.memoryOverhead=1000
But adding the parameter: --driver-memory to 5GB (or upper), the job ends without error.
spark-submit --master yarn --deploy-mode cluster --executor-memory 5G --total executor-cores 30 --num-executors 15 --driver-memory 5G --conf spark.yarn.executor.memoryOverhead=1000
Cluster info: 6 nodes with 120GB of Memory. YARN Container Memory Minimum: 1GB
The question is: what is the difference in using or not this parameter?
If increasing the driver memory is helping you to successfully complete the job then it means that driver is having lots of data coming into it from executors. Typically, the driver program is responsible for collecting results back from each executor after the tasks are executed. So, in your case it seems that increasing the driver memory helped to store more results back into the driver memory.
If you read the some points on executor memory, driver memory and the way Driver interacts with executors then you will get better clarity on the situation you are in.
Hope it helps to some extent.

spark on yarn,only one executor on one node works and the allocation is random

i run spark-shell with this command:
./bin/spark-shell --master yarn --num-executors 16
--executor-memory 14G --executor-cores 8
i have four nodes,every node has 16G memory and 4cores
after i changed num-executors,the spark webUi tell me it worked,
BUT only one executor on node which named "slave" is running
we could see TRANSFER1 and TRANSFER2 is empty
how can i solve this, when i submit job the situation has not changed
should i change worker-instances or num-executors and so on

Yarn Spark HBase - ExecutorLostFailure Container killed by YARN for exceeding memory limits

I am trying to read a big hbase table in spark (~100GB in size).
Spark Version : 1.6
Spark submit parameters:
spark-submit --master yarn-client --num-executors 10 --executor-memory 4G
--executor-cores 4
--conf spark.yarn.executor.memoryOverhead=2048
Error: ExecutorLostFailure Reason: Container killed by YARN for
exceeding limits.
4.5GB of 3GB physical memory used limits. Consider boosting spark.yarn.executor.memoryOverhead.
I have tried setting spark.yarn.executor.memoryOverhead to 100000. Still getting similar error.
I don't understand why spark doesn't spill to disk if the memory is insufficient OR is YARN causing the problem here.
Please share your code how you try to read in.
And also your cluster architecture
Container killed by YARN for exceeding limits. 4.5GB of 3GB physical memory used limits
Try
spark-submit
--master yarn-client
--num-executors 4
--executor-memory 100G
--executor-cores 4
--conf spark.yarn.executor.memoryOverhead=20480
If you have 128 gRam
The situation is clear, you run out of ram, try to rewrite your code in a disk friendly way.

Spark executor GC taking long

Am running Spark job on a standalone cluster and I noticed after sometime the GC starts taking long and the red scary color begins to show.
Here is the resources available:
Cores in use: 80 Total, 76 Used
Memory in use: 312.8 GB Total, 292.0 GB Used
Job details:
spark-submit --class com.mavencode.spark.MonthlyReports
--master spark://192.168.12.14:7077
--deploy-mode cluster --supervise
--executor-memory 16G --executor-cores 4
--num-executors 18 --driver-cores 8
--driver-memory 20G montly-reports-assembly-1.0.jar
How do I fix the GC time taking so long?
I had the same problem and could resolve it by using Parallel GC instead of G1GC. You may add the following options to the executors additional Java options in the submit request
-XX:+UseParallelGC -XX:+UseParallelOldGC

launched executors are less than number of executors specified

I have EMR cluster with following configuration:
Number of cores, RAM(GB), yarn.nodemanager.resource.memory-mb(MB)
Master: 4 15 11532
core(slave1): 16 30 23040
core(slave2): 16 30 23040
core(slave3): 16 30 23040
core(slave4): 16 30 23040
I am starting a spark application with one job that gets divided into 2 stages using --master yarn-client with following configurations:
--num-executors 12 --executor-cores 5 --executor-memory 7G ---->(1)
--num-executors 12 --executor-cores 5 --executor-memory 6G ---->(2)
I have not modified any other parameter so spark.storage.* and spark.shuffle.* fractions are default.
calculations that I performed to find above configuration (master node is not performing any computation i.e verified using Ganglia except serving as a driver) are:
1. allocated 15 cores to yarn per node and started 3 executors/node
which implies 4(# of slave nodes)*3 = 12 executors.
2. 15 cores/3 executors = 5 cores per executor
3. 23040*(1-0.07) ~ 21G. Dividing this among three executors i.e
21/3=7G
In the (1) configuration, it is not launching 12 executors whereas in the (2) case it is able to do so. Though the memory is available per executor to do so, why it is not able to launch 12 executors in the (1) case?
What is your memory utilization like? Have you checked yarn-site.xml on the node managers hosts to see if all that memory and cpu is being exposed via node manager configuration?
You can do yarn node -list for a list of nodes and then yarn node -status (I beieve) to see a listing of what this node exposes to yarn as far as resources.
consult yarn log -applicationId to see a detailed log of your application interaction including captures of output.
Finally look at yarn logs on the resource manager host to see if there are any issues there

Resources