When does a Spark on YARN application exit with exitCode: -104? - apache-spark

My spark application reads 3 files of 7 MB , 40 MB ,100MB and so many transformations and store multiple directories
Spark version CDH1.5
MASTER_URL=yarn-cluster
NUM_EXECUTORS=15
EXECUTOR_MEMORY=4G
EXECUTOR_CORES=6
DRIVER_MEMORY=3G
My spark job was running for some time and then it throws the below error message and restarts from begining
18/03/27 18:59:44 INFO avro.AvroRelation: using snappy for Avro output
18/03/27 18:59:47 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
18/03/27 18:59:47 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.
Once it restarted again it ran for sometime and failed with this error
Application application_1521733534016_7233 failed 2 times due to AM Container for appattempt_1521733534016_7233_000002 exited with exitCode: -104
For more detailed output, check application tracking page:http://entline.com:8088/proxy/application_1521733534016_7233/Then, click on links to logs of each attempt.
Diagnostics: Container [pid=52716,containerID=container_e98_1521733534016_7233_02_000001] is running beyond physical memory limits. Current usage: 3.5 GB of 3.5 GB physical memory used; 4.3 GB of 7.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_e98_1521733534016_7233_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 52720 52716 52716 52716 (java) 89736 8182 4495249408 923677 /usr/java/jdk1.7.0_67-cloudera/bin/java -server -Xmx3072m -Djava.io.tmpdir=/apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001 -XX:MaxPermSize=256m org.apache.spark.deploy.yarn.ApplicationMaster --class com.sky.ids.dovetail.asrun.etl.DovetailAsRunETLMain --jar file:/apps/projects/dovetail_asrun_etl/jars/EntLine-1.0-SNAPSHOT-jar-with-dependencies.jar --arg --app.conf.path --arg application.conf --arg --run_type --arg AUTO --arg --bus_date --arg 2018-03-27 --arg --code_base_id --arg EntLine-1.0-SNAPSHOT --executor-memory 4096m --executor-cores 6 --properties-file /apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/__spark_conf__/__spark_conf__.properties
|- 52716 52714 52716 52716 (bash) 2 0 108998656 389 /bin/bash -c LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/../../../CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/lib/native: /usr/java/jdk1.7.0_67-cloudera/bin/java -server -Xmx3072m -Djava.io.tmpdir=/apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001 -XX:MaxPermSize=256m org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.sky.ids.dovetail.asrun.etl.DovetailAsRunETLMain' --jar file:/apps/projects/dovetail_asrun_etl/jars/EntLine-1.0-SNAPSHOT-jar-with-dependencies.jar --arg '--app.conf.path' --arg 'application.conf' --arg '--run_type' --arg 'AUTO' --arg '--bus_date' --arg '2018-03-27' --arg '--code_base_id' --arg 'EntLine-1.0-SNAPSHOT' --executor-memory 4096m --executor-cores 6 --properties-file /apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/__spark_conf__/__spark_conf__.properties 1> /var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/stdout 2> /var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/stderr
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
As per my CDH
Container Memory[Amount of physical memory, in MiB, that can be allocated for containers]
yarn.nodemanager.resource.memory-mb 50655 MiB
Please see the containers running in my driver node
Why are there many containers running in one node .
I know that container_e98_1521733534016_7880_02_000001 is for my spark application . I don't know about other containers ? Any idea on that ?
Also I see that physical memory for container_e98_1521733534016_7880_02_000001 is 3584 which is close to 3.5 GB
What does this error mean? Whe it usally occurs?
What is 3.5 GB of 3.5 GB physical memory? Is it driver memory?
Could some one help me on this issue?

container_e98_1521733534016_7233_02_000001 is the first container started and given MASTER_URL=yarn-cluster that's not only the ApplicationMaster, but also the driver of the Spark application.
It appears that the memory setting for the driver, i.e. DRIVER_MEMORY=3G, is too low and you have to bump it up.
Spark on YARN runs two executors by default (see --num-executors) and so you'll end up with 3 YARN containers with 000001 for the ApplicationMaster (perhaps with the driver) and 000002 and 000003 for the two executors.
What is 3.5 GB of 3.5 GB physical memory? Is it driver memory?
Since you use yarn-cluster the driver, the ApplicationMaster and container_e98_1521733534016_7233_02_000001 are all the same and live in the same JVM. That gives that the error is about how much memory you assigned to the driver.
My understanding is that you gave DRIVER_MEMORY=3G which happened to have been too little for your processing and once YARN figured it out killed the driver (and hence the entire Spark application as it's not possible to have a Spark application up and running without the driver).
See the document Running Spark on YARN.

A small addition to what #Jacek already wrote to answer the question
why you get 3.5GB instead of 3GB?
is that apart the DRIVER_MEMORY=3G you need to consider spark.driver.memoryOverhead which can be calculated as MIN(DRIVER_MEMORY * 0.10, 384)MB = 384MB + 3GB ~ 3.5GB

Related

Application master is killed by yarn while running spark job in cluster mode randomly

The error log is as follows :
20/05/10 18:40:47 ERROR yarn.Client: Application diagnostics message: Application application_1588683044535_1067 failed 2 times due to AM Container for appattempt_1588683044535_1067_000002 exited with exitCode: -104
Failing this attempt.Diagnostics: [2020-05-10 18:40:47.661]Container [pid=209264,containerID=container_e142_1588683044535_1067_02_000001] is running 3313664B beyond the 'PHYSICAL' memory limit. Current usage: 1.5 GB of 1.5 GB physical memory used; 3.6 GB of 3.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_e142_1588683044535_1067_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 209264 209262 209264 209264 (bash) 0 0 22626304 372 /bin/bash -c LD_LIBRARY_PATH="/cdhparcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/../../../CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/lib/native:" /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/tmp -Dspark.yarn.app.container.log.dir=/hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.airtel.spark.execution.driver.SparkDriver' --jar hdfs:///user/aiuat/lib/platform/di-platform-main-1.0.jar --arg 'hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irsparkbatchjobconf.json,hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irruleexecution.json' --properties-file /hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/__spark_conf__/__spark_conf__.properties 1> /hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/stdout 2> /hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/stderr
|- 209280 209264 209264 209264 (java) 34135 2437 3845763072 393653 /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/tmp -Dspark.yarn.app.container.log.dir=/hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class com.airtel.spark.execution.driver.SparkDriver --jar hdfs:///user/aiuat/lib/platform/di-platform-main-1.0.jar --arg hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irsparkbatchjobconf.json,hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irruleexecution.json --properties-file /hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/__spark_conf__/__spark_conf__.properties
Some of the observations are :
Application master is getting killed. The memory error is in container of application master itself, not of executer containers.
This job is scheduled via oozie and some instances of job get succeeded and some fails randomly without any pattern. The amount of input data is same in every case.
I have tried the most of solutions suggested on internet.
yarn.mapareduce.map.mb and yarn.mapareduce.reduce.mb is set to 8gb already.
I have also tried increasing driver memory , executer memory , overhead memory of both to very high value, low value, tweaking with these configurations but some instances still failed in every case.
yarn.nodemanager.vmem-pmem-ratio is set to 2.1 vnem check is disable and pnem check is enabled. Unfortunately these configurations can't be changed as it's a production cluster.
yarn.app.mapreduce.am.resource.mb is set to 5GB already. yarn.scheduler.maximum-allocation-mb is set to 26GB
Some of my other confusions are :
Why is memory available to Application master container only 1.5GB as shown in logs if yarn.app.mapreduce.am.resource.mb is set to 5GB ?
As this error comes in the container of application master itself and as per my understanding , application master and spark driver runs in the same jvm. I am concluding that that this error is because of either spark driver memory or application master memory not being sufficient. Does my conclusion seem correct ?
I have fixed this error. So, I thought I will answer this here.
In case of cluster mode, driver memory configurations can't be given on runtime after a sparksession is already created as application master was already launched and driver runs inside yarn application master container. What I was trying to do is to pass driver memory conf via "spark.driver.memory" after creating a sparksession. Spark doesn't give any error for this case and even shows the driver memory as exactly what was provided via this conf in the environment tab on spark ui page, which makes identifying the issue even more difficult. Application master memory was taken as default value 1GB instead of the memory I provided and thus, I was getting this error.

Application failed 2 times due to AM Container, exited with exitcode -104

I am running a Spark application with two input files and a jar file which is taken up from Amazon S3 bucket. I am creating a cluster using AWS CLI with instance type as m5.12xlarge and instance-count as 11 and spark properties as:
--deploy-mode cluster
--num-executors 10
--executor-cores 45
--executor-memory 155g
My spark job was running for some time and then it failed and restarted automatically and it ran again for some time and then it showed this diagnostics (pulled from the logs)
diagnostics: Application application_1557259242251_0001 failed 2 times due to AM Container for appattempt_1557259242251_0001_000002 exited with exitCode: -104
Failing this attempt.Diagnostics: Container [pid=11779,containerID=container_1557259242251_0001_02_000001] is running beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical memory used; 3.5 GB of 6.9 GB virtual memory used. Killing container.
Dump of the process-tree for container_1557259242251_0001_02_000001 :
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Exception in thread "main" org.apache.spark.SparkException: Application application_1557259242251_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1165)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1520)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/05/07 20:03:35 INFO ShutdownHookManager: Shutdown hook called
19/05/07 20:03:35 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-3deea823-45e5-4a11-a5ff-833b01e6ae79
19/05/07 20:03:35 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-d6c3f8b2-34c6-422b-b946-ad03b1ee77d6
Command exiting with ret '1'
I am not able to figure out what is the problem?
I have tried change the instance type or lowering the executor memory and executor-cores but still the same problem keep on occuring.
Sometimes the same configuration settings terminates the cluster successfully and results are generated but many time these error are generated.
Can someone please help?
If you are providing more than 1 input file to the spark job. Make a jar and then execute it.
Step 1: How to make a zip file
zip abc.zip file1.py file2.py
Step 2: Execute job with a zip file
spark2-submit --master yarn --deploy-mode cluster --py-files /home/abc.zip /home/main_program_file.py

spark heap size error even RAM is 32 GB and JAVA_OPTIONS=-Xmx8g

I have 32 GB of physical memoryand my input file size about 30 MB, I try to submit my spark job in yarn client mode using the below command
spark-submit --master yarn --packages com.databricks:spark-xml_2.10:0.4.1 --driver-memory 8g ericsson_xml_parsing_version_6_stage1.py
and my executor space is 8g, but get the below error anyone please help me to configure the java heap memory. I read about the --driver-java-options using command line but I don't know how to set java heap space using this option.
Anyone please help me out.
java.lang.OutOfMemoryError: Java heap space
enter image description here
Did you try to configure executor memory as well?
like this: "--executor-memory 8g"

Spark Executors off-heap memory usage keeps increasing

The off-heap memory usage of the 3 Spark executor processes keeps increasing constantly until the boundaries of the physical RAM are hit. This happened two weeks ago, at which point the system comes to a grinding halt, because it's unable to spawn new processes. At such a moment restarting Spark is the obvious solution. In the collectd memory usage graph below we see two moments that we've restarted Spark: last week when we upgraded Spark from 1.4.1 to 1.5.1 and two weeks ago when the physical memory was exhausted.
As can be seen below, the Spark executor process uses approx. 62GB of memory, while the heap size max is set to 20GB. This means the off-heap memory usage is approx. 42GB.
$ ps aux | grep 40724
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
apache-+ 40724 140 47.1 75678780 62181644 ? Sl Nov06 11782:27 /usr/lib/jvm/java-7-oracle/jre/bin/java -cp /opt/spark-1.5.1-bin-hadoop2.4/conf/:/opt/spark-1.5.1-bin-hadoop2.4/lib/spark-assembly-1.5.1-hadoop2.4.0.jar:/opt/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar -Xms20480M -Xmx20480M -Dspark.driver.port=7201 -Dspark.blockManager.port=7206 -Dspark.executor.port=7202 -Dspark.broadcast.port=7204 -Dspark.fileserver.port=7203 -Dspark.replClassServer.port=7205 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver#xxx.xxx.xxx.xxx:7201/user/CoarseGrainedScheduler --executor-id 2 --hostname xxx.xxx.xxx.xxx --cores 10 --app-id app-20151106125547-0000 --worker-url akka.tcp://sparkWorker#xxx.xxx.xxx.xxx:7200/user/Worker
$ sudo -u apache-spark jps
40724 CoarseGrainedExecutorBackend
40517 Worker
30664 Jps
$ sudo -u apache-spark jstat -gc 40724
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
158720.0 157184.0 110339.8 0.0 6674944.0 1708036.1 13981184.0 2733206.2 59904.0 59551.9 41944 1737.864 39 13.464 1751.328
$ sudo -u apache-spark jps -v
40724 CoarseGrainedExecutorBackend -Xms20480M -Xmx20480M -Dspark.driver.port=7201 -Dspark.blockManager.port=7206 -Dspark.executor.port=7202 -Dspark.broadcast.port=7204 -Dspark.fileserver.port=7203 -Dspark.replClassServer.port=7205 -XX:MaxPermSize=256m
40517 Worker -Xms2048m -Xmx2048m -XX:MaxPermSize=256m
10693 Jps -Dapplication.home=/usr/lib/jvm/java-7-oracle -Xms8m
Some info:
We use Spark Streaming lib.
Our code is written in Java.
We run Oracle Java v1.7.0_76
Data is read from Kafka (Kafka runs on different boxes).
Data is written to Cassandra (Cassandra runs on different boxes).
1 Spark master and 3 Spark executors/workers, running on 4 separate boxes.
We recently upgraded Spark to 1.4.1 and 1.5.1 and the memory usage pattern is identical on all those versions.
What can be the cause of this ever-increasing off-heap memory use?

Cannot submit Spark app to cluster, stuck on "UNDEFINED"

I use this command to summit spark application to yarn cluster
export YARN_CONF_DIR=conf
bin/spark-submit --class "Mining"
--master yarn-cluster
--executor-memory 512m ./target/scala-2.10/mining-assembly-0.1.jar
In Web UI, it stuck on UNDEFINED
In console, it stuck to
<code>14/11/12 16:37:55 INFO yarn.Client: Application report from ASM:
application identifier: application_1415704754709_0017
appId: 17
clientToAMToken: null
appDiagnostics:
appMasterHost: example.com
appQueue: default
appMasterRpcPort: 0
appStartTime: 1415784586000
yarnAppState: RUNNING
distributedFinalState: UNDEFINED
appTrackingUrl: http://example.com:8088/proxy/application_1415704754709_0017/
appUser: rain
</code>
Update:
Dive into Logs for container in Web UI http://example.com:8042/node/containerlogs/container_1415704754709_0017_01_000001/rain/stderr/?start=0, I found this
14/11/12 02:11:47 WARN YarnClusterScheduler: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered
and have sufficient memory
14/11/12 02:11:47 DEBUG Client: IPC Client (1211012646) connection to
spark.mvs.vn/192.168.64.142:8030 from rain sending #24418
14/11/12 02:11:47 DEBUG Client: IPC Client (1211012646) connection to
spark.mvs.vn/192.168.64.142:8030 from rain got value #24418
I found this problem have had solution here http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/
The Hadoop cluster must have sufficient memory for the request.
For example, submitting the following job with 1GB memory allocated for
executor and Spark driver fails with the above error in the HDP 2.1 Sandbox.
Reduce the memory asked for the executor and the Spark driver to 512m and
re-start the cluster.
I'm trying this solution and hopefully it will work.
Solutions
Finally I found that it caused by memory problem
It worked when I change yarn.nodemanager.resource.memory-mb to 3072 (its value was 2048) in Web UI of interface and restarted cluster.
I'm very happy to see this
With 3GB in yarn nodemanager, my summit is
bin/spark-submit
--class "Mining"
--master yarn-cluster
--executor-memory 512m
--driver-memory 512m
--num-executors 2
--executor-cores 1
./target/scala-2.10/mining-assembly-0.1.jar`

Resources