spark tasks fail with error, showing exit status: -100 - apache-spark

The spark job running in yarn mode, shows few tasks failed with following reason:
ExecutorLostFailure (executor 36 exited caused by one of the running tasks) Reason: Container marked as failed: container_xxxxxxxxxx_yyyy_01_000054 on host: ip-xxx-yy-zzz-zz. Exit status: -100. Diagnostics: Container released on a *lost* node
Any idea why is this happening?

There are two main reasons.
It is may because of your memoryOverhead needed by the yarn container is not enough, and the solution is to Increase the spark.executor.memoryOverhead
Possibly, it is because the slave node disk lack space to write tmp data required by spark. check your yarn usercache dir (for EMR, it locates on /mnt/yarn/usercache/),
or type df -h to check your disk remaining space.

Container killed by the framework, either due to being released by the application or being 'lost' due to node failures etc. have a special exit code of -100.
The node failure could be because of not having enough disc space or executor memory.

I understand your cluster is not on AWS but as AWS manager the MR cluster they have released an FAQ
For Glue job: https://aws.amazon.com/premiumsupport/knowledge-center/container-released-lost-node-100-glue/
For EMR: https://aws.amazon.com/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/

Related

Application master is killed by yarn while running spark job in cluster mode randomly

The error log is as follows :
20/05/10 18:40:47 ERROR yarn.Client: Application diagnostics message: Application application_1588683044535_1067 failed 2 times due to AM Container for appattempt_1588683044535_1067_000002 exited with exitCode: -104
Failing this attempt.Diagnostics: [2020-05-10 18:40:47.661]Container [pid=209264,containerID=container_e142_1588683044535_1067_02_000001] is running 3313664B beyond the 'PHYSICAL' memory limit. Current usage: 1.5 GB of 1.5 GB physical memory used; 3.6 GB of 3.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_e142_1588683044535_1067_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 209264 209262 209264 209264 (bash) 0 0 22626304 372 /bin/bash -c LD_LIBRARY_PATH="/cdhparcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/../../../CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/lib/native:" /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/tmp -Dspark.yarn.app.container.log.dir=/hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.airtel.spark.execution.driver.SparkDriver' --jar hdfs:///user/aiuat/lib/platform/di-platform-main-1.0.jar --arg 'hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irsparkbatchjobconf.json,hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irruleexecution.json' --properties-file /hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/__spark_conf__/__spark_conf__.properties 1> /hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/stdout 2> /hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/stderr
|- 209280 209264 209264 209264 (java) 34135 2437 3845763072 393653 /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/tmp -Dspark.yarn.app.container.log.dir=/hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class com.airtel.spark.execution.driver.SparkDriver --jar hdfs:///user/aiuat/lib/platform/di-platform-main-1.0.jar --arg hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irsparkbatchjobconf.json,hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irruleexecution.json --properties-file /hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/__spark_conf__/__spark_conf__.properties
Some of the observations are :
Application master is getting killed. The memory error is in container of application master itself, not of executer containers.
This job is scheduled via oozie and some instances of job get succeeded and some fails randomly without any pattern. The amount of input data is same in every case.
I have tried the most of solutions suggested on internet.
yarn.mapareduce.map.mb and yarn.mapareduce.reduce.mb is set to 8gb already.
I have also tried increasing driver memory , executer memory , overhead memory of both to very high value, low value, tweaking with these configurations but some instances still failed in every case.
yarn.nodemanager.vmem-pmem-ratio is set to 2.1 vnem check is disable and pnem check is enabled. Unfortunately these configurations can't be changed as it's a production cluster.
yarn.app.mapreduce.am.resource.mb is set to 5GB already. yarn.scheduler.maximum-allocation-mb is set to 26GB
Some of my other confusions are :
Why is memory available to Application master container only 1.5GB as shown in logs if yarn.app.mapreduce.am.resource.mb is set to 5GB ?
As this error comes in the container of application master itself and as per my understanding , application master and spark driver runs in the same jvm. I am concluding that that this error is because of either spark driver memory or application master memory not being sufficient. Does my conclusion seem correct ?
I have fixed this error. So, I thought I will answer this here.
In case of cluster mode, driver memory configurations can't be given on runtime after a sparksession is already created as application master was already launched and driver runs inside yarn application master container. What I was trying to do is to pass driver memory conf via "spark.driver.memory" after creating a sparksession. Spark doesn't give any error for this case and even shows the driver memory as exactly what was provided via this conf in the environment tab on spark ui page, which makes identifying the issue even more difficult. Application master memory was taken as default value 1GB instead of the memory I provided and thus, I was getting this error.

Application failed 2 times due to AM Container, exited with exitcode -104

I am running a Spark application with two input files and a jar file which is taken up from Amazon S3 bucket. I am creating a cluster using AWS CLI with instance type as m5.12xlarge and instance-count as 11 and spark properties as:
--deploy-mode cluster
--num-executors 10
--executor-cores 45
--executor-memory 155g
My spark job was running for some time and then it failed and restarted automatically and it ran again for some time and then it showed this diagnostics (pulled from the logs)
diagnostics: Application application_1557259242251_0001 failed 2 times due to AM Container for appattempt_1557259242251_0001_000002 exited with exitCode: -104
Failing this attempt.Diagnostics: Container [pid=11779,containerID=container_1557259242251_0001_02_000001] is running beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical memory used; 3.5 GB of 6.9 GB virtual memory used. Killing container.
Dump of the process-tree for container_1557259242251_0001_02_000001 :
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Exception in thread "main" org.apache.spark.SparkException: Application application_1557259242251_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1165)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1520)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/05/07 20:03:35 INFO ShutdownHookManager: Shutdown hook called
19/05/07 20:03:35 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-3deea823-45e5-4a11-a5ff-833b01e6ae79
19/05/07 20:03:35 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-d6c3f8b2-34c6-422b-b946-ad03b1ee77d6
Command exiting with ret '1'
I am not able to figure out what is the problem?
I have tried change the instance type or lowering the executor memory and executor-cores but still the same problem keep on occuring.
Sometimes the same configuration settings terminates the cluster successfully and results are generated but many time these error are generated.
Can someone please help?
If you are providing more than 1 input file to the spark job. Make a jar and then execute it.
Step 1: How to make a zip file
zip abc.zip file1.py file2.py
Step 2: Execute job with a zip file
spark2-submit --master yarn --deploy-mode cluster --py-files /home/abc.zip /home/main_program_file.py

spark-submit runs infinitely - displayed errors: Removal of executor n requested - Asked to remove non-existent executor n+1

I have deployed a spark standalone cluster with one driver and 2 executors, each one running on a separate machine.
Whenever I submit a job to the master using spark-submit --master spark://driver_ip:7077 example/src/main/python/pi.py, It runs infinitely and displays these errors:
BlockManagerMaster:54 - Removal of executor 50 requested
CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 50
BlockManagerMasterEndpoint:54 - Trying to remove executor 50 from BlockManagerMaster.
StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20181129123913-0003/52 is now RUNNING
StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20181129123913-0003/51 is now EXITED (Command exited with code 1)
StandaloneSchedulerBackend:54 - Executor app-20181129123913-0003/51 removed: Command exited with code 1
StandaloneAppClient$ClientEndpoint:54 - Executor added: app-20181129123913-0003/53 on worker-20181129120029-10.0.1.101-36599 (10.0.1.101:36599) with 1 core(s)
Each time the number in Removal of executor increments and the program doesn't end. It looks like the executors are constantly refusing the jobs.
Could anyone help me figure out the issue.
Note that I can see that the Spark executors are registered with the Driver In the Spark Manager's web UI.

Running python spark on EMR

We're having a hard time running a python spark job on EMR.
aws emr add-steps --cluster-id j-XXXXXXXX --steps \
Type=CUSTOM_JAR,Name="Spark Program",\
Jar="command-runner.jar",ActionOnFailure=CONTINUE,\
Args=["spark-submit",--deploy-mode,cluster,--master,yarn,s3://XXXXXXX/pi.py,2]
We're running the same pyspark compute pi script as the AWS page suggests
This script runs, but it runs forever calculating pi. On local machine it takes seconds to finish. We've tried client mode as well. On client mode it makes us transfer the files locally.
16/09/20 15:20:32 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1474384831795
final status: UNDEFINED
tracking URL: http://XXXXXXX.ec2.internal:20888/proxy/application_1474381572045_0002/
user: hadoop
16/09/20 15:20:33 INFO Client: Application report for application_1474381572045_0002 (state: ACCEPTED)
Repeats this last command over and over...
Does anyone know how to run the example python spark pi script on EMR without it running forever?
When you see the job in ACCEPTED state forever, it means that it is not actually running but rather is waiting for YARN to have enough resources to run the application. Usually this is because you already have some other YARN application running and taking up resources. The easiest way to find out if this is the case is to look at the YARN ResourceManager on port 8088 of the master node. You can also run the command "yarn application -list" if you have ssh'ed to the master node.

Spark streaming job fails after getting stopped by Driver

I have a spark streaming job which reads in data from Kafka and does some operations on it. I am running the job over a yarn cluster, Spark 1.4.1, which has two nodes with 16 GB RAM each and 16 cores each.
I have these conf passed to the spark-submit job :
--master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 3
The job returns this error and finishes after running for a short while :
INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11,
(reason: Max number of executor failures reached)
.....
ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0:
Stopped by driver
Updated :
These logs were found too :
INFO yarn.YarnAllocator: Received 3 containers from YARN, launching executors on 3 of them.....
INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down.
....
INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
INFO yarn.ExecutorRunnable: Starting Executor Container.....
INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down...
INFO yarn.YarnAllocator: Completed container container_e10_1453801197604_0104_01_000006 (state: COMPLETE, exit status: 1)
INFO yarn.YarnAllocator: Container marked as failed: container_e10_1453801197604_0104_01_000006. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e10_1453801197604_0104_01_000006
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
What might be the reasons for this? Appreciate some help.
Thanks
can you please show your scala/java code that is reading from kafka? I suspect you probably not creating your SparkConf correctly.
Try something like
SparkConf sparkConf = new SparkConf().setAppName("ApplicationName");
also try running application in yarn-client mode and share the output.
I got the same issue. and I have found 1 solution to fix the issue by removing sparkContext.stop() at the end of main function, leave the stop action for GC.
Spark team has resolved the issue in Spark core, however, the fix has just been master branch so far. We need to wait until the fix has been updated into the new release.
https://issues.apache.org/jira/browse/SPARK-12009

Resources