Application master is killed by yarn while running spark job in cluster mode randomly - apache-spark

The error log is as follows :
20/05/10 18:40:47 ERROR yarn.Client: Application diagnostics message: Application application_1588683044535_1067 failed 2 times due to AM Container for appattempt_1588683044535_1067_000002 exited with exitCode: -104
Failing this attempt.Diagnostics: [2020-05-10 18:40:47.661]Container [pid=209264,containerID=container_e142_1588683044535_1067_02_000001] is running 3313664B beyond the 'PHYSICAL' memory limit. Current usage: 1.5 GB of 1.5 GB physical memory used; 3.6 GB of 3.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_e142_1588683044535_1067_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 209264 209262 209264 209264 (bash) 0 0 22626304 372 /bin/bash -c LD_LIBRARY_PATH="/cdhparcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/../../../CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/lib/native:" /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/tmp -Dspark.yarn.app.container.log.dir=/hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.airtel.spark.execution.driver.SparkDriver' --jar hdfs:///user/aiuat/lib/platform/di-platform-main-1.0.jar --arg 'hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irsparkbatchjobconf.json,hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irruleexecution.json' --properties-file /hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/__spark_conf__/__spark_conf__.properties 1> /hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/stdout 2> /hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/stderr
|- 209280 209264 209264 209264 (java) 34135 2437 3845763072 393653 /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/tmp -Dspark.yarn.app.container.log.dir=/hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class com.airtel.spark.execution.driver.SparkDriver --jar hdfs:///user/aiuat/lib/platform/di-platform-main-1.0.jar --arg hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irsparkbatchjobconf.json,hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irruleexecution.json --properties-file /hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/__spark_conf__/__spark_conf__.properties
Some of the observations are :
Application master is getting killed. The memory error is in container of application master itself, not of executer containers.
This job is scheduled via oozie and some instances of job get succeeded and some fails randomly without any pattern. The amount of input data is same in every case.
I have tried the most of solutions suggested on internet.
yarn.mapareduce.map.mb and yarn.mapareduce.reduce.mb is set to 8gb already.
I have also tried increasing driver memory , executer memory , overhead memory of both to very high value, low value, tweaking with these configurations but some instances still failed in every case.
yarn.nodemanager.vmem-pmem-ratio is set to 2.1 vnem check is disable and pnem check is enabled. Unfortunately these configurations can't be changed as it's a production cluster.
yarn.app.mapreduce.am.resource.mb is set to 5GB already. yarn.scheduler.maximum-allocation-mb is set to 26GB
Some of my other confusions are :
Why is memory available to Application master container only 1.5GB as shown in logs if yarn.app.mapreduce.am.resource.mb is set to 5GB ?
As this error comes in the container of application master itself and as per my understanding , application master and spark driver runs in the same jvm. I am concluding that that this error is because of either spark driver memory or application master memory not being sufficient. Does my conclusion seem correct ?

I have fixed this error. So, I thought I will answer this here.
In case of cluster mode, driver memory configurations can't be given on runtime after a sparksession is already created as application master was already launched and driver runs inside yarn application master container. What I was trying to do is to pass driver memory conf via "spark.driver.memory" after creating a sparksession. Spark doesn't give any error for this case and even shows the driver memory as exactly what was provided via this conf in the environment tab on spark ui page, which makes identifying the issue even more difficult. Application master memory was taken as default value 1GB instead of the memory I provided and thus, I was getting this error.

Related

Application failed 2 times due to AM Container, exited with exitcode -104

I am running a Spark application with two input files and a jar file which is taken up from Amazon S3 bucket. I am creating a cluster using AWS CLI with instance type as m5.12xlarge and instance-count as 11 and spark properties as:
--deploy-mode cluster
--num-executors 10
--executor-cores 45
--executor-memory 155g
My spark job was running for some time and then it failed and restarted automatically and it ran again for some time and then it showed this diagnostics (pulled from the logs)
diagnostics: Application application_1557259242251_0001 failed 2 times due to AM Container for appattempt_1557259242251_0001_000002 exited with exitCode: -104
Failing this attempt.Diagnostics: Container [pid=11779,containerID=container_1557259242251_0001_02_000001] is running beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical memory used; 3.5 GB of 6.9 GB virtual memory used. Killing container.
Dump of the process-tree for container_1557259242251_0001_02_000001 :
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Exception in thread "main" org.apache.spark.SparkException: Application application_1557259242251_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1165)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1520)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/05/07 20:03:35 INFO ShutdownHookManager: Shutdown hook called
19/05/07 20:03:35 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-3deea823-45e5-4a11-a5ff-833b01e6ae79
19/05/07 20:03:35 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-d6c3f8b2-34c6-422b-b946-ad03b1ee77d6
Command exiting with ret '1'
I am not able to figure out what is the problem?
I have tried change the instance type or lowering the executor memory and executor-cores but still the same problem keep on occuring.
Sometimes the same configuration settings terminates the cluster successfully and results are generated but many time these error are generated.
Can someone please help?
If you are providing more than 1 input file to the spark job. Make a jar and then execute it.
Step 1: How to make a zip file
zip abc.zip file1.py file2.py
Step 2: Execute job with a zip file
spark2-submit --master yarn --deploy-mode cluster --py-files /home/abc.zip /home/main_program_file.py

When does a Spark on YARN application exit with exitCode: -104?

My spark application reads 3 files of 7 MB , 40 MB ,100MB and so many transformations and store multiple directories
Spark version CDH1.5
MASTER_URL=yarn-cluster
NUM_EXECUTORS=15
EXECUTOR_MEMORY=4G
EXECUTOR_CORES=6
DRIVER_MEMORY=3G
My spark job was running for some time and then it throws the below error message and restarts from begining
18/03/27 18:59:44 INFO avro.AvroRelation: using snappy for Avro output
18/03/27 18:59:47 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
18/03/27 18:59:47 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.
Once it restarted again it ran for sometime and failed with this error
Application application_1521733534016_7233 failed 2 times due to AM Container for appattempt_1521733534016_7233_000002 exited with exitCode: -104
For more detailed output, check application tracking page:http://entline.com:8088/proxy/application_1521733534016_7233/Then, click on links to logs of each attempt.
Diagnostics: Container [pid=52716,containerID=container_e98_1521733534016_7233_02_000001] is running beyond physical memory limits. Current usage: 3.5 GB of 3.5 GB physical memory used; 4.3 GB of 7.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_e98_1521733534016_7233_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 52720 52716 52716 52716 (java) 89736 8182 4495249408 923677 /usr/java/jdk1.7.0_67-cloudera/bin/java -server -Xmx3072m -Djava.io.tmpdir=/apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001 -XX:MaxPermSize=256m org.apache.spark.deploy.yarn.ApplicationMaster --class com.sky.ids.dovetail.asrun.etl.DovetailAsRunETLMain --jar file:/apps/projects/dovetail_asrun_etl/jars/EntLine-1.0-SNAPSHOT-jar-with-dependencies.jar --arg --app.conf.path --arg application.conf --arg --run_type --arg AUTO --arg --bus_date --arg 2018-03-27 --arg --code_base_id --arg EntLine-1.0-SNAPSHOT --executor-memory 4096m --executor-cores 6 --properties-file /apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/__spark_conf__/__spark_conf__.properties
|- 52716 52714 52716 52716 (bash) 2 0 108998656 389 /bin/bash -c LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/../../../CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/lib/native: /usr/java/jdk1.7.0_67-cloudera/bin/java -server -Xmx3072m -Djava.io.tmpdir=/apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001 -XX:MaxPermSize=256m org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.sky.ids.dovetail.asrun.etl.DovetailAsRunETLMain' --jar file:/apps/projects/dovetail_asrun_etl/jars/EntLine-1.0-SNAPSHOT-jar-with-dependencies.jar --arg '--app.conf.path' --arg 'application.conf' --arg '--run_type' --arg 'AUTO' --arg '--bus_date' --arg '2018-03-27' --arg '--code_base_id' --arg 'EntLine-1.0-SNAPSHOT' --executor-memory 4096m --executor-cores 6 --properties-file /apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/__spark_conf__/__spark_conf__.properties 1> /var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/stdout 2> /var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/stderr
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
As per my CDH
Container Memory[Amount of physical memory, in MiB, that can be allocated for containers]
yarn.nodemanager.resource.memory-mb 50655 MiB
Please see the containers running in my driver node
Why are there many containers running in one node .
I know that container_e98_1521733534016_7880_02_000001 is for my spark application . I don't know about other containers ? Any idea on that ?
Also I see that physical memory for container_e98_1521733534016_7880_02_000001 is 3584 which is close to 3.5 GB
What does this error mean? Whe it usally occurs?
What is 3.5 GB of 3.5 GB physical memory? Is it driver memory?
Could some one help me on this issue?
container_e98_1521733534016_7233_02_000001 is the first container started and given MASTER_URL=yarn-cluster that's not only the ApplicationMaster, but also the driver of the Spark application.
It appears that the memory setting for the driver, i.e. DRIVER_MEMORY=3G, is too low and you have to bump it up.
Spark on YARN runs two executors by default (see --num-executors) and so you'll end up with 3 YARN containers with 000001 for the ApplicationMaster (perhaps with the driver) and 000002 and 000003 for the two executors.
What is 3.5 GB of 3.5 GB physical memory? Is it driver memory?
Since you use yarn-cluster the driver, the ApplicationMaster and container_e98_1521733534016_7233_02_000001 are all the same and live in the same JVM. That gives that the error is about how much memory you assigned to the driver.
My understanding is that you gave DRIVER_MEMORY=3G which happened to have been too little for your processing and once YARN figured it out killed the driver (and hence the entire Spark application as it's not possible to have a Spark application up and running without the driver).
See the document Running Spark on YARN.
A small addition to what #Jacek already wrote to answer the question
why you get 3.5GB instead of 3GB?
is that apart the DRIVER_MEMORY=3G you need to consider spark.driver.memoryOverhead which can be calculated as MIN(DRIVER_MEMORY * 0.10, 384)MB = 384MB + 3GB ~ 3.5GB

SPARK_WORKER_INSTANCES setting not working in Spark Standalone Windows

I'm trying to setup a standalone Spark 2.0 server to process an analytics function in parallel. To do this I want to run 8 workers, with a single core per each worker. However, the Spark Master/Worker UI doesn't seem to be reflecting my configuration.
I'm using :
Standalone Spark 2.0
8 Cores 24gig RAM
windows server 2008
pyspark
spark-env.sh file is configured as follows:
SPARK_WORKER_INSTANCES = 8
SPARK_WORKER_CORES = 1
SPARK_WORKER_MEMORY = 2g
spark-defaults.conf is configured as follows:
spark.cores.max = 8
I start the master:
spark-class org.apache.spark.deploy.master.Master
I start the workers by running this command 8 times within a batch file:
spark-class org.apache.spark.deploy.worker.Worker spark://10.0.0.10:7077
The problem is that the UI shows up as follows:
As you can see each worker has 8 cores instead of the 1 core I have assigned it via the SPARK_WORKER_CORES setting. Also the memory is reflective of the entire machine memory not the 2g assigned to each worker. How can I configure Spark to run with 1 core/2g per each worker in standalone mode?
I fixed this to adding the cores and memory arguments to the worker itself.
start spark-class org.apache.spark.deploy.worker.Worker --cores 1 --memory 2g spark://10.0.0.10:7077

Spark runs endlessly for Pi example

I just setup Spark and ran the command
spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m
However, it just keeps endlessly printing out messages like
16/04/25 17:34:46 INFO Client: Application report for application_1460481694166_0125 (state: ACCEPTED)
I read somewhere that I could try to kill the application. But I'm not sure what
When I try
yarn application -list
I see
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1460481694166_0118 org.apache.spark.examples.SparkPi SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0124 Spark shell SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0120 Spark shell ...
Zeppelin SPARK zeppelin default RUNNING UNDEFINED 10% http://10.0.2.15:4040
application_1460481694166_0117 org.apache.spark.examples.SparkPi SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0123 Spark shell
...
I'm not sure why Zeppelin is showing up because I closed it in my web browser
What do I need to do now?
I'm guessing Zeppelin is still running even though you closed your browser. Closing the browser is not the same as stopping the hosting process. Stopping the hosting process is done in the CLI tab that started the process. As a last ditch, you can yarn application -kill any of the running processes in any tab.
yarn application -kill application_1460481694166_0118
That will kill the (first) spark application.

Running a simple Spark script on Mesos with Zookeeper

I want to run a simple spark program, but i am restricted by some errors.
My Environment is:
CentOS:6.6
Java: 1.7.0_51
Scala: 2.10.4
Spark: spark-1.4.0-bin-hadoop2.6
Mesos: 0.22.1
All are installed and nodes are up.Now i have one Mesos master and Mesos slave node. My spark properties are below:
spark.app.id 20150624-185838-2885789888-5050-1291-0005
spark.app.name Spark shell
spark.driver.host 192.168.1.172
spark.driver.memory 512m
spark.driver.port 46428
spark.executor.id driver
spark.executor.memory 512m
spark.executor.uri http://192.168.1.172:8080/spark-1.4.0-bin-hadoop2.6.tgz
spark.externalBlockStore.folderName spark-91aafe3b-01a8-4c86-ac3b-999e278807c5
spark.fileserver.uri http://192.168.1.172:51240
spark.jars
spark.master mesos://zk://192.168.1.172:2181/mesos
spark.mesos.coarse true
spark.repl.class.uri http://192.168.1.172:51600
spark.scheduler.mode FIFO
Now when I started spark, it comes to scala prompt(scala>).
After that I am getting following error: mesos task 1 is now TASK_FAILED, blacklisting mesos slave value due to too many failures is Spark installed on it
How to resolve this.
With only 900MB and spark.driver.memory = 512m, you will be able to launch the scheduler/REPL, but you won't have enough memory for spark.executor.memory = 512m, so any tasks will fail. Either increasing your VM memory size or reducing the driver/executor memory requirements will help you get around these memory limits.
Could you check the mesos slave logs/ task information for more output on why the task failed. You could have a look at :5050.
Probably unrelated question: Do you actually have zookeeper:
spark.master mesos://zk://192.168.1.172:2181/mesos
running (as you mentioned you only have one master)?

Resources