Spark-yarn ends with an error exitCode=16, how to solve that? - apache-spark

I am using Apache Spark 2.0.0 and Apache Hadoop 2.6.0. I am trying to run my spark application on my hadoop cluster.
I used the command lines:
bin/spark-submit --class org.JavaWordCount \
--master yarn \
--deploy-mode cluster \
--driver-memory 512m \
--queue default \
/opt/JavaWordCount.jar \
10
However, Yarn ends with an error exictCode=16:
17/01/25 11:05:49 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/01/25 11:05:49 INFO impl.ContainerManagementProtocolProxy: Opening proxy : hmaster:59600
17/01/25 11:05:49 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM
17/01/25 11:05:49 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
17/01/25 11:05:49 INFO storage.DiskBlockManager: Shutdown hook called
I tried to solve this issue with this topic, but it doesn't give a pratical answer.
Does anyone know how to solve this isssue ?
Thanks in advance

Just Encountered this issue. Excess memory is being used by JVM. Try adding the property
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
in the yarn-site.xml of all nodemanagers and restart. It worked for me
Refer : https://issues.apache.org/jira/browse/YARN-4714

Related

Spark on Yarn Failed to send RPC and Slave lost

I want to deploy spark2.3.2 on Yarn, Hadoop2.7.3.
But when I run:
spark-shell
Always raise ERROR:
ERROR TransportClient:233 - Failed to send RPC 4858956348523471318 to /10.20.42.194:54288: java.nio.channels.ClosedChannelException
...
ERROR YarnScheduler:70 - Lost executor 1 on dc002: Slave lost
Both dc002 and dc003 will raise ERRORs Failed to send RPC and Slave lost.
I have one master node and two slave node server. They all are:
CentOS Linux release 7.5.1804 (Core) with 40 cpu and 62.6GB memory and 31.4 GB swap.
My HADOOP_CONF_DIR:
export HADOOP_CONF_DIR=/home/spark-test/hadoop-2.7.3/etc/hadoop
My /etc/hosts:
10.20.51.154 dc001
10.20.42.194 dc002
10.20.42.177 dc003
In Hadoop and Yarn Web UI, I can see both dc002 and dc003 node, and I can run simple mapreduce task on yarn in hadoop.
But when I run spark-shell or SparkPi example program by
./spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi spark-2.3.2-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.2.jar 10
, ERRORs always raise.
I really want to why those errors happened.
I fixed this problem by changing the yarn-site.xml conf file:
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
Try this parameter in you code-
spark.conf.set("spark.dynamicAllocation.enabled", "false")
Secondly while doing spark submit, define parameters like --executor-memory and --num-executors
sample:
spark2-submit --executor-memory 20g --num-executors 15 --class com.executor mapping.jar

ERROR : User did not initialize spark context

Log error :
TestSuccessfull
2018-08-20 04:52:15 INFO ApplicationMaster:54 - Final app status: FAILED, exitCode: 13
2018-08-20 04:52:15 ERROR ApplicationMaster:91 - Uncaught exception:
java.lang.IllegalStateException: User did not initialize spark context!
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:498)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:800)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:799)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:824)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
2018-08-20 04:52:15 INFO SparkContext:54 - Invoking stop() from shutdown hook
Error log on console After submit command :
2018-08-20 05:47:35 INFO Client:54 - Application report for application_1534690018301_0035 (state: ACCEPTED)
2018-08-20 05:47:36 INFO Client:54 - Application report for application_1534690018301_0035 (state: ACCEPTED)
2018-08-20 05:47:37 INFO Client:54 - Application report for application_1534690018301_0035 (state: FAILED)
2018-08-20 05:47:37 INFO Client:54 -
client token: N/A
diagnostics: Application application_1534690018301_0035 failed 2 times due to AM Container for appattempt_1534690018301_0035_000002 exited with exitCode: 13
Failing this attempt.Diagnostics: [2018-08-20 05:47:36.454]Exception from container-launch.
Container id: container_1534690018301_0035_02_000001
Exit code: 13
My code :
val sparkConf = new SparkConf().setAppName("Gathering Data")
val sc = new SparkContext(sparkConf)
submit command :
spark-submit --class spark_basic.Test_Local --master yarn --deploy-mode cluster /home/IdeaProjects/target/Spark-1.0-SNAPSHOT.jar
discription :
I have installed spark on hadoop in psedo distribustion mode.
spark-shell working fine. only problem when i used cluster mode .
My code also work file . i am able print output but at final its giving error .
I presume your lines of code has a line which sets master to local.
SparkConf.setMaster("local[*]")
if so, try to comment out that line and try again as you will be setting the master to yarn in your command
/usr/cdh/current/spark-client/bin/spark-submit --class com.test.sparkApp --master yarn --deploy-mode cluster --num-executors 40 --executor-cores 4 --driver-memory 17g --executor-memory 22g --files /usr/cdh/current/spark-client/conf/hive-site.xml /home/user/sparkApp.jar
Finally i got with
spark-submit
/home/mahendra/Marvaland/SparkEcho/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --master yarn --class spark_basic.Test_Local /home/mahendra/IdeaProjects/SparkTraining/target/SparkTraining-1.0-SNAPSHOT.jar
spark session
val spark = SparkSession.builder()
.appName("DataETL")
.master("local[1]")
.enableHiveSupport()
.getOrCreate()
thanks #cricket_007
This error may occur if you are submitting the spark job like this:
spark-submit --class some.path.com.Main --master yarn --deploy-mode cluster some_spark.jar (with passing master and deploy-mode as argument in CLI) and at the same time having this line: new SparkContext in your code.
Either get the context with val sc = SparkContext.getOrCreate() or do not pass the spark-submit master and deploy-mode arguments if want to have new SparkContext.

Spark on Mesos Cluster - Task Fails

I'm trying to run a Spark application in a Mesos cluster where I have one master and one slave. The slave has 8GB RAM assigned for Mesos. The master is running the Spark Mesos Dispatcher.
I use the following command to submit a Spark application (which is a streaming application).
spark-submit --master mesos://mesos-master:7077 --class com.verifone.media.ums.scheduling.spark.SparkBootstrapper --deploy-mode cluster scheduling-spark-0.5.jar
And I see the following output which shows its successfully submitted.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/09/01 12:52:38 INFO RestSubmissionClient: Submitting a request to launch an application in mesos://mesos-master:7077.
15/09/01 12:52:39 INFO RestSubmissionClient: Submission successfully created as driver-20150901072239-0002. Polling submission state...
15/09/01 12:52:39 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20150901072239-0002 in mesos://mesos-master:7077.
15/09/01 12:52:39 INFO RestSubmissionClient: State of driver driver-20150901072239-0002 is now QUEUED.
15/09/01 12:52:40 INFO RestSubmissionClient: Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"serverSparkVersion" : "1.4.1",
"submissionId" : "driver-20150901072239-0002",
"success" : true
}
However, this fails in Mesos, and when I look at the Spark Cluster UI, I see the following message.
task_id { value: "driver-20150901070957-0001" } state: TASK_FAILED message: "" slave_id { value: "20150831-082639-167881920-5050-4116-S6" } timestamp: 1.441091399975446E9 source: SOURCE_SLAVE reason: REASON_MEMORY_LIMIT 11: "\305-^E\377)N\327\277\361:\351\fm\215\312"
Seems like it is related to memory, but I'm not sure whether I have to configure something here to get this working.
UPDATE
I looked at the mesos logs in the slave, and I see the following message.
E0901 07:56:26.086618 1284 fetcher.cpp:515] Failed to run mesos-fetcher: Failed to fetch all URIs for container '33183181-e91b-4012-9e21-baa37485e755' with exit status: 256
So I thought that this could be because of the Spark Executor URL, so I modified the spark-submit to be as follows and increased memory for both driver and slave, but still I see the same error.
spark-submit \
--master mesos://mesos-master:7077 \
--class com.verifone.media.ums.scheduling.spark.SparkBootstrapper \
--deploy-mode cluster \
--driver-memory 1G \
--executor-memory 4G \
--conf spark.executor.uri=http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1-bin-hadoop2.6.tgz \
scheduling-spark-0.5.jar
UPDATE 2
I went past this point by following #hartem's advice (see comments). Tasks are running now, but still, actual Spark application does not run in the cluster. When I look at the logs I see the following. After the last line, seems that Spark does not proceed any further.
15/09/01 10:33:41 INFO SparkContext: Added JAR file:/tmp/mesos/slaves/20150831-082639-167881920-5050-4116-S8/frameworks/20150831-082639-167881920-5050-4116-0004/executors/driver-20150901103327-0002/runs/47339c12-fb78-43d6-bc8a-958dd94d0ccf/spark-1.4.1-bin-hadoop2.6/../scheduling-spark-0.5.jar at http://192.172.1.31:33666/jars/scheduling-spark-0.5.jar with timestamp 1441103621639
I0901 10:33:41.728466 4375 sched.cpp:157] Version: 0.23.0
I0901 10:33:41.730764 4383 sched.cpp:254] New master detected at master#192.172.1.10:7077
I0901 10:33:41.730908 4383 sched.cpp:264] No credentials provided. Attempting to register without authentication
I had similar issue problem was slave could not find the required jar for running the class file(SparkPi). So i gave the http URL of the jar it worked, it requires jar to be placed in distributed system not on local file system.
/home/centos/spark-1.6.1-bin-hadoop2.6/bin/spark-submit \
--name SparkPiTestApp \
--class org.apache.spark.examples.SparkPi \
--master mesos://xxxxxxx:7077 \
--deploy-mode cluster \
--executor-memory 5G --total-executor-cores 30 \
http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.4.0-SNAPSHOT.jar 100
Could you please do export GLOG_v=1 before launching the slave and see if there is anything interesting in the slave log? I would also look for stdout and stderr files under the slave working directory and see if they contain any clues.

Spark driver program launching in `cluster` mode failed in a weird way

I'm new to Spark. Now I encountered a problem: when I launch a program in a standalone spark cluster while command line:
./spark-submit --class scratch.Pi --deploy-mode cluster --executor-memory 5g --name pi --driver-memory 5g --driver-java-options "-XX:MaxPermSize=1024m" --master spark://bx-42-68:7077 hdfs://bx-42-68:9000/jars/pi.jar
It will throws following error:
15/01/28 19:48:51 INFO Slf4jLogger: Slf4jLogger started
15/01/28 19:48:51 INFO Utils: Successfully started service 'driverClient' on port 59290.
Sending launch command to spark://bx-42-68:7077
Driver successfully submitted as driver-20150128194852-0003
... waiting before polling master for driver state
... polling master for driver state
State of driver-20150128194852-0003 is FAILED
Master of cluster outputs following log:
15/01/28 19:48:52 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
15/01/28 19:48:52 INFO Master: Launching driver driver-20150128194852-0003 on worker worker-20150126133948-bx-42-151-26286
15/01/28 19:48:55 INFO Master: Removing driver: driver-20150128194852-0003
15/01/28 19:48:57 INFO Master: akka.tcp://driverClient#bx-42-68:59290 got disassociated, removing it.
15/01/28 19:48:57 INFO Master: akka.tcp://driverClient#bx-42-68:59290 got disassociated, removing it.
15/01/28 19:48:57 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://driverClient#bx-42-68:59290] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/28 19:48:57 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.16.42.68%3A48091-16#-1393479428] was not delivered. [9] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
And the corresponding worker for launching driver program outputs:
15/01/28 19:48:52 INFO Worker: Asked to launch driver driver-20150128194852-0003
15/01/28 19:48:52 INFO DriverRunner: Copying user jar hdfs://bx-42-68:9000/jars/pi.jar to /data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/01/28 19:48:55 INFO DriverRunner: Launch Command: "/opt/apps/jdk-1.7.0_60/bin/java" "-cp" "/data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar:::/data11/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/data11/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-XX:MaxPermSize=128m" "-Dspark.executor.memory=5g" "-Dspark.akka.askTimeout=10" "-Dspark.rdd.compress=true" "-Dspark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" "-Dspark.serializer=org.apache.spark.serializer.KryoSerializer" "-Dspark.app.name=YANL" "-Dspark.driver.extraJavaOptions=-XX:MaxPermSize=1024m" "-Dspark.jars=hdfs://bx-42-68:9000/jars/pi.jar" "-Dspark.master=spark://bx-42-68:7077" "-Dspark.storage.memoryFraction=0.6" "-Dakka.loglevel=WARNING" "-XX:MaxPermSize=1024m" "-Xms5120M" "-Xmx5120M" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker#bx-42-151:26286/user/Worker" "scratch.Pi"
15/01/28 19:48:55 WARN Worker: Driver driver-20150128194852-0003 exited with failure
My spark-env.sh is:
export SCALA_HOME=/opt/apps/scala-2.11.5
export JAVA_HOME=/opt/apps/jdk-1.7.0_60
export SPARK_HOME=/data11/spark-1.2.0-bin-hadoop2.4
export PATH=$JAVA_HOME/bin:$PATH
export SPARK_MASTER_IP=`hostname -f`
export SPARK_LOCAL_IP=`hostname -f`
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=10.16.42.68:2181,10.16.42.134:2181,10.16.42.151:2181,10.16.42.150:2181,10.16.42.125:2181 -Dspark.deploy.zookeeper.dir=/spark"
SPARK_WORKER_MEMORY=43g
SPARK_WORKER_CORES=22
And my spark-defaults.conf is:
spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.executor.memory 20g
spark.rdd.compress true
spark.storage.memoryFraction 0.6
spark.serializer org.apache.spark.serializer.KryoSerializer
However, when I launch the program with client mode with following command, it works fine.
./spark-submit --class scratch.Pi --deploy-mode client --executor-memory 5g --name pi --driver-memory 5g --driver-java-options "-XX:MaxPermSize=1024m" --master spark://bx-42-68:7077 /data11/pi.jar
The reason why it works in "client" mode and not in "cluster" mode is because there is no support for "cluster" mode in a standalone cluster.(mentioned in the spark documentation).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors.
Note that cluster mode is currently not supported for standalone
clusters, Mesos clusters, or python applications.
If you look at "Submitting Applications" section in spark documentation, it is clearly mentioned that the support for cluster mode is not available in standalone clusters.
Reference link : http://spark.apache.org/docs/1.2.0/submitting-applications.html
Go to above link and have a look at "Launching Applications with spark-submit" section.
Think it will help. Thanks.

Cannot submit Spark app to cluster, stuck on "UNDEFINED"

I use this command to summit spark application to yarn cluster
export YARN_CONF_DIR=conf
bin/spark-submit --class "Mining"
--master yarn-cluster
--executor-memory 512m ./target/scala-2.10/mining-assembly-0.1.jar
In Web UI, it stuck on UNDEFINED
In console, it stuck to
<code>14/11/12 16:37:55 INFO yarn.Client: Application report from ASM:
application identifier: application_1415704754709_0017
appId: 17
clientToAMToken: null
appDiagnostics:
appMasterHost: example.com
appQueue: default
appMasterRpcPort: 0
appStartTime: 1415784586000
yarnAppState: RUNNING
distributedFinalState: UNDEFINED
appTrackingUrl: http://example.com:8088/proxy/application_1415704754709_0017/
appUser: rain
</code>
Update:
Dive into Logs for container in Web UI http://example.com:8042/node/containerlogs/container_1415704754709_0017_01_000001/rain/stderr/?start=0, I found this
14/11/12 02:11:47 WARN YarnClusterScheduler: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered
and have sufficient memory
14/11/12 02:11:47 DEBUG Client: IPC Client (1211012646) connection to
spark.mvs.vn/192.168.64.142:8030 from rain sending #24418
14/11/12 02:11:47 DEBUG Client: IPC Client (1211012646) connection to
spark.mvs.vn/192.168.64.142:8030 from rain got value #24418
I found this problem have had solution here http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/
The Hadoop cluster must have sufficient memory for the request.
For example, submitting the following job with 1GB memory allocated for
executor and Spark driver fails with the above error in the HDP 2.1 Sandbox.
Reduce the memory asked for the executor and the Spark driver to 512m and
re-start the cluster.
I'm trying this solution and hopefully it will work.
Solutions
Finally I found that it caused by memory problem
It worked when I change yarn.nodemanager.resource.memory-mb to 3072 (its value was 2048) in Web UI of interface and restarted cluster.
I'm very happy to see this
With 3GB in yarn nodemanager, my summit is
bin/spark-submit
--class "Mining"
--master yarn-cluster
--executor-memory 512m
--driver-memory 512m
--num-executors 2
--executor-cores 1
./target/scala-2.10/mining-assembly-0.1.jar`

Resources