Spark driver program launching in `cluster` mode failed in a weird way - apache-spark

I'm new to Spark. Now I encountered a problem: when I launch a program in a standalone spark cluster while command line:
./spark-submit --class scratch.Pi --deploy-mode cluster --executor-memory 5g --name pi --driver-memory 5g --driver-java-options "-XX:MaxPermSize=1024m" --master spark://bx-42-68:7077 hdfs://bx-42-68:9000/jars/pi.jar
It will throws following error:
15/01/28 19:48:51 INFO Slf4jLogger: Slf4jLogger started
15/01/28 19:48:51 INFO Utils: Successfully started service 'driverClient' on port 59290.
Sending launch command to spark://bx-42-68:7077
Driver successfully submitted as driver-20150128194852-0003
... waiting before polling master for driver state
... polling master for driver state
State of driver-20150128194852-0003 is FAILED
Master of cluster outputs following log:
15/01/28 19:48:52 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
15/01/28 19:48:52 INFO Master: Launching driver driver-20150128194852-0003 on worker worker-20150126133948-bx-42-151-26286
15/01/28 19:48:55 INFO Master: Removing driver: driver-20150128194852-0003
15/01/28 19:48:57 INFO Master: akka.tcp://driverClient#bx-42-68:59290 got disassociated, removing it.
15/01/28 19:48:57 INFO Master: akka.tcp://driverClient#bx-42-68:59290 got disassociated, removing it.
15/01/28 19:48:57 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://driverClient#bx-42-68:59290] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/28 19:48:57 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.16.42.68%3A48091-16#-1393479428] was not delivered. [9] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
And the corresponding worker for launching driver program outputs:
15/01/28 19:48:52 INFO Worker: Asked to launch driver driver-20150128194852-0003
15/01/28 19:48:52 INFO DriverRunner: Copying user jar hdfs://bx-42-68:9000/jars/pi.jar to /data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/01/28 19:48:55 INFO DriverRunner: Launch Command: "/opt/apps/jdk-1.7.0_60/bin/java" "-cp" "/data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar:::/data11/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/data11/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-XX:MaxPermSize=128m" "-Dspark.executor.memory=5g" "-Dspark.akka.askTimeout=10" "-Dspark.rdd.compress=true" "-Dspark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" "-Dspark.serializer=org.apache.spark.serializer.KryoSerializer" "-Dspark.app.name=YANL" "-Dspark.driver.extraJavaOptions=-XX:MaxPermSize=1024m" "-Dspark.jars=hdfs://bx-42-68:9000/jars/pi.jar" "-Dspark.master=spark://bx-42-68:7077" "-Dspark.storage.memoryFraction=0.6" "-Dakka.loglevel=WARNING" "-XX:MaxPermSize=1024m" "-Xms5120M" "-Xmx5120M" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker#bx-42-151:26286/user/Worker" "scratch.Pi"
15/01/28 19:48:55 WARN Worker: Driver driver-20150128194852-0003 exited with failure
My spark-env.sh is:
export SCALA_HOME=/opt/apps/scala-2.11.5
export JAVA_HOME=/opt/apps/jdk-1.7.0_60
export SPARK_HOME=/data11/spark-1.2.0-bin-hadoop2.4
export PATH=$JAVA_HOME/bin:$PATH
export SPARK_MASTER_IP=`hostname -f`
export SPARK_LOCAL_IP=`hostname -f`
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=10.16.42.68:2181,10.16.42.134:2181,10.16.42.151:2181,10.16.42.150:2181,10.16.42.125:2181 -Dspark.deploy.zookeeper.dir=/spark"
SPARK_WORKER_MEMORY=43g
SPARK_WORKER_CORES=22
And my spark-defaults.conf is:
spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.executor.memory 20g
spark.rdd.compress true
spark.storage.memoryFraction 0.6
spark.serializer org.apache.spark.serializer.KryoSerializer
However, when I launch the program with client mode with following command, it works fine.
./spark-submit --class scratch.Pi --deploy-mode client --executor-memory 5g --name pi --driver-memory 5g --driver-java-options "-XX:MaxPermSize=1024m" --master spark://bx-42-68:7077 /data11/pi.jar

The reason why it works in "client" mode and not in "cluster" mode is because there is no support for "cluster" mode in a standalone cluster.(mentioned in the spark documentation).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors.
Note that cluster mode is currently not supported for standalone
clusters, Mesos clusters, or python applications.
If you look at "Submitting Applications" section in spark documentation, it is clearly mentioned that the support for cluster mode is not available in standalone clusters.
Reference link : http://spark.apache.org/docs/1.2.0/submitting-applications.html
Go to above link and have a look at "Launching Applications with spark-submit" section.
Think it will help. Thanks.

Related

Spark in Yarn Cluster Mode - Yarn client reports FAILED even when job completes successfully

I am experimenting with running Spark in yarn cluster mode (v2.3.0). We have traditionally been running in yarn client mode, but some jobs are submitted from .NET web services, so we have to keep a host process running in the background when using client mode (HostingEnvironment.QueueBackgroundWorkTime...). We are hoping we can execute these jobs in a more "fire and forget" style.
Our jobs continue to run successfully, but we see a curious entry in the logs where the yarn client that submits the job to the application manager is always reporting failure:
18/11/29 16:54:35 INFO yarn.Client: Application report for application_1539978346138_110818 (state: RUNNING)
18/11/29 16:54:36 INFO yarn.Client: Application report for application_1539978346138_110818 (state: RUNNING)
18/11/29 16:54:37 INFO yarn.Client: Application report for application_1539978346138_110818 (state: FINISHED)
18/11/29 16:54:37 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: N/A
ApplicationMaster host: <ip address>
ApplicationMaster RPC port: 0
queue: root.default
start time: 1543510402372
final status: FAILED
tracking URL: http://server.host.com:8088/proxy/application_1539978346138_110818/
user: p800s1
Exception in thread "main" org.apache.spark.SparkException: Application application_1539978346138_110818 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1153)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1568)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:892)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/11/29 16:54:37 INFO util.ShutdownHookManager: Shutdown hook called
We always create a SparkSession and always return sys.exit(0) (although that appears to be ignored by the Spark framework regardless of how we submit a job). We also have our own internal error logging that routes to Kafka/ElasticSearch. No errors are reported during the job run.
Here's an example of the submit command: spark2-submit --keytab /etc/keytabs/p800s1.ktf --principal p800s1#OURDOMAIN.COM --master yarn --deploy-mode cluster --driver-memory 2g --executor-memory 4g --class com.path.to.MainClass /path/to/UberJar.jar arg1 arg2
This seems to be harmless noise, but I don't like noise that I don't understand. Has anyone experienced something similar?

Spark-yarn ends with an error exitCode=16, how to solve that?

I am using Apache Spark 2.0.0 and Apache Hadoop 2.6.0. I am trying to run my spark application on my hadoop cluster.
I used the command lines:
bin/spark-submit --class org.JavaWordCount \
--master yarn \
--deploy-mode cluster \
--driver-memory 512m \
--queue default \
/opt/JavaWordCount.jar \
10
However, Yarn ends with an error exictCode=16:
17/01/25 11:05:49 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
17/01/25 11:05:49 INFO impl.ContainerManagementProtocolProxy: Opening proxy : hmaster:59600
17/01/25 11:05:49 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM
17/01/25 11:05:49 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
17/01/25 11:05:49 INFO storage.DiskBlockManager: Shutdown hook called
I tried to solve this issue with this topic, but it doesn't give a pratical answer.
Does anyone know how to solve this isssue ?
Thanks in advance
Just Encountered this issue. Excess memory is being used by JVM. Try adding the property
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
in the yarn-site.xml of all nodemanagers and restart. It worked for me
Refer : https://issues.apache.org/jira/browse/YARN-4714

Spark on Mesos Cluster - Task Fails

I'm trying to run a Spark application in a Mesos cluster where I have one master and one slave. The slave has 8GB RAM assigned for Mesos. The master is running the Spark Mesos Dispatcher.
I use the following command to submit a Spark application (which is a streaming application).
spark-submit --master mesos://mesos-master:7077 --class com.verifone.media.ums.scheduling.spark.SparkBootstrapper --deploy-mode cluster scheduling-spark-0.5.jar
And I see the following output which shows its successfully submitted.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/09/01 12:52:38 INFO RestSubmissionClient: Submitting a request to launch an application in mesos://mesos-master:7077.
15/09/01 12:52:39 INFO RestSubmissionClient: Submission successfully created as driver-20150901072239-0002. Polling submission state...
15/09/01 12:52:39 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20150901072239-0002 in mesos://mesos-master:7077.
15/09/01 12:52:39 INFO RestSubmissionClient: State of driver driver-20150901072239-0002 is now QUEUED.
15/09/01 12:52:40 INFO RestSubmissionClient: Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"serverSparkVersion" : "1.4.1",
"submissionId" : "driver-20150901072239-0002",
"success" : true
}
However, this fails in Mesos, and when I look at the Spark Cluster UI, I see the following message.
task_id { value: "driver-20150901070957-0001" } state: TASK_FAILED message: "" slave_id { value: "20150831-082639-167881920-5050-4116-S6" } timestamp: 1.441091399975446E9 source: SOURCE_SLAVE reason: REASON_MEMORY_LIMIT 11: "\305-^E\377)N\327\277\361:\351\fm\215\312"
Seems like it is related to memory, but I'm not sure whether I have to configure something here to get this working.
UPDATE
I looked at the mesos logs in the slave, and I see the following message.
E0901 07:56:26.086618 1284 fetcher.cpp:515] Failed to run mesos-fetcher: Failed to fetch all URIs for container '33183181-e91b-4012-9e21-baa37485e755' with exit status: 256
So I thought that this could be because of the Spark Executor URL, so I modified the spark-submit to be as follows and increased memory for both driver and slave, but still I see the same error.
spark-submit \
--master mesos://mesos-master:7077 \
--class com.verifone.media.ums.scheduling.spark.SparkBootstrapper \
--deploy-mode cluster \
--driver-memory 1G \
--executor-memory 4G \
--conf spark.executor.uri=http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1-bin-hadoop2.6.tgz \
scheduling-spark-0.5.jar
UPDATE 2
I went past this point by following #hartem's advice (see comments). Tasks are running now, but still, actual Spark application does not run in the cluster. When I look at the logs I see the following. After the last line, seems that Spark does not proceed any further.
15/09/01 10:33:41 INFO SparkContext: Added JAR file:/tmp/mesos/slaves/20150831-082639-167881920-5050-4116-S8/frameworks/20150831-082639-167881920-5050-4116-0004/executors/driver-20150901103327-0002/runs/47339c12-fb78-43d6-bc8a-958dd94d0ccf/spark-1.4.1-bin-hadoop2.6/../scheduling-spark-0.5.jar at http://192.172.1.31:33666/jars/scheduling-spark-0.5.jar with timestamp 1441103621639
I0901 10:33:41.728466 4375 sched.cpp:157] Version: 0.23.0
I0901 10:33:41.730764 4383 sched.cpp:254] New master detected at master#192.172.1.10:7077
I0901 10:33:41.730908 4383 sched.cpp:264] No credentials provided. Attempting to register without authentication
I had similar issue problem was slave could not find the required jar for running the class file(SparkPi). So i gave the http URL of the jar it worked, it requires jar to be placed in distributed system not on local file system.
/home/centos/spark-1.6.1-bin-hadoop2.6/bin/spark-submit \
--name SparkPiTestApp \
--class org.apache.spark.examples.SparkPi \
--master mesos://xxxxxxx:7077 \
--deploy-mode cluster \
--executor-memory 5G --total-executor-cores 30 \
http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.4.0-SNAPSHOT.jar 100
Could you please do export GLOG_v=1 before launching the slave and see if there is anything interesting in the slave log? I would also look for stdout and stderr files under the slave working directory and see if they contain any clues.

Spark 1.2.1 standalone cluster mode spark-submit is not working

I have 3 node spark cluster
node1 , node2 and node 3
I running below command on node 1 for deploying driver
/usr/local/spark-1.2.1-bin-hadoop2.4/bin/spark-submit --class com.fst.firststep.aggregator.FirstStepMessageProcessor --master spark://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:7077 --deploy-mode cluster --supervise file:///home/xyz/sparkstreaming-0.0.1-SNAPSHOT.jar /home/xyz/config.properties
driver gets launched on node 2 in cluster. but getting exception on node 2 that it is trying to bind to node 1 ip.
2015-02-26 08:47:32 DEBUG AkkaUtils:63 - In createActorSystem, requireCookie is: off
2015-02-26 08:47:32 INFO Slf4jLogger:80 - Slf4jLogger started
2015-02-26 08:47:33 ERROR NettyTransport:65 - failed to bind to ec2-xx.xx.xx.xx.compute-1.amazonaws.com/xx.xx.xx.xx:0, shutting down Netty transport
2015-02-26 08:47:33 WARN Utils:71 - Service 'Driver' could not bind on port 0. Attempting port 1.
2015-02-26 08:47:33 DEBUG AkkaUtils:63 - In createActorSystem, requireCookie is: off
2015-02-26 08:47:33 ERROR Remoting:65 - Remoting error: [Startup failed] [
akka.remote.RemoteTransportException: Startup failed
at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:136)
at akka.remote.Remoting.start(Remoting.scala:201)
at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1765)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1756)
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:33)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.jboss.netty.channel.ChannelException: Failed to bind to: ec2-xx-xx-xx.compute-1.amazonaws.com/xx.xx.xx.xx:0
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Success.map(Try.scala:206)
kindly suggest
Thanks
after spending lot more time.i got the answer.i did below changes
remove entry of SPARK_LOCAL_IP and SPARK_MASTER_IP
add host name and private ip address of each other nodes in etc/hosts.
use --deploy-mode cluster --supervise
thats all and it works perfectly with fully HA components(Master,Slaves and Driver)
Thanks
Cluster mode is not supported in EC2 1.2 instances where it creates a standalone cluster. Hence you can try removing
--deploy-mode cluster --supervise

Cannot submit Spark app to cluster, stuck on "UNDEFINED"

I use this command to summit spark application to yarn cluster
export YARN_CONF_DIR=conf
bin/spark-submit --class "Mining"
--master yarn-cluster
--executor-memory 512m ./target/scala-2.10/mining-assembly-0.1.jar
In Web UI, it stuck on UNDEFINED
In console, it stuck to
<code>14/11/12 16:37:55 INFO yarn.Client: Application report from ASM:
application identifier: application_1415704754709_0017
appId: 17
clientToAMToken: null
appDiagnostics:
appMasterHost: example.com
appQueue: default
appMasterRpcPort: 0
appStartTime: 1415784586000
yarnAppState: RUNNING
distributedFinalState: UNDEFINED
appTrackingUrl: http://example.com:8088/proxy/application_1415704754709_0017/
appUser: rain
</code>
Update:
Dive into Logs for container in Web UI http://example.com:8042/node/containerlogs/container_1415704754709_0017_01_000001/rain/stderr/?start=0, I found this
14/11/12 02:11:47 WARN YarnClusterScheduler: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered
and have sufficient memory
14/11/12 02:11:47 DEBUG Client: IPC Client (1211012646) connection to
spark.mvs.vn/192.168.64.142:8030 from rain sending #24418
14/11/12 02:11:47 DEBUG Client: IPC Client (1211012646) connection to
spark.mvs.vn/192.168.64.142:8030 from rain got value #24418
I found this problem have had solution here http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/
The Hadoop cluster must have sufficient memory for the request.
For example, submitting the following job with 1GB memory allocated for
executor and Spark driver fails with the above error in the HDP 2.1 Sandbox.
Reduce the memory asked for the executor and the Spark driver to 512m and
re-start the cluster.
I'm trying this solution and hopefully it will work.
Solutions
Finally I found that it caused by memory problem
It worked when I change yarn.nodemanager.resource.memory-mb to 3072 (its value was 2048) in Web UI of interface and restarted cluster.
I'm very happy to see this
With 3GB in yarn nodemanager, my summit is
bin/spark-submit
--class "Mining"
--master yarn-cluster
--executor-memory 512m
--driver-memory 512m
--num-executors 2
--executor-cores 1
./target/scala-2.10/mining-assembly-0.1.jar`

Resources