org.apache.spark.SparkException: Job aborted due to stage failure: Task in stage failed，Lost task in stage : ExecutorLostFailure (executor 4 lost)

org.apache.spark.SparkException: Job aborted due to stage failure: Task in stage failed，Lost task in stage : ExecutorLostFailure (executor 4 lost) - apache-spark

I build MonoSpark(based on Spark 1.3.1) with JDK 1.7 and Hadoop 2.6.2 by this command (I edited my pom.xml so that the command can work)
./make-distribution.sh --tgz -Phadoop-2.6 -Dhadoop.version=2.6.2
Then, I get a tgz file named 'spark-1.3.1-SNAPSHOT-bin-2.6.2.tgz'.
I put the tgz file on my hadoop cluster which has a master and 4 slaves.
Then, I start the spark by using the command.
$SPARK_HOME/sbin/start-all.sh
The spark works well as there are 4 workers and 1 master. However, when I use spark-submit to run an example:
./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --master spark://master:7077 lib/spark-examples-1.3.1-*-hadoop2.6.2.jar input/README.md
I get this error on my driver like below
......other useless logs.....
19/03/31 22:24:41 ERROR cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 2
19/03/31 22:24:46 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#slave3:55311] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:24:50 ERROR scheduler.TaskSchedulerImpl: Lost executor 3 on slave1: remote Akka client disassociated
19/03/31 22:24:54 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
.......other useless logs......
Exception in thread "main" 19/03/31 22:24:54 ERROR cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 4
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, slave4): ExecutorLostFailure (executor 4 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1325)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1314)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1313)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1313)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:714)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:714)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1526)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1487)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
The worker node error log is below:
19/03/31 22:25:11 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/2 for JavaWordCount
19/03/31 22:25:19 INFO worker.Worker: Executor app-20190331222434-0000/2 finished with state EXITED message Command exited with code 50 exitStatus 50
19/03/31 22:25:19 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#slave4:37919] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:25:19 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.0.2.27%3A35254-2#299045174] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
19/03/31 22:25:19 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/4 for JavaWordCount
19/03/31 22:25:19 INFO worker.ExecutorRunner: Launch command: "/usr/local/java/jdk1.8.0_101/bin/java" "-cp" "/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop" "-Dspark.driver.port=42211" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver#master:42211/user/CoarseGrainedScheduler" "--executor-id" "4" "--hostname" "slave4" "--cores" "4" "--app-id" "app-20190331222434-0000" "--worker-url" "akka.tcp://sparkWorker#slave4:55970/user/Worker"
19/03/31 22:25:32 INFO worker.Worker: Executor app-20190331222434-0000/4 finished with state EXITED message Command exited with code 50 exitStatus 50
19/03/31 22:25:32 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#slave4:60559] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:25:32 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.0.2.27%3A35260-3#479615849] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
19/03/31 22:25:32 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/7 for JavaWordCount
19/03/31 22:25:32 INFO worker.ExecutorRunner: Launch command: "/usr/local/java/jdk1.8.0_101/bin/java" "-cp" "/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop" "-Dspark.driver.port=42211" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver#master:42211/user/CoarseGrainedScheduler" "--executor-id" "7" "--hostname" "slave4" "--cores" "4" "--app-id" "app-20190331222434-0000" "--worker-url" "akka.tcp://sparkWorker#slave4:55970/user/Worker"
19/03/31 22:25:32 INFO worker.Worker: Asked to kill executor app-20190331222434-0000/7
19/03/31 22:25:32 INFO worker.ExecutorRunner: Runner thread for executor app-20190331222434-0000/7 interrupted
19/03/31 22:25:32 INFO worker.ExecutorRunner: Killing process!
19/03/31 22:25:32 INFO worker.Worker: Executor app-20190331222434-0000/7 finished with state KILLED exitStatus 143
19/03/31 22:25:32 INFO worker.Worker: Cleaning up local directories for application app-20190331222434-0000
Are there any errors about hadoop version? Maybe I use the wrong hadoop version or jdk version to build Spark.
Hope someone can give me some suggestions, Thanks.

I find some errors in the executor:
java.lang.UnsupportedOperationException: Datanode-side support for getVolumeBlockLocations() must also be enabled in the client configuration.
I set dfs.datanode.hdfs-blocks-metadata.enabled as true in hadoop-site.xml and restart the hadoop cluster. Finally, it works for me.
The error log of executor is in directory: work
cd $SPARK_HOME/work/appxxxx/xx(xx is a number)

Related

All executors finish with state KILLED and exitStatus 1

I am trying to setup a local Spark cluster. I am using Spark 2.4.4 on Windows 10 machine.
To start the master and one worker I do
spark-class org.apache.spark.deploy.master.Master
spark-class org.apache.spark.deploy.worker.Worker 172.17.1.230:7077
After submitting an application to the cluster, it finishes successfully but in the Spark web admin UI it says that the application is KILLED. It's also what I get from worker logs. I have tried running my own examples and examples included in the Spark installation. They all get killed with exitStatus 1.
To start spark JavaSparkPi example from spark installation folder
Spark> spark-submit --master spark://172.17.1.230:7077 --class org.apache.spark.examples.JavaSparkPi .\examples\jars\spark-examples_2.11-2.4.4.jar
Part of the log after finishing calculation outputs
20/01/19 18:55:11 INFO DAGScheduler: Job 0 finished: reduce at JavaSparkPi.java:54, took 4.183853 s
Pi is roughly 3.13814
20/01/19 18:55:11 INFO SparkUI: Stopped Spark web UI at http://Nikola-PC:4040
20/01/19 18:55:11 INFO StandaloneSchedulerBackend: Shutting down all executors
20/01/19 18:55:11 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/01/19 18:55:11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/01/19 18:55:11 WARN TransportChannelHandler: Exception in connection from /172.17.1.230:58560
java.io.IOException: An existing connection was forcibly closed by the remote host
stderr log of the completed application outputs this at the end
20/01/19 18:55:11 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 910 bytes result sent to driver
20/01/19 18:55:11 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 910 bytes result sent to driver
20/01/19 18:55:11 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
The worker log outputs
20/01/19 18:55:06 INFO ExecutorRunner: Launch command: "C:\Program Files\Java\jdk1.8.0_231\bin\java" "-cp" "C:\Users\nikol\Spark\bin\..\conf\;C:\Users\nikol\Spark\jars\*" "-Xmx1024M" "-Dspark.driver.port=58484" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#Nikola-PC:58484" "--executor-id" "0" "--hostname" "172.17.1.230" "--cores" "12" "--app-id" "app-20200119185506-0001" "--worker-url" "spark://Worker#172.17.1.230:58069"
20/01/19 18:55:11 INFO Worker: Asked to kill executor app-20200119185506-0001/0
20/01/19 18:55:11 INFO ExecutorRunner: Runner thread for executor app-20200119185506-0001/0 interrupted
20/01/19 18:55:11 INFO ExecutorRunner: Killing process!
20/01/19 18:55:11 INFO Worker: Executor app-20200119185506-0001/0 finished with state KILLED exitStatus 1
I have tried with Spark 2.4.4 for Hadoop 2.6 and 2.7. The problem remains in both the cases.
This problem is the same as this one.

ERROR yarn.ApplicationMaster: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after 100000 milliseconds [duplicate]

This question already has answers here:
Why does join fail with "java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]"?
(4 answers)
Closed 4 years ago.
I have this problem in my spark application, I use 1.6 spark version, scala 2.10:
17/10/23 14:32:15 ERROR yarn.ApplicationMaster: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000
milliseconds]at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107) at
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:342)
at
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:197)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:680)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:678)
at
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
17/10/23 14:32:15 INFO yarn.ApplicationMaster: Final app status:
FAILED, exitCode: 10, (reason: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000
milliseconds]) 17/10/23 14:32:15 INFO spark.SparkContext: Invoking
stop() from shutdown hook 17/10/23 14:32:15 INFO ui.SparkUI: Stopped
Spark web UI at http://180.21.232.30:43576 17/10/23 14:32:15 INFO
scheduler.DAGScheduler: ShuffleMapStage 27 (show at Linkage.scala:282)
failed in 24.519 s due to Stage cancelled because SparkContext was
shut down 17/10/23 14:32:15 arkListenerJobEnd (18,1508761935656,JobFailed (org.apache.spark.SparkException:Job 18 cancelled because SparkContext was shut down)) 17/10/23 14:32:15 INFO spark.MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped! 17/10/23 14:32:15 INFO
storage.MemoryStore: MemoryStore cleared 17/10/23 14:32:15 INFO
storage.BlockManager: BlockManager stopped 17/10/23 14:32:15 INFO
storage.BlockManagerMaster: BlockManagerMaster stopped 17/10/23
14:32:15 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Shutting down remote daemon.
17/10/23 14:32:15 INFO util.ShutdownHookManager: Shutdown hook
calledBlockquote
I read the articules that this problem and I tried to modify the next parameter without result
--conf spark.yarn.am.waitTime=6000s
--conf spark.sql.broadcastTimeout= 6000
--conf spark.network.timeout=600
Best Regars

Please remove the setMaster(’local’) on the code, because Spark by default uses the YARN cluster manager in EMR.

If you are trying to run your spark job on yarn client/cluster. Don't forget to remove master configuration from your code .master("local[n]").
For submitting spark job on yarn, you need to pass --master yarn --deploy-mode cluster/client.
Having master set as local was giving repeated timeout exception.

spark worker with 32GB or more memory encountered a fatal error

I have three slaves in a standalone Spark Cluster. Each slave has 48GB of RAM. When I assigned more than 31GB (e.g. 32GB or more) of RAM to my executors:
.config("spark.executor.memory", "44g")
the executors were terminated without much information during a join of two large Dataframes. The output message at the Slave driver showed "missing an output location for shuffle":
17/09/21 12:34:18 INFO StandaloneSchedulerBackend: Granted executor ID app-20170921123240-0000/3 on hostPort XXX.XXX.XXX.92:33705 with 6 cores, 44.0 GB RAM
17/09/21 12:34:18 WARN TaskSetManager: Lost task 14.0 in stage 7.0 (TID 124, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 5.0 in stage 7.0 (TID 115, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 17.0 in stage 7.0 (TID 127, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 8.0 in stage 7.0 (TID 118, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 2.0 in stage 7.0 (TID 112, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 11.0 in stage 7.0 (TID 121, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 INFO DAGScheduler: Executor lost: 0 (epoch 5)
17/09/21 12:34:18 INFO BlockManagerMaster: Removal of executor 0 requested
17/09/21 12:34:18 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_2 !
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_11 !
17/09/21 12:34:18 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170921123240-0000/3 is now RUNNING
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_5 !
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_8 !
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(0, XXX.XXX.XXX, 34840, None)
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/09/21 12:34:18 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor
The log message of Spark Master showed that the executors were "EXITED" and then relaunched:
17/09/21 12:34:18 INFO Master: Removing executor app-20170921123240-0000/0 because it is EXITED
17/09/21 12:34:18 INFO Master: Launching executor app-20170921123240-0000/3 on worker worker-20170921123014-152.83.247.92-33705
The log message of Spark Worker showed that the executor exited with code 134
17/09/21 12:34:18 INFO Worker: Executor app-20170921123240-0000/0 finished with state EXITED message Command exited with code 134 exitStatus 134
The only clue seems to be in the error log of the application, showing a fatal error has been detected by the JRE:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fdec0c92a73, pid=11300, tid=0x00007fd3a6951700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 1.8.0_131-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode linux-amd64 )
# Problematic frame:
# V [libjvm.so+0x3ffa73] CardTableExtension::scavenge_contents_parallel(ObjectStartArray*, MutableSpace*, HeapWord*, PSPromotionManager*, unsigned int, unsigned int)+0x5e3
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
--------------- T H R E A D ---------------
Current thread (0x0000000001c9e800): GCTaskThread [stack: 0x00007fd3a6851000,0x00007fd3a6952000] [id=11308]
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000008
As long as I assign 31GB of RAM (or less) to each executor, my program works just fine. Has anyone encountered such problem before?

44 GB may actually give you a smaller usable heap than 31 GB due to how Java stores object references: For heap sizes over 32 GB the JVM has to switch to 64 bit object references, which means all objects take up more space. More details here: http://java-performance.info/over-32g-heap-java/
My rule of thumb is to either stay below 32 GB or go much higher (say, 50 GB). Usually it is more cost efficient to use multiple JVMs, each having less than 32 GB heap. With 48 GB RAM I would stick to 31 GB heap.

Spark 1.5.1 standalone cluster - wrong Akka remoting config?

Doing my firsts steps with Spark, I'm facing problems submitting jobs to cluster from the application code. Digging the logs, I noticed some periodic WARN messages on master log:
15/10/08 13:00:00 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#192.168.254.167:64014] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
The problem is that ip address not exist on our network, and wasn't configured anywhere. The same wrong ip is shown on the worker log when it tries execute the task (wrong ip passed to --driver-url):
15/10/08 12:58:21 INFO worker.ExecutorRunner: Launch command: "/usr/java/latest//bin/java" "-cp" "/path/spark/spark-1.5.1-bin-ha
doop2.6/sbin/../conf/:/path/spark/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/path/spark/
spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/path/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.ja
r:/path/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/path/hadoop/2.6.0//etc/hadoop/" "-Xms102
4M" "-Xmx1024M" "-Dspark.driver.port=64014" "-Dspark.driver.port=53411" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url"
"akka.tcp://sparkDriver#192.168.254.167:64014/user/CoarseGrainedScheduler" "--executor-id" "39" "--hostname" "192.168.10.214" "--cores" "16" "--app-id" "app-20151008123702-0003" "--worker-url" "akka.tcp://sparkWorker#192.168.10.214:37625/user/Worker"
15/10/08 12:59:28 INFO worker.Worker: Executor app-20151008123702-0003/39 finished with state EXITED message Command exited with code 1 exitStatus 1
Any idea what I did wrong and how can this be fixed?
The java version is 1.8.0_20, and I'm using pre-built Spark binaries.
Thanks!

Maybe it will give you some clues here at my answer to a similar question which is similar question to yours with "Association with remote system has failed"

Can't connect from application to the standalone cluster

I'm trying to connect from application to Spark's standalone cluster. I want to do this on one machine.
I run standalone master server by command:
bash start-master.sh
Then I run one worker by command:
bash spark-class org.apache.spark.deploy.worker.Worker spark://PC:7077 -m 512m
(I allocated 512 MBs for it).
At master’s web UI:
http://localhost:8080
I see, that master and worker are running.
Then I try to connect from application to cluster, with following command:
JavaSparkContext sc = new JavaSparkContext("spark://PC:7077", "myapplication");
When I run application it's crashing with following error message:
4/11/01 22:53:26 INFO client.AppClient$ClientActor: Connecting to master spark://PC:7077...
14/11/01 22:53:26 INFO spark.SparkContext: Starting job: collect at App.java:115
14/11/01 22:53:26 INFO scheduler.DAGScheduler: Got job 0 (collect at App.java:115) with 2 output partitions (allowLocal=false)
14/11/01 22:53:26 INFO scheduler.DAGScheduler: Final stage: Stage 0(collect at App.java:115)
14/11/01 22:53:26 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/11/01 22:53:26 INFO scheduler.DAGScheduler: Missing parents: List()
14/11/01 22:53:26 INFO scheduler.DAGScheduler: Submitting Stage 0 (ParallelCollectionRDD[0] at parallelize at App.java:109), which has no missing parents
14/11/01 22:53:27 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 0 (ParallelCollectionRDD[0] at parallelize at App.java:109)
14/11/01 22:53:27 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/11/01 22:53:42 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/11/01 22:53:46 INFO client.AppClient$ClientActor: Connecting to master spark://PC:7077...
14/11/01 22:53:57 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/11/01 22:54:06 INFO client.AppClient$ClientActor: Connecting to master spark://PC:7077...
14/11/01 22:54:12 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/11/01 22:54:26 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
14/11/01 22:54:26 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/11/01 22:54:26 INFO scheduler.DAGScheduler: Failed to run collect at App.java:115
Exception in thread "main" 14/11/01 22:54:26 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up.
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAnd IndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017 )
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015 )
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.s cala:633)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.s cala:633)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAG Scheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/11/01 22:54:26 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/metrics/json,null}
14/11/01 22:54:26 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
14/11/01 22:54:26 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/,null}
14/11/01 22:54:26 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/static,null}
14/11/01 22:54:26 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/json,null}
14/11/01 22:54:26 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors,null}
14/11/01 22:54:26 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment/json,null}
Any ideas what is going on?
P.S. I'm using pre-built version of Spark - spark-1.1.0-bin-hadoop2.4.
Thank You.

Make sure that both the standalone workers and the Spark driver are connected to the Spark master on the exact address listed in its web UI / printed in its startup log message. Spark uses Akka for some of its control-plane communication and Akka can be really picky about hostnames, so these need to match exactly.
There are several options to control which hostnames / network interfaces the driver and master will bind to. Probably the simplest option is to set the SPARK_LOCAL_IP environment variable to control the address that the Master / Driver will bind to. See http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/connectivity_issues.html for an overview of the other settings that affect network address binding.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

org.apache.spark.SparkException: Job aborted due to stage failure: Task in stage failed，Lost task in stage : ExecutorLostFailure (executor 4 lost) - apache-spark

Related

All executors finish with state KILLED and exitStatus 1

ERROR yarn.ApplicationMaster: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after 100000 milliseconds [duplicate]

spark worker with 32GB or more memory encountered a fatal error

Spark 1.5.1 standalone cluster - wrong Akka remoting config?

Can't connect from application to the standalone cluster

Categories

Resources