Spark 1.5.1 standalone cluster - wrong Akka remoting config? - apache-spark

Doing my firsts steps with Spark, I'm facing problems submitting jobs to cluster from the application code. Digging the logs, I noticed some periodic WARN messages on master log:
15/10/08 13:00:00 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#192.168.254.167:64014] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
The problem is that ip address not exist on our network, and wasn't configured anywhere. The same wrong ip is shown on the worker log when it tries execute the task (wrong ip passed to --driver-url):
15/10/08 12:58:21 INFO worker.ExecutorRunner: Launch command: "/usr/java/latest//bin/java" "-cp" "/path/spark/spark-1.5.1-bin-ha
doop2.6/sbin/../conf/:/path/spark/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/path/spark/
spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/path/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.ja
r:/path/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/path/hadoop/2.6.0//etc/hadoop/" "-Xms102
4M" "-Xmx1024M" "-Dspark.driver.port=64014" "-Dspark.driver.port=53411" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url"
"akka.tcp://sparkDriver#192.168.254.167:64014/user/CoarseGrainedScheduler" "--executor-id" "39" "--hostname" "192.168.10.214" "--cores" "16" "--app-id" "app-20151008123702-0003" "--worker-url" "akka.tcp://sparkWorker#192.168.10.214:37625/user/Worker"
15/10/08 12:59:28 INFO worker.Worker: Executor app-20151008123702-0003/39 finished with state EXITED message Command exited with code 1 exitStatus 1
Any idea what I did wrong and how can this be fixed?
The java version is 1.8.0_20, and I'm using pre-built Spark binaries.
Thanks!

Maybe it will give you some clues here at my answer to a similar question which is similar question to yours with "Association with remote system has failed"

Related

org.apache.spark.SparkException: Job aborted due to stage failure: Task in stage failed,Lost task in stage : ExecutorLostFailure (executor 4 lost)

I build MonoSpark(based on Spark 1.3.1) with JDK 1.7 and Hadoop 2.6.2 by this command (I edited my pom.xml so that the command can work)
./make-distribution.sh --tgz -Phadoop-2.6 -Dhadoop.version=2.6.2
Then, I get a tgz file named 'spark-1.3.1-SNAPSHOT-bin-2.6.2.tgz'.
I put the tgz file on my hadoop cluster which has a master and 4 slaves.
Then, I start the spark by using the command.
$SPARK_HOME/sbin/start-all.sh
The spark works well as there are 4 workers and 1 master. However, when I use spark-submit to run an example:
./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --master spark://master:7077 lib/spark-examples-1.3.1-*-hadoop2.6.2.jar input/README.md
I get this error on my driver like below
......other useless logs.....
19/03/31 22:24:41 ERROR cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 2
19/03/31 22:24:46 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#slave3:55311] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:24:50 ERROR scheduler.TaskSchedulerImpl: Lost executor 3 on slave1: remote Akka client disassociated
19/03/31 22:24:54 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
.......other useless logs......
Exception in thread "main" 19/03/31 22:24:54 ERROR cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 4
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, slave4): ExecutorLostFailure (executor 4 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1325)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1314)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1313)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1313)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:714)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:714)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1526)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1487)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
The worker node error log is below:
19/03/31 22:25:11 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/2 for JavaWordCount
19/03/31 22:25:19 INFO worker.Worker: Executor app-20190331222434-0000/2 finished with state EXITED message Command exited with code 50 exitStatus 50
19/03/31 22:25:19 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#slave4:37919] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:25:19 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.0.2.27%3A35254-2#299045174] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
19/03/31 22:25:19 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/4 for JavaWordCount
19/03/31 22:25:19 INFO worker.ExecutorRunner: Launch command: "/usr/local/java/jdk1.8.0_101/bin/java" "-cp" "/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop" "-Dspark.driver.port=42211" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver#master:42211/user/CoarseGrainedScheduler" "--executor-id" "4" "--hostname" "slave4" "--cores" "4" "--app-id" "app-20190331222434-0000" "--worker-url" "akka.tcp://sparkWorker#slave4:55970/user/Worker"
19/03/31 22:25:32 INFO worker.Worker: Executor app-20190331222434-0000/4 finished with state EXITED message Command exited with code 50 exitStatus 50
19/03/31 22:25:32 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#slave4:60559] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:25:32 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.0.2.27%3A35260-3#479615849] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
19/03/31 22:25:32 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/7 for JavaWordCount
19/03/31 22:25:32 INFO worker.ExecutorRunner: Launch command: "/usr/local/java/jdk1.8.0_101/bin/java" "-cp" "/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop" "-Dspark.driver.port=42211" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver#master:42211/user/CoarseGrainedScheduler" "--executor-id" "7" "--hostname" "slave4" "--cores" "4" "--app-id" "app-20190331222434-0000" "--worker-url" "akka.tcp://sparkWorker#slave4:55970/user/Worker"
19/03/31 22:25:32 INFO worker.Worker: Asked to kill executor app-20190331222434-0000/7
19/03/31 22:25:32 INFO worker.ExecutorRunner: Runner thread for executor app-20190331222434-0000/7 interrupted
19/03/31 22:25:32 INFO worker.ExecutorRunner: Killing process!
19/03/31 22:25:32 INFO worker.Worker: Executor app-20190331222434-0000/7 finished with state KILLED exitStatus 143
19/03/31 22:25:32 INFO worker.Worker: Cleaning up local directories for application app-20190331222434-0000
Are there any errors about hadoop version? Maybe I use the wrong hadoop version or jdk version to build Spark.
Hope someone can give me some suggestions, Thanks.
I find some errors in the executor:
java.lang.UnsupportedOperationException: Datanode-side support for getVolumeBlockLocations() must also be enabled in the client configuration.
I set dfs.datanode.hdfs-blocks-metadata.enabled as true in hadoop-site.xml and restart the hadoop cluster. Finally, it works for me.
The error log of executor is in directory: work
cd $SPARK_HOME/work/appxxxx/xx(xx is a number)

Why am I getting "Removing worker because we got no heartbeat in 60 seconds" on Spark master

I think I might of stumbled across a bug and wanted to get other people's input. I am running a pyspark application using Spark 2.2.0 in standalone mode. I am doing a somewhat heavy transformation in python inside a flatMap and the driver keeps killing the workers.
Here is what am I seeing:
The master after 60s of not seeing any heartbeat message from the workers it prints out this message to the log:
Removing worker [worker name] because we got no heartbeat in 60
seconds
Removing worker [worker name] on [IP]:[port]
Telling app of
lost executor: [executor number]
I then see in the driver log the following message:
Lost executor [executor number] on [executor IP]: worker lost
The worker then terminates and I see this message in its log:
Driver commanded a shutdown
I have looked at the Spark source code and from what I can tell, as long as the executor is alive it should send a heartbeat message back as it is using a ThreadUtils.newDaemonSingleThreadScheduledExecutor to do this.
One other thing that I noticed while I was running top on one of the workers, is that the executor JVM seems to be suspended throughout this process. There are as many python processes as I specified in the SPARK_WORKER_CORES env variable and each is consuming close to 100% of the CPU.
Anyone have any thoughts on this?
I was facing this same issue, increasing interval worked.
Excerpt from Logs start-all.sh logs
INFO Utils: Successfully started service 'sparkMaster' on port 7077.
INFO Master: Starting Spark master at spark://master:7077
INFO Master: Running Spark version 3.0.1
INFO Utils: Successfully started service 'MasterUI' on port 8080.
INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://master:8080
INFO Master: I have been elected leader! New state: ALIVE
INFO Master: Registering worker slave01:41191 with 16 cores, 15.7 GiB RAM
INFO Master: Registering worker slave02:37853 with 16 cores, 15.7 GiB RAM
WARN Master: Removing worker-20210618205117-slave01-41191 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618205117-slave01-41191 on slave01:41191
INFO Master: Telling app of lost worker: worker-20210618205117-slave01-41191
WARN Master: Removing worker-20210618204723-slave02-37853 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618204723-slave02-37853 on slave02:37853
INFO Master: Telling app of lost worker: worker-20210618204723-slave02-37853
WARN Master: Got heartbeat from unregistered worker worker-20210618205117-slave01-41191. This worker was never registered, so ignoring the heartbeat.
WARN Master: Got heartbeat from unregistered worker worker-20210618204723-slave02-37853. This worker was never registered, so ignoring the heartbeat.
Solution: add following configs to $SPARK_HOME/conf/spark-defaults.conf
spark.network.timeout 50000
spark.executor.heartbeatInterval 5000
spark.worker.timeout 5000

Executor app finished with state KILLED exitStatus 143

​Hello, I am using Hive 2.3.0 with spark version 2.0.2, When i am trying to run Hive commands on Spark from hive console,
The Job is getting stuck and i have to manually kill it.
Following was the error message in Spark worker Log. Could you please advise if i am doing something wrong.
INFO worker.Worker: Executor app-20171114093447-0000/0 finished with state KILLED exitStatus 143
I was able to fix it by configuring correct EventLog directory in Spark.

Why does Spark exit with exitCode: 16?

I am using Spark 2.0.0 with Hadoop 2.7 and use the yarn-cluster mode. Every time, I get the following error:
17/01/04 11:18:04 INFO spark.SparkContext: Successfully stopped SparkContext
17/01/04 11:18:04 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
17/01/04 11:18:04 INFO util.ShutdownHookManager: Shutdown hook called
17/01/04 11:18:04 INFO util.ShutdownHookManager: Deleting directory /tmp/hadoop-hduser/nm-local-dir/usercache/harry/appcache/application_1475261544699_0833/spark-42e40ac3-279f-4c3f-ab27-9999d20069b8
17/01/04 11:18:04 INFO spark.SparkContext: SparkContext already stopped.
However, I do get the correct printed output.
The same code works fine in Spark 1.4.0-Hadoop 2.4.0 where I do not see any exit codes.
This issue .sparkStaging not cleaned if application exited incorrectly https://issues.apache.org/jira/browse/SPARK-17340 started after Spark 1.4 (Affects Version/s: 1.5.2, 1.6.1, 2.0.0)
The issue is: When running Spark (yarn,cluster mode) and killing application
.sparkStaging is not cleaned.
When this issue happened exitCode 16 raised in Spark 2.0.X
ERROR ApplicationMaster: RECEIVED SIGNAL TERM
INFO ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
Is it possible that in your code, something is killing the application?
If so - it shouldn't be seen in Spark 1.4, but should be seen in Spark 2.0.0
Please search your code for "exit" (as if you have such in your code, the error won't be shown in Spark 1.4, but will in Spark 2.0.0)
Seems like excess memory is used by JVM, try adding the property
yarn.nodemanager.vmem-check-enabled to false in your yarn-site.xml
Reference

Can't spark-submit to analytics node on DataStax Enterprise

I have a 6 node cluster, one of those is spark enabled.
I also have a spark job that I would like to submit to the cluster / that node, so I enter the following command
spark-submit --class VDQConsumer --master spark://node-public-ip:7077 target/scala-2.10/vdq-consumer-assembly-1.0.jar
it launches the spark ui on that node, but eventually gets here:
15/05/14 14:19:55 INFO SparkContext: Added JAR file:/Users/cwheeler/dev/git/vdq-consumer/target/scala-2.10/vdq-consumer-assembly-1.0.jar at http://node-ip:54898/jars/vdq-consumer-assembly-1.0.jar with timestamp 1431627595602
15/05/14 14:19:55 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#node-ip:7077/user/Master...
15/05/14 14:19:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#node-ip:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/05/14 14:20:15 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#node-ip:7077/user/Master...
15/05/14 14:20:35 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#node-ip:7077/user/Master...
15/05/14 14:20:55 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
15/05/14 14:20:55 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
15/05/14 14:20:55 WARN SparkDeploySchedulerBackend: Application ID is not initialized yet.
Does anyone have any idea what just happened?

Resources