Spark: Association with remote system failed. Reason: Disassociated - apache-spark

I have a standalone spark job and every time the job finishes, the below warning occurs: I don't really understand the meaning of this and also how to solve this. Would be great if you could help. Thanks
WARN [SparkWorker-0 error logger] 2016-10-08 10:18:33,395 SparkWorker-0
- Association with remote system [akka.tcp://sparkExecutor#] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO [] 2016-10-08 10:18:33,406 Logging.scala:59 - Removing executor app-20161008101807-0002/5 because it is EXITED
INFO [] 2016-10-08 10:18:33,407 Logging.scala:59 - Launching executor app-20161008101807-0002/6 on worker worker-20161008093556-
WARN [] 2016-10-08 10:18:33,762 Logging.scala:71 - Got status update for unknown executor app-20161008100608-0001/4
INFO [] 2016-10-08 10:18:33,819 Logging.scala:59 - akka.tcp://sparkDriver#XXX.196.201.23:36340 got disassociated, removing it.
INFO [SparkWorker-0 logger] 2016-10-08 10:18:33,835 SparkWorker-0 - Executor app-20161008100608-0001/0 finished with state KILLED exitStatus 143
WARN [] 2016-10-08 10:18:33,837 Logging.scala:71 - Got status update for unknown executor app-20161008100608-0001/0

This is just the executor saying it can not talk to anyone. I would check connection ports and the like on your firewall.


org.apache.spark.SparkException: Job aborted due to stage failure: Task in stage failed,Lost task in stage : ExecutorLostFailure (executor 4 lost)

I build MonoSpark(based on Spark 1.3.1) with JDK 1.7 and Hadoop 2.6.2 by this command (I edited my pom.xml so that the command can work)
./ --tgz -Phadoop-2.6 -Dhadoop.version=2.6.2
Then, I get a tgz file named 'spark-1.3.1-SNAPSHOT-bin-2.6.2.tgz'.
I put the tgz file on my hadoop cluster which has a master and 4 slaves.
Then, I start the spark by using the command.
The spark works well as there are 4 workers and 1 master. However, when I use spark-submit to run an example:
./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --master spark://master:7077 lib/spark-examples-1.3.1-*-hadoop2.6.2.jar input/
I get this error on my driver like below
......other useless logs.....
19/03/31 22:24:41 ERROR cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 2
19/03/31 22:24:46 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#slave3:55311] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:24:50 ERROR scheduler.TaskSchedulerImpl: Lost executor 3 on slave1: remote Akka client disassociated
19/03/31 22:24:54 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
.......other useless logs......
Exception in thread "main" 19/03/31 22:24:54 ERROR cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 4
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, slave4): ExecutorLostFailure (executor 4 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1314)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1313)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1313)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:714)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:714)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1526)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1487)
at org.apache.spark.util.EventLoop$$anon$
The worker node error log is below:
19/03/31 22:25:11 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/2 for JavaWordCount
19/03/31 22:25:19 INFO worker.Worker: Executor app-20190331222434-0000/2 finished with state EXITED message Command exited with code 50 exitStatus 50
19/03/31 22:25:19 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#slave4:37919] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:25:19 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.0.2.27%3A35254-2#299045174] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
19/03/31 22:25:19 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/4 for JavaWordCount
19/03/31 22:25:19 INFO worker.ExecutorRunner: Launch command: "/usr/local/java/jdk1.8.0_101/bin/java" "-cp" "/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop" "-Dspark.driver.port=42211" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver#master:42211/user/CoarseGrainedScheduler" "--executor-id" "4" "--hostname" "slave4" "--cores" "4" "--app-id" "app-20190331222434-0000" "--worker-url" "akka.tcp://sparkWorker#slave4:55970/user/Worker"
19/03/31 22:25:32 INFO worker.Worker: Executor app-20190331222434-0000/4 finished with state EXITED message Command exited with code 50 exitStatus 50
19/03/31 22:25:32 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#slave4:60559] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:25:32 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.0.2.27%3A35260-3#479615849] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
19/03/31 22:25:32 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/7 for JavaWordCount
19/03/31 22:25:32 INFO worker.ExecutorRunner: Launch command: "/usr/local/java/jdk1.8.0_101/bin/java" "-cp" "/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop" "-Dspark.driver.port=42211" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver#master:42211/user/CoarseGrainedScheduler" "--executor-id" "7" "--hostname" "slave4" "--cores" "4" "--app-id" "app-20190331222434-0000" "--worker-url" "akka.tcp://sparkWorker#slave4:55970/user/Worker"
19/03/31 22:25:32 INFO worker.Worker: Asked to kill executor app-20190331222434-0000/7
19/03/31 22:25:32 INFO worker.ExecutorRunner: Runner thread for executor app-20190331222434-0000/7 interrupted
19/03/31 22:25:32 INFO worker.ExecutorRunner: Killing process!
19/03/31 22:25:32 INFO worker.Worker: Executor app-20190331222434-0000/7 finished with state KILLED exitStatus 143
19/03/31 22:25:32 INFO worker.Worker: Cleaning up local directories for application app-20190331222434-0000
Are there any errors about hadoop version? Maybe I use the wrong hadoop version or jdk version to build Spark.
Hope someone can give me some suggestions, Thanks.
I find some errors in the executor:
java.lang.UnsupportedOperationException: Datanode-side support for getVolumeBlockLocations() must also be enabled in the client configuration.
I set dfs.datanode.hdfs-blocks-metadata.enabled as true in hadoop-site.xml and restart the hadoop cluster. Finally, it works for me.
The error log of executor is in directory: work
cd $SPARK_HOME/work/appxxxx/xx(xx is a number)

Why am I getting "Removing worker because we got no heartbeat in 60 seconds" on Spark master

I think I might of stumbled across a bug and wanted to get other people's input. I am running a pyspark application using Spark 2.2.0 in standalone mode. I am doing a somewhat heavy transformation in python inside a flatMap and the driver keeps killing the workers.
Here is what am I seeing:
The master after 60s of not seeing any heartbeat message from the workers it prints out this message to the log:
Removing worker [worker name] because we got no heartbeat in 60
Removing worker [worker name] on [IP]:[port]
Telling app of
lost executor: [executor number]
I then see in the driver log the following message:
Lost executor [executor number] on [executor IP]: worker lost
The worker then terminates and I see this message in its log:
Driver commanded a shutdown
I have looked at the Spark source code and from what I can tell, as long as the executor is alive it should send a heartbeat message back as it is using a ThreadUtils.newDaemonSingleThreadScheduledExecutor to do this.
One other thing that I noticed while I was running top on one of the workers, is that the executor JVM seems to be suspended throughout this process. There are as many python processes as I specified in the SPARK_WORKER_CORES env variable and each is consuming close to 100% of the CPU.
Anyone have any thoughts on this?
I was facing this same issue, increasing interval worked.
Excerpt from Logs logs
INFO Utils: Successfully started service 'sparkMaster' on port 7077.
INFO Master: Starting Spark master at spark://master:7077
INFO Master: Running Spark version 3.0.1
INFO Utils: Successfully started service 'MasterUI' on port 8080.
INFO MasterWebUI: Bound MasterWebUI to, and started at http://master:8080
INFO Master: I have been elected leader! New state: ALIVE
INFO Master: Registering worker slave01:41191 with 16 cores, 15.7 GiB RAM
INFO Master: Registering worker slave02:37853 with 16 cores, 15.7 GiB RAM
WARN Master: Removing worker-20210618205117-slave01-41191 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618205117-slave01-41191 on slave01:41191
INFO Master: Telling app of lost worker: worker-20210618205117-slave01-41191
WARN Master: Removing worker-20210618204723-slave02-37853 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618204723-slave02-37853 on slave02:37853
INFO Master: Telling app of lost worker: worker-20210618204723-slave02-37853
WARN Master: Got heartbeat from unregistered worker worker-20210618205117-slave01-41191. This worker was never registered, so ignoring the heartbeat.
WARN Master: Got heartbeat from unregistered worker worker-20210618204723-slave02-37853. This worker was never registered, so ignoring the heartbeat.
Solution: add following configs to $SPARK_HOME/conf/spark-defaults.conf 50000
spark.executor.heartbeatInterval 5000
spark.worker.timeout 5000

Mesos Future discarded

I am trying to run a spark job via Mesos
it throws an exception
WARN MesosExternalShuffleClient: Unable to register app
with external shuffle service.
Please manually remove shuffle data after driver exit
java.lang.UnsupportedOperationException: Unexpected message:
and the logs in Stderr as
ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
INFO DiskBlockManager: Shutdown hook called
E0911 05:32:34.711486 6619 process.cpp:951] Failed to accept socket: future discarded**
In Spark-Defaults.conf
spark.mesos.coarse true 3600s 3600s
Who is killing my application..?

Spark Shell Yarn Client Mode - Akka AssociationError

When I launch Spark Shell using:
spark-shell --master yarn --deploy-mode client
I'm getting the following error:
16/03/21 20:52:29 ERROR ErrorMonitor: AssociationError [akka.tcp://sparkDriver#ipaddress10:47915] -> [akka.tcp://sparkExecutor#hostname02:48703]: Error [Association failed with [
p://sparkExecutor#hostname02:48703]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor#hostname02:48703]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No route to host
16/03/21 20:52:29 ERROR ErrorMonitor: AssociationError [akka.tcp://sparkDriver#ipaddress10:47915] -> [akka.tcp://sparkExecutor#hostname02:48703]: Error [Association failed with [
p://sparkExecutor#hostname02:48703]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor#hostname02:48703]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No route to host
16/03/21 20:52:32 ERROR YarnScheduler: Lost executor 3 on hostname01: remote Rpc client disassociated
16/03/21 20:52:32 INFO DAGScheduler: Executor lost: 3 (epoch 0)
16/03/21 20:52:32 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
16/03/21 20:52:32 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, hostname01, 37497)
16/03/21 20:52:32 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
16/03/21 20:52:32 INFO ExecutorAllocationManager: Existing executor 3 has been removed (new total is 0)
Firewall & Iptables are turned off. Machines in the cluster are mutually ping-able on all the ports.
But i'm puzzled why I'm still getting "akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No route to host"
Any help please.
Probably you have a name resolution issue. You should try using IP addresses in your settings (for instance in slaves file) rather than names to confirm this hypothesis.
I have experienced the same problem before. I found that I have mistyped some environement variables regarding SPARK_LOCAL_IP and SPARK_LOCAL_DNS
To resolve your problem, you have to:
In all your nodemanager nodes, check the .bashrc and .bash_profile files that you have set the env variables to right values : SPARK_LOCAL_IP and SPARK_PUBLIC_DNS, then restart your nodemanger(s)
In your client machine (where you issue the command spark-shell) set the values of the previous env variables to your client machine IP and hostname

Datastax Spark Jobs Killed for No Reason

We are using DSE Spark with a 3 node cluster running 5 jobs. We are seeing SIGTERM commands come into the /var/log/spark/worker/worker-0/worker.log which is stopping our jobs. We are not seeing any corresponding memory or processor constraints during these times, and no one manually made these calls.
I've seen a couple similar issues which result in a heap size issue with YARN or Mesos, but since we are using DSE, these didn't seem to be relevant.
Below is a sample of the log info from 1 server which was running 2 of the jobs:
ERROR [SIGTERM handler] 2016-03-26 00:43:28,780 SignalLogger.scala:57 - RECEIVED SIGNAL 15: SIGTERM
ERROR [SIGHUP handler] 2016-03-26 00:43:28,788 SignalLogger.scala:57 - RECEIVED SIGNAL 1: SIGHUP
INFO [Spark Shutdown Hook] 2016-03-26 00:43:28,795 Logging.scala:59 - Killing process!
ERROR [File appending thread for /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stderr] 2016-03-26 00:43:28,848 Logging.scala:96 - Error writing stream to file /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stderr Stream closed
at ~[na:1.8.0_71]
at ~[na:1.8.0_71]
at ~[na:1.8.0_71]
at ~[na:1.8.0_71]
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) ~[spark-core_2.10-]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) [spark-core_2.10-]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772) [spark-core_2.10-]
at org.apache.spark.util.logging.FileAppender$$anon$ [spark-core_2.10-]
ERROR [File appending thread for /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stdout] 2016-03-26 00:43:28,892 Logging.scala:96 - Error writing stream to file /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stdout Stream closed
at ~[na:1.8.0_71]
at ~[na:1.8.0_71]
at ~[na:1.8.0_71]
at ~[na:1.8.0_71]
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) ~[spark-core_2.10-]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) [spark-core_2.10-]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772) [spark-core_2.10-]
at org.apache.spark.util.logging.FileAppender$$anon$ [spark-core_2.10-]
ERROR [SIGTERM handler] 2016-03-26 00:43:29,070 SignalLogger.scala:57 - RECEIVED SIGNAL 15: SIGTERM
INFO [] 2016-03-26 00:43:29,079 Logging.scala:59 - Disassociated [akka.tcp://sparkWorker#] -> [akka.tcp://sparkMaster#] Disassociated !
ERROR [] 2016-03-26 00:43:29,080 Logging.scala:75 - Connection to master failed! Waiting for master to reconnect...
INFO [] 2016-03-26 00:43:29,081 Logging.scala:59 - Connecting to master akka.tcp://sparkMaster#
WARN [] 2016-03-26 00:43:29,091 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkMaster#] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO [] 2016-03-26 00:43:29,101 Logging.scala:59 - Disassociated [akka.tcp://sparkWorker#] -> [akka.tcp://sparkMaster#] Disassociated !
ERROR [] 2016-03-26 00:43:29,102 Logging.scala:75 - Connection to master failed! Waiting for master to reconnect...
INFO [] 2016-03-26 00:43:29,102 Logging.scala:59 - Not spawning another attempt to register with the master, since there is an attempt scheduled already.
WARN [] 2016-03-26 00:43:29,323 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkExecutor#] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO [] 2016-03-26 00:43:29,330 Logging.scala:59 - Executor app-20160325132151-0004/0 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,414 Logging.scala:59 - Killing process!
INFO [] 2016-03-26 00:43:29,415 Logging.scala:59 - Executor app-20160325131848-0001/0 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,417 Logging.scala:59 - Killing process!
INFO [] 2016-03-26 00:43:29,422 Logging.scala:59 - Unknown Executor app-20160325132151-0004/0 finished with state EXITED message Worker shutting down exitStatus 129
WARN [] 2016-03-26 00:43:29,425 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkExecutor#] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
WARN [] 2016-03-26 00:43:29,433 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkExecutor#] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO [] 2016-03-26 00:43:29,441 Logging.scala:59 - Executor app-20160325131918-0002/1 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO [] 2016-03-26 00:43:29,448 Logging.scala:59 - Unknown Executor app-20160325131918-0002/1 finished with state EXITED message Worker shutting down exitStatus 129
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,448 Logging.scala:59 - Shutdown hook called
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,449 Logging.scala:59 - Deleting directory /var/lib/spark/rdd/spark-28fa2f73-d2aa-44c0-ad4e-3ccfd07a95d2
Error seems straight forward to me
Error writing stream to file /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stdout Stream closed at
Either there is a network issue at play here, between your datasource (Cassandra) and spark. Remember Spark in reality on node1 can/will pull data from node2 of cassandra, though it tries to minimize that.
Or, your serialization is having issue. So, add this parameter in spark configuration to switch to Kryo.
