Spark: Association with remote system failed. Reason: Disassociated - apache-spark

I have a standalone spark job and every time the job finishes, the below warning occurs: I don't really understand the meaning of this and also how to solve this. Would be great if you could help. Thanks
WARN [SparkWorker-0 error logger] 2016-10-08 10:18:33,395 SparkWorker-0 ExternalLogger.java:92
- Association with remote system [akka.tcp://sparkExecutor#10.47.183.30:39422] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO [sparkMaster-akka.actor.default-dispatcher-4] 2016-10-08 10:18:33,406 Logging.scala:59 - Removing executor app-20161008101807-0002/5 because it is EXITED
INFO [sparkMaster-akka.actor.default-dispatcher-4] 2016-10-08 10:18:33,407 Logging.scala:59 - Launching executor app-20161008101807-0002/6 on worker worker-20161008093556-10.47.183.121-41649
WARN [sparkMaster-akka.actor.default-dispatcher-4] 2016-10-08 10:18:33,762 Logging.scala:71 - Got status update for unknown executor app-20161008100608-0001/4
INFO [sparkMaster-akka.actor.default-dispatcher-4] 2016-10-08 10:18:33,819 Logging.scala:59 - akka.tcp://sparkDriver#XXX.196.201.23:36340 got disassociated, removing it.
INFO [SparkWorker-0 logger] 2016-10-08 10:18:33,835 SparkWorker-0 ExternalLogger.java:88 - Executor app-20161008100608-0001/0 finished with state KILLED exitStatus 143
WARN [sparkMaster-akka.actor.default-dispatcher-5] 2016-10-08 10:18:33,837 Logging.scala:71 - Got status update for unknown executor app-20161008100608-0001/0

This is just the executor saying it can not talk to anyone. I would check connection ports and the like on your firewall.

Related

org.apache.spark.SparkException: Job aborted due to stage failure: Task in stage failed,Lost task in stage : ExecutorLostFailure (executor 4 lost)

I build MonoSpark(based on Spark 1.3.1) with JDK 1.7 and Hadoop 2.6.2 by this command (I edited my pom.xml so that the command can work)
./make-distribution.sh --tgz -Phadoop-2.6 -Dhadoop.version=2.6.2
Then, I get a tgz file named 'spark-1.3.1-SNAPSHOT-bin-2.6.2.tgz'.
I put the tgz file on my hadoop cluster which has a master and 4 slaves.
Then, I start the spark by using the command.
$SPARK_HOME/sbin/start-all.sh
The spark works well as there are 4 workers and 1 master. However, when I use spark-submit to run an example:
./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --master spark://master:7077 lib/spark-examples-1.3.1-*-hadoop2.6.2.jar input/README.md
I get this error on my driver like below
......other useless logs.....
19/03/31 22:24:41 ERROR cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 2
19/03/31 22:24:46 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#slave3:55311] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:24:50 ERROR scheduler.TaskSchedulerImpl: Lost executor 3 on slave1: remote Akka client disassociated
19/03/31 22:24:54 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
.......other useless logs......
Exception in thread "main" 19/03/31 22:24:54 ERROR cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 4
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, slave4): ExecutorLostFailure (executor 4 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1325)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1314)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1313)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1313)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:714)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:714)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1526)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1487)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
The worker node error log is below:
19/03/31 22:25:11 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/2 for JavaWordCount
19/03/31 22:25:19 INFO worker.Worker: Executor app-20190331222434-0000/2 finished with state EXITED message Command exited with code 50 exitStatus 50
19/03/31 22:25:19 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#slave4:37919] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:25:19 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.0.2.27%3A35254-2#299045174] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
19/03/31 22:25:19 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/4 for JavaWordCount
19/03/31 22:25:19 INFO worker.ExecutorRunner: Launch command: "/usr/local/java/jdk1.8.0_101/bin/java" "-cp" "/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop" "-Dspark.driver.port=42211" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver#master:42211/user/CoarseGrainedScheduler" "--executor-id" "4" "--hostname" "slave4" "--cores" "4" "--app-id" "app-20190331222434-0000" "--worker-url" "akka.tcp://sparkWorker#slave4:55970/user/Worker"
19/03/31 22:25:32 INFO worker.Worker: Executor app-20190331222434-0000/4 finished with state EXITED message Command exited with code 50 exitStatus 50
19/03/31 22:25:32 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#slave4:60559] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
19/03/31 22:25:32 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.0.2.27%3A35260-3#479615849] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
19/03/31 22:25:32 INFO worker.Worker: Asked to launch executor app-20190331222434-0000/7 for JavaWordCount
19/03/31 22:25:32 INFO worker.ExecutorRunner: Launch command: "/usr/local/java/jdk1.8.0_101/bin/java" "-cp" "/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/sbin/../conf:/home/zxd/monotask_jdk1.7/spark-1.3.1-SNAPSHOT-bin-2.6.2/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.6.2.jar:/home/zxd/hadoop/hadoop-2.6.2/etc/hadoop" "-Dspark.driver.port=42211" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver#master:42211/user/CoarseGrainedScheduler" "--executor-id" "7" "--hostname" "slave4" "--cores" "4" "--app-id" "app-20190331222434-0000" "--worker-url" "akka.tcp://sparkWorker#slave4:55970/user/Worker"
19/03/31 22:25:32 INFO worker.Worker: Asked to kill executor app-20190331222434-0000/7
19/03/31 22:25:32 INFO worker.ExecutorRunner: Runner thread for executor app-20190331222434-0000/7 interrupted
19/03/31 22:25:32 INFO worker.ExecutorRunner: Killing process!
19/03/31 22:25:32 INFO worker.Worker: Executor app-20190331222434-0000/7 finished with state KILLED exitStatus 143
19/03/31 22:25:32 INFO worker.Worker: Cleaning up local directories for application app-20190331222434-0000
Are there any errors about hadoop version? Maybe I use the wrong hadoop version or jdk version to build Spark.
Hope someone can give me some suggestions, Thanks.
I find some errors in the executor:
java.lang.UnsupportedOperationException: Datanode-side support for getVolumeBlockLocations() must also be enabled in the client configuration.
I set dfs.datanode.hdfs-blocks-metadata.enabled as true in hadoop-site.xml and restart the hadoop cluster. Finally, it works for me.
The error log of executor is in directory: work
cd $SPARK_HOME/work/appxxxx/xx(xx is a number)

Why am I getting "Removing worker because we got no heartbeat in 60 seconds" on Spark master

I think I might of stumbled across a bug and wanted to get other people's input. I am running a pyspark application using Spark 2.2.0 in standalone mode. I am doing a somewhat heavy transformation in python inside a flatMap and the driver keeps killing the workers.
Here is what am I seeing:
The master after 60s of not seeing any heartbeat message from the workers it prints out this message to the log:
Removing worker [worker name] because we got no heartbeat in 60
seconds
Removing worker [worker name] on [IP]:[port]
Telling app of
lost executor: [executor number]
I then see in the driver log the following message:
Lost executor [executor number] on [executor IP]: worker lost
The worker then terminates and I see this message in its log:
Driver commanded a shutdown
I have looked at the Spark source code and from what I can tell, as long as the executor is alive it should send a heartbeat message back as it is using a ThreadUtils.newDaemonSingleThreadScheduledExecutor to do this.
One other thing that I noticed while I was running top on one of the workers, is that the executor JVM seems to be suspended throughout this process. There are as many python processes as I specified in the SPARK_WORKER_CORES env variable and each is consuming close to 100% of the CPU.
Anyone have any thoughts on this?
I was facing this same issue, increasing interval worked.
Excerpt from Logs start-all.sh logs
INFO Utils: Successfully started service 'sparkMaster' on port 7077.
INFO Master: Starting Spark master at spark://master:7077
INFO Master: Running Spark version 3.0.1
INFO Utils: Successfully started service 'MasterUI' on port 8080.
INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://master:8080
INFO Master: I have been elected leader! New state: ALIVE
INFO Master: Registering worker slave01:41191 with 16 cores, 15.7 GiB RAM
INFO Master: Registering worker slave02:37853 with 16 cores, 15.7 GiB RAM
WARN Master: Removing worker-20210618205117-slave01-41191 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618205117-slave01-41191 on slave01:41191
INFO Master: Telling app of lost worker: worker-20210618205117-slave01-41191
WARN Master: Removing worker-20210618204723-slave02-37853 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618204723-slave02-37853 on slave02:37853
INFO Master: Telling app of lost worker: worker-20210618204723-slave02-37853
WARN Master: Got heartbeat from unregistered worker worker-20210618205117-slave01-41191. This worker was never registered, so ignoring the heartbeat.
WARN Master: Got heartbeat from unregistered worker worker-20210618204723-slave02-37853. This worker was never registered, so ignoring the heartbeat.
Solution: add following configs to $SPARK_HOME/conf/spark-defaults.conf
spark.network.timeout 50000
spark.executor.heartbeatInterval 5000
spark.worker.timeout 5000

Mesos Future discarded

I am trying to run a spark job via Mesos
it throws an exception
WARN MesosExternalShuffleClient: Unable to register app
6c1b7274-960f-47ef-9fa7-1dd06b05d4f1-0010
with external shuffle service.
Please manually remove shuffle data after driver exit
Error:java.lang.RuntimeException:
java.lang.UnsupportedOperationException: Unexpected message:
org.apache.spark.network.shuffle.protocol.mesos.RegisterDriver#88d86a24
"
and the logs in Stderr as
ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
INFO DiskBlockManager: Shutdown hook called
E0911 05:32:34.711486 6619 process.cpp:951] Failed to accept socket: future discarded**
In Spark-Defaults.conf
spark.mesos.coarse true
spark.network.timeout 3600s
spark.shuffle.io.connectionTimeout 3600s
Who is killing my application..?

Spark Shell Yarn Client Mode - Akka AssociationError

When I launch Spark Shell using:
spark-shell --master yarn --deploy-mode client
I'm getting the following error:
16/03/21 20:52:29 ERROR ErrorMonitor: AssociationError [akka.tcp://sparkDriver#ipaddress10:47915] -> [akka.tcp://sparkExecutor#hostname02:48703]: Error [Association failed with [akka.tc
p://sparkExecutor#hostname02:48703]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor#hostname02:48703]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No route to host
]
akka.event.Logging$Error$NoCause$
16/03/21 20:52:29 ERROR ErrorMonitor: AssociationError [akka.tcp://sparkDriver#ipaddress10:47915] -> [akka.tcp://sparkExecutor#hostname02:48703]: Error [Association failed with [akka.tc
p://sparkExecutor#hostname02:48703]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor#hostname02:48703]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No route to host
]
akka.event.Logging$Error$NoCause$
16/03/21 20:52:32 ERROR YarnScheduler: Lost executor 3 on hostname01: remote Rpc client disassociated
16/03/21 20:52:32 INFO DAGScheduler: Executor lost: 3 (epoch 0)
16/03/21 20:52:32 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
16/03/21 20:52:32 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, hostname01, 37497)
16/03/21 20:52:32 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
16/03/21 20:52:32 INFO ExecutorAllocationManager: Existing executor 3 has been removed (new total is 0)
Firewall & Iptables are turned off. Machines in the cluster are mutually ping-able on all the ports.
But i'm puzzled why I'm still getting "akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No route to host"
Any help please.
Probably you have a name resolution issue. You should try using IP addresses in your settings (for instance in slaves file) rather than names to confirm this hypothesis.
I have experienced the same problem before. I found that I have mistyped some environement variables regarding SPARK_LOCAL_IP and SPARK_LOCAL_DNS
To resolve your problem, you have to:
In all your nodemanager nodes, check the .bashrc and .bash_profile files that you have set the env variables to right values : SPARK_LOCAL_IP and SPARK_PUBLIC_DNS, then restart your nodemanger(s)
In your client machine (where you issue the command spark-shell) set the values of the previous env variables to your client machine IP and hostname

Datastax Spark Jobs Killed for No Reason

We are using DSE Spark with a 3 node cluster running 5 jobs. We are seeing SIGTERM commands come into the /var/log/spark/worker/worker-0/worker.log which is stopping our jobs. We are not seeing any corresponding memory or processor constraints during these times, and no one manually made these calls.
I've seen a couple similar issues which result in a heap size issue with YARN or Mesos, but since we are using DSE, these didn't seem to be relevant.
Below is a sample of the log info from 1 server which was running 2 of the jobs:
ERROR [SIGTERM handler] 2016-03-26 00:43:28,780 SignalLogger.scala:57 - RECEIVED SIGNAL 15: SIGTERM
ERROR [SIGHUP handler] 2016-03-26 00:43:28,788 SignalLogger.scala:57 - RECEIVED SIGNAL 1: SIGHUP
INFO [Spark Shutdown Hook] 2016-03-26 00:43:28,795 Logging.scala:59 - Killing process!
ERROR [File appending thread for /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stderr] 2016-03-26 00:43:28,848 Logging.scala:96 - Error writing stream to file /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) ~[na:1.8.0_71]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) ~[na:1.8.0_71]
at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[na:1.8.0_71]
at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[na:1.8.0_71]
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
ERROR [File appending thread for /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stdout] 2016-03-26 00:43:28,892 Logging.scala:96 - Error writing stream to file /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stdout
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) ~[na:1.8.0_71]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) ~[na:1.8.0_71]
at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[na:1.8.0_71]
at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[na:1.8.0_71]
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
ERROR [SIGTERM handler] 2016-03-26 00:43:29,070 SignalLogger.scala:57 - RECEIVED SIGNAL 15: SIGTERM
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,079 Logging.scala:59 - Disassociated [akka.tcp://sparkWorker#10.0.1.7:44131] -> [akka.tcp://sparkMaster#10.0.1.7:7077] Disassociated !
ERROR [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,080 Logging.scala:75 - Connection to master failed! Waiting for master to reconnect...
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,081 Logging.scala:59 - Connecting to master akka.tcp://sparkMaster#10.0.1.7:7077/user/Master...
WARN [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,091 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkMaster#10.0.1.7:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,101 Logging.scala:59 - Disassociated [akka.tcp://sparkWorker#10.0.1.7:44131] -> [akka.tcp://sparkMaster#10.0.1.7:7077] Disassociated !
ERROR [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,102 Logging.scala:75 - Connection to master failed! Waiting for master to reconnect...
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,102 Logging.scala:59 - Not spawning another attempt to register with the master, since there is an attempt scheduled already.
WARN [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,323 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkExecutor#10.0.1.7:49943] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,330 Logging.scala:59 - Executor app-20160325132151-0004/0 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,414 Logging.scala:59 - Killing process!
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,415 Logging.scala:59 - Executor app-20160325131848-0001/0 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,417 Logging.scala:59 - Killing process!
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,422 Logging.scala:59 - Unknown Executor app-20160325132151-0004/0 finished with state EXITED message Worker shutting down exitStatus 129
WARN [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,425 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkExecutor#10.0.1.7:32874] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
WARN [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,433 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkExecutor#10.0.1.7:56212] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,441 Logging.scala:59 - Executor app-20160325131918-0002/1 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,448 Logging.scala:59 - Unknown Executor app-20160325131918-0002/1 finished with state EXITED message Worker shutting down exitStatus 129
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,448 Logging.scala:59 - Shutdown hook called
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,449 Logging.scala:59 - Deleting directory /var/lib/spark/rdd/spark-28fa2f73-d2aa-44c0-ad4e-3ccfd07a95d2
Error seems straight forward to me
Error writing stream to file /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stdout java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
Either there is a network issue at play here, between your datasource (Cassandra) and spark. Remember Spark in reality on node1 can/will pull data from node2 of cassandra, though it tries to minimize that.
Or, your serialization is having issue. So, add this parameter in spark configuration to switch to Kryo.
spark.serializer=org.apache.spark.serializer.KryoSerializer

Resources