Spark OutOfMemoryError - apache-spark

I am experiencing OOME when I try to submit a spark job that sends a message to kafka - it sends the message to Kafka (675 bytes) - the error only shows when the executor is about to shut down.
Diagnostics: Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
start time: 1441611385047
final status: FAILED
Here's the yarn logs:
(1):
INFO cluster.YarnClusterSchedulerBackend: Asking each executor to shut down
WARN thread.QueuedThreadPool: 7 threads could not be stopped
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "sparkDriver-12"
Exception in thread "Thread-3"
(2):
Exception in thread "shuffle-client-4" Exception in thread "shuffle-server-7"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "shuffle-client-4"
(3):
INFO cluster.YarnClusterSchedulerBackend: Asking each executor to shut down
Exception in thread "LeaseRenewer:user#dom"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "LeaseRenewer:user#dom"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "sparkDriver-akka.actor.default-dispatcher-16"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "sparkDriver-akka.remote.default-remote-dispatcher-6"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "sparkDriver-akka.remote.default-remote-dispatcher-5"
Exception in thread "Thread-3"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-3"
On rare instances it shows as SUCCEEDED but the YARN logs still have the OOME:
INFO cluster.YarnClusterSchedulerBackend: Asking each executor to shut down
INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorActor: OutputCommitCoordinator stopped!
INFO spark.MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!
INFO storage.MemoryStore: MemoryStore cleared
INFO storage.BlockManager: BlockManager stopped
INFO storage.BlockManagerMaster: BlockManagerMaster stopped
INFO spark.SparkContext: Successfully stopped SparkContext
INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
Exception in thread "Thread-3"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-3"

Have you tried increasing MaxPermSize like this?

Related

ERROR yarn.ApplicationMaster: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after 100000 milliseconds [duplicate]

This question already has answers here:
Why does join fail with "java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]"?
(4 answers)
Closed 4 years ago.
I have this problem in my spark application, I use 1.6 spark version, scala 2.10:
17/10/23 14:32:15 ERROR yarn.ApplicationMaster: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000
milliseconds]at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107) at
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:342)
at
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:197)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:680)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:678)
at
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
17/10/23 14:32:15 INFO yarn.ApplicationMaster: Final app status:
FAILED, exitCode: 10, (reason: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000
milliseconds]) 17/10/23 14:32:15 INFO spark.SparkContext: Invoking
stop() from shutdown hook 17/10/23 14:32:15 INFO ui.SparkUI: Stopped
Spark web UI at http://180.21.232.30:43576 17/10/23 14:32:15 INFO
scheduler.DAGScheduler: ShuffleMapStage 27 (show at Linkage.scala:282)
failed in 24.519 s due to Stage cancelled because SparkContext was
shut down 17/10/23 14:32:15 arkListenerJobEnd (18,1508761935656,JobFailed (org.apache.spark.SparkException:Job 18 cancelled because SparkContext was shut down)) 17/10/23 14:32:15 INFO spark.MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped! 17/10/23 14:32:15 INFO
storage.MemoryStore: MemoryStore cleared 17/10/23 14:32:15 INFO
storage.BlockManager: BlockManager stopped 17/10/23 14:32:15 INFO
storage.BlockManagerMaster: BlockManagerMaster stopped 17/10/23
14:32:15 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Shutting down remote daemon.
17/10/23 14:32:15 INFO util.ShutdownHookManager: Shutdown hook
calledBlockquote
I read the articules that this problem and I tried to modify the next parameter without result
--conf spark.yarn.am.waitTime=6000s
--conf spark.sql.broadcastTimeout= 6000
--conf spark.network.timeout=600
Best Regars 
Please remove the setMaster(’local’) on the code, because Spark by default uses the YARN cluster manager in EMR.
If you are trying to run your spark job on yarn client/cluster. Don't forget to remove master configuration from your code .master("local[n]").
For submitting spark job on yarn, you need to pass --master yarn --deploy-mode cluster/client.
Having master set as local was giving repeated timeout exception.

ERROR TransportClientFactory: Exception while bootstrapping client

I have a PySpark job which works successfully with a small cluster, but starts to get a lot of the following errors in the first few minutes when it starts up. Any idea on how I can solve it? This is with PySpark 2.2.0 and mesos.
17/09/29 18:54:26 INFO Executor: Running task 5717.0 in stage 0.0 (TID 5717)
17/09/29 18:54:26 INFO CoarseGrainedExecutorBackend: Got assigned task 5813
17/09/29 18:54:26 INFO Executor: Running task 5813.0 in stage 0.0 (TID 5813)
17/09/29 18:54:26 INFO CoarseGrainedExecutorBackend: Got assigned task 5909
17/09/29 18:54:26 INFO Executor: Running task 5909.0 in stage 0.0 (TID 5909)
17/09/29 18:54:56 ERROR TransportClientFactory: Exception while bootstrapping client after 30001 ms
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
at org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:275)
at org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:70)
at org.apache.spark.network.crypto.AuthClientBootstrap.doSaslAuth(AuthClientBootstrap.java:117)
at org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:76)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:244)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:366)
at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:332)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:654)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:467)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:684)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:681)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:681)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:308)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:271)
... 23 more
17/09/29 18:54:56 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /10.0.1.25:37314 is closed
17/09/29 18:54:56 INFO Executor: Fetching spark://10.0.1.25:37314/files/djinn.spark.zip with timestamp 1506711239350

Spark UI's kill is not killing Driver

I am trying to kill my spark-kafka streaming job from Spark UI. It is able to kill the application but the driver is still running.
Can anyone help me with this. I am good with my other streaming jobs. only one of the streaming jobs is giving this problem ever time.
I can't kill the driver through command or spark UI. Spark Master is alive.
Output i collected from logs is -
16/10/25 03:14:25 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor
16/10/25 03:14:25 INFO SparkUI: Stopped Spark web UI at http://***:4040
16/10/25 03:14:25 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/10/25 03:14:25 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/10/25 03:14:35 INFO AppClient: Stop request to Master timed out; it may already be shut down.
16/10/25 03:14:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/10/25 03:14:35 INFO MemoryStore: MemoryStore cleared
16/10/25 03:14:35 INFO BlockManager: BlockManager stopped
16/10/25 03:14:35 INFO BlockManagerMaster: BlockManagerMaster stopped
16/10/25 03:14:35 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/10/25 03:14:35 INFO SparkContext: Successfully stopped SparkContext
16/10/25 03:14:35 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED
at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:438)
at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:124)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint.markDead(AppClient.scala:264)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(AppClient.scala:172)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/10/25 03:14:35 WARN NettyRpcEnv: Ignored message: true
16/10/25 03:14:35 WARN AppClient$ClientEndpoint: Connection to master:7077 failed; waiting for master to reconnect...
16/10/25 03:14:35 WARN AppClient$ClientEndpoint: Connection to master:7077 failed; waiting for master to reconnect...
Get the running driverId from spark UI, and hit the post rest call(spark master rest port like 6066) to kill the pipeline. I have tested it with spark 1.6.1
curl -X POST http://localhost:6066/v1/submissions/kill/driverId
Hope it helps...

Spark:Error writing stream to file /XXX/stderr java.io.IOException: Stream closed

My codes are as below:
object WordCount extends App{
val conf = new SparkConf().setAppName("WordCount").setMaster("spark://sparkmaster:7077")
val sc = new SparkContext(conf)
sc.addJar("/home/found/FromWindows/testsSpark/out/artifacts/untitled_jar/untitled.jar")
val file = sc.textFile("hdfs://192.168.1.101:9000/Texts.txt")
val count = file.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
val res = count.collect()
res.foreach(println(_))
}
When I run it as local mode,everything is OK.However,when running it on cluster,it would crash.I have got the following error messages:
1.error message from intellij console
16/04/24 19:17:34 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 192.168.1.101): java.lang.ClassNotFoundException: org.klordy.test.WordCount$$anonfun$2
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
...
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
16/04/24 19:17:34 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on executor 192.168.1.101: java.lang.ClassNotFoundException (org.klordy.test.WordCount$$anonfun$2) [duplicate 1]
16/04/24 19:17:34 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 2, 192.168.1.101, ANY, 2133 bytes)
16/04/24 19:17:34 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID 3, 192.168.1.101, ANY, 2133 bytes)
16/04/24 19:17:34 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 3) on executor 192.168.1.101: java.lang.ClassNotFoundException (org.klordy.test.WordCount$$anonfun$2) [duplicate 2]
16/04/24 19:17:34 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, 192.168.1.101, ANY, 2133 bytes)
16/04/24 19:17:34 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; aborting job
16/04/24 19:17:34 INFO TaskSchedulerImpl: Cancelling stage 0
16/04/24 19:17:34 INFO TaskSchedulerImpl: Stage 0 was cancelled
16/04/24 19:17:34 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCount.scala:17) failed in 2.904 s
16/04/24 19:17:34 INFO DAGScheduler: Job 0 failed: collect at WordCount.scala:18, took 3.103414 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, 192.168.1.101): java.lang.ClassNotFoundException: org.klordy.test.WordCount$$anonfun$2
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
...
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.lang.ClassNotFoundException: org.klordy.test.WordCount$$anonfun$2
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
...
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
16/04/24 19:17:34 INFO SparkContext: Invoking stop() from shutdown hook
16/04/24 19:17:34 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 7) on executor 192.168.1.101: java.lang.ClassNotFoundException (org.klordy.test.WordCount$$anonfun$2) [duplicate 7]
16/04/24 19:17:34 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/04/24 19:17:34 INFO SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor#192.168.1.117:45350/user/Executor#2034500302]) with ID 0
16/04/24 19:17:34 INFO SparkUI: Stopped Spark web UI at http://192.168.1.101:4040
16/04/24 19:17:34 INFO DAGScheduler: Stopping DAGScheduler
16/04/24 19:17:34 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/04/24 19:17:34 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/04/24 19:17:34 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
2.message from spark logs:
16/04/24 19:17:27 INFO Worker: Asked to launch executor app-20160424191727-0005/1 for WordCount
16/04/24 19:17:27 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message (1.327376 ms) AkkaMessage(LaunchExecutor(spark://sparkworker:7077,app-20160424191727-0005,1,ApplicationDescription(WordCount),2,1024),false) from Actor[akka://sparkWorker/deadLetters]
16/04/24 19:17:27 INFO SecurityManager: Changing view acls to: root
16/04/24 19:17:27 INFO SecurityManager: Changing modify acls to: root
16/04/24 19:17:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/04/24 19:17:27 DEBUG SSLOptions: No SSL protocol specified
16/04/24 19:17:27 DEBUG SSLOptions: No SSL protocol specified
16/04/24 19:17:27 DEBUG SSLOptions: No SSL protocol specified
16/04/24 19:17:27 DEBUG SecurityManager: SSLConfiguration for file server: SSLOptions{enabled=false, keyStore=None, keyStorePassword=None, trustStore=None, trustStorePassword=None, protocol=None, enabledAlgorithms=Set()}
16/04/24 19:17:27 DEBUG SecurityManager: SSLConfiguration for Akka: SSLOptions{enabled=false, keyStore=None, keyStorePassword=None, trustStore=None, trustStorePassword=None, protocol=None, enabledAlgorithms=Set()}
16/04/24 19:17:27 INFO ExecutorRunner: Launch command: "/usr/java/jdk1.7.0_51/bin/java" "-cp" "/usr/local/Spark/spark-1.5.1-bin-hadoop2.4/sbin/../conf/:/usr/local/Spark/spark-1.5.1-bin-hadoop2.4/lib/spark-assembly-1.5.1-hadoop2.4.0.jar:/usr/local/Spark/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/Spark/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/usr/local/Spark/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/Spark/hadoop-2.5.2/etc/hadoop/" "-Xms1024M" "-Xmx1024M" "-Dspark.driver.port=49473" "-XX:MaxPermSize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver#192.168.1.101:49473/user/CoarseGrainedScheduler" "--executor-id" "1" "--hostname" "192.168.1.101" "--cores" "2" "--app-id" "app-20160424191727-0005" "--worker-url" "akka.tcp://sparkWorker#192.168.1.101:49838/user/Worker"
16/04/24 19:17:28 DEBUG FileAppender: Started appending thread
16/04/24 19:17:28 DEBUG FileAppender: Opened file /usr/local/Spark/spark-1.5.1-bin-hadoop2.4/work/app-20160424191727-0005/1/stdout
16/04/24 19:17:28 DEBUG FileAppender: Started appending thread
16/04/24 19:17:28 DEBUG FileAppender: Opened file /usr/local/Spark/spark-1.5.1-bin-hadoop2.4/work/app-20160424191727-0005/1/stderr
16/04/24 19:17:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: AkkaMessage(KillExecutor(spark://sparkworker:7077,app-20160424191727-0005,1),false)
16/04/24 19:17:34 INFO Worker: Asked to kill executor app-20160424191727-0005/1
16/04/24 19:17:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message (0.172782 ms) AkkaMessage(KillExecutor(spark://sparkworker:7077,app-20160424191727-0005,1),false) from Actor[akka://sparkWorker/deadLetters]
16/04/24 19:17:34 INFO ExecutorRunner: Runner thread for executor app-20160424191727-0005/1 interrupted
16/04/24 19:17:34 INFO ExecutorRunner: Killing process!
16/04/24 19:17:34 ERROR FileAppender: Error writing stream to file /usr/local/Spark/spark-1.5.1-bin-hadoop2.4/work/app-20160424191727-0005/1/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
16/04/24 19:17:34 DEBUG FileAppender: Closed file /usr/local/Spark/spark-1.5.1-bin-hadoop2.4/work/app-20160424191727-0005/1/stderr
16/04/24 19:17:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message AkkaMessage(ApplicationFinished(app-20160424191727-0005),false) from Actor[akka://sparkWorker/deadLetters]
16/04/24 19:17:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: AkkaMessage(ApplicationFinished(app-20160424191727-0005),false)
16/04/24 19:17:34 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message (0.186628 ms) AkkaMessage(ApplicationFinished(app-20160424191727-0005),false) from Actor[akka://sparkWorker/deadLetters]
16/04/24 19:17:35 DEBUG FileAppender: Closed file /usr/local/Spark/spark-1.5.1-bin-hadoop2.4/work/app-20160424191727-0005/1/stdout
16/04/24 19:17:35 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message AkkaMessage(ExecutorStateChanged(app-20160424191727-0005,1,KILLED,None,Some(143)),false) from Actor[akka://sparkWorker/deadLetters]
16/04/24 19:17:35 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: AkkaMessage(ExecutorStateChanged(app-20160424191727-0005,1,KILLED,None,Some(143)),false)
16/04/24 19:17:35 INFO Worker: Executor app-20160424191727-0005/1 finished with state KILLED exitStatus 143
16/04/24 19:17:35 INFO Worker: Cleaning up local directories for application app-20160424191727-0005
Really puzzled about this, any good idea solving this?
My cluster environment is:Spark 1.5.1/hadoop 2.5.2/scala 2.10.4/jdk1.7.0_51.
My guess is that sc.addJar("/home/found/FromWindows/testsSpark/out/artifacts/untitled_jar/untitled.jar") fails on your executor, because that path / jar does not exist on the executor node. Try putting the jar into HDFS and using an HDFS path to specify it, so that your executors have access to it.
Or even better, use the --jars comand line option to spark-submit and specify your jar there, and then you don't have to change your code and recomple when you change your jar.
The problem has been solved.The main reason is that my project have imported two versions of scala(one extra version from sbt downloading).Thanks for guys who helped me solving this problem.

Datastax Spark Jobs Killed for No Reason

We are using DSE Spark with a 3 node cluster running 5 jobs. We are seeing SIGTERM commands come into the /var/log/spark/worker/worker-0/worker.log which is stopping our jobs. We are not seeing any corresponding memory or processor constraints during these times, and no one manually made these calls.
I've seen a couple similar issues which result in a heap size issue with YARN or Mesos, but since we are using DSE, these didn't seem to be relevant.
Below is a sample of the log info from 1 server which was running 2 of the jobs:
ERROR [SIGTERM handler] 2016-03-26 00:43:28,780 SignalLogger.scala:57 - RECEIVED SIGNAL 15: SIGTERM
ERROR [SIGHUP handler] 2016-03-26 00:43:28,788 SignalLogger.scala:57 - RECEIVED SIGNAL 1: SIGHUP
INFO [Spark Shutdown Hook] 2016-03-26 00:43:28,795 Logging.scala:59 - Killing process!
ERROR [File appending thread for /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stderr] 2016-03-26 00:43:28,848 Logging.scala:96 - Error writing stream to file /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) ~[na:1.8.0_71]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) ~[na:1.8.0_71]
at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[na:1.8.0_71]
at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[na:1.8.0_71]
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
ERROR [File appending thread for /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stdout] 2016-03-26 00:43:28,892 Logging.scala:96 - Error writing stream to file /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stdout
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) ~[na:1.8.0_71]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) ~[na:1.8.0_71]
at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[na:1.8.0_71]
at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[na:1.8.0_71]
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
ERROR [SIGTERM handler] 2016-03-26 00:43:29,070 SignalLogger.scala:57 - RECEIVED SIGNAL 15: SIGTERM
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,079 Logging.scala:59 - Disassociated [akka.tcp://sparkWorker#10.0.1.7:44131] -> [akka.tcp://sparkMaster#10.0.1.7:7077] Disassociated !
ERROR [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,080 Logging.scala:75 - Connection to master failed! Waiting for master to reconnect...
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,081 Logging.scala:59 - Connecting to master akka.tcp://sparkMaster#10.0.1.7:7077/user/Master...
WARN [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,091 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkMaster#10.0.1.7:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,101 Logging.scala:59 - Disassociated [akka.tcp://sparkWorker#10.0.1.7:44131] -> [akka.tcp://sparkMaster#10.0.1.7:7077] Disassociated !
ERROR [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,102 Logging.scala:75 - Connection to master failed! Waiting for master to reconnect...
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,102 Logging.scala:59 - Not spawning another attempt to register with the master, since there is an attempt scheduled already.
WARN [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,323 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkExecutor#10.0.1.7:49943] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,330 Logging.scala:59 - Executor app-20160325132151-0004/0 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,414 Logging.scala:59 - Killing process!
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,415 Logging.scala:59 - Executor app-20160325131848-0001/0 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,417 Logging.scala:59 - Killing process!
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,422 Logging.scala:59 - Unknown Executor app-20160325132151-0004/0 finished with state EXITED message Worker shutting down exitStatus 129
WARN [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,425 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkExecutor#10.0.1.7:32874] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
WARN [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,433 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkExecutor#10.0.1.7:56212] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,441 Logging.scala:59 - Executor app-20160325131918-0002/1 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,448 Logging.scala:59 - Unknown Executor app-20160325131918-0002/1 finished with state EXITED message Worker shutting down exitStatus 129
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,448 Logging.scala:59 - Shutdown hook called
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,449 Logging.scala:59 - Deleting directory /var/lib/spark/rdd/spark-28fa2f73-d2aa-44c0-ad4e-3ccfd07a95d2
Error seems straight forward to me
Error writing stream to file /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stdout java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
Either there is a network issue at play here, between your datasource (Cassandra) and spark. Remember Spark in reality on node1 can/will pull data from node2 of cassandra, though it tries to minimize that.
Or, your serialization is having issue. So, add this parameter in spark configuration to switch to Kryo.
spark.serializer=org.apache.spark.serializer.KryoSerializer

Resources