Driver stops executors without a reason - apache-spark

I have an application based on spark structured streaming 3 with kafka, which is processing some user logs and after some time the driver is starting to kill the executors and I don't understand why.
The executors doesn't contain any errors. I'm leaving bellow the logs from executor and driver
On the executor 1:
0/08/31 10:01:31 INFO executor.Executor: Finished task 5.0 in stage 791.0 (TID 46411). 1759 bytes result sent to driver
20/08/31 10:01:33 INFO executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
On the executor 2:
20/08/31 10:14:33 INFO executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
20/08/31 10:14:34 INFO memory.MemoryStore: MemoryStore cleared
20/08/31 10:14:34 INFO storage.BlockManager: BlockManager stopped
20/08/31 10:14:34 INFO util.ShutdownHookManager: Shutdown hook called
On the driver:
20/08/31 10:01:33 ERROR cluster.YarnScheduler: Lost executor 3 on Executor heartbeat timed out after 130392 ms
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Lost executor 2 on Executor heartbeat timed out after 125773 ms
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129308 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129314 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129311 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129305 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
Is there anyone which had the same problem and solved it?

Looking at the available information at hand:
no errors
Driver commanded a shutdown
Yarn logs showing "state FINISHED"
this seems to be expected behavior.
This typically happens if you forget to await the termination of the spark streaming query. If you do not conclude your code with
your streaming application will just shutdown after all data was processed.


Structured Streaming CoarseGrainedExecutorBackend: Driver commanded a shutdown

I am running a spark structured streaming application. I have assigned 10gb to driver. The program runs fine for 8 hours then it give error like following. Executor finished task and send result to driver then driver command shutdown WHY?? How much memory driver needs?
20/04/18 19:25:24 INFO CoarseGrainedExecutorBackend: Got assigned task
489524 20/04/18 19:25:24 INFO Executor: Running task 1000.0 in stage
477.0 (TID 489524) 20/04/18 19:25:25 INFO Executor: Finished task 938.0 in stage 477.0 (TID 489492). 4153 bytes result sent to driver 20/04/18 19:25:25 INFO Executor: Finished task 953.0 in stage 477.0
(TID 489499). 3687 bytes result sent to driver 20/04/18 19:25:28 INFO
Executor: Finished task 1000.0 in stage 477.0 (TID 489524). 3898 bytes
result sent to driver 20/04/18 19:25:29 INFO
CoarseGrainedExecutorBackend: Driver commanded a shutdown 20/04/18
19:25:29 INFO MemoryStore: MemoryStore cleared 20/04/18 19:25:29 INFO
BlockManager: BlockManager stopped 20/04/18 19:25:29 INFO
ShutdownHookManager: Shutdown hook called
There is no specific memory limit of driver. The driver can take up to 40 GB of memory beyond that the JVM GC causes it to slowdown.
In your case, it looks like driver is getting overwhelmed by the results sent by all the executors to it.
There are few things you can try
Please ensure there are no collect operation in the driver. That will definitely cause the driver to overwhelm.
try adding more memory of driver maybe 18G.
Increase spark.yarn.driver.memoryOverhead to 2G : This is the amount of off-heap memory (in megabytes) to be allocated per driver.

Spark Streaming - Stopped worker throws FileNotFoundException

I am running a spark streaming application on a cluster composed by three nodes, each one with a worker and three executors (so a total of 9 executors). I am using the spark standalone mode (version 2.1.1).
The application is run with a spark-submit command with option --deploy-mode client and --conf spark.streaming.stopGracefullyOnShutdown=true.
The submit command is run from one of the nodes, let's call it node 1.
As a fault tolerance test I am stopping the worker on node 2 by calling the script
In executor logs on node 2 I can see several errors related to a FileNotFoundException during a shuffle operation:
ERROR Executor: Exception in task 5.0 in stage 5531241.0 (TID 62488319) /opt/spark/spark-31c5b4b0-56e1-45d2-88dc-772b8712833f/executor-0bad0669-57fe-43f9-a77e-1b69cd284523/blockmgr-2aa295ac-78ca-4df6-ab89-51d422e8860e/1c/shuffle_2074211_5_0.index.ecb8e397-c3a3-4c1a-96ba-e153ed92b05c (No such file or directory)
at Method)
at org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:144)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
I can see 4 errors of this kind on the same task in each of the 3 executors on node 2.
In driver logs I can see:
ERROR TaskSetManager: Task 5 in stage 5531241.0 failed 4 times; aborting job
ERROR JobScheduler: Error running job streaming job 1503995015000 ms.1
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 5531241.0 failed 4 times, most recent failure: Lost task 5.3 in stage 5531241.0 (TID 62488335,, executor 2): /opt/spark/spark-31c5b4b0-56e1-45d2-88dc-772b8712833f/executor-0bad0669-57fe-43f9-a77e-1b69cd284523/blockmgr-2aa295ac-78ca-4df6-ab89-51d422e8860e/1c/shuffle_2074211_5_0.index.9e6148da-6ce2-4de5-94ab-d95db2c8f9f7 (No such file or directory)
This is taking down the application, as expected: the executor reached the spark.task.maxFailures on a single task and the application is then stopped.
I ran different tests and all of them but one ended with the app stopped. My idea is that the behaviour can vary depending on the precise step in the stream process I ask the worker to stop. In any case, all other tests failed with the same error described above.
Increasing the parameter spark.task.maxFailures to 8 did not help either, with the TaskSetManager signalling task failed 8 times instead of 4.
What if the worker is killed?
I also ran a different test: I killed the worker and 3 executors processes on node 2 with the command kill -9. And in this case, the streaming app adapted to the remaining resources and kept working.
In driver log we can see the driver noticing the missing executors:
ERROR TaskSchedulerImpl: Lost executor 0 on Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Then, we notice the a long long serie of the following errors:
17/08/29 14:43:19 ERROR ReceiverTracker: Deregistered receiver for stream 5: Error starting receiver 5 - Failed to bind to: /X.X.X.X:40001
at org.jboss.netty.bootstrap.ServerBootstrap.bind(
at org.apache.avro.ipc.NettyServer.<init>(
at org.apache.avro.ipc.NettyServer.<init>(
at org.apache.avro.ipc.NettyServer.<init>(
at org.apache.avro.ipc.NettyServer.<init>(
at org.apache.spark.streaming.flume.FlumeReceiver.initServer(FlumeInputDStream.scala:162)
at org.apache.spark.streaming.flume.FlumeReceiver.onStart(FlumeInputDStream.scala:169)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:607)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:597)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2028)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2028)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: Cannot assign requested address
at Method)
... 3 more
This errors appears in the log until the killed worker is started again.
Stopping a worker with the dedicated command has a unexpected behaviour: the app should be able to cope with the missed worked, adapting to the remaining resources and keep working (as it does in the case of kill).
What are your observations on this issue?
Thank you,

Flink with High-availability with zookeeper: Submitted job is not acknowledged by Job manager

I am trying to run the Flink Cluster in High-Availability Zookeeper mode. For functional testing of the HA-cluster, I have 5 Job-managers and 1 Task-manager. After starting zookeeper-quorum and flink cluster, I am submitting the job to job-manager but I am getting the following errors
log4j:WARN No appenders could be found for logger(org.apache.kafka.clients.consumer.ConsumerConfig).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See for more info.
Submitting job with JobID: ac57484f600814326f28c941244a4c94. Waiting for job completion.
Connected to JobManager at Actor[akka.tcp://flink#]
Exception in thread "main" org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Communication with JobManager failed: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.executeRemotely(
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(
at MainAlert.main(
Caused by: org.apache.flink.runtime.client.JobExecutionException: Communication with JobManager failed: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.
at org.apache.flink.runtime.client.JobClient.submitJobAndWait(
... 6 more
Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission.
at org.apache.flink.runtime.client.JobClientActor.handleMessage(
at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(
at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
Do I have to set the explicitly or is it something else that is causing this problem?
(6123 is my jobmanager.rpc.port and also recovery.jobmanager.port)

Spark - How to identify a failed Job through 'SparkLauncher'

I am using Spark 2.0 and sometimes my job fails due to problems with input. For example, I am reading CSV files off from a S3 folder based on the date, and if there's no data for the current date, my job has nothing to process so it throws an exception as follows. This gets printed in the driver's logs.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: s3n://data/2016-08-31/*.csv;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/09/03 10:51:54 INFO SparkContext: Invoking stop() from shutdown hook
16/09/03 10:51:54 INFO SparkUI: Stopped Spark web UI at
16/09/03 10:51:54 INFO StandaloneSchedulerBackend: Shutting down all executors
16/09/03 10:51:54 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
Spark App app-20160903105040-0007 state changed to FINISHED
However, despite this uncaught exception, my Spark Job status is 'FINISHED'. I would expect it to be in 'FAILED' status because there was an exception. Why is it marked as FINISHED? How can I find out whether the job failed or not?
Note: I am spawning the Spark jobs using SparkLauncher, and listening to state changes through AppHandle. But the state change I receive is FINISHED whereas I am expecting FAILED.
The one FINISHED you see is for Spark application not a job. It is FINISHED since the Spark context was able to start and stop properly.
You can see any job information using JavaSparkStatusTracker.
For active jobs nothing additional should be done, since it has ".getActiveJobIds" method.
For getting finished/failed you will need to setup the job group ID in the thread from which you are calling for a spark execution:
JavaSparkContext sc;
sc.setJobGroup(MY_JOB_ID, "Some description");
Then whenever you need, you can read the status of each job with in specified job group:
JavaSparkStatusTracker statusTracker = sc.statusTracker();
for (int jobId : statusTracker.getJobIdsForGroup(JOB_GROUP_ALL)) {
final SparkJobInfo jobInfo = statusTracker.getJobInfo(jobId);
final JobExecutionStatus status = jobInfo.status();
The JobExecutionStatus can be one of RUNNING, SUCCEEDED, FAILED, UNKNOWN; The last one is for case of job is submitted, but not actually started.
Note: all this is available from Spark driver, which is jar you are launching using SparkLauncher. So above code should be placed into the jar.
If you want to check in general is there any failures from the side of Spark Launcher, you can exit the application started by Jar with exit code different than 0 using kind of System.exit(1), if detected a job failure. The Process returned by SparkLauncher::launch contains exitValue method, so you can detect is it failed or no.
you can always go to spark history server and click on your job id to
get the job details.
If you are using yarn then you can go to resource manager web UI to
track your job status.

Rolling upgrade from 1.2.19 to 2.0.15

We have 3 datacenters with 12 cassandra nodes in each. Current Cassandra version is 1.2.19. We want to migrate to Cassandra 2.0.15. We cannot have a full downtime and we need to do a rolling upgrade. As a preliminary check we've done 2 experiments:
Experiment 1
Created a new 2.0.15 node and tried to bootstrap it into the cluster with 10% of token interval of already existing node.
The node was unable to join the cluster by producing: "java.lang.RuntimeException: Unable to gossip with any seeds" (line 584) Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(
at org.apache.cassandra.service.StorageService.prepareToJoin(
at org.apache.cassandra.service.StorageService.initServer(
at org.apache.cassandra.service.StorageService.initServer(
at org.apache.cassandra.service.CassandraDaemon.setup(
at org.apache.cassandra.service.CassandraDaemon.activate(
at org.apache.cassandra.service.CassandraDaemon.main(
Experiment 2
Added one 1.2.19 node into the cluster with 10% of token interval of already existing node.
When the node was up we stopped it and upgraded to 2.0.15, then started again (minor downtime). This time the node joined the cluster and started serving requests correctly.
To check how it behaves under heavier load we tried to move token to cover 15% of a normal node. Unfortunately
the move operation has failed with the following exception:
INFO [RMI TCP Connection(1424)-] 2015-07-10 11:37:05,235 (line 982) MOVING: fetching new ranges and streaming old ranges
INFO [RMI TCP Connection(1424)-] 2015-07-10 11:37:05,262 (line 87) [Stream #fc3f4290-26f7-11e5-9988-afe392008597] Executing streaming plan for Moving
INFO [RMI TCP Connection(1424)-] 2015-07-10 11:37:05,262 (line 91) [Stream #fc3f4290-26f7-11e5-9988-afe392008597] Beginning stream session with /
INFO [StreamConnectionEstablisher:1] 2015-07-10 11:37:05,263 (line 218) [Stream #fc3f4290-26f7-11e5-9988-afe392008597] Starting streaming to /
INFO [StreamConnectionEstablisher:1] 2015-07-10 11:37:05,274 (line 173) [Stream #fc3f4290-26f7-11e5-9988-afe392008597] Prepare completed. Receiving 0 files(0 bytes), sending 112 fi
les(6538607891 bytes)
ERROR [STREAM-IN-/] 2015-07-10 11:37:05,303 (line 467) [Stream #fc3f4290-26f7-11e5-9988-afe392008597] Streaming error occurred Connection reset by peer
at Method)
at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(
at org.apache.cassandra.streaming.ConnectionHandler$
ERROR [STREAM-OUT-/] 2015-07-10 11:37:05,312 (line 467) [Stream #fc3f4290-26f7-11e5-9988-afe392008597] Streaming error occurred Connection reset by peer
at Method)
Q1. Is it normal for Cassandra 2.0.15 to not bootstrap into 1.2.19 cluster as it was in eperiment 1? (Here I mean that it might not supposed to work by design)
Q2. Is move token operation supposed to work for Cassandra 2.0.15 node which operates in 1.2.19 cluster?
Q3. Are there any workarounds/recommendations of doing a proper rolling upgrade in our case?
