Structured Streaming CoarseGrainedExecutorBackend: Driver commanded a shutdown - apache-spark

I am running a spark structured streaming application. I have assigned 10gb to driver. The program runs fine for 8 hours then it give error like following. Executor finished task and send result to driver then driver command shutdown WHY?? How much memory driver needs?
20/04/18 19:25:24 INFO CoarseGrainedExecutorBackend: Got assigned task
489524 20/04/18 19:25:24 INFO Executor: Running task 1000.0 in stage
477.0 (TID 489524) 20/04/18 19:25:25 INFO Executor: Finished task 938.0 in stage 477.0 (TID 489492). 4153 bytes result sent to driver 20/04/18 19:25:25 INFO Executor: Finished task 953.0 in stage 477.0
(TID 489499). 3687 bytes result sent to driver 20/04/18 19:25:28 INFO
Executor: Finished task 1000.0 in stage 477.0 (TID 489524). 3898 bytes
result sent to driver 20/04/18 19:25:29 INFO
CoarseGrainedExecutorBackend: Driver commanded a shutdown 20/04/18
19:25:29 INFO MemoryStore: MemoryStore cleared 20/04/18 19:25:29 INFO
BlockManager: BlockManager stopped 20/04/18 19:25:29 INFO
ShutdownHookManager: Shutdown hook called

There is no specific memory limit of driver. The driver can take up to 40 GB of memory beyond that the JVM GC causes it to slowdown.
In your case, it looks like driver is getting overwhelmed by the results sent by all the executors to it.
There are few things you can try
Please ensure there are no collect operation in the driver. That will definitely cause the driver to overwhelm.
try adding more memory of driver maybe 18G.
Increase spark.yarn.driver.memoryOverhead to 2G : This is the amount of off-heap memory (in megabytes) to be allocated per driver.

Related

Spark job failing with "Fail to know the executor driver is alive or not", "Cannot find endpoint: spark://CoarseGrainedScheduler#<host:port>

I'm running a job on a local Spark cluster (pyspark). When I run it with a small dataset it works fine, but once it's large, I get an error. I'm wondering 1. How to find logs from the scheduler process that appears to be crashing, and 2. more generally, what might be going on and how to debug the problem. Thanks in advance. Happy to provide more info.
Here's the error (from what I understand to be the driver logs):
block-manager-ask-thread-pool-224 ERROR BlockManagerMasterEndpoint: Fail to know th
e executor driver is alive or not.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at...
...
<stacktrace>
...
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedScheduler#<host:port>
and then immediately below that
block-manager-ask-thread-pool-224 WARN BlockManagerMasterEndpoint: Error trying to remove shuffle 25. The executor driver may have been lost.
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from <host:port>
What I know about my job... I'm using Pyspark and running Spark standalone, using a local cluster with 72 workers (the machine has 96 cores). Here's my config:
spark:
master: "local[72]"
files:
maxPartitionBytes: 67108864
sql:
files:
maxPartitionBytes: 67108864
driver:
memory: "50g"
maxResultSize: "2g"
supervise: true
cores: 72
log:
dfsDir: <my/logs/dir>
persistToDfs:
enabled: true
loglevel: "WARN"
logConf: true
I've set SPARK_LOG_DIR and SPARK_WORKER_LOG_DIR to attempt to see scheduler logs, but I still only see driver (worker?) logs as far as I can tell, with the above error. I'm monitoring memory usage and it doesn't seem like my machine is memory-constrained, but I can't be sure I'm checking at the right moments. The machine has about 1TB of memory and tens of terabytes of free disk space.
Thanks in advance!

Driver stops executors without a reason

I have an application based on spark structured streaming 3 with kafka, which is processing some user logs and after some time the driver is starting to kill the executors and I don't understand why.
The executors doesn't contain any errors. I'm leaving bellow the logs from executor and driver
On the executor 1:
0/08/31 10:01:31 INFO executor.Executor: Finished task 5.0 in stage 791.0 (TID 46411). 1759 bytes result sent to driver
20/08/31 10:01:33 INFO executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
On the executor 2:
20/08/31 10:14:33 INFO executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
20/08/31 10:14:34 INFO memory.MemoryStore: MemoryStore cleared
20/08/31 10:14:34 INFO storage.BlockManager: BlockManager stopped
20/08/31 10:14:34 INFO util.ShutdownHookManager: Shutdown hook called
On the driver:
20/08/31 10:01:33 ERROR cluster.YarnScheduler: Lost executor 3 on xxx.xxx.xxx.xxx: Executor heartbeat timed out after 130392 ms
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Lost executor 2 on xxx.xxx.xxx.xxx: Executor heartbeat timed out after 125773 ms
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129308 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129314 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129311 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129305 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
Is there anyone which had the same problem and solved it?
Looking at the available information at hand:
no errors
Driver commanded a shutdown
Yarn logs showing "state FINISHED"
this seems to be expected behavior.
This typically happens if you forget to await the termination of the spark streaming query. If you do not conclude your code with
query.awaitTermination()
your streaming application will just shutdown after all data was processed.

Spark initial job has not accepted any resources

I have trouble to get my program to run on my spark cluster. I set the cluster up with 1 master and 4 slaves. I started the master, after that, I started the slaves and they show up in the master's web ui.
I then start a small python script to check, if jobs can be executed:
from pyspark import * #SparkContext, SparkConf, spark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import SQLContext
from files import files
import sys
if __name__ == "__main__":
appName = 'SparkExample'
masterUrl = 'spark://10.0.2.55:7077'
conf = SparkConf()
conf.setAppName(appName)
conf.setMaster(masterUrl)
conf.set("spark.driver.cores","1")
conf.set("spark.driver.memory","1g")
conf.set("spark.executor.cores","1")
conf.set("spark.executor.memory","4g")
conf.set("spark.python.worker.memory","256m")
conf.set("spark.cores.max","4")
conf.set("spark.shuffle.service.enabled","true")
conf.set("spark.dynamicAllocation.enabled","true")
conf.set("spark.dynamicAllocation.maxExecutors","1")
for k,v in conf.getAll():
print(k+":"+v)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
#spark = SparkSession.builder.master(masterUrl).appName(appName).config("spark.executor.memory","1g").getOrCreate()
l = [('Alice', 1)]
spark.createDataFrame(l).collect()
spark.createDataFrame(l, ['name', 'age']).collect()
print("#############")
print("Test finished")
print("#############")
But as soon as I should get something back (line 45: " spark.createDataFrame(l).collect()"), spark seems to hang up. After a while, I see the message:
"WARN TaskSchedulerImpl: Initial job has not accepted any resources: check your cluster UI to ensure that workers are registered and have sufficient resources"
So I check the cluster UI:
worker-20171027105227-xx.x.x.x6-35309 10.0.2.56:35309 ALIVE 4 (0 Used) 6.8 GB (0.0 B Used)
worker-20171027110202-xx.x.x.x0-43433 10.0.2.10:43433 ALIVE 16 (1 Used) 30.4 GB (4.0 GB Used)
worker-20171027110746-xx.x.x.x5-45126 10.0.2.65:45126 ALIVE 8 (0 Used) 30.4 GB (0.0 B Used)
worker-20171027110939-xx.x.x.x4-42477 10.0.2.64:42477 ALIVE 16 (0 Used) 30.4 GB (0.0 B Used)
Looks like there are plenty of resources for the small task I created. I also see the task actually running there. When I click on it, I see, that it was launched on 5 executors and all but one EXITED. When I open the log on one of the exited ones, I see the following error message:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/10/27 16:45:23 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 14443#CODA
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for TERM
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for HUP
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for INT
17/10/27 16:45:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/27 16:45:24 INFO SecurityManager: Changing view acls to: root,geissler
17/10/27 16:45:24 INFO SecurityManager: Changing modify acls to: root,geissler
17/10/27 16:45:24 INFO SecurityManager: Changing view acls groups to:
17/10/27 16:45:24 INFO SecurityManager: Changing modify acls groups to:
17/10/27 16:45:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, geissler); groups with view permissions: Set(); users with modify permissions: Set(root, geissler); groups with modify permissions: Set()
17/10/27 16:47:25 ERROR RpcOutboxMessage: Ask timeout before connecting successfully
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Failure.recover(Try.scala:216)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
... 8 more
This looks as if the slaves cannot provide their results back to the master to me. But I don't know what to do at this point. The slaves are in the same layer of the network as the master, but on different virtual machines (not docker containers). Is there a way how I can check, if they can/cannot reach the master server? Are there any configuration settings I overlooked when setting the cluster up?
Spark version: 2.1.2 (on master, nodes and pyspark)
The error here was, that the python script was executed locally. Always launch your spark scripts through spark-submit, never just run it as a normal program. Same is true for Java spark programs.

spark-cassandra connector in local gives Spark cluster looks down

I am very new to spark and cassandra. I am trying a simple java progam where I am trying to add new rows to cassandra table using spark-cassandra-connector provided by datastax.
I am running dse on my laptop . Using java, I am trying to save the data to cassandra DB thru Spark . Following is the code :
Map<String, String> extra = new HashMap<String, String>();
extra.put("city", "bangalore");
extra.put("dept", "software");
List<User> products = Arrays.asList(new User(1, "vamsi", extra));
JavaRDD<User> productsRDD = sc.parallelize(products);
javaFunctions(productsRDD, User.class).saveToCassandra("test", "users");
When i execute this code I am getting following error
16/03/26 20:57:31 INFO client.AppClient$ClientActor: Connecting to master spark://127.0.0.1:7077...
16/03/26 20:57:44 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
16/03/26 20:57:51 INFO client.AppClient$ClientActor: Connecting to master spark://127.0.0.1:7077...
16/03/26 20:57:59 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
16/03/26 20:58:11 ERROR client.AppClient$ClientActor: All masters are unresponsive! Giving up.
16/03/26 20:58:11 ERROR cluster.SparkDeploySchedulerBackend: Spark cluster looks dead, giving up.
16/03/26 20:58:11 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/03/26 20:58:11 INFO scheduler.DAGScheduler: Failed to run runJob at RDDFunctions.scala:48
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Spark cluster looks down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Looks like you need to fix your Spark configuration...see this:
http://www.datastax.com/dev/blog/common-spark-troubleshooting

Spark on cluster: I would like to know the meaning of the following error and possible causes:

I've the follow errors/warns:
1) WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(2,[Lscala.Tuple2;#58149ee3,BlockManagerId(2, 192.168.0.171, 49714))] in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
2) ERROR CoarseGrainedExecutorBackend: Driver 192.168.0.131:41837 disassociated! Shutting down.
I'm running a Spark (v. 1.4.0) app in a cluster of 4 machines in which the driver has less memory (4 GB) of the workers (8 Gb each one). Is it possible that the driver produces the error due to its workload?
The driver was not able to respond to the executors since it was under stress during the computation.
The problem was solved simply by adding mroe RAM to the driver.

Resources