Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode

Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode - apache-spark

I have Spark driver running in a Kubernetes pod with client deploy-mode and it tries to start an executor.
Executor will fail with error:
{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", "class":"dispatcher-Executor", "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", "log":"Executor self-exiting due to : Driver 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting down.\n"}
Then driver will attempt to start another executor which fails with same error and this goes on and on.
In the driver pod, I see only following errors:
22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.43.250:
22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.43.233:
22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on 192.168.43.221:
22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on 192.168.43.217:
22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on 192.168.43.197:
22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on 192.168.43.237:
22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on 192.168.43.196:
22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on 192.168.43.228:
22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on 192.168.43.254:
22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on 192.168.43.204:
22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on 192.168.43.231:
What is wrong? And how can I get executors running correctly?

Looks like the cluster is disconnected. Which platform are you using?

We are using Kubernetes running on private cloud running OpenStack

Related

Why I get executor refused connection?

I have something strange in the execution of my code:
When I execute the following line:
sourceList = joinLabelrdd_df.select("x").collect()
I get the following execption. Noting I have enough memory and cpus.
19/07/14 11:22:34 ERROR TaskSchedulerImpl: Lost executor 5 on 172.16.140.68: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
19/07/14 11:22:34 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190714111835-0000/5 is now EXITED (Command exited with code 137)
This error cause another exception:
19/07/14 11:22:41 WARN TaskSetManager: Lost task 113.1 in stage 9.0 (TID 2154, 172.16.140.113, executor 9): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=113, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

Lost executor Spark

I have a long running job on Spark, which after running for hours failed with the following errors.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 547 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 750.0 in stage 19.0 (TID 1565492, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 752.0 in stage 19.0 (TID 1565494, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 751.0 in stage 19.0 (TID 1565493, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 754.0 in stage 19.0 (TID 1565496, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 753.0 in stage 19.0 (TID 1565495, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 572 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 INFO DAGScheduler: Executor lost: 547 (epoch 45)
18/10/09 03:22:15 WARN TaskSetManager: Lost task 756.0 in stage 19.0 (TID 1565498, ip, executor 572): ExecutorLostFailure (executor 572 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
...
The strange thing is, I can't even see the lost executors on the Executor list for the log.
It would be great if someone can help fix the problem.

There are many factors for this to happen but the summary is the following:
Your master node is unable to reply to a specific executor and therefore gives the error
Unable to register with external shuffle server due to
Why your master node is unable to reply can be of different reasons. Depends on how your code is structured, the size of your instance if you are using EMR.
To solve it
Increase your master node. For example, if you are using i3.4xlarge, instead use i3.8xlarge or even i3.16xlarge.
Increase the network timeout from 2 minutes to 5 minutes. This is done with the following spark configuration: spark.network.timeout=300
Increase both the memory and number of cores of your master node. To increase the number of cores of your master node, set the following configuration. spark.yarn.am.cores=3
Hope this solves the issue.

spark worker with 32GB or more memory encountered a fatal error

I have three slaves in a standalone Spark Cluster. Each slave has 48GB of RAM. When I assigned more than 31GB (e.g. 32GB or more) of RAM to my executors:
.config("spark.executor.memory", "44g")
the executors were terminated without much information during a join of two large Dataframes. The output message at the Slave driver showed "missing an output location for shuffle":
17/09/21 12:34:18 INFO StandaloneSchedulerBackend: Granted executor ID app-20170921123240-0000/3 on hostPort XXX.XXX.XXX.92:33705 with 6 cores, 44.0 GB RAM
17/09/21 12:34:18 WARN TaskSetManager: Lost task 14.0 in stage 7.0 (TID 124, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 5.0 in stage 7.0 (TID 115, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 17.0 in stage 7.0 (TID 127, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 8.0 in stage 7.0 (TID 118, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 2.0 in stage 7.0 (TID 112, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 11.0 in stage 7.0 (TID 121, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 INFO DAGScheduler: Executor lost: 0 (epoch 5)
17/09/21 12:34:18 INFO BlockManagerMaster: Removal of executor 0 requested
17/09/21 12:34:18 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_2 !
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_11 !
17/09/21 12:34:18 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170921123240-0000/3 is now RUNNING
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_5 !
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_8 !
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(0, XXX.XXX.XXX, 34840, None)
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/09/21 12:34:18 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor
The log message of Spark Master showed that the executors were "EXITED" and then relaunched:
17/09/21 12:34:18 INFO Master: Removing executor app-20170921123240-0000/0 because it is EXITED
17/09/21 12:34:18 INFO Master: Launching executor app-20170921123240-0000/3 on worker worker-20170921123014-152.83.247.92-33705
The log message of Spark Worker showed that the executor exited with code 134
17/09/21 12:34:18 INFO Worker: Executor app-20170921123240-0000/0 finished with state EXITED message Command exited with code 134 exitStatus 134
The only clue seems to be in the error log of the application, showing a fatal error has been detected by the JRE:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fdec0c92a73, pid=11300, tid=0x00007fd3a6951700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 1.8.0_131-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode linux-amd64 )
# Problematic frame:
# V [libjvm.so+0x3ffa73] CardTableExtension::scavenge_contents_parallel(ObjectStartArray*, MutableSpace*, HeapWord*, PSPromotionManager*, unsigned int, unsigned int)+0x5e3
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
--------------- T H R E A D ---------------
Current thread (0x0000000001c9e800): GCTaskThread [stack: 0x00007fd3a6851000,0x00007fd3a6952000] [id=11308]
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000008
As long as I assign 31GB of RAM (or less) to each executor, my program works just fine. Has anyone encountered such problem before?

44 GB may actually give you a smaller usable heap than 31 GB due to how Java stores object references: For heap sizes over 32 GB the JVM has to switch to 64 bit object references, which means all objects take up more space. More details here: http://java-performance.info/over-32g-heap-java/
My rule of thumb is to either stay below 32 GB or go much higher (say, 50 GB). Usually it is more cost efficient to use multiple JVMs, each having less than 32 GB heap. With 48 GB RAM I would stick to 31 GB heap.

Apache Spark Configuration

I'm facing memory issue while running spark configuration and i have changed the settings to max memory but it's still not working. Please check out the following issue:
Command-
spark2-shell --conf "spark.default.parallelism=40" --executor.memory 8g --driver-memory 32g --conf "spark.ui.port=4404" --conf spark.driver.maxResultSize=2048m --conf spark.executor.heartbeatInterval=200s
Error- ERROR cluster.YarnScheduler: Lost executor 9 on
ampanacdddbp01.au.amp.local: Executor heartbeat timed out after 123643
ms WARN scheduler.TaskSetManager: Lost task 19.0 in stage 0.0 (TID 19,
ampanacdddbp01.au.amp.local, executor 9): ExecutorLostFailure
(executor 9 e running tasks) Reason: Executor heartbeat timed out
after 123643 ms WARN spark.HeartbeatReceiver: Removing executor 3 with
no recent heartbeats: 126935 ms exceeds timeout 120000 ms ERROR
cluster.YarnScheduler: Lost executor 3 on ampanacdddbp01.au.amp.local:
Executor heartbeat timed out after 126935 ms ERROR
scheduler.TaskSetManager: Total size of serialized results of 23 tasks
(1040.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
ERROR scheduler.TaskSetManager: Total size of serialized results of 24
tasks (1085.8 MB) is bigger than spark.driver.maxResultSize (1024.0
MB)
Please help me with the configuration and how to fix this "lost executor error".

Default value of parameter "spark.driver.maxResultSize" is 1g which is 1024MB. Since your application is trying to use more that the allocated memory to this property you are getting this error.
Try changing the value as follows :
Either pass command line argument while launching spark-shell as "--conf spark.driver.maxResultSize=4g"
Set the value of this property at system level in "conf/spark-env.sh"
Set the property at sparkContext Level as follows
conf = SparkConf().set('spark.driver.maxResultSize', '4g')
sc = SparkContext(conf=conf)
Hope it Helps.
Regards,
Neeraj

Spark doesn't seem to be fault tolerant to workers dying

I've got a test Spark cluster going on AWS (1 master + 5 worker machines, all running Spark 2.1.0 with Scala 2.11.8 on m4.2xlarge instances), and I'm running the ALS demo code in a Spark shell to test out performance.
I noticed when I terminate all worker machines (all the while keeping the master up), the workload redistributes to the remaining workers, but when I kill all the workers, the job usually dies completely instead of patiently waiting for more workers to come along. Is this normal behavior?
My shell session is below. The first few lines are the ALS app, and the rest are the error messages. You'll notice that the first time I kill all workers (executor IDs: 0, 1, 2, 3, 4) the shell waits for more workers to come online, like it's supposed to. Once I do bring up more workers (IDs: 10, 11, 12, 13, 14), the application continues on its way. But when I terminate those new workers as well, the entire job aborts with SparkException: Job aborted due to stage failure.
Is this normal behavior? If not, what am I doing wrong? If so, how can I improve Spark's tolerance to (possibly all) workers dying? Any insight into this would be appreciated.
Spark context Web UI available at http://xxx.xxx.xxx.133:4040
Spark context available as 'sc' (master = spark://xxx.xxx.xxx.133:7077, app id = app-20170222012148-0005).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_92)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.mllib.recommendation._
val data = sc.textFile("s3n://my.bucket/training-set.tsv")
val ratings = data.map(_.split('\t') match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)
// Exiting paste mode, now interpreting.
[Stage 0:===========> (5 + 19) / 24]
17/02/22 01:23:32 ERROR TaskSchedulerImpl: Lost executor 1 on xxx.xxx.xxx.174: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 11.0 in stage 0.0 (TID 11, xxx.xxx.xxx.174, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 16.0 in stage 0.0 (TID 16, xxx.xxx.xxx.174, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, xxx.xxx.xxx.174, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 6.0 in stage 0.0 (TID 6, xxx.xxx.xxx.174, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TransportChannelHandler: Exception in connection from /xxx.xxx.xxx.118:60180
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
17/02/22 01:23:32 ERROR TaskSchedulerImpl: Lost executor 0 on xxx.xxx.xxx.118: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 16.1 in stage 0.0 (TID 26, xxx.xxx.xxx.118, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4, xxx.xxx.xxx.118, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 9.0 in stage 0.0 (TID 9, xxx.xxx.xxx.118, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 ERROR TaskSchedulerImpl: Lost executor 2 on xxx.xxx.xxx.253: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 23.0 in stage 0.0 (TID 23, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 13.0 in stage 0.0 (TID 13, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 1.1 in stage 0.0 (TID 25, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 18.0 in stage 0.0 (TID 18, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 4.1 in stage 0.0 (TID 30, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 14.1 in stage 0.0 (TID 33, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 ERROR TaskSchedulerImpl: Lost executor 4 on xxx.xxx.xxx.200: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 19.1 in stage 0.0 (TID 32, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 9.1 in stage 0.0 (TID 29, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 20.0 in stage 0.0 (TID 20, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 10.0 in stage 0.0 (TID 10, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 21.1 in stage 0.0 (TID 28, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 6.1 in stage 0.0 (TID 24, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 ERROR TaskSchedulerImpl: Lost executor 3 on xxx.xxx.xxx.136: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 4.2 in stage 0.0 (TID 35, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 17.0 in stage 0.0 (TID 17, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 16.2 in stage 0.0 (TID 31, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 14.2 in stage 0.0 (TID 34, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 11.1 in stage 0.0 (TID 27, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 12, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 0:==============================> (13 + 11) / 24]
17/02/22 01:26:29 ERROR TaskSchedulerImpl: Lost executor 13 on xxx.xxx.xxx.136: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 ERROR TaskSchedulerImpl: Lost executor 14 on xxx.xxx.xxx.200: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 ERROR TaskSchedulerImpl: Lost executor 11 on xxx.xxx.xxx.118: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 20.1 in stage 0.0 (TID 50, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 5.1 in stage 0.0 (TID 49, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 6.2 in stage 0.0 (TID 45, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 10.1 in stage 0.0 (TID 48, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 9.2 in stage 0.0 (TID 51, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 0:==============================> (13 + 8) / 24]
17/02/22 01:26:30 ERROR TaskSchedulerImpl: Lost executor 10 on xxx.xxx.xxx.174: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 2.1 in stage 0.0 (TID 41, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 14.3 in stage 0.0 (TID 38, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 16.3 in stage 0.0 (TID 40, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 ERROR TaskSetManager: Task 16 in stage 0.0 failed 4 times; aborting job
17/02/22 01:26:30 WARN TaskSetManager: Lost task 4.3 in stage 0.0 (TID 43, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 11.2 in stage 0.0 (TID 37, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 9.3 in stage 0.0 (TID 60, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 12.1 in stage 0.0 (TID 36, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 7.1 in stage 0.0 (TID 39, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 ERROR TaskSchedulerImpl: Lost executor 12 on xxx.xxx.xxx.253: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 6.3 in stage 0.0 (TID 62, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 1.2 in stage 0.0 (TID 56, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 20.2 in stage 0.0 (TID 64, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 8.1 in stage 0.0 (TID 58, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 10.2 in stage 0.0 (TID 61, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 5.2 in stage 0.0 (TID 63, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 3.1 in stage 0.0 (TID 54, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 13.1 in stage 0.0 (TID 57, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 0.0 failed 4 times, most recent failure: Lost task 16.3 in stage 0.0 (TID 40, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
at org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:694)
at org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:253)
at org.apache.spark.mllib.recommendation.ALS$.train(ALS.scala:340)
at org.apache.spark.mllib.recommendation.ALS$.train(ALS.scala:357)
... 53 elided
scala>

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode - apache-spark

Looks like the cluster is disconnected. Which platform are you using?

We are using Kubernetes running on private cloud running OpenStack

Related

Why I get executor refused connection?

Lost executor Spark

spark worker with 32GB or more memory encountered a fatal error

Apache Spark Configuration

Spark doesn't seem to be fault tolerant to workers dying

Categories

Resources