Why I get executor refused connection? - apache-spark

I have something strange in the execution of my code:
When I execute the following line:
sourceList = joinLabelrdd_df.select("x").collect()
I get the following execption. Noting I have enough memory and cpus.
19/07/14 11:22:34 ERROR TaskSchedulerImpl: Lost executor 5 on 172.16.140.68: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
19/07/14 11:22:34 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190714111835-0000/5 is now EXITED (Command exited with code 137)
This error cause another exception:
19/07/14 11:22:41 WARN TaskSetManager: Lost task 113.1 in stage 9.0 (TID 2154, 172.16.140.113, executor 9): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=113, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

Related

Spark executor self-exiting due to driver disassociated in Kubernetes with client deploy-mode

I have Spark driver running in a Kubernetes pod with client deploy-mode and it tries to start an executor.
Executor will fail with error:
{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", "class":"dispatcher-Executor", "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", "log":"Executor self-exiting due to : Driver 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting down.\n"}
Then driver will attempt to start another executor which fails with same error and this goes on and on.
In the driver pod, I see only following errors:
22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.43.250:
22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.43.233:
22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on 192.168.43.221:
22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on 192.168.43.217:
22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on 192.168.43.197:
22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on 192.168.43.237:
22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on 192.168.43.196:
22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on 192.168.43.228:
22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on 192.168.43.254:
22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on 192.168.43.204:
22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on 192.168.43.231:
What is wrong? And how can I get executors running correctly?
Looks like the cluster is disconnected. Which platform are you using?
We are using Kubernetes running on private cloud running OpenStack

Lost executor Spark

I have a long running job on Spark, which after running for hours failed with the following errors.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 547 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 750.0 in stage 19.0 (TID 1565492, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 752.0 in stage 19.0 (TID 1565494, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 751.0 in stage 19.0 (TID 1565493, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 754.0 in stage 19.0 (TID 1565496, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 753.0 in stage 19.0 (TID 1565495, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 572 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 INFO DAGScheduler: Executor lost: 547 (epoch 45)
18/10/09 03:22:15 WARN TaskSetManager: Lost task 756.0 in stage 19.0 (TID 1565498, ip, executor 572): ExecutorLostFailure (executor 572 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
...
The strange thing is, I can't even see the lost executors on the Executor list for the log.
It would be great if someone can help fix the problem.
There are many factors for this to happen but the summary is the following:
Your master node is unable to reply to a specific executor and therefore gives the error
Unable to register with external shuffle server due to
Why your master node is unable to reply can be of different reasons. Depends on how your code is structured, the size of your instance if you are using EMR.
To solve it
Increase your master node. For example, if you are using i3.4xlarge, instead use i3.8xlarge or even i3.16xlarge.
Increase the network timeout from 2 minutes to 5 minutes. This is done with the following spark configuration: spark.network.timeout=300
Increase both the memory and number of cores of your master node. To increase the number of cores of your master node, set the following configuration. spark.yarn.am.cores=3
Hope this solves the issue.

spark worker with 32GB or more memory encountered a fatal error

I have three slaves in a standalone Spark Cluster. Each slave has 48GB of RAM. When I assigned more than 31GB (e.g. 32GB or more) of RAM to my executors:
.config("spark.executor.memory", "44g")
the executors were terminated without much information during a join of two large Dataframes. The output message at the Slave driver showed "missing an output location for shuffle":
17/09/21 12:34:18 INFO StandaloneSchedulerBackend: Granted executor ID app-20170921123240-0000/3 on hostPort XXX.XXX.XXX.92:33705 with 6 cores, 44.0 GB RAM
17/09/21 12:34:18 WARN TaskSetManager: Lost task 14.0 in stage 7.0 (TID 124, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 5.0 in stage 7.0 (TID 115, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 17.0 in stage 7.0 (TID 127, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 8.0 in stage 7.0 (TID 118, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 2.0 in stage 7.0 (TID 112, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 11.0 in stage 7.0 (TID 121, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 INFO DAGScheduler: Executor lost: 0 (epoch 5)
17/09/21 12:34:18 INFO BlockManagerMaster: Removal of executor 0 requested
17/09/21 12:34:18 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_2 !
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_11 !
17/09/21 12:34:18 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170921123240-0000/3 is now RUNNING
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_5 !
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_8 !
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(0, XXX.XXX.XXX, 34840, None)
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/09/21 12:34:18 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor
The log message of Spark Master showed that the executors were "EXITED" and then relaunched:
17/09/21 12:34:18 INFO Master: Removing executor app-20170921123240-0000/0 because it is EXITED
17/09/21 12:34:18 INFO Master: Launching executor app-20170921123240-0000/3 on worker worker-20170921123014-152.83.247.92-33705
The log message of Spark Worker showed that the executor exited with code 134
17/09/21 12:34:18 INFO Worker: Executor app-20170921123240-0000/0 finished with state EXITED message Command exited with code 134 exitStatus 134
The only clue seems to be in the error log of the application, showing a fatal error has been detected by the JRE:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fdec0c92a73, pid=11300, tid=0x00007fd3a6951700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 1.8.0_131-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode linux-amd64 )
# Problematic frame:
# V [libjvm.so+0x3ffa73] CardTableExtension::scavenge_contents_parallel(ObjectStartArray*, MutableSpace*, HeapWord*, PSPromotionManager*, unsigned int, unsigned int)+0x5e3
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
--------------- T H R E A D ---------------
Current thread (0x0000000001c9e800): GCTaskThread [stack: 0x00007fd3a6851000,0x00007fd3a6952000] [id=11308]
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000008
As long as I assign 31GB of RAM (or less) to each executor, my program works just fine. Has anyone encountered such problem before?
44 GB may actually give you a smaller usable heap than 31 GB due to how Java stores object references: For heap sizes over 32 GB the JVM has to switch to 64 bit object references, which means all objects take up more space. More details here: http://java-performance.info/over-32g-heap-java/
My rule of thumb is to either stay below 32 GB or go much higher (say, 50 GB). Usually it is more cost efficient to use multiple JVMs, each having less than 32 GB heap. With 48 GB RAM I would stick to 31 GB heap.

Spark doesn't seem to be fault tolerant to workers dying

I've got a test Spark cluster going on AWS (1 master + 5 worker machines, all running Spark 2.1.0 with Scala 2.11.8 on m4.2xlarge instances), and I'm running the ALS demo code in a Spark shell to test out performance.
I noticed when I terminate all worker machines (all the while keeping the master up), the workload redistributes to the remaining workers, but when I kill all the workers, the job usually dies completely instead of patiently waiting for more workers to come along. Is this normal behavior?
My shell session is below. The first few lines are the ALS app, and the rest are the error messages. You'll notice that the first time I kill all workers (executor IDs: 0, 1, 2, 3, 4) the shell waits for more workers to come online, like it's supposed to. Once I do bring up more workers (IDs: 10, 11, 12, 13, 14), the application continues on its way. But when I terminate those new workers as well, the entire job aborts with SparkException: Job aborted due to stage failure.
Is this normal behavior? If not, what am I doing wrong? If so, how can I improve Spark's tolerance to (possibly all) workers dying? Any insight into this would be appreciated.
Spark context Web UI available at http://xxx.xxx.xxx.133:4040
Spark context available as 'sc' (master = spark://xxx.xxx.xxx.133:7077, app id = app-20170222012148-0005).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_92)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.mllib.recommendation._
val data = sc.textFile("s3n://my.bucket/training-set.tsv")
val ratings = data.map(_.split('\t') match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)
// Exiting paste mode, now interpreting.
[Stage 0:===========> (5 + 19) / 24]
17/02/22 01:23:32 ERROR TaskSchedulerImpl: Lost executor 1 on xxx.xxx.xxx.174: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 11.0 in stage 0.0 (TID 11, xxx.xxx.xxx.174, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 16.0 in stage 0.0 (TID 16, xxx.xxx.xxx.174, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, xxx.xxx.xxx.174, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 6.0 in stage 0.0 (TID 6, xxx.xxx.xxx.174, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TransportChannelHandler: Exception in connection from /xxx.xxx.xxx.118:60180
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
17/02/22 01:23:32 ERROR TaskSchedulerImpl: Lost executor 0 on xxx.xxx.xxx.118: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 16.1 in stage 0.0 (TID 26, xxx.xxx.xxx.118, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4, xxx.xxx.xxx.118, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 9.0 in stage 0.0 (TID 9, xxx.xxx.xxx.118, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 ERROR TaskSchedulerImpl: Lost executor 2 on xxx.xxx.xxx.253: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 23.0 in stage 0.0 (TID 23, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 13.0 in stage 0.0 (TID 13, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 1.1 in stage 0.0 (TID 25, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 18.0 in stage 0.0 (TID 18, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 4.1 in stage 0.0 (TID 30, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 14.1 in stage 0.0 (TID 33, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 ERROR TaskSchedulerImpl: Lost executor 4 on xxx.xxx.xxx.200: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 19.1 in stage 0.0 (TID 32, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 9.1 in stage 0.0 (TID 29, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 20.0 in stage 0.0 (TID 20, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 10.0 in stage 0.0 (TID 10, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 21.1 in stage 0.0 (TID 28, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 6.1 in stage 0.0 (TID 24, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 ERROR TaskSchedulerImpl: Lost executor 3 on xxx.xxx.xxx.136: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 4.2 in stage 0.0 (TID 35, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 17.0 in stage 0.0 (TID 17, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 16.2 in stage 0.0 (TID 31, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 14.2 in stage 0.0 (TID 34, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 11.1 in stage 0.0 (TID 27, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 12, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 0:==============================> (13 + 11) / 24]
17/02/22 01:26:29 ERROR TaskSchedulerImpl: Lost executor 13 on xxx.xxx.xxx.136: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 ERROR TaskSchedulerImpl: Lost executor 14 on xxx.xxx.xxx.200: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 ERROR TaskSchedulerImpl: Lost executor 11 on xxx.xxx.xxx.118: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 20.1 in stage 0.0 (TID 50, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 5.1 in stage 0.0 (TID 49, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 6.2 in stage 0.0 (TID 45, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 10.1 in stage 0.0 (TID 48, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 9.2 in stage 0.0 (TID 51, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 0:==============================> (13 + 8) / 24]
17/02/22 01:26:30 ERROR TaskSchedulerImpl: Lost executor 10 on xxx.xxx.xxx.174: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 2.1 in stage 0.0 (TID 41, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 14.3 in stage 0.0 (TID 38, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 16.3 in stage 0.0 (TID 40, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 ERROR TaskSetManager: Task 16 in stage 0.0 failed 4 times; aborting job
17/02/22 01:26:30 WARN TaskSetManager: Lost task 4.3 in stage 0.0 (TID 43, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 11.2 in stage 0.0 (TID 37, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 9.3 in stage 0.0 (TID 60, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 12.1 in stage 0.0 (TID 36, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 7.1 in stage 0.0 (TID 39, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 ERROR TaskSchedulerImpl: Lost executor 12 on xxx.xxx.xxx.253: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 6.3 in stage 0.0 (TID 62, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 1.2 in stage 0.0 (TID 56, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 20.2 in stage 0.0 (TID 64, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 8.1 in stage 0.0 (TID 58, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 10.2 in stage 0.0 (TID 61, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 5.2 in stage 0.0 (TID 63, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 3.1 in stage 0.0 (TID 54, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 13.1 in stage 0.0 (TID 57, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 0.0 failed 4 times, most recent failure: Lost task 16.3 in stage 0.0 (TID 40, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
at org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:694)
at org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:253)
at org.apache.spark.mllib.recommendation.ALS$.train(ALS.scala:340)
at org.apache.spark.mllib.recommendation.ALS$.train(ALS.scala:357)
... 53 elided
scala>

Spark on yarn mode end with "Exit status: -100. Diagnostics: Container released on a *lost* node"

I am trying to load a database with 1TB data to spark on AWS using the latest EMR. And the running time is so long that it doesn't finished in even 6 hours, but after running 6h30m , I get some error announcing that Container released on a lost node and then the job failed. Logs are like this:
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144181.0 in stage 0.0 (TID 144181, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144175.0 in stage 0.0 (TID 144175, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144213.0 in stage 0.0 (TID 144213, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-0-2-176.ec2.internal, 43922)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 5 has been removed (new total is 41)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144138.0 in stage 0.0 (TID 144138, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144185.0 in stage 0.0 (TID 144185, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144184.0 in stage 0.0 (TID 144184, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144186.0 in stage 0.0 (TID 144186, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, ip-10-0-2-173.ec2.internal, 43593)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 30 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144162.0 in stage 0.0 (TID 144162, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 40)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144156.0 in stage 0.0 (TID 144156, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144170.0 in stage 0.0 (TID 144170, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144169.0 in stage 0.0 (TID 144169, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000024 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
I am pretty sure that my network setting works because I have tried to run this script on the same environment on a much smaller table.
Also, I am aware that somebody posted a question 6 months ago asking for the same issue:spark-job-error-yarnallocator-exit-status-100-diagnostics-container-released but I still have to ask because nobody was answering this question.
Looks like other peoples has the same issue as well, so I just post an answer instead of writing a comment. I am not sure that this would solve the issue but this should be an idea.
If you use spot instance, you should know that spot instance will be shut down if the price is higher than your input, and you will hit this issue. Even if you are just using a spot instance as a slave. So my solution is not using any spot instance for long term running job.
Another idea is to slice the job into many independent steps, so you can save the result of each step as a file on S3. If any error happened, just start from that step by the cached files.
is it dynamic allocation of memory ? I had similar issue I fixed it by going with static allocation by calculating executor memory, executor cores and executors.
Try Static allocation for huge workloads in Spark.
This means your YARN container is down, to debug what happened, you must read YARN logs, use the official CLI yarn logs -applicationId or feel free to use and contribute to my project https://github.com/ebuildy/yoga a YARN viewer as web app.
You should see lot of Worker errors.
I was hitting the same problem. I found some clues in this article on DZone:
https://dzone.com/articles/some-lessons-of-spark-and-memory-issues-on-emr
This one was solved by increasing the number of DataFrame partitions (in this case, from 1,024 to 2,048). That reduced the needed memory per partition.
So I tried to increase the number of DataFrame partitions which solved my issue.
AWS has released this as FAQ
For EMR:
https://aws.amazon.com/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/
For Glue job:
https://aws.amazon.com/premiumsupport/knowledge-center/container-released-lost-node-100-glue/
Amazon has provided their solution, which is handled through resource allocation, and there is no processing method from the perspective of users
For my case, we were using GCP Dataproc cluster with 2 Pre-Emptible (default) Secondary Workers.
This isn't a problem for short running jobs since both primary and secondary workers finished quite quickly.
However, for long running jobs, it was observed that all primary workers finished assigned tasks quite quickly relative to secondary workers.
Due to the pre-emptible nature, containers get lost after running 3 hours for the tasks that were assigned to secondary workers. Thus, resulting in Container losts error.
I would recommend to not use secondary workers for any long running jobs.

Resources