Spark on yarn mode end with "Exit status: -100. Diagnostics: Container released on a *lost* node"

Spark on yarn mode end with "Exit status: -100. Diagnostics: Container released on a *lost* node" - apache-spark

I am trying to load a database with 1TB data to spark on AWS using the latest EMR. And the running time is so long that it doesn't finished in even 6 hours, but after running 6h30m , I get some error announcing that Container released on a lost node and then the job failed. Logs are like this:
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144181.0 in stage 0.0 (TID 144181, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144175.0 in stage 0.0 (TID 144175, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144213.0 in stage 0.0 (TID 144213, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-0-2-176.ec2.internal, 43922)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 5 has been removed (new total is 41)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144138.0 in stage 0.0 (TID 144138, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144185.0 in stage 0.0 (TID 144185, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144184.0 in stage 0.0 (TID 144184, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144186.0 in stage 0.0 (TID 144186, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, ip-10-0-2-173.ec2.internal, 43593)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 30 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144162.0 in stage 0.0 (TID 144162, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 40)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144156.0 in stage 0.0 (TID 144156, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144170.0 in stage 0.0 (TID 144170, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144169.0 in stage 0.0 (TID 144169, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000024 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
I am pretty sure that my network setting works because I have tried to run this script on the same environment on a much smaller table.
Also, I am aware that somebody posted a question 6 months ago asking for the same issue:spark-job-error-yarnallocator-exit-status-100-diagnostics-container-released but I still have to ask because nobody was answering this question.

Looks like other peoples has the same issue as well, so I just post an answer instead of writing a comment. I am not sure that this would solve the issue but this should be an idea.
If you use spot instance, you should know that spot instance will be shut down if the price is higher than your input, and you will hit this issue. Even if you are just using a spot instance as a slave. So my solution is not using any spot instance for long term running job.
Another idea is to slice the job into many independent steps, so you can save the result of each step as a file on S3. If any error happened, just start from that step by the cached files.

is it dynamic allocation of memory ? I had similar issue I fixed it by going with static allocation by calculating executor memory, executor cores and executors.
Try Static allocation for huge workloads in Spark.

This means your YARN container is down, to debug what happened, you must read YARN logs, use the official CLI yarn logs -applicationId or feel free to use and contribute to my project https://github.com/ebuildy/yoga a YARN viewer as web app.
You should see lot of Worker errors.

I was hitting the same problem. I found some clues in this article on DZone:
https://dzone.com/articles/some-lessons-of-spark-and-memory-issues-on-emr
This one was solved by increasing the number of DataFrame partitions (in this case, from 1,024 to 2,048). That reduced the needed memory per partition.
So I tried to increase the number of DataFrame partitions which solved my issue.

AWS has released this as FAQ
For EMR:
https://aws.amazon.com/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/
For Glue job:
https://aws.amazon.com/premiumsupport/knowledge-center/container-released-lost-node-100-glue/

Amazon has provided their solution, which is handled through resource allocation, and there is no processing method from the perspective of users

For my case, we were using GCP Dataproc cluster with 2 Pre-Emptible (default) Secondary Workers.
This isn't a problem for short running jobs since both primary and secondary workers finished quite quickly.
However, for long running jobs, it was observed that all primary workers finished assigned tasks quite quickly relative to secondary workers.
Due to the pre-emptible nature, containers get lost after running 3 hours for the tasks that were assigned to secondary workers. Thus, resulting in Container losts error.
I would recommend to not use secondary workers for any long running jobs.

Related

Why I get executor refused connection?

I have something strange in the execution of my code:
When I execute the following line:
sourceList = joinLabelrdd_df.select("x").collect()
I get the following execption. Noting I have enough memory and cpus.
19/07/14 11:22:34 ERROR TaskSchedulerImpl: Lost executor 5 on 172.16.140.68: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
19/07/14 11:22:34 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190714111835-0000/5 is now EXITED (Command exited with code 137)
This error cause another exception:
19/07/14 11:22:41 WARN TaskSetManager: Lost task 113.1 in stage 9.0 (TID 2154, 172.16.140.113, executor 9): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=113, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

Lost executor Spark

I have a long running job on Spark, which after running for hours failed with the following errors.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 547 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 750.0 in stage 19.0 (TID 1565492, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 752.0 in stage 19.0 (TID 1565494, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 751.0 in stage 19.0 (TID 1565493, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 754.0 in stage 19.0 (TID 1565496, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 753.0 in stage 19.0 (TID 1565495, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 572 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 INFO DAGScheduler: Executor lost: 547 (epoch 45)
18/10/09 03:22:15 WARN TaskSetManager: Lost task 756.0 in stage 19.0 (TID 1565498, ip, executor 572): ExecutorLostFailure (executor 572 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
...
The strange thing is, I can't even see the lost executors on the Executor list for the log.
It would be great if someone can help fix the problem.

There are many factors for this to happen but the summary is the following:
Your master node is unable to reply to a specific executor and therefore gives the error
Unable to register with external shuffle server due to
Why your master node is unable to reply can be of different reasons. Depends on how your code is structured, the size of your instance if you are using EMR.
To solve it
Increase your master node. For example, if you are using i3.4xlarge, instead use i3.8xlarge or even i3.16xlarge.
Increase the network timeout from 2 minutes to 5 minutes. This is done with the following spark configuration: spark.network.timeout=300
Increase both the memory and number of cores of your master node. To increase the number of cores of your master node, set the following configuration. spark.yarn.am.cores=3
Hope this solves the issue.

Cassandra process grows in ram usage and is never freed

Using the latest DSE 6.0.1.
All 7 nodes have 64G of ram and are configured to use a MAX of 32G. When we run top the cassandra process takes roughly 61% of RAM.
We start pyspark shell and run simple count query against cassandra tables.
After few seconds we get a lot of:
WARN 2018-07-19 10:52:47,983
org.apache.spark.scheduler.TaskSetManager: Lost task 2.0 in stage 0.0
(TID 1, 10.5.7.56, executor 3): ExecutorLostFailure (executor 3 exited
caused by one of the running tasks) Reason: Command exited with code 1
WARN 2018-07-19 10:52:47,984
org.apache.spark.scheduler.TaskSetManager: Lost task 69.0 in stage 0.0
(TID 78, 10.5.7.56, executor 3): ExecutorLostFailure (executor 3
exited caused by one of the running tasks) Reason: Command exited with
code 1 WARN 2018-07-19 10:52:47,984
org.apache.spark.scheduler.TaskSetManager: Lost task 82.0 in stage 0.0
(TID 99, 10.5.7.56, executor 3): ExecutorLostFailure (executor 3
exited caused by one of the running tasks) Reason: Command exited with
code 1 WARN 2018-07-19 10:52:47,984
org.apache.spark.scheduler.TaskSetManager: Lost task 36.0 in stage 0.0
(TID 36, 10.5.7.56, executor 3): ExecutorLostFailure (executor 3
exited caused by one of the running tasks) Reason: Command exited with
code 1 WARN 2018-07-19 10:52:47,984
org.apache.spark.scheduler.TaskSetManager: Lost task 59.0 in stage 0.0
(TID 57, 10.5.7.56, executor 3): ExecutorLostFailure (executor 3
exited caused by one of the running tasks) Reason: Command exited with
code 1 WARN 2018-07-19 10:52:47,985
org.apache.spark.scheduler.TaskSetManager: Lost task 22.0 in stage 0.0
(TID 15, 10.5.7.56, executor 3): ExecutorLostFailure (executor 3
exited caused by one of the running tasks) Reason: Command exited with
code 1 [Stage
0:===================================================>(1406 + 19) /
1425]ERROR 2018-07-19 10:53:18,165
org.apache.spark.scheduler.TaskSchedulerImpl: Lost executor 7 on
10.5.7.56: Command exited with code 137 WARN 2018-07-19 10:53:18,166 org.apache.spark.scheduler.TaskSetManager: Lost task 908.0 in stage
0.0 (TID 932, 10.5.7.56, executor 7): ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Command exited with
code 137 WARN 2018-07-19 10:53:18,166
org.apache.spark.scheduler.TaskSetManager: Lost task 1377.0 in stage
0.0 (TID 1240, 10.5.7.56, executor 7): ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Command exited with
code 137 WARN 2018-07-19 10:53:18,167
org.apache.spark.scheduler.TaskSetManager: Lost task 1378.0 in stage
0.0 (TID 1243, 10.5.7.56, executor 7): ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Command exited with
code 137 WARN 2018-07-19 10:53:18,167
org.apache.spark.scheduler.TaskSetManager: Lost task 940.0 in stage
0.0 (TID 965, 10.5.7.56, executor 7): ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Command exited with
code 137 WARN 2018-07-19 10:53:18,167
org.apache.spark.scheduler.TaskSetManager: Lost task 864.0 in stage
0.0 (TID 911, 10.5.7.56, executor 7): ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Command exited with
code 137
at that point pyspark job fails and all of cassandra processes on all nodes go up in ram consumption to about 85% and NEVER go back down, even after exiting from pyspark shell.
The only way to free up the ram is to restart each process.
UPDATE:
nodetool status output:
Datacenter: Cassandra
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.5.7.53 2.1 TiB 32 ? 8c441d54-c5ae-42c0-819a-6d6b597c9903 rack1
UN 10.5.7.52 1.79 TiB 32 ? efdc8d20-72a2-4b99-a88c-4a8c5ed6be93 rack1
UN 10.5.7.55 1.5 TiB 32 ? 3d72fb6b-a45f-46af-abdd-b8b7b13873cf rack1
UN 10.5.7.54 2.02 TiB 32 ? 886849ff-90ac-4ed0-940e-92d92a54cc49 rack1
UN 10.5.7.51 1.99 TiB 32 ? 7834b960-f074-4139-82de-9444320130d1 rack1
UN 10.5.7.57 1.72 TiB 32 ? bc195dbd-4449-423b-8527-5cf9669c1bd3 rack1
UN 10.5.7.56 1.89 TiB 32 ? 51e2ad43-fb44-4729-a55f-437ce5b1f505 rack1

spark worker with 32GB or more memory encountered a fatal error

I have three slaves in a standalone Spark Cluster. Each slave has 48GB of RAM. When I assigned more than 31GB (e.g. 32GB or more) of RAM to my executors:
.config("spark.executor.memory", "44g")
the executors were terminated without much information during a join of two large Dataframes. The output message at the Slave driver showed "missing an output location for shuffle":
17/09/21 12:34:18 INFO StandaloneSchedulerBackend: Granted executor ID app-20170921123240-0000/3 on hostPort XXX.XXX.XXX.92:33705 with 6 cores, 44.0 GB RAM
17/09/21 12:34:18 WARN TaskSetManager: Lost task 14.0 in stage 7.0 (TID 124, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 5.0 in stage 7.0 (TID 115, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 17.0 in stage 7.0 (TID 127, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 8.0 in stage 7.0 (TID 118, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 2.0 in stage 7.0 (TID 112, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 11.0 in stage 7.0 (TID 121, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 INFO DAGScheduler: Executor lost: 0 (epoch 5)
17/09/21 12:34:18 INFO BlockManagerMaster: Removal of executor 0 requested
17/09/21 12:34:18 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_2 !
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_11 !
17/09/21 12:34:18 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170921123240-0000/3 is now RUNNING
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_5 !
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_8 !
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(0, XXX.XXX.XXX, 34840, None)
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/09/21 12:34:18 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor
The log message of Spark Master showed that the executors were "EXITED" and then relaunched:
17/09/21 12:34:18 INFO Master: Removing executor app-20170921123240-0000/0 because it is EXITED
17/09/21 12:34:18 INFO Master: Launching executor app-20170921123240-0000/3 on worker worker-20170921123014-152.83.247.92-33705
The log message of Spark Worker showed that the executor exited with code 134
17/09/21 12:34:18 INFO Worker: Executor app-20170921123240-0000/0 finished with state EXITED message Command exited with code 134 exitStatus 134
The only clue seems to be in the error log of the application, showing a fatal error has been detected by the JRE:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fdec0c92a73, pid=11300, tid=0x00007fd3a6951700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 1.8.0_131-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode linux-amd64 )
# Problematic frame:
# V [libjvm.so+0x3ffa73] CardTableExtension::scavenge_contents_parallel(ObjectStartArray*, MutableSpace*, HeapWord*, PSPromotionManager*, unsigned int, unsigned int)+0x5e3
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
--------------- T H R E A D ---------------
Current thread (0x0000000001c9e800): GCTaskThread [stack: 0x00007fd3a6851000,0x00007fd3a6952000] [id=11308]
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000008
As long as I assign 31GB of RAM (or less) to each executor, my program works just fine. Has anyone encountered such problem before?

44 GB may actually give you a smaller usable heap than 31 GB due to how Java stores object references: For heap sizes over 32 GB the JVM has to switch to 64 bit object references, which means all objects take up more space. More details here: http://java-performance.info/over-32g-heap-java/
My rule of thumb is to either stay below 32 GB or go much higher (say, 50 GB). Usually it is more cost efficient to use multiple JVMs, each having less than 32 GB heap. With 48 GB RAM I would stick to 31 GB heap.

Spark doesn't seem to be fault tolerant to workers dying

I've got a test Spark cluster going on AWS (1 master + 5 worker machines, all running Spark 2.1.0 with Scala 2.11.8 on m4.2xlarge instances), and I'm running the ALS demo code in a Spark shell to test out performance.
I noticed when I terminate all worker machines (all the while keeping the master up), the workload redistributes to the remaining workers, but when I kill all the workers, the job usually dies completely instead of patiently waiting for more workers to come along. Is this normal behavior?
My shell session is below. The first few lines are the ALS app, and the rest are the error messages. You'll notice that the first time I kill all workers (executor IDs: 0, 1, 2, 3, 4) the shell waits for more workers to come online, like it's supposed to. Once I do bring up more workers (IDs: 10, 11, 12, 13, 14), the application continues on its way. But when I terminate those new workers as well, the entire job aborts with SparkException: Job aborted due to stage failure.
Is this normal behavior? If not, what am I doing wrong? If so, how can I improve Spark's tolerance to (possibly all) workers dying? Any insight into this would be appreciated.
Spark context Web UI available at http://xxx.xxx.xxx.133:4040
Spark context available as 'sc' (master = spark://xxx.xxx.xxx.133:7077, app id = app-20170222012148-0005).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_92)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.mllib.recommendation._
val data = sc.textFile("s3n://my.bucket/training-set.tsv")
val ratings = data.map(_.split('\t') match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)
// Exiting paste mode, now interpreting.
[Stage 0:===========> (5 + 19) / 24]
17/02/22 01:23:32 ERROR TaskSchedulerImpl: Lost executor 1 on xxx.xxx.xxx.174: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 11.0 in stage 0.0 (TID 11, xxx.xxx.xxx.174, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 16.0 in stage 0.0 (TID 16, xxx.xxx.xxx.174, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, xxx.xxx.xxx.174, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 6.0 in stage 0.0 (TID 6, xxx.xxx.xxx.174, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TransportChannelHandler: Exception in connection from /xxx.xxx.xxx.118:60180
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
17/02/22 01:23:32 ERROR TaskSchedulerImpl: Lost executor 0 on xxx.xxx.xxx.118: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 16.1 in stage 0.0 (TID 26, xxx.xxx.xxx.118, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4, xxx.xxx.xxx.118, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 9.0 in stage 0.0 (TID 9, xxx.xxx.xxx.118, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 ERROR TaskSchedulerImpl: Lost executor 2 on xxx.xxx.xxx.253: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 23.0 in stage 0.0 (TID 23, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 13.0 in stage 0.0 (TID 13, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 1.1 in stage 0.0 (TID 25, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 18.0 in stage 0.0 (TID 18, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 4.1 in stage 0.0 (TID 30, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 14.1 in stage 0.0 (TID 33, xxx.xxx.xxx.253, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 ERROR TaskSchedulerImpl: Lost executor 4 on xxx.xxx.xxx.200: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 19.1 in stage 0.0 (TID 32, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 9.1 in stage 0.0 (TID 29, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 20.0 in stage 0.0 (TID 20, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 10.0 in stage 0.0 (TID 10, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 21.1 in stage 0.0 (TID 28, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:32 WARN TaskSetManager: Lost task 6.1 in stage 0.0 (TID 24, xxx.xxx.xxx.200, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 ERROR TaskSchedulerImpl: Lost executor 3 on xxx.xxx.xxx.136: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 4.2 in stage 0.0 (TID 35, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 17.0 in stage 0.0 (TID 17, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 16.2 in stage 0.0 (TID 31, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 14.2 in stage 0.0 (TID 34, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 11.1 in stage 0.0 (TID 27, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:23:33 WARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 12, xxx.xxx.xxx.136, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 0:==============================> (13 + 11) / 24]
17/02/22 01:26:29 ERROR TaskSchedulerImpl: Lost executor 13 on xxx.xxx.xxx.136: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 ERROR TaskSchedulerImpl: Lost executor 14 on xxx.xxx.xxx.200: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 ERROR TaskSchedulerImpl: Lost executor 11 on xxx.xxx.xxx.118: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 20.1 in stage 0.0 (TID 50, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 5.1 in stage 0.0 (TID 49, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 6.2 in stage 0.0 (TID 45, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 10.1 in stage 0.0 (TID 48, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:29 WARN TaskSetManager: Lost task 9.2 in stage 0.0 (TID 51, xxx.xxx.xxx.118, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 0:==============================> (13 + 8) / 24]
17/02/22 01:26:30 ERROR TaskSchedulerImpl: Lost executor 10 on xxx.xxx.xxx.174: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 2.1 in stage 0.0 (TID 41, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 14.3 in stage 0.0 (TID 38, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 16.3 in stage 0.0 (TID 40, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 ERROR TaskSetManager: Task 16 in stage 0.0 failed 4 times; aborting job
17/02/22 01:26:30 WARN TaskSetManager: Lost task 4.3 in stage 0.0 (TID 43, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 11.2 in stage 0.0 (TID 37, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 9.3 in stage 0.0 (TID 60, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 12.1 in stage 0.0 (TID 36, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 7.1 in stage 0.0 (TID 39, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 ERROR TaskSchedulerImpl: Lost executor 12 on xxx.xxx.xxx.253: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 6.3 in stage 0.0 (TID 62, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 1.2 in stage 0.0 (TID 56, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 20.2 in stage 0.0 (TID 64, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 8.1 in stage 0.0 (TID 58, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 10.2 in stage 0.0 (TID 61, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 5.2 in stage 0.0 (TID 63, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 3.1 in stage 0.0 (TID 54, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/02/22 01:26:30 WARN TaskSetManager: Lost task 13.1 in stage 0.0 (TID 57, xxx.xxx.xxx.253, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 0.0 failed 4 times, most recent failure: Lost task 16.3 in stage 0.0 (TID 40, xxx.xxx.xxx.174, executor 10): ExecutorLostFailure (executor 10 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
at org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:694)
at org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:253)
at org.apache.spark.mllib.recommendation.ALS$.train(ALS.scala:340)
at org.apache.spark.mllib.recommendation.ALS$.train(ALS.scala:357)
... 53 elided
scala>

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark on yarn mode end with "Exit status: -100. Diagnostics: Container released on a lost node" - apache-spark

is it dynamic allocation of memory ? I had similar issue I fixed it by going with static allocation by calculating executor memory, executor cores and executors. Try Static allocation for huge workloads in Spark.

This means your YARN container is down, to debug what happened, you must read YARN logs, use the official CLI yarn logs -applicationId or feel free to use and contribute to my project https://github.com/ebuildy/yoga a YARN viewer as web app. You should see lot of Worker errors.

AWS has released this as FAQ For EMR: https://aws.amazon.com/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/ For Glue job: https://aws.amazon.com/premiumsupport/knowledge-center/container-released-lost-node-100-glue/

Amazon has provided their solution, which is handled through resource allocation, and there is no processing method from the perspective of users

Related

Why I get executor refused connection?

Lost executor Spark

Cassandra process grows in ram usage and is never freed

spark worker with 32GB or more memory encountered a fatal error

Spark doesn't seem to be fault tolerant to workers dying

Categories

Resources