Executor lost failure: Host was preempted - apache-spark

My spark streaming application (running on HD Insights) constantly has tasks failing due to the error message below:
ExecutorLostFailure (executor 525 exited unrelated to the running tasks) Reason: Container container_1495825717937_0056_01_000916 on host: 10.0.0.14 was preempted.
Do you have any ideas what I should do here? It is not obvious to me how I should act upon this error message.

Related

Spark: failed to save dataframe to Google Cloud Storage

Final stage of the Spark job is to save 37Gb of data to GCS bucket in avro format. Spark app is run on Dataproc.
My cluster consists of: 15 workers with 4 cores and 15Gb RAM, 1 master with 4 cores and 15Gb RAM.
I use the following code:
df.write.option("forceSchema", schema_str) \
.format("avro") \
.partitionBy('platform', 'cluster') \
.save(f"gs://{output_path}")
Final statistics from executors:
In 4 attempts by Spark to run one of the failed tasks, the error codes I get are:
1/4. java.lang.StackOverflowError
2/4. Job aborted due to stage failure: Task 29 in stage 13.0 failed 4 times, most recent failure: Lost task 29.3 in stage 13.0 (TID 3048, ce-w1.internal, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_1607696154227_0002_01_000028 on host: ce-w1.internal. Exit status: 50. Diagnostics: [2020-12-11 15:46:19.880]Exception from container-launch.
Container id: container_1607696154227_0002_01_000028
Exit code: 50
[2020-12-11 15:46:19.881]Container exited with a non-zero exit code 50. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
3/4. java.lang.StackOverflowError
4/4. Job aborted due to stage failure: Task 29 in stage 13.0 failed 4 times, most recent failure: Lost task 29.3 in stage 13.0 (TID 3048, ce-w1.internal, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_1607696154227_0002_01_000028 on host: ce-w1.internal. Exit status: 50. Diagnostics: [2020-12-11 15:46:19.880]Exception from container-launch.
Container id: container_1607696154227_0002_01_000028
Exit code: 50
[2020-12-11 15:46:19.881]Container exited with a non-zero exit code 50. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
Spark UI gives me this:
From UI it's apparent that something is going on with data distribution, but repartitioning gives the same StackOverflow error.
So the two questions I want to ask:
how do I decode the message 'container prelaunch-error' in context of StackOverflow error?
why other actions in the job run safely, despite the same data distribution ?
the problem is not due to your cluster capacities, it is due to the fact that you are working with an avro format and you are forcing spark to write a new schema while saving try to not use the postdefined schema, It will work. If you want to chance the schema just do it before saving via withColumn for example,please to check the number of shuffle too.
df.write.format("avro") \
.partitionBy('platform', 'cluster') \
.save(f"gs://{output_path}")

Why I get executor refused connection?

I have something strange in the execution of my code:
When I execute the following line:
sourceList = joinLabelrdd_df.select("x").collect()
I get the following execption. Noting I have enough memory and cpus.
19/07/14 11:22:34 ERROR TaskSchedulerImpl: Lost executor 5 on 172.16.140.68: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
19/07/14 11:22:34 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190714111835-0000/5 is now EXITED (Command exited with code 137)
This error cause another exception:
19/07/14 11:22:41 WARN TaskSetManager: Lost task 113.1 in stage 9.0 (TID 2154, 172.16.140.113, executor 9): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=113, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

Lost executor Spark

I have a long running job on Spark, which after running for hours failed with the following errors.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 547 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 750.0 in stage 19.0 (TID 1565492, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 752.0 in stage 19.0 (TID 1565494, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 751.0 in stage 19.0 (TID 1565493, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 754.0 in stage 19.0 (TID 1565496, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 753.0 in stage 19.0 (TID 1565495, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 572 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 INFO DAGScheduler: Executor lost: 547 (epoch 45)
18/10/09 03:22:15 WARN TaskSetManager: Lost task 756.0 in stage 19.0 (TID 1565498, ip, executor 572): ExecutorLostFailure (executor 572 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
...
The strange thing is, I can't even see the lost executors on the Executor list for the log.
It would be great if someone can help fix the problem.
There are many factors for this to happen but the summary is the following:
Your master node is unable to reply to a specific executor and therefore gives the error
Unable to register with external shuffle server due to
Why your master node is unable to reply can be of different reasons. Depends on how your code is structured, the size of your instance if you are using EMR.
To solve it
Increase your master node. For example, if you are using i3.4xlarge, instead use i3.8xlarge or even i3.16xlarge.
Increase the network timeout from 2 minutes to 5 minutes. This is done with the following spark configuration: spark.network.timeout=300
Increase both the memory and number of cores of your master node. To increase the number of cores of your master node, set the following configuration. spark.yarn.am.cores=3
Hope this solves the issue.

spark tasks fail with error, showing exit status: -100

The spark job running in yarn mode, shows few tasks failed with following reason:
ExecutorLostFailure (executor 36 exited caused by one of the running tasks) Reason: Container marked as failed: container_xxxxxxxxxx_yyyy_01_000054 on host: ip-xxx-yy-zzz-zz. Exit status: -100. Diagnostics: Container released on a *lost* node
Any idea why is this happening?
There are two main reasons.
It is may because of your memoryOverhead needed by the yarn container is not enough, and the solution is to Increase the spark.executor.memoryOverhead
Possibly, it is because the slave node disk lack space to write tmp data required by spark. check your yarn usercache dir (for EMR, it locates on /mnt/yarn/usercache/),
or type df -h to check your disk remaining space.
Container killed by the framework, either due to being released by the application or being 'lost' due to node failures etc. have a special exit code of -100.
The node failure could be because of not having enough disc space or executor memory.
I understand your cluster is not on AWS but as AWS manager the MR cluster they have released an FAQ
For Glue job: https://aws.amazon.com/premiumsupport/knowledge-center/container-released-lost-node-100-glue/
For EMR: https://aws.amazon.com/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/

Spark on yarn mode end with "Exit status: -100. Diagnostics: Container released on a *lost* node"

I am trying to load a database with 1TB data to spark on AWS using the latest EMR. And the running time is so long that it doesn't finished in even 6 hours, but after running 6h30m , I get some error announcing that Container released on a lost node and then the job failed. Logs are like this:
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144181.0 in stage 0.0 (TID 144181, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144175.0 in stage 0.0 (TID 144175, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144213.0 in stage 0.0 (TID 144213, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-0-2-176.ec2.internal, 43922)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 5 has been removed (new total is 41)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144138.0 in stage 0.0 (TID 144138, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144185.0 in stage 0.0 (TID 144185, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144184.0 in stage 0.0 (TID 144184, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144186.0 in stage 0.0 (TID 144186, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, ip-10-0-2-173.ec2.internal, 43593)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 30 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144162.0 in stage 0.0 (TID 144162, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 40)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144156.0 in stage 0.0 (TID 144156, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144170.0 in stage 0.0 (TID 144170, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144169.0 in stage 0.0 (TID 144169, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000024 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
I am pretty sure that my network setting works because I have tried to run this script on the same environment on a much smaller table.
Also, I am aware that somebody posted a question 6 months ago asking for the same issue:spark-job-error-yarnallocator-exit-status-100-diagnostics-container-released but I still have to ask because nobody was answering this question.
Looks like other peoples has the same issue as well, so I just post an answer instead of writing a comment. I am not sure that this would solve the issue but this should be an idea.
If you use spot instance, you should know that spot instance will be shut down if the price is higher than your input, and you will hit this issue. Even if you are just using a spot instance as a slave. So my solution is not using any spot instance for long term running job.
Another idea is to slice the job into many independent steps, so you can save the result of each step as a file on S3. If any error happened, just start from that step by the cached files.
is it dynamic allocation of memory ? I had similar issue I fixed it by going with static allocation by calculating executor memory, executor cores and executors.
Try Static allocation for huge workloads in Spark.
This means your YARN container is down, to debug what happened, you must read YARN logs, use the official CLI yarn logs -applicationId or feel free to use and contribute to my project https://github.com/ebuildy/yoga a YARN viewer as web app.
You should see lot of Worker errors.
I was hitting the same problem. I found some clues in this article on DZone:
https://dzone.com/articles/some-lessons-of-spark-and-memory-issues-on-emr
This one was solved by increasing the number of DataFrame partitions (in this case, from 1,024 to 2,048). That reduced the needed memory per partition.
So I tried to increase the number of DataFrame partitions which solved my issue.
AWS has released this as FAQ
For EMR:
https://aws.amazon.com/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/
For Glue job:
https://aws.amazon.com/premiumsupport/knowledge-center/container-released-lost-node-100-glue/
Amazon has provided their solution, which is handled through resource allocation, and there is no processing method from the perspective of users
For my case, we were using GCP Dataproc cluster with 2 Pre-Emptible (default) Secondary Workers.
This isn't a problem for short running jobs since both primary and secondary workers finished quite quickly.
However, for long running jobs, it was observed that all primary workers finished assigned tasks quite quickly relative to secondary workers.
Due to the pre-emptible nature, containers get lost after running 3 hours for the tasks that were assigned to secondary workers. Thus, resulting in Container losts error.
I would recommend to not use secondary workers for any long running jobs.

Resources