Final stage of the Spark job is to save 37Gb of data to GCS bucket in avro format. Spark app is run on Dataproc.
My cluster consists of: 15 workers with 4 cores and 15Gb RAM, 1 master with 4 cores and 15Gb RAM.
I use the following code:
df.write.option("forceSchema", schema_str) \
.format("avro") \
.partitionBy('platform', 'cluster') \
.save(f"gs://{output_path}")
Final statistics from executors:
In 4 attempts by Spark to run one of the failed tasks, the error codes I get are:
1/4. java.lang.StackOverflowError
2/4. Job aborted due to stage failure: Task 29 in stage 13.0 failed 4 times, most recent failure: Lost task 29.3 in stage 13.0 (TID 3048, ce-w1.internal, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_1607696154227_0002_01_000028 on host: ce-w1.internal. Exit status: 50. Diagnostics: [2020-12-11 15:46:19.880]Exception from container-launch.
Container id: container_1607696154227_0002_01_000028
Exit code: 50
[2020-12-11 15:46:19.881]Container exited with a non-zero exit code 50. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
3/4. java.lang.StackOverflowError
4/4. Job aborted due to stage failure: Task 29 in stage 13.0 failed 4 times, most recent failure: Lost task 29.3 in stage 13.0 (TID 3048, ce-w1.internal, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_1607696154227_0002_01_000028 on host: ce-w1.internal. Exit status: 50. Diagnostics: [2020-12-11 15:46:19.880]Exception from container-launch.
Container id: container_1607696154227_0002_01_000028
Exit code: 50
[2020-12-11 15:46:19.881]Container exited with a non-zero exit code 50. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
Spark UI gives me this:
From UI it's apparent that something is going on with data distribution, but repartitioning gives the same StackOverflow error.
So the two questions I want to ask:
how do I decode the message 'container prelaunch-error' in context of StackOverflow error?
why other actions in the job run safely, despite the same data distribution ?
the problem is not due to your cluster capacities, it is due to the fact that you are working with an avro format and you are forcing spark to write a new schema while saving try to not use the postdefined schema, It will work. If you want to chance the schema just do it before saving via withColumn for example,please to check the number of shuffle too.
df.write.format("avro") \
.partitionBy('platform', 'cluster') \
.save(f"gs://{output_path}")
Related
We have spark job which randomly fails.
Code is pretty simple :
load 2 dataframes from SqlServer
join them
write the result to Mysql
Total data size is around 10.2 GB.
configuration details :
Master:1 r4.8xlarge
Slave:4 r4.4xlarge
"spark.driver.memory": "57.6G",
"spark.driver.cores": "5",
"spark.driver.memoryOverhead": "25600",
"spark.executor.memory": "54G",
"spark.executor.cores": "5",
"spark.executor.memoryOverhead": "6144",
"spark.executor.instances": "8",
"spark.default.parallelism": "112",
"spark.executor.extraJavaOptions": "-XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThread=20",
"spark.dynamicAllocation.enabled": "false"
It's really difficult to understand this behavior, when it fails it's always the same error :
ERROR : Job aborted due to stage failure: Task 54 in stage 12.0 failed 4 times, most recent failure: Lost task 54.3 in stage 12.0 (TID 1011, ip-10-43-67-156.hrlogix.va, executor 111): ExecutorLostFailure (executor 111 exited caused by one of the running tasks) Reason: Container marked as failed: container_1591579744421_0001_01_000124 on host: ip-10-43-67-156.hrlogix.va. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal.
We have already tried playing around incresing executor-memory ,driver-memory, spark.sql.shuffle.partitions & df.repartition as mentioned in
https://aws.amazon.com/premiumsupport/knowledge-center/container-killed-on-request-137-emr/
Can anyone help ?
I have something strange in the execution of my code:
When I execute the following line:
sourceList = joinLabelrdd_df.select("x").collect()
I get the following execption. Noting I have enough memory and cpus.
19/07/14 11:22:34 ERROR TaskSchedulerImpl: Lost executor 5 on 172.16.140.68: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
19/07/14 11:22:34 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20190714111835-0000/5 is now EXITED (Command exited with code 137)
This error cause another exception:
19/07/14 11:22:41 WARN TaskSetManager: Lost task 113.1 in stage 9.0 (TID 2154, 172.16.140.113, executor 9): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=113, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
I have a long running job on Spark, which after running for hours failed with the following errors.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 547 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 750.0 in stage 19.0 (TID 1565492, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 752.0 in stage 19.0 (TID 1565494, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 751.0 in stage 19.0 (TID 1565493, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 754.0 in stage 19.0 (TID 1565496, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 753.0 in stage 19.0 (TID 1565495, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 572 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 INFO DAGScheduler: Executor lost: 547 (epoch 45)
18/10/09 03:22:15 WARN TaskSetManager: Lost task 756.0 in stage 19.0 (TID 1565498, ip, executor 572): ExecutorLostFailure (executor 572 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
...
The strange thing is, I can't even see the lost executors on the Executor list for the log.
It would be great if someone can help fix the problem.
There are many factors for this to happen but the summary is the following:
Your master node is unable to reply to a specific executor and therefore gives the error
Unable to register with external shuffle server due to
Why your master node is unable to reply can be of different reasons. Depends on how your code is structured, the size of your instance if you are using EMR.
To solve it
Increase your master node. For example, if you are using i3.4xlarge, instead use i3.8xlarge or even i3.16xlarge.
Increase the network timeout from 2 minutes to 5 minutes. This is done with the following spark configuration: spark.network.timeout=300
Increase both the memory and number of cores of your master node. To increase the number of cores of your master node, set the following configuration. spark.yarn.am.cores=3
Hope this solves the issue.
i am running Hive on Spark on CDH 5.10. and i get the below error. I have checked all the logs of YARN , Hive and Spark, but there is no useful information apart from the below error:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 4, xxx.local, executor 1): java.lang.StackOverflowError
Tyr to set the following parameters before executing your query:
set spark.executor.extraJavaOptions=-Xss16m;
set hive.execution.engine=spark;
My spark streaming application (running on HD Insights) constantly has tasks failing due to the error message below:
ExecutorLostFailure (executor 525 exited unrelated to the running tasks) Reason: Container container_1495825717937_0056_01_000916 on host: 10.0.0.14 was preempted.
Do you have any ideas what I should do here? It is not obvious to me how I should act upon this error message.