i am running Hive on Spark on CDH 5.10. and i get the below error. I have checked all the logs of YARN , Hive and Spark, but there is no useful information apart from the below error:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 4, xxx.local, executor 1): java.lang.StackOverflowError
Tyr to set the following parameters before executing your query:
set spark.executor.extraJavaOptions=-Xss16m;
set hive.execution.engine=spark;
Related
I am trying to save a trained pyspark word2vec in local, however this is resulting in an error - Mkdirs failed to create
model.write().overwrite().save("word2vec.model")
22/08/31 12:56:24 WARN TaskSetManager: Lost task 0.0 in stage 421.0
(TID 11440) (100.66.40.74 executor 98): java.io.IOException: Mkdirs
failed to create
file:/home/jovyan/_git/notebooks/word2vec.model/metadata/_temporary/0/_temporary/attempt_202208311256242380698786646139780_0477_m_000000_0
(exists=false, cwd=file:/opt/spark/work-dir) at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:458)
The command does create a folder word2vec.model, but eventually fails. What can be the possible issue here?
Final stage of the Spark job is to save 37Gb of data to GCS bucket in avro format. Spark app is run on Dataproc.
My cluster consists of: 15 workers with 4 cores and 15Gb RAM, 1 master with 4 cores and 15Gb RAM.
I use the following code:
df.write.option("forceSchema", schema_str) \
.format("avro") \
.partitionBy('platform', 'cluster') \
.save(f"gs://{output_path}")
Final statistics from executors:
In 4 attempts by Spark to run one of the failed tasks, the error codes I get are:
1/4. java.lang.StackOverflowError
2/4. Job aborted due to stage failure: Task 29 in stage 13.0 failed 4 times, most recent failure: Lost task 29.3 in stage 13.0 (TID 3048, ce-w1.internal, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_1607696154227_0002_01_000028 on host: ce-w1.internal. Exit status: 50. Diagnostics: [2020-12-11 15:46:19.880]Exception from container-launch.
Container id: container_1607696154227_0002_01_000028
Exit code: 50
[2020-12-11 15:46:19.881]Container exited with a non-zero exit code 50. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
3/4. java.lang.StackOverflowError
4/4. Job aborted due to stage failure: Task 29 in stage 13.0 failed 4 times, most recent failure: Lost task 29.3 in stage 13.0 (TID 3048, ce-w1.internal, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_1607696154227_0002_01_000028 on host: ce-w1.internal. Exit status: 50. Diagnostics: [2020-12-11 15:46:19.880]Exception from container-launch.
Container id: container_1607696154227_0002_01_000028
Exit code: 50
[2020-12-11 15:46:19.881]Container exited with a non-zero exit code 50. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
Spark UI gives me this:
From UI it's apparent that something is going on with data distribution, but repartitioning gives the same StackOverflow error.
So the two questions I want to ask:
how do I decode the message 'container prelaunch-error' in context of StackOverflow error?
why other actions in the job run safely, despite the same data distribution ?
the problem is not due to your cluster capacities, it is due to the fact that you are working with an avro format and you are forcing spark to write a new schema while saving try to not use the postdefined schema, It will work. If you want to chance the schema just do it before saving via withColumn for example,please to check the number of shuffle too.
df.write.format("avro") \
.partitionBy('platform', 'cluster') \
.save(f"gs://{output_path}")
While reading parquet files in spark, if you face the below problem.
App > Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 44, 10.23.5.196, executor 2): java.io.EOFException: Reached the end of stream with 193212 bytes left to read
App > at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
App > at org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
App > at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
App > at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
App > at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
App > at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
App > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:124)
App > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:215)
For below spark commands:
val df = spark.read.parquet("s3a://.../file.parquet")
df.show(5, false)
For me above didn't do the trick, but the following did:
--conf spark.hadoop.fs.s3a.experimental.input.fadvise=sequential
Not sure why, but what gave me a hint was this issue and some details about the options here.
I think you can bypass this issue with
--conf spark.sql.parquet.enableVectorizedReader=false
I have a long running job on Spark, which after running for hours failed with the following errors.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 547 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 750.0 in stage 19.0 (TID 1565492, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 752.0 in stage 19.0 (TID 1565494, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 751.0 in stage 19.0 (TID 1565493, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 754.0 in stage 19.0 (TID 1565496, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 753.0 in stage 19.0 (TID 1565495, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 572 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 INFO DAGScheduler: Executor lost: 547 (epoch 45)
18/10/09 03:22:15 WARN TaskSetManager: Lost task 756.0 in stage 19.0 (TID 1565498, ip, executor 572): ExecutorLostFailure (executor 572 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
...
The strange thing is, I can't even see the lost executors on the Executor list for the log.
It would be great if someone can help fix the problem.
There are many factors for this to happen but the summary is the following:
Your master node is unable to reply to a specific executor and therefore gives the error
Unable to register with external shuffle server due to
Why your master node is unable to reply can be of different reasons. Depends on how your code is structured, the size of your instance if you are using EMR.
To solve it
Increase your master node. For example, if you are using i3.4xlarge, instead use i3.8xlarge or even i3.16xlarge.
Increase the network timeout from 2 minutes to 5 minutes. This is done with the following spark configuration: spark.network.timeout=300
Increase both the memory and number of cores of your master node. To increase the number of cores of your master node, set the following configuration. spark.yarn.am.cores=3
Hope this solves the issue.
My Scala program does parsing the log using Java object method called parse and it works fine in local[*] mode however it is not working neither on Cloudera yarn client mode or cluster mode.
val hivehbaserows = spark.sql(" select log msg from hivehbasetable")
hivehbaserows.foreach(x => {
val line = x.getString(0).toString
javaObject.javaparsemethod(line)
}
for me when I run the above program on local[*], it works however if I launch it on yarn client or cluster mode, it throws javanullpointer exception on executors stages and throws the bellow error.
executor 6), java.lang.NullPointerException(null) [duplicate 1]
executor 2), java.lang.NullPointerException(null) [duplicate 2]
executor 2), java.lang.NullPointerException(null) [duplicate 1]
job aborted due to stage:Task 0 in stage 1.0 failed 4 times, most recent failure:lost task 0.3 in stage in 1.0 java Null Pointer exception