I have a current pipeline, where I do several transformations to my dataframe.
It is important to insert checkpoints to assure an accepted execution time.
However from time to time I get this error from any of the checkpoints:
Job aborted due to stage failure: Task creation failed: java.io.FileNotFoundException
No such file or directory: /checkpoints/...../rdd-1002/part-00003
Can you please suggest.
Is it similar to this issue?
A complex logic is overwriting the RDD while a failing process is trying to recover.
Spark not able to find checkpointed data in HDFS after executor fails
Related
I have jobs that repartition the huge datasets in parquet format and the file system used is s3a (S3).
Browsing through the Spark UI, I stumbled upon a job which has uncompleted tasks but the job marked is successful.
The different categories of jobs: i) Active, ii) Completed, iii) Failed.
I am unable to deduce the reason for this failed job, nor I am able to assert whether this was actually a failed one, given that there is another category for failed jobs.
How do I resolve this ambiguity?
I am running a job in 9 nodes.
All of them are going to write some information to files doing simple writes like below:
dfLogging.coalesce(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)
However I am receiving this exception:
py4j.protocol.Py4JJavaError: An error occurred while calling
o106.save. : java.util.concurrent.ExecutionException:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task
1.0 in stage 14.0 (TID 259, localhost, executor driver): org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/dbfs/delta/Logging/_delta_log/00000000000000000063.json at 0
exp: 1179219224 got: -1020415797
It looks to me, that because of concurrency, spark is somehow failing and it generates checksum errors.
Is there any known scenario that may be causing it?
So there are a couple of things going on and it should explain why coalesce may not work.
What coalesce does is it essentially combines the partitions across each worker. For example, if you have three workers, you can perform coalesce(3) which would consolidate the partitions on each worker.
What repartition does is it shuffles the data to increase/decrease the number of total partitions. In your case, if you have more than one worker and if you need a single output, you would have to use repartition(1) since you want the data to be on a single partition before writing it out.
Why coalesce would not work?
Spark limits the shuffling during coalesce. So you cannot perform a full shuffle (across different workers) when you are using coalesce, whereas you can perform a full shuffle when you are using repartition, although it is an expensive operation.
Here is the code that would work:
dfLogging.repartition(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)
I am seeing intermittent exceptions when attempting to write a Dataset to a partition in a hive table.
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /user/hive/warehouse/devl_fr9.db/fr9_ftdelivery_cpy_2_4d8eebd3_9691_47ce_8acc_b2a5123dabf6/.spark-staging-d996755c-eb81-4362-a393-31e8387104f0/date_id=20180604/part-00000-d996755c-eb81-4362-a393-31e8387104f0.c000.snappy.parquet for client 10.56.219.20 already exists
If I check HDFS the relevant path does not exist. I can only assume this is some race condition regarding temp staging files. I am using Spark 2.3
A possible reason for this issue is that, during a job's execution, a task started writing data to that file and failed.
When a task fails, the data that it had already written is not deleted/purged by Spark (confirmed at least in 2.3 and 2.4). Therefore, when a different executor attempts to re-execute the failed task it will attempt to write to a file with the same name, and you'll get a FileAlreadyExistsException.
In your case, the file that already exists is called part-00000-d996755c-eb81-4362-a393-31e8387104f0.c000, so it's likely that you have a log message in stderr indicating that task 00000 was lost due to failure, something like
WARN TaskSetManager: Lost task **00000** in stage...
If you fix the reason for this failure - probably an OutOfMemoryError, if the issue is intermitent - the FileAlreadyExistsException will likely be solved because the task will not fail and leave temporary files behind.
I am sreaming data from Kafka as below:
final JavaPairDStream<String, Row> transformedMessages =
rtStream
.mapToPair(record -> new Tuple2<String, GenericDataModel>(record.key(), record.value()))
.mapWithState(StateSpec.function(updateDataFunc).numPartitions(32)).stateSnapshots()
.foreachRDD(rdd -> {
--logic goes here
});
I have four workers threads, and multiple executors for this application, and i am trying to check fault tolerance of Spark.
Since we are using mapWithState, spark is checkpointing data to HDFS, so if any executor/worker goes down , we should be able to recover the lost data (data lost in the dead executor), and continue with leftover executors/workers.
So i kill one of the workers nodes to see if the application still runs smoothly, but instead i get an exception of FileNotFound in HDFS as below:
This is a bit odd, as Spark checkpointed data at sometime in HDFS, why is it not able to find it. Obviously HDFS is not deleting any data, so why this exception.
Or am i missing something here?
[ERROR] 2018-08-21 13:07:24,067 org.apache.spark.streaming.scheduler.JobScheduler logError - Error running job streaming job 1534871220000 ms.2
org.apache.spark.SparkException: Job aborted due to stage failure: Task creation failed: java.io.FileNotFoundException: File does not exist: hdfs://mycluster/user/user1/sparkCheckpointData/2db59817-d954-41a7-9b9d-4ec874bc86de/rdd-1005/part-00000
java.io.FileNotFoundException: File does not exist: hdfs://mycluster/user/user1/sparkCheckpointData/2db59817-d954-41a7-9b9d-4ec874bc86de/rdd-1005/part-00000
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1122)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
at org.apache.spark.rdd.ReliableCheckpointRDD.getPreferredLocations(ReliableCheckpointRDD.scala:89)
at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:273)
at scala.Option.map(Option.scala:146)
at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:273)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1615)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1626)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1625)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1625)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1625)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1623)
Further Update:
I found that the RDD that Spark is trying to find in HDFS is already deleted by "ReliableRDDCheckpointData" process and it created a new RDD for the checkpoint data.
DAG is pointing to this old RDD somehow. Had there been any reference to this data, it shouldn't have been deleted.
Consider this pipeline of transformation on a Spark stream:
rtStream
.mapToPair(record -> new Tuple2<String, GenericDataModel>(record.key(), record.value()))
.mapWithState(StateSpec.function(updateDataFunc).numPartitions(32)).stateSnapshots()
.foreachRDD(rdd -> {
if(counter ==1){
--convert RDD to Dataset, and register it as a SQL table names "InitialDataTable"
} else
--convert RDD to Dataset, and register it as a SQL table names "ActualDataTable"
});
mapWithState is associated with automatic checkpointing of state data after every batch, so each "rdd" in the above "forEachRdd" block is checkpointed , and while checkpointing, it overwrites the previous checkpoint (because obviously the latest state needs to stay in the checkpoint)
but lets say if the user is still using the rdd number 1, as in my case i am registering the very 1st rdd as a different table, and every other rdd as a different table, then it shouldnot be overwritten. (its same in java, if something is referring to a object reference , that object will not be eligible for garbage collection)
Now, when i try to access the table "InitialDataTable", obviously the "rdd" used to create this table is no more in memory, so it will go to HDFS to recover that from the checkpoint, and it will not find it there as well because it was overwritten by the very next rdd, and the spark application stops citing the reason.
"org.apache.spark.SparkException: Job aborted due to stage failure: Task creation failed: java.io.FileNotFoundException: File does not exist: hdfs://mycluster/user/user1/sparkCheckpointData/2db59817-d954-41a7-9b9d-4ec874bc86de/rdd-1005/part-00000"
So to resolve this issue, i had to checkpoint the very first rdd explicitly.
Is there a way to run my spark program and be shielded from files
underneath changing?
The code starts by reading a parquet file (no errors during the read):
val mappings = spark.read.parquet(S3_BUCKET_PATH + "/table/mappings/")
It then does transformations with the data e.g.,
val newTable = mappings.join(anotherTable, 'id)
These transformations take hours (which is another problem).
Sometimes the job finishes, other times, it dies with the following similar message:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 6 in stage 1014.0 failed 4 times, most recent failure: Lost task
6.3 in stage 1014.0 (TID 106820, 10.127.251.252, executor 5): java.io.FileNotFoundException: No such file or directory:
s3a://bucket1/table/mappings/part-00007-21eac9c5-yyzz-4295-a6ef-5f3bb13bed64.snappy.parquet
We believe another job is changing the files underneath us, but haven't been able to find the culprit.
This is a very complicated problem to solve here. If the underlying data changes while you are operating on the same dataframe the spark job will fail. The reason is when the dataframe was created the underlying RDD knew the location of the data and the DAG associated with it. Now if the underlying data suddenly changed by some job , RDD has no option but fail it.
One possibility of enable retry ,speculation etc but nevertheless the problem exists. Generally if you have a table in parquet and you want to read write at the same time, partition the table by date or time and then write will happen in the different partition while reading will happen in different partition.
Now with the problem of join taking long time. If you are reading the data from s3 then join and write back to s3 again the performance will be slower. Because now the hadoop needs to fetch the data from s3 first then perform the operation ( code not going to data ). Although the network call is fast, I ran some experiment with s3 vs EMR FS and found 50% slowdown with s3.
One alternative is to copy the data from s3 to HDFS and then run the join. That will shield you from the data overwriting and the performance will be faster.
One last thing if you are using spark 2.2 s3 write is painfully slow due to deprecation of DirectOutputCommiter. So that could be another reason for slowdown