spark3.2 tmp commit file is not found when commit batch

spark3.2 tmp commit file is not found when commit batch - apache-spark

exception info
firstly, official file /tmp/chenxiaobin/online_sid_sgby_v2/commits/222881 is exist, but spark still to rename a file of temporary version
spark fail to rename tmp commit file, the batch is running long time and namenode has a delay at the same time, hadoop cluster is busy
but i still can not understand why spark will throw the exception
as far as i know, spark is not running more same batch than one at the same time, but that look like a concurrent problem
so does anyone encounter this problem ? how to solve

Related

How are writes managed in Spark with speculation enabled?

Let's say I have a Spark 2.x application, which has speculation enabled (spark.speculation=true), which writes data to a specific location on HDFS.
Now if the task (which writes data to HDFS) takes long, Spark would create a copy of the same task on another executor, and both the jobs would be running in parallel.
How does Spark handle this? Obviously both the tasks shouldn't be trying to write data at the same file location at the same time (which seems to be happening in this case).
Any help would be appreciated.
Thanks

As I understand what is happening in my tasks:
If one of the speculative tasks is finished, the other is killed
When spark kills this task, it deletes temporary file written by this task
So no data will be duplicated
If you choose mode overwrite, some specilative tasks may fail with this exception:
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
Failed to CREATE_FILE /<hdfs_path>/.spark-staging-<...>///part-00191-.c000.snappy.parquet
for DFSClient_NONMAPREDUCE_936684547_1 on 10.3.110.14
because this file lease is currently owned by DFSClient_NONMAPREDUCE_-1803714432_1 on 10.0.14.64
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2629)
I will continue to study this situation, so maybe the answer will be more helpful some day

Spark concurrent writes on same HDFS location

I have a spark code which saves a dataframe to a HDFS location (date partitioned location) in Json format using append mode.
df.write.mode("append").format('json').save(hdfsPath)
sample hdfs location : /tmp/table1/datepart=20190903
I am consuming data from upstream in NiFi cluster. Each node in NiFi cluster will create a flow file for consumed data. My spark code is processing that flow file.As NiFi is distributed, my spark code is getting executed from different NiFi nodes in parallel trying to save data into same HDFS location.
I cannot store output of spark job in different directories as my data is partitioned on date.
This process is running daily once from last 14 days and my spark job failed 4 times with different errors.
First Error:
java.io.IOException: Failed to rename FileStatus{path=hdfs://tmp/table1/datepart=20190824/_temporary/0/task_20190824020604_0000_m_000000/part-00000-101aa2e2-85da-4067-9769-b4f6f6b8f276-c000.json; isDirectory=false; length=0; replication=3; blocksize=268435456; modification_time=1566630365451; access_time=1566630365034; owner=hive; group=hive; permission=rwxrwx--x; isSymlink=false} to hdfs://tmp/table1/datepart=20190824/part-00000-101aa2e2-85da-4067-9769-b4f6f6b8f276-c000.json
Second Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190825/_temporary/0 does not exist.
Third Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190901/_temporary/0/task_20190901020450_0000_m_000000 does not exist.
Fourth Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190903/_temporary/0 does not exist.
Following are the problems/issue:
I am not able to recreate this scenario again. How to do that?
On all 4 occasions, errors are related to _temporary directory. Is is because 2 or more jobs are parallelly trying to save the data in same HDFS location and whiling doing that Job A might have deleted _temporary directory of Job B? (Because of the same location and all folders have common name /_directory/0/)
If it is concurrency problem then I can run all NiFi processor from primary node but then I will loose the performance.
Need your expert advice.
Thanks in advance.

It seems the problem is that two spark nodes are independently trying to write to the same place, causing conflicts as the fastest one will clear up the working directory before the second one expects it.
The most straightforward solution may be to avoid this.
As I understand how you use Nifi and spark, the node where Nifi runs also determines the node where spark runs (there is a 1-1 relationship?)
If that is the case you should be able to solve this by routing the work in Nifi to nodes that do not interfere with each other. Check out the load balancing strategy (property of the queue) that depends on attributes. Of course you would need to define the right attribute, but something like directory or table name should go a long way.

Try to enable outputcommitter v2:
spark.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
It doesn't use shared temp directory for files , but creates .sparkStaging-<...> independent temp directories for each write
It also speeds up write, but allow some rear hypothetical cases of partial data write
Try to check this doc for more info:
https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#recommended-settings-for-writing-to-object-stores

Spark failing because S3 files are updated. How to eliminate this error?

My Spark script is failing because the S3 bucket from which the df is drawn gets updated with new files while the script is running. I don't care about the newly arriving files, but apparently Spark does.
I've tried adding the REFRESH TABLE command per the error msg, but that doesn't work because it is impossible to know at execution time when the new files will arrive, and so no way to know where to put that command. I have tried putting that REFRESH command in 4 different places in the script (in other words, invoking it 4 times at different points in the script) - all with the same failure message
Caused by: java.io.FileNotFoundException: No such file or directory '<snipped for posting>.snappy.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I create the df with: df = spark.table('data_base.bal_daily_posts')
So what can I do to make sure that S3 files arriving at S3 post-script-kickoff are ignored and do not error out the script?

Move files you're going to process to a different folder(key) and point spark to work with this folder only

This can happen if the same (e.g. if reading and writing directories are the same, Spark will first delete the files in destination) or other process has updated the files while they are being read.

Spark staging directory race condition when writing to hive partition?

I am seeing intermittent exceptions when attempting to write a Dataset to a partition in a hive table.
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /user/hive/warehouse/devl_fr9.db/fr9_ftdelivery_cpy_2_4d8eebd3_9691_47ce_8acc_b2a5123dabf6/.spark-staging-d996755c-eb81-4362-a393-31e8387104f0/date_id=20180604/part-00000-d996755c-eb81-4362-a393-31e8387104f0.c000.snappy.parquet for client 10.56.219.20 already exists
If I check HDFS the relevant path does not exist. I can only assume this is some race condition regarding temp staging files. I am using Spark 2.3

A possible reason for this issue is that, during a job's execution, a task started writing data to that file and failed.
When a task fails, the data that it had already written is not deleted/purged by Spark (confirmed at least in 2.3 and 2.4). Therefore, when a different executor attempts to re-execute the failed task it will attempt to write to a file with the same name, and you'll get a FileAlreadyExistsException.
In your case, the file that already exists is called part-00000-d996755c-eb81-4362-a393-31e8387104f0.c000, so it's likely that you have a log message in stderr indicating that task 00000 was lost due to failure, something like
WARN TaskSetManager: Lost task **00000** in stage...
If you fix the reason for this failure - probably an OutOfMemoryError, if the issue is intermitent - the FileAlreadyExistsException will likely be solved because the task will not fail and leave temporary files behind.

Spark Streaming - Restarting from checkpoint replays last batch

We are trying to build a fault tolerant spark streaming job, there's one problem we are running into. Here's our scenario:
1) Start a spark streaming process that runs batches of 2 mins
2) We have checkpoint enabled. Also the streaming context is configured to either create a new context or build from checkpoint if one exists
3) After a particular batch completes, the spark streaming job is manually killed using yarn application -kill (basically mimicking a sudden failure)
4) The spark streaming job is then restarted from checkpoint
The issue that we are having is that after the spark streaming job is restarted it replays the last successful batch. It always does this, just the last successful batch is replayed, not the earlier batches
The side effect of this is that the data part of that batch is duplicated. We even tried waiting for more than a minute after the last successful batch before killing the process (just in case writing to checkpoint takes time), that didn't help
Any insights? I have not added the code here, hoping someone has faced this as well and can give some ideas or insights. Can post the relevant code as well if that helps. Shouldn't spark streaming checkpoint right after a successful batch so that it is not replayed after a restart? Does it matter where I place the ssc.checkpoint command?

You have the answer in the last line of your question. The placement of ssc.checkpoint() matters. When you restart the job using the saved checkpointing, the job comes up with whatever is being saved. So in your case when you killed the job after the batch is completed, the recent one is the last successful one. By this time, you might have understood that checkpointing is mainly to pick up from where you left off-especially for failed jobs.

There are two things those need to be taken care.
1] Ensure that the same checkpoint directory is being used in getOrCreate streaming context method when you restart the program.
2] Set “spark.streaming.stopGracefullyOnShutdown" to "true". This allows spark to complete processing current data and update the checkpoint directory accordingly. If set to false, it may lead to corrupt data in checkpoint directory.
Note: Please post code snippets if possible. And yes, the placement of ssc.checkpoint does matter.

In Such a scenario, one should ensure that checkpoint directory used in streaming context method is same after restart of Spark application. Hopefully it will help

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

spark3.2 tmp commit file is not found when commit batch - apache-spark

Related

How are writes managed in Spark with speculation enabled?

Spark concurrent writes on same HDFS location

Spark failing because S3 files are updated. How to eliminate this error?

Spark staging directory race condition when writing to hive partition?

Spark Streaming - Restarting from checkpoint replays last batch

Categories

Resources