I have a spark process that writes to hdfs (parquet files). My guess is that by default, if spark has some failure and retry, it could write some files twice (am I wrong?).
But then, how should I do to get idempotence on the hdfs output?
I see 2 situations that should be questioned differently (but please correct me or develop it if you know better):
the failure happens while writing one item: I guess the writes is restarted, thus could write twice if the post on hdfs is not "atomic" w.r.t to spark writing call. What are the chances?
the failure happens in any place, but due to how the execution dag is made, the restart will happens in a task that is before several write tasks (I am thinking of having to restart before some groupBy for example), and some of these writes tasks were already done. Does spark execution guarantee that those tasks won't be called again?
I think it is depends on what kind of committer you use for your job and that committer is able to undo the failed job or not. for example
when you using Apache Parquet-formatted output, Spark expects the committer Parquet to be a subclass of ParquetOutputCommitter. and if you use this committer DirectParquetOutputCommitter to for example appending data it is not able to undo the job. code
if you use ParquetOutputCommitter itself you can see that it extends FileOutputCommitter and overrides commitJob(JobContext jobContext) method a bit.
THE FOLLOWING CONTENTS ARE COPY/PASTE FROM Hadoop: The Definitive Guide
OutputCommitter API:
The setupJob() method is called before the job is run, and is typically used to perform
initialization. For FileOutputCommitter, the method creates the final output directory,
${mapreduce.output.fileoutputformat.outputdir}, and a temporary working space
for task output, _temporary, as a subdirectory underneath it.
If the job succeeds, the commitJob() method is called, which in the default file-based
implementation deletes the temporary working space and creates a hidden empty marker
file in the output directory called _SUCCESS to indicate to filesystem clients that the job
completed successfully. If the job did not succeed, abortJob() is called with a state object
indicating whether the job failed or was killed (by a user, for example). In the default
implementation, this will delete the job’s temporary working space.
The operations are similar at the task level. The setupTask() method is called before the
task is run, and the default implementation doesn’t do anything, because temporary
directories named for task outputs are created when the task outputs are written.
The commit phase for tasks is optional and may be disabled by returning false from
needsTaskCommit(). This saves the framework from having to run the distributed commit
protocol for the task, and neither commitTask() nor abortTask() is called.
FileOutputCommitter will skip the commit phase when no output has been written by a
task.
If a task succeeds, commitTask() is called, which in the default implementation moves the
temporary task output directory (which has the task attempt ID in its name to avoid
conflicts between task attempts) to the final output path,
${mapreduce.output.fileoutputformat.outputdir}. Otherwise, the framework calls
abortTask(), which deletes the temporary task output directory.
The framework ensures that in the event of multiple task attempts for a particular task,
only one will be committed; the others will be aborted. This situation may arise because
the first attempt failed for some reason — in which case, it would be aborted, and a later,
successful attempt would be committed. It can also occur if two task attempts were
running concurrently as speculative duplicates; in this instance, the one that finished first
would be committed, and the other would be aborted.
Related
I have a databricks job that run many commands and at the end it tries to save the results to a folder. However, it is failed because it tried to write a file to folder but folder was not exists.
I simply created the folder.
However, how can I make it continue where it left without executing all the previous commands.
I assume that by Databricks job you refer to the way to run non-interactive code in a Databricks cluster.
I do not think that what you ask is possible, namely getting the output of a certain Spark task from a previous job run on Databricks. As pointed out in the other answer, "if job is finished, then all processed data is gone". This has to do with the way Spark works under the hood. If you are curious about this topic, I suggest you start reading this post about Transformations and Actions in Spark.
Although you can think of a few workarounds, for instance if you are curious about certain intermediate outputs of your job, you could decide to temporary write your DataFrame/Dataset to some external location. In this way you can easily resume a the job from your preferred point by reading one of your checkpoints as input. This approach is a bit clanky and I do not recommend it, but it's a quick and dirty solution you might want to choose if you are in the testing/designing phase.
A more robust solution would involve splitting your job in multiple sub-jobs and setting upstream & downstream dependencies among them. You can do that using Databricks natively (Task dependencies section) or an external scheduler that integrates with Databricks, like Airflow.
In this way, you can split your tasks and you will be able to have an higher control granularity on your Application. So, in case of again failures on the writing step, you will be able to re run only the writing easily.
If job is finished, then all processed data is gone, until you write some intermediate states (additional tables, etc.) from which you can continue processing. In most cases, Spark actually execute the code only when it's writing results of execution of provided transformations.
So right now you just need to rerun the job.
I'm trying to write about 30k-60k parquet files to s3 using Spark and it's taking a massive amount of time (40+ minutes) due to the s3 rate limit.
I wonder if there is a best practice to do such a thing. I heard that writing the data to HDFS and then copying it using s3-dist-cp may be faster. I can't understand why. isn't the copy from HDFS will take the same amount of time because of the s3 rate limit?
Thanks for your help
There is nothing wrong in this approach and works absolutely fine in most of the use cases, but there might be some challenges due to the way in S3 files are written.
Two Important Concepts to Understand
S3(Object Store) != POSIX File System : Rename Operation:
File rename process in POSIX based file system is a metadata only operation.Only the pointer changes and file remains as is on the disk. For example, I have a file abc.txt and I want to rename it as xyz.txt its instantaneous and atomic. xyz.txt’s last modified timestamp remain same as abc.txt’s last modfied timestamp.
Where as in AWS S3 (object store) the file rename under the hood is a copy followed by a delete operation. The source file is first copied to destination and then the source file is deleted.So “aws s3 mv” changes the last modified timestamp of destination file unlike POSIX file system.The metadata here is a key value store where key is the file path and value is the content of the file and there is no such process as changing the key and get this done immediately. The rename process depends on the size of the file. If there is a directory rename(there is nothing called directory in S3 for for simplicity we can assume a recusrive set of files as a directory) then it depends on the # of files inside the dir along with size of each file. So in a nutshell rename is very expensive operation in S3 as compared to normal file system.
S3 Consistency Model
S3 comes with 2 kinds of consistency a.read after write b.eventual consistency and which some cases results in file not found expectation.Files being added and not listed or files being deleted or not removed from list.
Deep explanation:
Spark leverages Hadoop’s “FileOutputCommitter” implementations to write data. Writing data again involves multiple steps and on a high level staging output files and then committing them i.e. writing final files.Here the rename step is involved as I was talking earlier from staging to final step.As you know a spark job is divided into multiple stages and set of tasks and due to nature of distributed computing the tasks are prone to failure so there is also provision to re-launch same task due to system failure or speculative execution of slow running tasks and that leads to concepts of task commit and job commit functions.Here we have 2 options of readily available algorithms and how job and task commits are done and having said this not one algorithm is better then other rather based on where we are committing data.
mapreduce.fileoutputcommitter.algorithm.version=1
commitTask renames the data generated by task from task temporary directory to job temporary directory.
When all the tasks are complete commitJob rename all the data from job temporary directory to final destination and at the end creates _SUCCESS file.
Here driver does the work of commitJob at the end so object stores like S3 may take longer time because of lots of task temporary file being queued up for rename operation(its not serial though)and the write performance is not optimized.It might work pretty well for HDFS as rename is not expensive and just a metadata change.For AWS S3 during commitJob each rename operation of files opens up huge number of API calls to AWS S3 and might cause issues of unexpected API call closure if the number of files are high. It might not also. I have seen both the cases on the same job running in two different times.
mapreduce.fileoutputcommitter.algorithm.version=2
commitTask moves data generated by task from task temporary directory directly to the final destination as soon as task is complete.
commitJob basically writes the _SUCCESS file and doesn't do much.
From a high level this looks optimized but it comes with a limitation not to have the speculative task execution and also if any task fails due to corrupt data then we might end up with residual data in the final destination and needs a clean up. So this algorithm doesn't give 100% data correctness or doesn't work for use cases where we need data in append mode to existing files.Even if this ensures optimised results comes with a risk.The reason for good performance is basically because of less number of rename operations as compared to algorithm 1(still there are renames). Here we might encounter issues of file not found expectations because commitTask writes the file in temporary path and immediately renames them and there are light chances of eventual consistency issues.
Best Practices
Here are few I think we can use while writing spark data processing applications :
If you have a HDFS cluster available then write data from Spark to HDFS and copy it to S3 to persist. s3-dist-cp can be used for data copy from HDFS to S3 optimally.Here we can avoid all that rename operation.With AWS EMR being running for only duration of compute and then terminated afterwards to persist result this approach looks preferable.
Try avoiding writing files and reading it again and again unless there are consumers for the files , and spark is well known for in-memory processing and careful data persistence/cache in-memory will help the optimized run time of the application.
Why does spark, while saving result to a file system, uploads result files to a _temporary directory and then move them to output folder instead of directly uploading them to output folder?
Two stage process is the simplest way to ensure consistency of the final result when working with file systems.
You have to remember that each executor thread writes its result set independent of the other threads and writes can be performed at different moments in time or even reuse the same set of resources. At the moment of write Spark cannot determine if all writes will succeed.
In case of failure one can rollback the changes by removing temporary directory.
In case of success one can commit the changes by moving temporary directory.
Another benefit of this model is clear distinction between writes in progress and finalized output. As a result it can easily integrated with simple workflow management tools, without a need of having a separate state store or other synchronization mechanism.
This model is simple, reliable and works well with file systems for which it has been designed. Unfortunately it doesn't perform that well with object stores, which don't support moves.
We have a spark cluster with 10 nodes. I have a process that joins couple of dataframes and then saves the result to a s3 location. We are running in cluster mode. When I call save on the dataframe, does it save from the nodes or does it collect all the result to the driver and write it from the driver to the s3. is there a way to verify this.
RDD.save() triggers an evaluation of the entire query.
The work is partitioned by source data (i.e. files), and any splitting which can be done, pushing individual tasks to available executors, collecting the results and finally writing it to the destination directory using the cross-node protocol defined in implementations of FileCommitProtocol, generally HadoopMapReduceCommitProtocol, which then works with Hadoop's FileOutputCommitter to choreograph the commit.
Essentially:
Tasks write to their task-specific subdir under __temporary/$job-attempt/$task-attempt
tasks say they are ready to write, the spark driver tells them to commit vs abort
in speculative execution or failure conditions, tasks can abort, in which case they delete their temp dir
on a commit, the task lists files in its dir and renames them to a job attempt dir, or direct to the destination (v2 protocol)
On a job commit, the driver either lists and renames files in the job attempt dir (v1 protocol), or is a no-op (v2).
On the question of "writing to s3", if you are using Apache Spark (and not amazon EMR), then be aware that this list + rename commit mechanism is (a) slow as renames are really copies, and (b) dangerous as the eventual consistency of S3, especially list inconsistency, means that files saved by tasks may not be listed, hence not committed
At the time of writing (May 2017), the sole committer known to safely commit using s3a or s3n clients is the netflix committer.
There's work underway to pull this into Hadoop and hence spark, but again, in May 2017 it's still a work in progress: demo state only. I speak as the engineer doing this.
To close then: If you want reliable data output writing to S3, consult whoever is hosting your code on EC2. If you are using out-the-box Apache Spark without any supplier-specific code, do not write direct to S3. It may work in tests, but as well as seeing the intermittent failure, you may lose data and not even notice. Statistics are your enemy here: the more work you do, the bigger the datasets, the more tasks that are executed -and so eventually something will go wrong.
From offical doc we can see that :
For accumulator updates performed inside actions only, Spark
guarantees that each task’s update to the accumulator will only be
applied once, i.e. restarted tasks will not update the value. In
transformations, users should be aware of that each task’s update may
be applied more than once if tasks or job stages are re-executed.
I think it is means that accumulator should be performed inside actions only, such as rdd.foreachPartition()
Through rdd.foreachPartition's API code in pyspark, I find that rdd.foreachPartition(accum_func) is equal to :
rdd.mapPartitions(accum_func).mapPartitions(lambda i: [sum(1 for _ in i)]).mapPartitions(lambda x: [sum(x)]).mapPartitions(some_add_func).collect()
It seems that accum_func can run inside transformations(rdd.mapPartition) ?
Thanks a lot for any explanin
if the node running a partition of a map() operation crashes,
Spark will rerun it on another node and even if the node does not crash but is simply
much slower than other nodes, Spark can preemptively launch a “speculative” copy
of the task on another node, and take its result if that finishes.
Even if no nodes fail,Spark may have to rerun a task to rebuild a cached value that falls out of memory.The net result is therefore that the same function may run multiple times on the same data depending on what happens on the cluster.
Accumulators used in actions, Spark applies each task’s update to each accumulator only once. Thus, if we want a reliable absolute value counter, regardless of failures or multiple evaluations, we must put it inside an action like foreach().