What is the best practice writing massive amount of files to s3 using Spark - apache-spark

I'm trying to write about 30k-60k parquet files to s3 using Spark and it's taking a massive amount of time (40+ minutes) due to the s3 rate limit.
I wonder if there is a best practice to do such a thing. I heard that writing the data to HDFS and then copying it using s3-dist-cp may be faster. I can't understand why. isn't the copy from HDFS will take the same amount of time because of the s3 rate limit?
Thanks for your help

There is nothing wrong in this approach and works absolutely fine in most of the use cases, but there might be some challenges due to the way in S3 files are written.
Two Important Concepts to Understand
S3(Object Store) != POSIX File System : Rename Operation:
File rename process in POSIX based file system is a metadata only operation.Only the pointer changes and file remains as is on the disk. For example, I have a file abc.txt and I want to rename it as xyz.txt its instantaneous and atomic. xyz.txt’s last modified timestamp remain same as abc.txt’s last modfied timestamp.
Where as in AWS S3 (object store) the file rename under the hood is a copy followed by a delete operation. The source file is first copied to destination and then the source file is deleted.So “aws s3 mv” changes the last modified timestamp of destination file unlike POSIX file system.The metadata here is a key value store where key is the file path and value is the content of the file and there is no such process as changing the key and get this done immediately. The rename process depends on the size of the file. If there is a directory rename(there is nothing called directory in S3 for for simplicity we can assume a recusrive set of files as a directory) then it depends on the # of files inside the dir along with size of each file. So in a nutshell rename is very expensive operation in S3 as compared to normal file system.
S3 Consistency Model
S3 comes with 2 kinds of consistency a.read after write b.eventual consistency and which some cases results in file not found expectation.Files being added and not listed or files being deleted or not removed from list.
Deep explanation:
Spark leverages Hadoop’s “FileOutputCommitter” implementations to write data. Writing data again involves multiple steps and on a high level staging output files and then committing them i.e. writing final files.Here the rename step is involved as I was talking earlier from staging to final step.As you know a spark job is divided into multiple stages and set of tasks and due to nature of distributed computing the tasks are prone to failure so there is also provision to re-launch same task due to system failure or speculative execution of slow running tasks and that leads to concepts of task commit and job commit functions.Here we have 2 options of readily available algorithms and how job and task commits are done and having said this not one algorithm is better then other rather based on where we are committing data.
mapreduce.fileoutputcommitter.algorithm.version=1
commitTask renames the data generated by task from task temporary directory to job temporary directory.
When all the tasks are complete commitJob rename all the data from job temporary directory to final destination and at the end creates _SUCCESS file.
Here driver does the work of commitJob at the end so object stores like S3 may take longer time because of lots of task temporary file being queued up for rename operation(its not serial though)and the write performance is not optimized.It might work pretty well for HDFS as rename is not expensive and just a metadata change.For AWS S3 during commitJob each rename operation of files opens up huge number of API calls to AWS S3 and might cause issues of unexpected API call closure if the number of files are high. It might not also. I have seen both the cases on the same job running in two different times.
mapreduce.fileoutputcommitter.algorithm.version=2
commitTask moves data generated by task from task temporary directory directly to the final destination as soon as task is complete.
commitJob basically writes the _SUCCESS file and doesn't do much.
From a high level this looks optimized but it comes with a limitation not to have the speculative task execution and also if any task fails due to corrupt data then we might end up with residual data in the final destination and needs a clean up. So this algorithm doesn't give 100% data correctness or doesn't work for use cases where we need data in append mode to existing files.Even if this ensures optimised results comes with a risk.The reason for good performance is basically because of less number of rename operations as compared to algorithm 1(still there are renames). Here we might encounter issues of file not found expectations because commitTask writes the file in temporary path and immediately renames them and there are light chances of eventual consistency issues.
Best Practices
Here are few I think we can use while writing spark data processing applications :
If you have a HDFS cluster available then write data from Spark to HDFS and copy it to S3 to persist. s3-dist-cp can be used for data copy from HDFS to S3 optimally.Here we can avoid all that rename operation.With AWS EMR being running for only duration of compute and then terminated afterwards to persist result this approach looks preferable.
Try avoiding writing files and reading it again and again unless there are consumers for the files , and spark is well known for in-memory processing and careful data persistence/cache in-memory will help the optimized run time of the application.

Related

Spark: writing data to place that is being read from without loosing data

Help me please to understand how can I write data to the place that is also being read from without any issue, using EMR and S3.
So I need to read partitioned data, find old data, delete it, write new data back and I'm thinking about 2 ways here:
Read all data, apply a filter, write data back with save option SaveMode.Overwrite. I see here one major issue - before writing it will delete files in S3, so if EMR cluster goes down by some reason after deletion but before writing - all data will be lost. I can use dynamic partition but that would mean that in such situation I'm gonna lost data from 1 partition.
Same as above but write to the temp directory, then delete original, move everything from temp to original. But as this is S3 storage it doesn't have move operation and all files will be copied, which can be a bit pricy(I'm going to work with 200GB of data).
Is there any other way or am I'm wrong in how spark works?
You are not wrong. The process of deleting a record from a table on EMR/Hadoop is painful in the ways you describe and more. It gets messier with failed jobs, small files, partition swapping, slow metadata operations...
There are several formats, and file protocols that add transactional capability on top of a table stored S3. The open Delta Lake (https://delta.io/) format, supports transactional deletes, updates, merge/upsert and does so very well. You can read & delete (say for GDPR purposes) like you're describing. You'll have a transaction log to track what you've done.
On point 2, as long as you have a reasonable # of files, your costs should be modest, with data charges at ~$23/TB/mo. However, if you end with too many small files, then the API costs of listing the files, fetching files can add up quickly. Managed Delta (from Databricks) will help speed of many of the operations on your tables through compaction, data caching, data skipping, z-ordering
Disclaimer, I work for Databricks....

Spark _temporary creation reason

Why does spark, while saving result to a file system, uploads result files to a _temporary directory and then move them to output folder instead of directly uploading them to output folder?
Two stage process is the simplest way to ensure consistency of the final result when working with file systems.
You have to remember that each executor thread writes its result set independent of the other threads and writes can be performed at different moments in time or even reuse the same set of resources. At the moment of write Spark cannot determine if all writes will succeed.
In case of failure one can rollback the changes by removing temporary directory.
In case of success one can commit the changes by moving temporary directory.
Another benefit of this model is clear distinction between writes in progress and finalized output. As a result it can easily integrated with simple workflow management tools, without a need of having a separate state store or other synchronization mechanism.
This model is simple, reliable and works well with file systems for which it has been designed. Unfortunately it doesn't perform that well with object stores, which don't support moves.

Is it possible to retrieve the list of files when a DataFrame is written, or or have spark store it somewhere?

With a call like
df.write.csv("s3a://mybucket/mytable")
I obviously know where files/objects are written, but because of S3's eventual consistency guarantees, I can't be 100% sure that getting a listing from that location will return all (or even any) of the files that were just written. If I could get the list of files/objects spark just wrote, then I could prepare a manifest file for a Redshift COPY command without worrying about eventual consistency. Is this possible-- and if so how?
The spark-redshift library can take care of this for you. If you want to do it yourself you can have a look at how they do it here: https://github.com/databricks/spark-redshift/blob/1092c7cd03bb751ba4e93b92cd7e04cffff10eb0/src/main/scala/com/databricks/spark/redshift/RedshiftWriter.scala#L299
EDIT: I avoid further worry about consistency by using df.coalesce(fileCount) to output a known number of file parts (for Redshift you want a multiple of the slices in your cluster). You can then check how many files are listed in the Spark code and also how many files are loaded in Redshift stl_load_commits.
It's good to be aware of consistency risks; you can get it in listings with delayed create visibility and deleted objects still being found.
AFAIK, You can't get a list of files created, as its somewhere where tasks can generate whatever they want into the task output dir, which is then marshalled (via listing and copy) into the final output dir,
In the absence of a consistency layer atop S3 (S3mper, s3guard, etc), you can read & spin for "a bit" to allow for the shards to catch up. I have no good idea of what is a good value of "a bit".
However, if you are calling fs.write.csv(), you may have been caught by listing inconsistencies within the committer used to propagate task output to the job dir; s that's done in S3A via list + copy, see.

spark save does it collect and save or save from each node

We have a spark cluster with 10 nodes. I have a process that joins couple of dataframes and then saves the result to a s3 location. We are running in cluster mode. When I call save on the dataframe, does it save from the nodes or does it collect all the result to the driver and write it from the driver to the s3. is there a way to verify this.
RDD.save() triggers an evaluation of the entire query.
The work is partitioned by source data (i.e. files), and any splitting which can be done, pushing individual tasks to available executors, collecting the results and finally writing it to the destination directory using the cross-node protocol defined in implementations of FileCommitProtocol, generally HadoopMapReduceCommitProtocol, which then works with Hadoop's FileOutputCommitter to choreograph the commit.
Essentially:
Tasks write to their task-specific subdir under __temporary/$job-attempt/$task-attempt
tasks say they are ready to write, the spark driver tells them to commit vs abort
in speculative execution or failure conditions, tasks can abort, in which case they delete their temp dir
on a commit, the task lists files in its dir and renames them to a job attempt dir, or direct to the destination (v2 protocol)
On a job commit, the driver either lists and renames files in the job attempt dir (v1 protocol), or is a no-op (v2).
On the question of "writing to s3", if you are using Apache Spark (and not amazon EMR), then be aware that this list + rename commit mechanism is (a) slow as renames are really copies, and (b) dangerous as the eventual consistency of S3, especially list inconsistency, means that files saved by tasks may not be listed, hence not committed
At the time of writing (May 2017), the sole committer known to safely commit using s3a or s3n clients is the netflix committer.
There's work underway to pull this into Hadoop and hence spark, but again, in May 2017 it's still a work in progress: demo state only. I speak as the engineer doing this.
To close then: If you want reliable data output writing to S3, consult whoever is hosting your code on EC2. If you are using out-the-box Apache Spark without any supplier-specific code, do not write direct to S3. It may work in tests, but as well as seeing the intermittent failure, you may lose data and not even notice. Statistics are your enemy here: the more work you do, the bigger the datasets, the more tasks that are executed -and so eventually something will go wrong.

Quickly write files made by MultipleTextOutputFormat to cloud store (Azure, S3, etc)

I have a Spark job that takes data from a Hive table, transforms it, and eventually gives me an RDD containing keys of filenames and values of that file's content. I then pass that onto a custom OutputFormat that creates individual files based on those keys. The end result is about 20 million files, each file being about 1-10MB in size.
My issue is now efficiently writing those files to my end destination. I can't put them in HDFS because 20 million small files will quickly grind HDFS to a halt. If I attempt to write directly to my cloud store it goes very slowly as it appears that each task will upload each file it gets sequentially. I'm interested in hearing about any technique I can use to speed up this process such that files will be uploaded with as much parallelization as possible.

Resources