Quickly write files made by MultipleTextOutputFormat to cloud store (Azure, S3, etc) - azure

I have a Spark job that takes data from a Hive table, transforms it, and eventually gives me an RDD containing keys of filenames and values of that file's content. I then pass that onto a custom OutputFormat that creates individual files based on those keys. The end result is about 20 million files, each file being about 1-10MB in size.
My issue is now efficiently writing those files to my end destination. I can't put them in HDFS because 20 million small files will quickly grind HDFS to a halt. If I attempt to write directly to my cloud store it goes very slowly as it appears that each task will upload each file it gets sequentially. I'm interested in hearing about any technique I can use to speed up this process such that files will be uploaded with as much parallelization as possible.

Related

Continous appending of data on existing tabular data file (CSV, parquet) using PySpark

For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.
NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka

is there a limit for pyspark read csv files?

i am relatively new to spark/pyspark so any help is well appreciated.
currently we have files being delivered to Azure data lake hourly into a file directory, example:
hour1.csv
hour2.csv
hour3.csv
i am using databricks to read the files in the file directory using the code below:
sparkdf = spark.read.format(csv).option("recursiveFileLookup", "true").option("header", "true").schema(schema).load(file_location)
each of the CSV files is about 5kb and all have the same schema.
what i am unsure about is how scalable "spark.read" is? currently we are processing about 2000 of such small files, i am worried that there is a limit on the number of files being processed. is there a limit such as maximum 5000 files and my code above breaks?
from what i have read online, i believe data size is not a issue with the method above, spark can read petabytes worth of data(comparatively, our data size in total is still very small), but there are no mentions of the number of files that it is able to process - educate me if i am wrong.
any explanations is very much appreciated.
thank you
The limit it your driver's memory.
When reading a directory, the driver lists it (depending on the initial size, it may parallelize the listing to executors, but it collects the results either way).
After having the list of files, it creates tasks for the executors to run.
With that in mind, if the list is too large to fit in the driver's memory, you will have issues.
You can always increase the driver's memory space to manage it, or have some preprocess to merge the files (GCS has a gsutil compose which can merge files without downloading them).

Spark: writing data to place that is being read from without loosing data

Help me please to understand how can I write data to the place that is also being read from without any issue, using EMR and S3.
So I need to read partitioned data, find old data, delete it, write new data back and I'm thinking about 2 ways here:
Read all data, apply a filter, write data back with save option SaveMode.Overwrite. I see here one major issue - before writing it will delete files in S3, so if EMR cluster goes down by some reason after deletion but before writing - all data will be lost. I can use dynamic partition but that would mean that in such situation I'm gonna lost data from 1 partition.
Same as above but write to the temp directory, then delete original, move everything from temp to original. But as this is S3 storage it doesn't have move operation and all files will be copied, which can be a bit pricy(I'm going to work with 200GB of data).
Is there any other way or am I'm wrong in how spark works?
You are not wrong. The process of deleting a record from a table on EMR/Hadoop is painful in the ways you describe and more. It gets messier with failed jobs, small files, partition swapping, slow metadata operations...
There are several formats, and file protocols that add transactional capability on top of a table stored S3. The open Delta Lake (https://delta.io/) format, supports transactional deletes, updates, merge/upsert and does so very well. You can read & delete (say for GDPR purposes) like you're describing. You'll have a transaction log to track what you've done.
On point 2, as long as you have a reasonable # of files, your costs should be modest, with data charges at ~$23/TB/mo. However, if you end with too many small files, then the API costs of listing the files, fetching files can add up quickly. Managed Delta (from Databricks) will help speed of many of the operations on your tables through compaction, data caching, data skipping, z-ordering
Disclaimer, I work for Databricks....

What is the best practice writing massive amount of files to s3 using Spark

I'm trying to write about 30k-60k parquet files to s3 using Spark and it's taking a massive amount of time (40+ minutes) due to the s3 rate limit.
I wonder if there is a best practice to do such a thing. I heard that writing the data to HDFS and then copying it using s3-dist-cp may be faster. I can't understand why. isn't the copy from HDFS will take the same amount of time because of the s3 rate limit?
Thanks for your help
There is nothing wrong in this approach and works absolutely fine in most of the use cases, but there might be some challenges due to the way in S3 files are written.
Two Important Concepts to Understand
S3(Object Store) != POSIX File System : Rename Operation:
File rename process in POSIX based file system is a metadata only operation.Only the pointer changes and file remains as is on the disk. For example, I have a file abc.txt and I want to rename it as xyz.txt its instantaneous and atomic. xyz.txt’s last modified timestamp remain same as abc.txt’s last modfied timestamp.
Where as in AWS S3 (object store) the file rename under the hood is a copy followed by a delete operation. The source file is first copied to destination and then the source file is deleted.So “aws s3 mv” changes the last modified timestamp of destination file unlike POSIX file system.The metadata here is a key value store where key is the file path and value is the content of the file and there is no such process as changing the key and get this done immediately. The rename process depends on the size of the file. If there is a directory rename(there is nothing called directory in S3 for for simplicity we can assume a recusrive set of files as a directory) then it depends on the # of files inside the dir along with size of each file. So in a nutshell rename is very expensive operation in S3 as compared to normal file system.
S3 Consistency Model
S3 comes with 2 kinds of consistency a.read after write b.eventual consistency and which some cases results in file not found expectation.Files being added and not listed or files being deleted or not removed from list.
Deep explanation:
Spark leverages Hadoop’s “FileOutputCommitter” implementations to write data. Writing data again involves multiple steps and on a high level staging output files and then committing them i.e. writing final files.Here the rename step is involved as I was talking earlier from staging to final step.As you know a spark job is divided into multiple stages and set of tasks and due to nature of distributed computing the tasks are prone to failure so there is also provision to re-launch same task due to system failure or speculative execution of slow running tasks and that leads to concepts of task commit and job commit functions.Here we have 2 options of readily available algorithms and how job and task commits are done and having said this not one algorithm is better then other rather based on where we are committing data.
mapreduce.fileoutputcommitter.algorithm.version=1
commitTask renames the data generated by task from task temporary directory to job temporary directory.
When all the tasks are complete commitJob rename all the data from job temporary directory to final destination and at the end creates _SUCCESS file.
Here driver does the work of commitJob at the end so object stores like S3 may take longer time because of lots of task temporary file being queued up for rename operation(its not serial though)and the write performance is not optimized.It might work pretty well for HDFS as rename is not expensive and just a metadata change.For AWS S3 during commitJob each rename operation of files opens up huge number of API calls to AWS S3 and might cause issues of unexpected API call closure if the number of files are high. It might not also. I have seen both the cases on the same job running in two different times.
mapreduce.fileoutputcommitter.algorithm.version=2
commitTask moves data generated by task from task temporary directory directly to the final destination as soon as task is complete.
commitJob basically writes the _SUCCESS file and doesn't do much.
From a high level this looks optimized but it comes with a limitation not to have the speculative task execution and also if any task fails due to corrupt data then we might end up with residual data in the final destination and needs a clean up. So this algorithm doesn't give 100% data correctness or doesn't work for use cases where we need data in append mode to existing files.Even if this ensures optimised results comes with a risk.The reason for good performance is basically because of less number of rename operations as compared to algorithm 1(still there are renames). Here we might encounter issues of file not found expectations because commitTask writes the file in temporary path and immediately renames them and there are light chances of eventual consistency issues.
Best Practices
Here are few I think we can use while writing spark data processing applications :
If you have a HDFS cluster available then write data from Spark to HDFS and copy it to S3 to persist. s3-dist-cp can be used for data copy from HDFS to S3 optimally.Here we can avoid all that rename operation.With AWS EMR being running for only duration of compute and then terminated afterwards to persist result this approach looks preferable.
Try avoiding writing files and reading it again and again unless there are consumers for the files , and spark is well known for in-memory processing and careful data persistence/cache in-memory will help the optimized run time of the application.

What is the fastest way to put a large amount of data on a local file system onto a distributed store?

I have a single local directory on the order of 1 terabyte. It is made up of millions of very small text documents. If I were to iterate through each file sequentially for my ETL, it would take days. What would be the fastest way for me to perform ETL on this data, ultimately loading it onto a distributed store like hdfs or a redis cluster?
Generically: try to use several/many parallel asynchronous streams, one per file. How many will depend on several factors (number of destination endpoints, disk IO for traversing/reading data, network buffers, errors and latency...)

Resources