Continous appending of data on existing tabular data file (CSV, parquet) using PySpark - apache-spark

For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.

NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka

Related

Output Many CSV files, and Combing into one without performance impact with Transform data using mapping data flows via Azure Data Factory

I followed the example below, and all is going well.
https://learn.microsoft.com/en-gb/azure/data-factory/tutorial-data-flow
Below is about the output files and rows:
If you followed this tutorial correctly, you should have written 83
rows and 2 columns into your sink folder.
Below is the result from my example, which is correct having the same number of rows and columns.
Below is the output. Please note that the total number of files is 77, not 83, not 1.
Question:: Is it correct to have so many csv files (77 items)?
Question:: How to combine all files into one file without slowing down the process?
I can create one file by following the link below, which warns of slowing down the process.
How to remove extra files when sinking CSV files to Azure Data Lake Gen2 with Azure Data Factory data flow?
The number of files generated from the process is dependent upon a number of factors. If you've set the default partitioning in the optimize tab on your sink, that will tell ADF to use Spark's current partitioning mode, which will be based on the number of cores available on the worker nodes. So the number of files will vary based upon how your data is distributed across the workers. You can manually set the number of partitions in the sink's optimize tab. Or, if you wish to name a single output file, you can do that, but it will result in Spark coalescing to a single partition, which is why you see that warning. You may find it takes a little longer to write that file because Spark has to coalesce existing partitions. But that is the nature of a big data distributed processing cluster.

use of df.coalesce(1) in csv vs delta table

When saving to a delta table we avoid 'df.coalesce(1)' but when saving to csv or parquet we(my team) add 'df.coalesce(1)'. Is it a common practise? Why? Is it mandatory?
In most cases when I have seen df.coalesce(1) it was done to generate only one file, for example, import CSV file into Excel, or for Parquet file into the Pandas-based program. But if you're doing .coalesce(1), then the write happens via single task, and it's becoming the performance bottleneck because you need to get data from other executors, and write it.
If you're consuming data from Spark or other distributed system, having multiple files will be beneficial for performance because you can write & read them in parallel. By default, Spark writes N files into the directory where N is the number of partitions. As #pltc noticed, this may generate the big number of files that's often not desirable because you'll get performance overhead from accessing them. So we need to have a balance between the number of files and their size - for Parquet and Delta (that is based on Parquet), having the bigger files bring several performance advantages - you read less files, you can get better compression for data inside the file, etc.
For Delta specifically, having .coalesce(1) having the same problem as for other file formats - you're writing via one task. Relying on default Spark behaviour and writing multiple files is beneficial from performance point of view - each node is writing its data in parallel, but you can get too many small files (so you may use .coalesce(N) to write bigger files). For Databricks Delta, as it was correctly pointed by #Kafels, there are some optimizations that will allow to remove that .coalesce(N) and do automatic tuning achieve the best throughput (so called "Optimized Writes"), and create bigger files ("Auto compaction") - but they should be used carefully.
Overall, the topic of optimal file size for Delta is an interesting topic - if you have big files (1Gb is used by default by OPTIMIZE command), you can get better read throughput, but if you're rewriting them with MERGE/UPDATE/DELETE, then big files are bad from performance standpoint, and it's better to have smaller (16-64-128Mb) files, so you can rewrite less data.
TL;DR: it's not mandatory, it depends on the size of your dataframe.
Long answer:
If your dataframe is 10Mb, and you have 1000 partitions for example, each file would be about 10Kb. And having so many small files would reduce Spark performance dramatically, not to mention when you have too many files, you'll eventually reach OS limitation of the number of files. Any how, when your dataset is small enough, you should merge them into a couple of files by coalesce.
However, if your dataframe is 100G, technically you still can use coalesce(1) and save to a single file, but later on you will have to deal with less parallelism when reading from it.

Spark output JSON vs Parquet file size discrepancy

new Spark user here. i wasn't able to find any information about filesize comparison between JSON and parquet output of the same dataFrame via Spark.
testing with a very small data set for now, doing a df.toJSON().collect() and then writing to disk creates a 15kb file. but doing a df.write.parquet creates 105 files at around 1.1kb each. why is the total file size so much larger with parquet in this case than with JSON?
thanks in advance
what you're doing with df.toJSON.collect is you get a single JSON from all your data (15kb in your case) and you save that to disk - this is not something scalable for situations you'd want to use Spark in any way.
For saving parquet you are using spark built-in function and it seems that for some reason you have 105 partitions (probably the result of the manipulation you did) so you get 105 files. Each of these files has the overhead of the file structure and probably stores 0,1 or 2 records. if you want to save a single file you should coalesce(1) before you save (again this just for the toy example you have) so you'd get 1 file. Note that it still might be larger due to the file format overhead (i.e. the overhead might still be larger than the compression benefit)
Conan, it is very hard to answer your question precisely without knowing the nature of the data (you don't even tell amount of row in your DataFrame). But let me speculate.
First. Text files containing JSON usually take more space on disk then parquet. At least when one store millions-billions rows. The reason for that is parquet is highly optimized column based storage format which uses a binary encoding to store your data
Second. I would guess that you have a very small dataframe with 105 partitions (and probably 105 rows). When you store something that small the disk footprint should not bother you but if it does you need to be aware that each parquet file has a relatively sizeable header describing the data you store.

Quickly write files made by MultipleTextOutputFormat to cloud store (Azure, S3, etc)

I have a Spark job that takes data from a Hive table, transforms it, and eventually gives me an RDD containing keys of filenames and values of that file's content. I then pass that onto a custom OutputFormat that creates individual files based on those keys. The end result is about 20 million files, each file being about 1-10MB in size.
My issue is now efficiently writing those files to my end destination. I can't put them in HDFS because 20 million small files will quickly grind HDFS to a halt. If I attempt to write directly to my cloud store it goes very slowly as it appears that each task will upload each file it gets sequentially. I'm interested in hearing about any technique I can use to speed up this process such that files will be uploaded with as much parallelization as possible.

Updating values in apache parquet file

I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier solution to this.
Lets start with basics:
Parquet is a file format that needs to be saved in a file system.
Key questions:
Does parquet support append operations?
Does the file system (namely, HDFS) allow append on files?
Can the job framework (Spark) implement append operations?
Answers:
parquet.hadoop.ParquetFileWriter only supports CREATE and OVERWRITE; there is no append mode. (Not sure but this could potentially change in other implementations -- parquet design does support append)
HDFS allows append on files using the dfs.support.append property
Spark framework does not support append to existing parquet files, and with no plans to; see this JIRA
It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.
More details are here:
http://bytepadding.com/big-data/spark/read-write-parquet-files-using-spark/
http://bytepadding.com/linux/understanding-basics-of-filesystem/
There are workarounds, but you need to create your parquet file in a certain way to make it easier to update.
Best practices:
A. Use row groups to create parquet files. You need to optimize how many rows of data can go into a row group before features like data compression and dictionary encoding stop kicking in.
B. Scan row groups one at a time and figure out which row groups need to be updated. Generate new parquet files with amended data for each modified row group. It is more memory efficient to work with one row group's worth of data at a time instead of everything in the file.
C. Rebuild the original parquet file by appending unmodified row groups and with modified row groups generated by reading in one parquet file per row group.
it's surprisingly fast to reassemble a parquet file using row groups.
In theory it should be easy to append to existing parquet file if you just strip the footer (stats info), append new row groups and add new footer with update stats, but there isn't an API / Library that supports it..
Look at this nice blog which can answer your question and provide a method to perform updates using Spark(Scala):
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Copy & Paste from the blog:
when we need to edit the data, in our data structures (Parquet), that are immutable.
You can add partitions to Parquet files, but you can’t edit the data in place.
But ultimately we can mutate the data, we just need to accept that we won’t be doing it in place. We will need to recreate the Parquet files using a combination of schemas and UDFs to correct the bad data.
If you want to incrementally append the data in Parquet (you did n't ask this question, still it would be useful for other readers)
Refer this well written blog:
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Disclaimer: I have n't written those blogs, I just read it and found it might be useful for others.
You must re-create the file, this is the Hadoop way. Especially if the file is compressed.
Another approach, (very common in Big-data), is to do the update on another Parquet (or ORC) file, then JOIN / UNION at query time.
Well, in 2022, I strongly recommend to use a lake house solution, like deltaLake or Apache Iceberg. They will care about that for you.

Resources