Updating values in apache parquet file - apache-spark

I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier solution to this.

Lets start with basics:
Parquet is a file format that needs to be saved in a file system.
Key questions:
Does parquet support append operations?
Does the file system (namely, HDFS) allow append on files?
Can the job framework (Spark) implement append operations?
Answers:
parquet.hadoop.ParquetFileWriter only supports CREATE and OVERWRITE; there is no append mode. (Not sure but this could potentially change in other implementations -- parquet design does support append)
HDFS allows append on files using the dfs.support.append property
Spark framework does not support append to existing parquet files, and with no plans to; see this JIRA
It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.
More details are here:
http://bytepadding.com/big-data/spark/read-write-parquet-files-using-spark/
http://bytepadding.com/linux/understanding-basics-of-filesystem/

There are workarounds, but you need to create your parquet file in a certain way to make it easier to update.
Best practices:
A. Use row groups to create parquet files. You need to optimize how many rows of data can go into a row group before features like data compression and dictionary encoding stop kicking in.
B. Scan row groups one at a time and figure out which row groups need to be updated. Generate new parquet files with amended data for each modified row group. It is more memory efficient to work with one row group's worth of data at a time instead of everything in the file.
C. Rebuild the original parquet file by appending unmodified row groups and with modified row groups generated by reading in one parquet file per row group.
it's surprisingly fast to reassemble a parquet file using row groups.
In theory it should be easy to append to existing parquet file if you just strip the footer (stats info), append new row groups and add new footer with update stats, but there isn't an API / Library that supports it..

Look at this nice blog which can answer your question and provide a method to perform updates using Spark(Scala):
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Copy & Paste from the blog:
when we need to edit the data, in our data structures (Parquet), that are immutable.
You can add partitions to Parquet files, but you can’t edit the data in place.
But ultimately we can mutate the data, we just need to accept that we won’t be doing it in place. We will need to recreate the Parquet files using a combination of schemas and UDFs to correct the bad data.
If you want to incrementally append the data in Parquet (you did n't ask this question, still it would be useful for other readers)
Refer this well written blog:
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Disclaimer: I have n't written those blogs, I just read it and found it might be useful for others.

You must re-create the file, this is the Hadoop way. Especially if the file is compressed.
Another approach, (very common in Big-data), is to do the update on another Parquet (or ORC) file, then JOIN / UNION at query time.
Well, in 2022, I strongly recommend to use a lake house solution, like deltaLake or Apache Iceberg. They will care about that for you.

Related

Continous appending of data on existing tabular data file (CSV, parquet) using PySpark

For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.
NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka

How can I extract information from parquet files with Spark/PySpark?

I have to read in N parquet files, sort all the data by a particular column, and then write out the sorted data in N parquet files. While I'm processing this data, I also have to produce an index that will later be used to optimize the access to the data in these files. The index will also be written as a parquet file.
For the sake of example, let's say that the data represents grocery store transactions and we want to create an index by product to transaction so that we can quickly know which transactions have cottage cheese, for example, without having to scan all N parquet files.
I'm pretty sure I know how to do the first part, but I'm struggling with how to extract and tally the data for the index while reading in the N parquet files.
For the moment, I'm using PySpark locally on my box, but this solution will eventually run on AWS, probably in AWS Glue.
Any suggestions on how to create the index would be greatly appreciated.
This is already built into spark SQL. In SQL use "distribute by" or pyspark: paritionBy before writing and it will group the data as you wish on your behalf. Even if you don't use a partitioning strategy Parquet has predicate pushdown that does lower level filtering. (Actually if you are using AWS, you likely don't want to use partitioning and should stick with large files that use predicate pushdown. Specifically because s3 scanning of directories is slow and should be avoided.)
Basically, great idea, but this is already in place.

How to read parquet files in pyspark from s3 bucket whose path is partially unpredictable?

My paths are of the format s3://my_bucket/timestamp=yyyy-mm-dd HH:MM:SS/.
E.g. s3://my-bucket/timestamp=2021-12-12 12:19:27/, however MM:SS part are not predictable, and I am interested in reading the data for a given hour. I tried the following:
df = spark.read.parquet("s3://my-bucket/timestamp=2021-12-12 12:*:*/")
df = spark.read.parquet("s3://my-bucket/timestamp=2021-12-12 12:[00,01-59]:[00,01-59]/")
but they give the error pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException.
The problem is your path contains colons :. Unfortunately, it is still not supported. Here are some related tickets:
https://issues.apache.org/jira/browse/SPARK-20061
https://issues.apache.org/jira/browse/HADOOP-14217
and threads:
Struggling with colon ':' in file names
I think the only way is rename these files...
If you want performance.....
I humbly suggest that when you do re-architect this you don't use S3 file lists/directory lists to accomplish this. I suggest you use a Hive table partitioned by hour. (Or you write a job to help migrate data into hours in larger files not small files.)
S3 is a wonderful engine for long term cheap storage. It's not performant, and it is particularly bad at directory listing due to how they implemented it. (And performance only gets worse if there are multiple small files in the directories).
To get some real performance from your job you should use a hive table (Partitioned so the file lookups are done in DynamoDB, and the partition is at the hour level.) or some other groomed file structure that reduces file count/directories listings required.
You will see a large performance boost if you can restructure your data into bigger files without use of file lists.

Spark tagging file names for purpose of possible later deletion/rollback?

I am using Spark 2.4 in AWS EMR.
I am using Pyspark and SparkSQL for my ELT/ETL and using DataFrames with Parquet input and output on AWS S3.
As of Spark 2.4, as far as I know, there is no way to tag or to customize the file name of output files (parquet). Please correct me?
When I store parquet output files on S3 I end up with file names which look like this:
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
The middle part of the file name looks like it has embedded GUID/UUID :
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
I would like to know if I can obtain this GUID/UUID value from the PySpark or SparkSQL function at run-time, to log/save/display this value in a text file?
I need to log this GUID/UUID value because I may need to later remove the files with this value as part of their names, for a manual rollback purposes (for example, I may discover a day or a week later that this data is somehow corrupt and needs to be deleted, so all files tagged with GUID/UUID can be identified and removed).
I know that I can partition the table manually on a GUID column but then I end up with too many partitions, so it hurts performance. What I need is to somehow tag the files, for each data load job, so I can identify and delete them easily from S3, hence GUID/UUID value seems like one possible solution.
Open for any other suggestions.
Thank you
Is this with the new "s3a specific committer"? If so, it means that they're using netflix's code/trick of using a GUID on each file written so as to avoid eventual consistency problems. That doesn't help much though.
consider offering a patch to Spark which lets you add a specific prefix to a file name.
Or for Apache Hadoop & Spark (i.e. not EMR), an option for the S3A committers to put that prefix in when they generate temporary filenames.
Short term: well, you can always list the before-and-after state of the directory tree (tip: use FileSystem.listFiles(path, recursive) for speed), and either remember the new files, or rename them (which will be slow: Remembering the new filenames is better)
Spark already writes files with UUID in names. Instead of creating too many partitions you can setup customer file naming (e.g. add some id). May be this is solution for you - https://stackoverflow.com/a/43377574/1251549
Not tried yet (but planning) - https://github.com/awslabs/amazon-s3-tagging-spark-util
In theory, you can tag with jobid (or whatever) and then run something
Both solutions lead to perform multiple s3 list objects API request check tags/filename and delete file one by one.

Design of Spark + Parquet "database"

I've got 100G text files coming in daily, and I wish to create an efficient "database" accessible from Spark. By "database" I mean the ability to execute fast queries on the data (going back about a year), and incrementally add data each day, preferably without read locks.
Assuming I want to use Spark SQL and parquet, what's the best way to achieve this?
give up on concurrent reads/writes and append new data to the existing parquet file.
create a new parquet file for each day of data, and use the fact that Spark can load multiple parquet files to allow me to load e.g. an entire year. This effectively gives me "concurrency".
something else?
Feel free to suggest other options, but let's assume I'm using parquet for now, as from what I've read this will be helpful to many others.
My Level 0 design of this
Use partitioning by date/time (if your queries are based on date/time to avoid scanning of all data)
Use Append SaveMode where required
Run SparkSQL distributed SQL engine so that
You enable querying of the data from multiple clients/applications/users
cache the data only once across all clients/applications/users
Use just HDFS if you can to store all your Parquet files
I have very similar requirement in my system. I would say if load the whole year's data -for 100g one day that will be 36T data ,if you need to load 36TB daily ,that couldn't be fast anyway. better to save the processed daily data somewhere(such as count ,sum, distinct result) and use that to go back for whole year .

Resources