I would like to check how each column of parquet data contributes to total file size / total table size.
I looked through Spark/Databricks commands, parquet-cli, parquet-tools and unfortunately it seems that none of them provide such information directly. Considering that this is a columnar format, it should be possible to pull out somehow.
So far the closest I got would be to run parquet-tools meta, summing up details by column for each row group within the file, then aggregating this for all files of a table. This means iterating on all parquet files and cumbersome parsing of the output.
Maybe there is an easier way?
Your approach is correct. Here is a py script using DuckDB to find overall compressed and uncompressed size of all the columns in a parquet dataset.
import duckdb
con = duckdb.connect(database=':memory:')
print(con.execute("""SELECT SUM(total_compressed_size) AS
total_compressed_size_in_bytes, SUM(total_uncompressed_size) AS
total_uncompressed_size_in_bytes, path_in_schema AS column_name from
parquet_metadata('D:\\dev\\tmp\\parq_dataset\\*') GROUP BY path_in_schema""").df())
D:\\dev\\tmp\\parq_dataset\\* here parq_dataset consists of multiple parquet files with same schema. Something similar should be possible using other libraries like pyarrow/fastparquet as well.
Related
I have to read in N parquet files, sort all the data by a particular column, and then write out the sorted data in N parquet files. While I'm processing this data, I also have to produce an index that will later be used to optimize the access to the data in these files. The index will also be written as a parquet file.
For the sake of example, let's say that the data represents grocery store transactions and we want to create an index by product to transaction so that we can quickly know which transactions have cottage cheese, for example, without having to scan all N parquet files.
I'm pretty sure I know how to do the first part, but I'm struggling with how to extract and tally the data for the index while reading in the N parquet files.
For the moment, I'm using PySpark locally on my box, but this solution will eventually run on AWS, probably in AWS Glue.
Any suggestions on how to create the index would be greatly appreciated.
This is already built into spark SQL. In SQL use "distribute by" or pyspark: paritionBy before writing and it will group the data as you wish on your behalf. Even if you don't use a partitioning strategy Parquet has predicate pushdown that does lower level filtering. (Actually if you are using AWS, you likely don't want to use partitioning and should stick with large files that use predicate pushdown. Specifically because s3 scanning of directories is slow and should be avoided.)
Basically, great idea, but this is already in place.
On Apache's official website, this is the official explanation of this parameter:
When true, the Parquet data source merges schemas collected from all data files, otherwise the schema is picked from the summary file or a random data file if no summary file is available.
In fact, my question is, what is the summary file?
Apache Parquet uses metadata to store all information required to load the data from a file, like column metadata, dictionaries row groups and so on.
The format is designed to keep this metadata embeded in the file itself, or stored a separate file. This is what summary file is.
Parquet summary file contains a collection of footers from actual Parquet data files in a directory. It can be used to skip RowGroups when reading w/o fetching the footer from each individual Parquet file which may be expensive if you have a lot of files and/or on Blob stores.
https://github.com/apache/parquet-mr/blob/65b95fb72be8f5a8a193a6f7bc4560fdcd742fc7/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L407
Parquet storage format is the columnar oriented file format, that means data for a particular column for all the rows will be stored adjacent to each other, which results in two main benefits - better compression ratio and increased query performance.
I ran into an issue in which apparently performing a full outer join with an empty table in Spark SQL results in a much larger file size than simply selecting columns from the other dataset without doing a join.
Basically, I had two datasets, one of which was very large, and the other was empty. I went through and selected nearly all columns from the large dataset, and full-outer-joined it to the empty dataset. Then, I wrote out the resulting dataset to snappy-compressed parquet (I also tried with snappy-compressed orc). Alternatively, I simply selected the same columns from the large dataset and then saved the resulting dataset as snappy-compressed parquet (or orc) as above. The file sizes were drastically different, in that the file from the empty-dataset-join was nearly five times bigger than the simple select file.
I've tried this on a number of different data sets, and get the same results. In looking at the data:
Number of output rows are the same (verified with spark-shell by reading in the output datasets and doing a count)
Schemas are the same (verified with spark-shell, parquet-tools, and orc-tools)
Spot-checking the data looks the same, and I don't see any crazy data in either type of output
I explicitly saved all files with the same, snappy, compression, and output files are given '.snappy.parquet' extensions by Spark
I understand that doing a join with an empty table is effectively pointless (I was doing so as part of some generic code that always performed a full outer join and sometimes encountered empty datasets). And, I've updated my code so it no longer does this, so the problem is fixed.
Still, I would like to understand why / how this could be happening. So my question is -- why would doing a Spark SQL join with an empty dataset result in a larger file size? And/or any ideas about how to figure out what is making the resulting parquet files so large would also be helpful.
After coming across several other situations where seemingly small differences in the way data was processed resulted in big difference in file size (for seemingly the exact same data), I finally figured this out.
The key here is understanding how parquet (or orc) encodes data using potentially sophisticated formats such as dictionary encoding, run-length encoding, etc. These formats take advantage of data redundancy to make file size smaller, i.e., if your data contains lots of similar values, the file size will be smaller than for many distinct values.
In the case of joining with an empty dataset, an important point is that when spark does a join, it partitions on the join column. So doing a join, even with an empty dataset, may change the way data is partitioned.
In my case, joining with an empty dataset changed the partitioning from one where many similar records were grouped together in each partition, to one where many dissimilar records were grouped in each partition. Then, when these partitions were written out, many similar or dissimilar records were put together in each file. When partitions had similar records, parquet's encoding was very efficient and file size was small; when partitions were diverse, parquet's encoding couldn't be as efficient, and file size was larger -- even though the data overall was exactly the same.
As mentioned above, we ran into several other instances where the same problem showed up -- we'd change one aspect of processing, and would get the same output data, but the file size might be four times larger. One thing that helped in figuring this out was using parquet-tools to look at low-level file statistics.
I am loading a (5gb compressed file) into memory (aws), creating a dataframe(in spark) and trying to split it into smaller dataframes based on 2 column values. Eventually i want to write all these sub-sets into their respective files.
I just started experimenting in spark and just getting used to the data structures. The approach I was trying to follow was something like this.
read the file
sort it by the 2 columns (still not familiar with repartitioning and do not know if it will help)
identify unique list of all values of those 2 columns
iterate through this list
-- create smaller dataframes by filtering using the values in list
-- writing to files
df.sort("DEVICE_TYPE", "PARTNER_POS")
df.registerTempTable("temp")
grp_col = sqlContext.sql("SELECT DEVICE_TYPE, PARTNER_POS FROM temp GROUP BY DEVICE_TYPE, PARTNER_POS")
print(grp_col)
I do not believe this are cleaner and more efficient ways of doing this. I need to write this to files as there are etls which get kicked off in parallel based on the output. Any recommendations?
If it's okay that the subsets are nested in a directory hierarchy, then you should consider using spark's builtin partitioning:
df.write.partitionBy("device_type","partner_pos")
.json("/path/to/root/output/dir")
I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier solution to this.
Lets start with basics:
Parquet is a file format that needs to be saved in a file system.
Key questions:
Does parquet support append operations?
Does the file system (namely, HDFS) allow append on files?
Can the job framework (Spark) implement append operations?
Answers:
parquet.hadoop.ParquetFileWriter only supports CREATE and OVERWRITE; there is no append mode. (Not sure but this could potentially change in other implementations -- parquet design does support append)
HDFS allows append on files using the dfs.support.append property
Spark framework does not support append to existing parquet files, and with no plans to; see this JIRA
It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.
More details are here:
http://bytepadding.com/big-data/spark/read-write-parquet-files-using-spark/
http://bytepadding.com/linux/understanding-basics-of-filesystem/
There are workarounds, but you need to create your parquet file in a certain way to make it easier to update.
Best practices:
A. Use row groups to create parquet files. You need to optimize how many rows of data can go into a row group before features like data compression and dictionary encoding stop kicking in.
B. Scan row groups one at a time and figure out which row groups need to be updated. Generate new parquet files with amended data for each modified row group. It is more memory efficient to work with one row group's worth of data at a time instead of everything in the file.
C. Rebuild the original parquet file by appending unmodified row groups and with modified row groups generated by reading in one parquet file per row group.
it's surprisingly fast to reassemble a parquet file using row groups.
In theory it should be easy to append to existing parquet file if you just strip the footer (stats info), append new row groups and add new footer with update stats, but there isn't an API / Library that supports it..
Look at this nice blog which can answer your question and provide a method to perform updates using Spark(Scala):
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Copy & Paste from the blog:
when we need to edit the data, in our data structures (Parquet), that are immutable.
You can add partitions to Parquet files, but you can’t edit the data in place.
But ultimately we can mutate the data, we just need to accept that we won’t be doing it in place. We will need to recreate the Parquet files using a combination of schemas and UDFs to correct the bad data.
If you want to incrementally append the data in Parquet (you did n't ask this question, still it would be useful for other readers)
Refer this well written blog:
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Disclaimer: I have n't written those blogs, I just read it and found it might be useful for others.
You must re-create the file, this is the Hadoop way. Especially if the file is compressed.
Another approach, (very common in Big-data), is to do the update on another Parquet (or ORC) file, then JOIN / UNION at query time.
Well, in 2022, I strongly recommend to use a lake house solution, like deltaLake or Apache Iceberg. They will care about that for you.