Is gzipped Parquet file splittable in HDFS for Spark? - apache-spark

I get confusing messages when searching and reading answers on the internet on this subject. Anyone can share their experience? I know for a fact that gzipped csv is not, but maybe file internal structures for Parquet are such that it is totally different case for Parquet vs csv?

Parquet files with GZIP compression are actually splittable. This is because of the internal layout of Parquet files. These are always splittable, independent of the used compression algorithm.
This fact is mainly due to the design of Parquet files that divided in the following parts:
Each Parquet files consists of several RowGroups, these should be the same size as your HDFS Block Size.
Each RowGroup consists of a ColumnChunk per column. Each ColumnChunk in a RowGroup has the same number of Rows.
ColumnChunks are split into Pages, these are probably in the size of 64KiB to 16MiB. Compression is done on a per-page basis, thus a page is the lowest level of parallelisation a job can work on.
You can find a more detailed explanation here: https://github.com/apache/parquet-format#file-format

Related

S3 and Spark: File size and File format best practices

I need to read data (originating from a RedShift table with 5 columns, total size of the table is on the order of 500gb - 1tb) from S3 into Spark via PySpark for a daily batch job.
Are there any best practices around:
Preferred File Formats for how I store my data in S3? (does the format even matter?)
Optimal file size?
Any resources/links that can point me in the right direction would also work.
Thanks!
This blog post has some great info on the subject:
https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/
Look at the section titled: Use the Best Data Store for Your Use Case
From personal experience, I prefer using parquet in most scenarios, because I’m usually writing the data out once, and then reading it many times (for analytics).
In terms of numbers of files, I like to have between 200 and 1,000. This allows clusters of all sizes to read and write in parallel, and allows my reading of the data to be efficient because with parquet I can zoom in on just the file I’m interested in. If you have too many files, there is a ton of overhead in spark remembering all the file names and locations, and if you have too few files, it can’t parallelize your reads and writes effectively.
File size I have found to be less important than number of files, when using parquet.
EDIT:
Here’s a good section from that blog post that describes why I like to use parquet:
Apache Parquet gives the fastest read performance with Spark. Parquet arranges data in columns, putting related values in close proximity to each other to optimize query performance, minimize I/O, and facilitate compression. Parquet detects and encodes the same or similar data, using a technique that conserves resources. Parquet also stores column metadata and statistics, which can be pushed down to filter columns (discussed below). Spark 2.x has a vectorized Parquet reader that does decompression and decoding in column batches, providing ~ 10x faster read performance.

Spark 2.x - gzip vs snappy compression for parquet files

I am (for the first time) trying to repartition the data my team is working with to enhance our querying performance. Our data is currently stored in partitioned .parquet files compressed with gzip. I have been reading that using snappy instead would significantly increase throughput (we query this data daily for our analysis). I still wanted to benchmark the two codecs to see the perfomance gap with with my own eyes. I wrote a simple (Py)Spark 2.1.1 application to carry out some tests. I persisted 50 millions records in memory (deserialized) in a single partition, wrote them into a single parquet file (to HDFS) using the different codecs and then imported the files again to assess the difference. My problem is that I can't see any significant difference for both read and write.
Here is how I wrote my records to HDFS (same thing for the gzip file, just replace 'snappy' with 'gzip') :
persisted_records.write\
.option('compression', 'snappy')\
.mode("overwrite")\
.partitionBy(*partition_cols)\
.parquet('path_to_dir/test_file_snappy')
And here is how I read my single .parquet file (same thing for the gzip file, just replace 'snappy' with 'gzip') :
df_read_snappy = spark.read\
.option('basePath', 'path_to_dir/test_file_snappy')\
.option('compression', 'snappy')\
.parquet('path_to_dir/test_file_snappy')\
.cache()
df_read_snappy.count()
I looked at the durations in the Spark UI. For information, the persisted (deserialized) 50 millions rows amount 317.4M. Once written into a single parquet file, the file weights 60.5M and 105.1M using gzip and snappy respectively (this is expected as gzip is supposed to have a better compression ratio). Spark spends 1.7min (gzip) et 1.5min (snappy) to write the file (single partition so a single core has to carry out all the work). Reading times amount to 2.7min (gzip) et 2.9min (snappy) on a single core (since we have a single file / HDFS block). This what I do not understand : where is snappy's higher performance ?
Have I done something wrong ? Is my "benchmarking protocol" flawed ? Is the performance gain here but I am not looking at the right metrics ?
I must add that I am using Spark default conf. I did not change anything aside from specifying the number of executors, etc.
Many thanks for your help!
Notice: Spark parquet jar version is 1.8.1

Partitioning strategy in Parquet and Spark

I have a job that reads csv files , converts it into data frames and writes in Parquet. I am using append mode while writing the data in Parquet. With this approach, in each write a separate Parquet file is getting generated. My questions are :
1) If every time I write the data to Parquet schema ,a new file gets
appended , will it impact read performance (as the data is now
distributed in varying length of partitioned Parquet files)
2) Is there a way to generate the Parquet partitions purely based on
the size of the data ?
3) Do we need to think to a custom partitioning strategy to implement
point 2?
I am using Spark 2.3
It will affect read performance if
spark.sql.parquet.mergeSchema=true.
In this case, Spark needs to visit each file and grab schema from
it.
In other cases, I believe it does not affect read performance much.
There is no way generate purely on data size. You may use
repartition or coalesce. Latter will created uneven output
files, but much performant.
Also, you have config spark.sql.files.maxRecordsPerFile or option
maxRecordsPerFile to prevent big size of files, but usually it is
not an issue.
Yes, I think Spark has not built in API to evenly distribute by data
size. There are Column
Statistics
and Size
Estimator may help with this.

spark mechanism of accessing files larger than (or lesser) than HDFS block size

This is most of a theoretical query per se, but directly linked to how I should create my files in HDFS. So, please bear with me for a bit.
I'm recently stuck on creating Dataframes for a set of data stored in parquet (snappy) files sitting on HDFS. Each parquet file is approximately 250+ MB in size but the total number of files are around 6k. Which I see as the reason of creating around 10K tasks while creating the DF & obviously runs longer than expected.
I went through some posts where the explanation of the optimal parquet file size to be 1G minimum has been given (https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html),
(Is it better to have one large parquet file or lots of smaller parquet files?).
I wanted to understand how Spark's processing is affected by the size of the files it is reading. More so, does HDFS block size & the file size greater or lesser than HDFS block size literally affects how spark partitions get created? If yes, then how; I need to understand the granular level details. If anyone has any specific & precise links to the context I'm asking of, it'd be a great help in understanding.

Why so many Parquet files created? Can we not limit Parquet output files?

Why so many Parquet files created in sparkSql? Can we not limit Parquet output files ?
in general when you write to parquet it will write one (or more depending on various options) files per partition. If you want to reduce the number of files you can call coalesce on the dataframe before writing. e.g.:
df.coalesce(20).write.parquet(filepath)
Of course if you have various options (e.g. partitionBy) the number of files can increase dramatically.
Also note that if you coalesce to a very small number of partitions this can become very slow (both because of copying data between the partitions and because of the reduced parallelism if you go to a number small enough). You might also get OOM errors if the data in a single partition is too large (when you coalesce the partitions naturally get bigger).
A couple of things to note:
saveAsParquetFile is depracated since version 1.4.0. Use write.parquet(path) instead.
Depending on your use case, searching for a specific string on parquet files might not be the most efficient way to go.

Resources