Is predicate pushdown available for compressed Parquet files? - apache-spark

In Spark 2.2, is predicate pushdown available for compressed Parquet files (e.g. GZIP, Snappy)?

Yes, predicate pushdown works on all Parquet files. The important part here is that compression in the context of Parquet means that the data is compressed but the metadata parts of the file are not compressed but always stored in plain. This allows then any processor working on top of Parquet files to read the statistics of each chunk in a file and then only load the relevant parts of it.

Related

Parquet file size doubles after deduplication in Spark

We have a deduplication process that reads parquet files and drops duplicate records and writes back the distinct dataframe in Spark sql as parquet output files.
But the output file size doubles it's original size. We are writing the parquet with gzip compression that is also the original file compression codec.

Spark: Avro vs Parquet performance

Now that Spark 2.4 has built-in support for Avro format, I'm considering changing the format of some of the data sets in my data lake - those that are usually queried/joined for entire rows rather than specific column aggregations - from Parquet to Avro.
However, most of the work on top of the data is done via Spark, and to my understanding, Spark's in-memory caching and computations are done on columnar-formatted data. Does Parquet offer a performance boost in this regard, while Avro would incur some sort of data "transformation" penalty? What other considerations should I be aware of in this regard?
Both formats shine under different constraints but have things like strong types with schemas and a binary encoding in common. In its basic form it boils down to this differentiation:
Avro is a row-wise format. From this it follows that you can append row-by-row to an existing file. These row-wise appends are then also immediately visible to all readers that work on these files. Avro is best when you have a process that writes into your data lake in a streaming (non-batch) fashion.
Parquet is a columnar format and its files are not appendable. This means that for new arriving records, you must always create new files. In exchange for this behaviour Parquet brings several benefits. Data is stored in a columnar fashion and compression and encoding (simple type-aware, low-cpu but highly effective compression) is applied to each column. Thus Parquet files will be much smaller than Avro files. Also Parquet writes out basic statistics that when you load data from it, you can push down parts of your selection to the I/O. Then only the necessary set of rows is loaded from disk. As Parquet is already in a columnar fashion and most in-memory structures will also be columnar, loading data from them is in general much faster.
As you already have your data and the ingestion process tuned to write Parquet files, it's probably best for you to stay with Parquet as long as data ingestion (latency) does not become a problem for you.
A typical usage is actually to have a mix of Parquet and Avro. Recent, freshly arrived data is stored as Avro files as this makes the data immediately available to the data lake. More historic data is transformed on e.g. a daily basis into Parquet files as they are smaller and most efficient to load but can only be written in batches. While working with this data, you would load both into Spark as a union of two tables. Thus you have the benefit of efficient reads with Parquet combined with the immediate availability of data with Avro. This pattern is often hidden by table formats like Uber's Hudi or Apache Iceberg (incubating) which was started by Netflix.

Snappy Compression

I am trying to store an avro file as a parquet file with snappy compression. Although the data gets written as a parquet with the filename.snappy.parquet but the file size remains the same. Pasting the code.
CODE:
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")
orders_avro.write.parquet("/user/cloudera/problem5/parquet-snappy-compress")
Snappy compression is the default in parquet-mr (the library that Spark uses to write Parquet files). So the only thing that changes here is the filename.

Is gzipped Parquet file splittable in HDFS for Spark?

I get confusing messages when searching and reading answers on the internet on this subject. Anyone can share their experience? I know for a fact that gzipped csv is not, but maybe file internal structures for Parquet are such that it is totally different case for Parquet vs csv?
Parquet files with GZIP compression are actually splittable. This is because of the internal layout of Parquet files. These are always splittable, independent of the used compression algorithm.
This fact is mainly due to the design of Parquet files that divided in the following parts:
Each Parquet files consists of several RowGroups, these should be the same size as your HDFS Block Size.
Each RowGroup consists of a ColumnChunk per column. Each ColumnChunk in a RowGroup has the same number of Rows.
ColumnChunks are split into Pages, these are probably in the size of 64KiB to 16MiB. Compression is done on a per-page basis, thus a page is the lowest level of parallelisation a job can work on.
You can find a more detailed explanation here: https://github.com/apache/parquet-format#file-format

Parquet file compression

What would be the most optimized compression logic for Parquet files when using in Spark? Also what would be the approximate size of a 1gb parquet file after compression with each compression type?
Refer here for Size Difference between all the compress & uncompress
ORC: If you create ORC table in Hive you can't insert that from Impala, so you have to INSERT in Hive followed by REFRESH table_name in Impala
Avro: To my knowledge, it is same as ORC
Parquet: You can create a table in Hive and insert it from Impala
It depends on what kind of data you have; text usually compresses very well, random timestamp or float values not so well.
Have a look at this presentation from the latest Apache Big Data conference, especially slides 15-16 that shows the compression results per column on a test dataset.[The rest of the pres. is about the theory & practice of compression applied to the Parquet internal structure]
In my case compression seemed to have increased the file size. So,it has essentially made the file larger and unreadable. Parquet if not fully understood and used on small files can really suck. So, I would advice you to switch to avaro file format if you can.
You can try below steps to compress a parquet file in Spark:
Step 1:Set the compression type, configure the spark.sql.parquet.compression.codec property:
sqlContext.setConf("spark.sql.parquet.compression.codec","codec")
Step 2:Specify the codec values.The supported codec values are: uncompressed, gzip, lzo, and snappy. The default is gzip.
Then create a dataframe,say Df from you data and save it using below command:
Df.write.parquet("path_destination")
If you check the destination folder now you will be albe to see that files have been stored with the compression type you have specified in the Step 2 above.
Please refer to the below link for more details:
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/spark_parquet.html

Resources