Snappy Compression - apache-spark

I am trying to store an avro file as a parquet file with snappy compression. Although the data gets written as a parquet with the filename.snappy.parquet but the file size remains the same. Pasting the code.
CODE:
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")
orders_avro.write.parquet("/user/cloudera/problem5/parquet-snappy-compress")

Snappy compression is the default in parquet-mr (the library that Spark uses to write Parquet files). So the only thing that changes here is the filename.

Related

Read compressed JSON in Spark

I have data stored in S3 as utf-8 encoded json files, and compressed using either snappy/lz4.
I'd like to use Spark to read/process this data, but Spark seems to require the filename suffix (.lz4, .snappy) to understand the compression scheme.
The issue is that I have no control over how the files are named - they will not be written with this suffix. It is also too expensive to rename all such files to include such as suffix.
Is there any way for spark to read these JSON files properly?
For parquet encoded files there is the 'parquet.compression' = 'snappy' in Hive Metastore, which seems to solve this problem for parquet files. Is there something similar for text files?

Parquet file size doubles after deduplication in Spark

We have a deduplication process that reads parquet files and drops duplicate records and writes back the distinct dataframe in Spark sql as parquet output files.
But the output file size doubles it's original size. We are writing the parquet with gzip compression that is also the original file compression codec.

hadoop: In which format data is stored in HDFS

I am loading data into HDFS using spark. How is the data stored in HDFS? Is it encrypt mode? Is it possible to crack the HDFS data? how about Security for existing data?
I want to know the details how the system behaves.
HDFS is a distributed file system which supports various formats like plain text format csv, tsv files. Other formats like parquet, orc, Json etc..
While saving the data in HDFS in spark you need to specify the format.
You can’t read parquet files without any parquet tools but spark can read it.
The security of HDFS is governed by Kerberos authentication. You need to set up the authentication explicitly.
But the default format of spark to read and write data is - parquet
HDFS can store data in many formats and Spark has the ability to read it (csv, json, parquet etc). While writing back specify the format that you wish to save the file in.
reading some stuff on the below commands will help you this thing:
hadoop fs -ls /user/hive/warehouse
hadoop fs -get (this till get the files from hdfs to your local file system)
hadoop fs -put (this will put the files from your local file system to hdfs)

Is predicate pushdown available for compressed Parquet files?

In Spark 2.2, is predicate pushdown available for compressed Parquet files (e.g. GZIP, Snappy)?
Yes, predicate pushdown works on all Parquet files. The important part here is that compression in the context of Parquet means that the data is compressed but the metadata parts of the file are not compressed but always stored in plain. This allows then any processor working on top of Parquet files to read the statistics of each chunk in a file and then only load the relevant parts of it.

Parquet file compression

What would be the most optimized compression logic for Parquet files when using in Spark? Also what would be the approximate size of a 1gb parquet file after compression with each compression type?
Refer here for Size Difference between all the compress & uncompress
ORC: If you create ORC table in Hive you can't insert that from Impala, so you have to INSERT in Hive followed by REFRESH table_name in Impala
Avro: To my knowledge, it is same as ORC
Parquet: You can create a table in Hive and insert it from Impala
It depends on what kind of data you have; text usually compresses very well, random timestamp or float values not so well.
Have a look at this presentation from the latest Apache Big Data conference, especially slides 15-16 that shows the compression results per column on a test dataset.[The rest of the pres. is about the theory & practice of compression applied to the Parquet internal structure]
In my case compression seemed to have increased the file size. So,it has essentially made the file larger and unreadable. Parquet if not fully understood and used on small files can really suck. So, I would advice you to switch to avaro file format if you can.
You can try below steps to compress a parquet file in Spark:
Step 1:Set the compression type, configure the spark.sql.parquet.compression.codec property:
sqlContext.setConf("spark.sql.parquet.compression.codec","codec")
Step 2:Specify the codec values.The supported codec values are: uncompressed, gzip, lzo, and snappy. The default is gzip.
Then create a dataframe,say Df from you data and save it using below command:
Df.write.parquet("path_destination")
If you check the destination folder now you will be albe to see that files have been stored with the compression type you have specified in the Step 2 above.
Please refer to the below link for more details:
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/spark_parquet.html

Resources