Parquet file size doubles after deduplication in Spark - apache-spark

We have a deduplication process that reads parquet files and drops duplicate records and writes back the distinct dataframe in Spark sql as parquet output files.
But the output file size doubles it's original size. We are writing the parquet with gzip compression that is also the original file compression codec.

Related

Is the number of partitions of a parquet file in ADLS same as number of partitions after reading it as dataframe?

I have 3 parquet files in ADLS
2 parquet files have 10 sub-parquet files and when i read it as a dataframe in databricks using pyspark, the number of partitions are equal to 10 which is expected behaviour.
3rd file has 172 snappy.parquet files and when i read it as a dataframe, the number of partitions are equal to 89, what is the reason behind this?
Used this command df.rdd.getNumPartitions() to find the number of partitions of a dataframe.
When reading, Spark is trying to create Spark partitions of size not bigger than specified by spark.files.maxPartitionBytes (128Mb by default). When reading files, Spark will look for file size, and take it into account - when file size is smaller than desired partition size, then partition will be created from multiple files, and when file size is bigger than desired partition size, then it will be split into multiple partitions (if the format is splittable, like, Parquet).
In your case, it looks like you have a lot of files smaller than desired partition size.

PySpark is Writing Large Single Parquet Files instead of Partitioned Files

For most of my files, when I read in delimited files and write them out to snappy parquet, spark is executing as I expected and creating multiple partitioned snappy parquet files.
That said, I have some large .out files that are pipe-separated (25GB+), and when I read them in:
inputFile = spark.read.load(s3PathIn, format='csv', sep=fileSeparator, quote=fileQuote, escape=fileEscape, inferSchema='true', header='true', multiline='true')
Then output the results to S3:
inputFile.write.parquet(pathOut, mode="overwrite")
I am getting large single snappy parquet files (20GB+). Is there a reason for this? All my other spark pipelines generate nicely split files that make query in Athena more performant, but in these specific cases I am only getting single-large files. I am NOT executing any repartition or coallesce commands.
check how much partitions you have on inputFile dataframe. Seems like it has single partitioned.
Seems like you are just reading a CSV file and then writing it as parquet file. check the size of your CSV file, seems like it really large.
inputFile.rdd.getNumPartitions
if it's one. Try repartition dataframe.
inputFile.repartition(10) \\or
inputFile.repartition("col_name")

Snappy Compression

I am trying to store an avro file as a parquet file with snappy compression. Although the data gets written as a parquet with the filename.snappy.parquet but the file size remains the same. Pasting the code.
CODE:
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")
orders_avro.write.parquet("/user/cloudera/problem5/parquet-snappy-compress")
Snappy compression is the default in parquet-mr (the library that Spark uses to write Parquet files). So the only thing that changes here is the filename.

Concatenate ORC partition files on disk?

I am using Spark 2.3 to convert some CSV data to ORC for use with Amazon Athena; it is working fine! Athena works best with files that are not too small so, after manipulating the data a bit, I am using Spark to coalesce the partitions into a single partition before writing to disk, like so:
df.coalesce(1).write.orc("out.orc", compression='zlib', mode='append')
This results in a single ORC file that is an optimal file size for use with Athena. However, the coalesce step takes a very long time. It adds about 33% to the total amount of time to convert the data!
This is obviously due to the fact that Spark cannot parallelize the coalesce step when saving to a single file. When I create the same number of partitions as there are CPUs available, the ORC write out to disk is much faster!
My question is, can I parallelize the ORC write to disk and then concatenate the files somehow? This would allow me to parallelize the write and merge the files without having to compress everything on a single CPU?

How to create a DStream from parquet files stored in HDFS?

I want to read a large number of records, stored in HDFS in parquet format, and convert them into a Spark Stream, so they can be processed in batches (written to another data-store).
Is there anyway to convert DataFrames to DStreams?

Resources