PySpark is Writing Large Single Parquet Files instead of Partitioned Files - apache-spark

For most of my files, when I read in delimited files and write them out to snappy parquet, spark is executing as I expected and creating multiple partitioned snappy parquet files.
That said, I have some large .out files that are pipe-separated (25GB+), and when I read them in:
inputFile = spark.read.load(s3PathIn, format='csv', sep=fileSeparator, quote=fileQuote, escape=fileEscape, inferSchema='true', header='true', multiline='true')
Then output the results to S3:
inputFile.write.parquet(pathOut, mode="overwrite")
I am getting large single snappy parquet files (20GB+). Is there a reason for this? All my other spark pipelines generate nicely split files that make query in Athena more performant, but in these specific cases I am only getting single-large files. I am NOT executing any repartition or coallesce commands.

check how much partitions you have on inputFile dataframe. Seems like it has single partitioned.
Seems like you are just reading a CSV file and then writing it as parquet file. check the size of your CSV file, seems like it really large.
inputFile.rdd.getNumPartitions
if it's one. Try repartition dataframe.
inputFile.repartition(10) \\or
inputFile.repartition("col_name")

Related

Parquet file size doubles after deduplication in Spark

We have a deduplication process that reads parquet files and drops duplicate records and writes back the distinct dataframe in Spark sql as parquet output files.
But the output file size doubles it's original size. We are writing the parquet with gzip compression that is also the original file compression codec.

Merge multiple parquet files into single file on S3

I don't want to partition or repartition the spark dataframe and write multiple part files as it gives the best performance. Is there any way I can merge the files after it has been written to S3.
I have used parquet-tools and it does the merge to local files. I want to do this on S3.

How to read ".gz" compressed file using spark DF or DS?

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?
Details : File is csv with tab delimited.
Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):
val df = spark.read.option("sep", "\t").csv("file.csv.gz")
PySpark:
df = spark.read.csv("file.csv.gz", sep='\t')
The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.

PySpark: Writing input files to separate output files without repartitioning

I have a sequence of very large daily gzipped files. I'm trying to use PySpark to re-save all the files in S3 in Parquet format for later use.
If for a single file (in example, 2012-06-01) I do:
dataframe = spark.read.csv('s3://mybucket/input/20120601.gz', schema=my_schema, header=True)
dataframe.write.parquet('s3://mybucket/output/20120601')
it works, but since gzip isn't splittable it runs on a single host and I get no benefit of using the cluster.
I tried reading in a chunk of files at once, and using partitionBy to write the output to daily files like this (in example, reading in a month):
dataframe = spark.read.csv('s3://mybucket/input/201206*.gz', schema=my_schema, header=True)
dataframe.write.partitionBy('dayColumn').parquet('s3://mybucket/output/')
This time, individual files are read in different executors like I want, but the executors later die and the process fails. I believe since the files are so large, and the partitionBy is somehow using unnecessary resources (a shuffle?) it's crashing the tasks.
I don't actually need to re-partition my dataframe since this is just a 1:1 mapping. Is there anyway to make each individual task write to a separate, explicitly named parquet output file?
I was thinking something like
def write_file(date):
# get input/output locations from date
dataframe = spark.read.csv(input_location, schema=my_schema, header=True)
dataframe.write.parquet(output_location)
spark.sparkContext.parallelize(my_dates).for_each(write_file)
except this doesn't work since you can't broadcast the spark session to the cluster. Any suggestions?
Writing input files to separate output files without repartitioning
TL;DR This is what your code is already doing.
partitionBy is causing a unnecessary shuffle
No. DataFrameWriter.partitionBy doesn't shuffle at all.
it works, but since gzip isn't splittable
You can:
Drop compression completely - Parquet uses internal compression.
Use splittable compression like bzip2.
Unpack the files to a temporary storage before submitting the job.
If you are concerned about resources used by partitionBy (it might open larger number of files for each executor thread) you can actually shuffle to improve performance - DataFrame partitionBy to a single Parquet file (per partition). Single file is probably to much but
dataframe \
.repartition(n, 'dayColumn', 'someOtherColumn') \
.write.partitionBy('dayColumn') \
.save(...)
where someOtherColumn can be chosen to get reasonable cardinality, should improve things.

How does spark reads text format files

I have a dataset in S3 in text format(.gz) and I am using spark.read.csv to read the file into spark.
This is about 100GB of data but it contains 150 columns. I am using only 5 columns (so I reduce the breadth of the data) and I have selecting only 5 columns.
For this kind of scenario, does spark scans the complete 100GB of data or it smartly filters only these 5 columns without scanning all the columns(like in columnar formats)?
Any help on this would be appreciated.
imp_feed = spark.read.csv('s3://mys3-loc/input/', schema=impressionFeedSchema, sep='\t').where(col('dayserial_numeric').between(start_date_imp,max_date_imp)).select("col1","col2","col3","col4")
make step 1 of your workflow the process of reading in the CSV file and saving it as snappy compressed ORC or parquet files.
then go to whoever creates those files and tell them to stop it. At the very least, they should switch to Avro + Snappy, as that's easier to split up the initial parse and migration to a columnar format.

Resources