Merge multiple parquet files into single file on S3 - apache-spark

I don't want to partition or repartition the spark dataframe and write multiple part files as it gives the best performance. Is there any way I can merge the files after it has been written to S3.
I have used parquet-tools and it does the merge to local files. I want to do this on S3.

Related

Write parquet files to object store using spark-kubernetes

Working on a project to write parquet files to a object store from k8 spark operator using multiple partition keys. The output location of the spark jobs for the parquet files is the same and the data is written to the subfolders based on the partition keys.
What are the options to write the parquet files without HDFS?
The default S3A (algo1 and algo2) committers will not work since the temp files in the output location gets deleted after each successful write as data is continuously written to the output location as small batches.
is there any way use the directory committer without HDFS?
NFS does not seem to work.

PySpark is Writing Large Single Parquet Files instead of Partitioned Files

For most of my files, when I read in delimited files and write them out to snappy parquet, spark is executing as I expected and creating multiple partitioned snappy parquet files.
That said, I have some large .out files that are pipe-separated (25GB+), and when I read them in:
inputFile = spark.read.load(s3PathIn, format='csv', sep=fileSeparator, quote=fileQuote, escape=fileEscape, inferSchema='true', header='true', multiline='true')
Then output the results to S3:
inputFile.write.parquet(pathOut, mode="overwrite")
I am getting large single snappy parquet files (20GB+). Is there a reason for this? All my other spark pipelines generate nicely split files that make query in Athena more performant, but in these specific cases I am only getting single-large files. I am NOT executing any repartition or coallesce commands.
check how much partitions you have on inputFile dataframe. Seems like it has single partitioned.
Seems like you are just reading a CSV file and then writing it as parquet file. check the size of your CSV file, seems like it really large.
inputFile.rdd.getNumPartitions
if it's one. Try repartition dataframe.
inputFile.repartition(10) \\or
inputFile.repartition("col_name")

Renaming Exported files from Spark Job

We are currently using Spark Job on Databricks which do processing on our data lake which in S3.
Once the processing is done we export our result to S3 bucket using normal
df.write()
The issue is when we write dataframe to S3 the name of file is controlled by Spark, but as per our agreement we need to rename this files to a meaningful name.
Since S3 doesn't have renaming feature we are right now using boto3 to copy and paste file with expected name.
This process is very complex and not scalable with more client getting onboard.
Do we have any better solution to rename exported files from spark to S3 ?
It's not possible to do it directly in Spark's save
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. If the file is small enough to fit into memory, one work around is to convert to a pandas dataframe and save as csv from there.
df_pd = df.toPandas()
df_pd.to_csv("path")

Saving dataframe as a single file to AWS S3 through Apache-Spark

How do I save a dataframe as a single file to AWS S3? I tried both repartition and coalesce , but didn't work. The data frame is saved as multiple part files in a folder, I need that to be saved as single file (myfile.csv)
val s3path="s3a://***:***#***/myfile.csv"
df.coalesce(1).write.format("com.databricks.spark.csv").save(s3path)
Thanks

Do Parquet Metadata Files Need to be Rolled-back?

When a Parquet file data is written with partitioning on its date column we get a directory structure like:
/data
_common_metadata
_metadata
_SUCCESS
/date=1
part-r-xxx.gzip
part-r-xxx.gzip
/date=2
part-r-xxx.gzip
part-r-xxx.gzip
If the partition date=2 is deleted without the involvement of Parquet utilities (via the shell or file browser, etc) do any of the metadata files need to be rolled back to when there was only the partition date=1?
Or is it ok to delete partitions at will and rewrite them (or not) later?
If you're using DataFrame there is no need to roll back the metadata files.
For example:
You can write your DataFrame to S3
df.write.partitionBy("date").parquet("s3n://bucket/folderPath")
Then, manually delete one of your partitions (date=1 folder in S3) using S3 browser (e.g. CloudBerry)
Now you can
Load your data and see that the data is still valid except the data you had in partition date=1 sqlContext.read.parquet("s3n://bucket/folderPath").count
Or rewrite your DataFrame (or any other DataFrame with the same schema) using append mode
df2.write.mode("append").partitionBy("date").parquet("s3n://bucket/folderPath")
You can also take a look at this question from databricks forum.

Resources