Do Parquet Metadata Files Need to be Rolled-back? - apache-spark

When a Parquet file data is written with partitioning on its date column we get a directory structure like:
/data
_common_metadata
_metadata
_SUCCESS
/date=1
part-r-xxx.gzip
part-r-xxx.gzip
/date=2
part-r-xxx.gzip
part-r-xxx.gzip
If the partition date=2 is deleted without the involvement of Parquet utilities (via the shell or file browser, etc) do any of the metadata files need to be rolled back to when there was only the partition date=1?
Or is it ok to delete partitions at will and rewrite them (or not) later?

If you're using DataFrame there is no need to roll back the metadata files.
For example:
You can write your DataFrame to S3
df.write.partitionBy("date").parquet("s3n://bucket/folderPath")
Then, manually delete one of your partitions (date=1 folder in S3) using S3 browser (e.g. CloudBerry)
Now you can
Load your data and see that the data is still valid except the data you had in partition date=1 sqlContext.read.parquet("s3n://bucket/folderPath").count
Or rewrite your DataFrame (or any other DataFrame with the same schema) using append mode
df2.write.mode("append").partitionBy("date").parquet("s3n://bucket/folderPath")
You can also take a look at this question from databricks forum.

Related

Write parquet files to object store using spark-kubernetes

Working on a project to write parquet files to a object store from k8 spark operator using multiple partition keys. The output location of the spark jobs for the parquet files is the same and the data is written to the subfolders based on the partition keys.
What are the options to write the parquet files without HDFS?
The default S3A (algo1 and algo2) committers will not work since the temp files in the output location gets deleted after each successful write as data is continuously written to the output location as small batches.
is there any way use the directory committer without HDFS?
NFS does not seem to work.

Find out target csv file name for a Spark DataFrame.write.csv() call

In a pyspark session when I do this:
df = spark.read.parquet(file)
df.write.csv('output')
it creates a directory called output with a bunch of files, one of which is a target csv file with unpredictable name, example: part-00006-80ba8022-33cb-4478-aab3-29f08efc160a-c000.csv
Is there a way to know what the output file name is after the .csv() call?
When you read a parquet file in the dataframe it will have some partitions as we are using distributed storage here. Similarly when you save that dataframe as an csv file it would get saved in an distributed manner based on the number of partitions that dataframe had.
The path that you provide at the time of writing the csv file would create a folder with that name is what happens and then you would have multiple partitions files inside that folder. Each file would have some portion of data and when you combine all that partitions file you get the entire content of the csv file.
Also if you read that folder path then you would be able to see the entire content of the csv file. This is the default behaviour of how spark and distributed computing works.

Use Unmanaged table in Delta lake on Top of ADLS Gen2

I use ADF to ingest the data from SQL server to ADLS GEN2 in a Parquet Snappy format, But the size of the file in sink goes upto 120 GB, The size causes me a lot of problem when I read this file in Spark and join the data from this file with many other Parquet files.
I am thinking to use Delta lake's unmanage table with the location pointing to the ADLS location, I am able to create an UnManaged table if I don't specify any partition using this
" CONVERT TO DELTA parquet.PATH TO FOLDER CONTAINING A PARQUET FILE(S)"
But if I would want to partition this file for query optimization
" CONVERT TO DELTA parquet.PATH TO FOLDER CONTAINING A PARQUET FILE(S), PARTITIONED_COLUMN DATATYPE"
It gives me error like the one mentioned in the screenshot (find the attachment).
Error in Text :-
org.apache.spark.sql.AnalysisException: Expecting 1 partition column(s): [<PARTITIONED_COLUMN>], but found 0 partition column(s): [] from parsing the file name: abfss://mydirectory#myADLS.dfs.core.windows.net/level1/Level2/Table1.parquet.snappy;
There is no way that I can create this Parquet file using ADF with partition details (Am open for suggestions)
Am I giving a wrong Syntax or this can be even done?
Ok, I found the answer to this. While you convert parquet files to delta using the above approach, Delta would look for the correct directory structure with partition information along with the name of the column mentioned in "Partitioned By" clause.
For E.g, I have a folder called /Parent, inside this I have a directory structure with partition information, the partitioned parquet files are kept one level further inside the partitioned folders, the folder names are like this
/Parent/Subfolder=0/part-00000-62ef2efd-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=1/part-00000-fsgvfabv-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=2/part-00000-fbfdfbfe-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=3/part-00000-gbgdbdtb-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
in this case, subfolder is the partitions created inside parent.
CONVERT TO DELTA parquet./Parent/ partitioned by (Subfolder INT)
will just take this directory structure and convert the whole partitioned data to delta and will store the partitioned information in metastore.
Summary:- This command is only to utilize already created partitioned Parquet files. To create partition on single Parquet file you would have to take different route, Which I can explain you later if you are interested ;)

Renaming Exported files from Spark Job

We are currently using Spark Job on Databricks which do processing on our data lake which in S3.
Once the processing is done we export our result to S3 bucket using normal
df.write()
The issue is when we write dataframe to S3 the name of file is controlled by Spark, but as per our agreement we need to rename this files to a meaningful name.
Since S3 doesn't have renaming feature we are right now using boto3 to copy and paste file with expected name.
This process is very complex and not scalable with more client getting onboard.
Do we have any better solution to rename exported files from spark to S3 ?
It's not possible to do it directly in Spark's save
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. If the file is small enough to fit into memory, one work around is to convert to a pandas dataframe and save as csv from there.
df_pd = df.toPandas()
df_pd.to_csv("path")

Unable to Merge Small ORC Files using Spark

I have an external ORC table with a large number of the small files, which are coming from the source on daily basis. I need to merge these files into larger files.
I tried to load ORC files to the spark and save with overwrite method
val fileName = "/user/db/table_data/" //This table contains multiple partition on date column with small data files.
val df = hiveContext.read.format("orc").load(fileName)
df.repartition(1).write.mode(SaveMode.Overwrite).partitionBy("date").orc("/user/db/table_data/)
But mode(SaveMode.Overwrite) is deleting all the data from the HDFS. When I tried without mode(SaveMode.Overwrite) method, it was throwing error file already exists.
Can anyone help me to proceed?
As suggested by #Avseiytsev, I have stored by merged orc files in different folder as source in HDFS and moved the data to the table path after the completion of the job.

Resources