Unable to Merge Small ORC Files using Spark - apache-spark

I have an external ORC table with a large number of the small files, which are coming from the source on daily basis. I need to merge these files into larger files.
I tried to load ORC files to the spark and save with overwrite method
val fileName = "/user/db/table_data/" //This table contains multiple partition on date column with small data files.
val df = hiveContext.read.format("orc").load(fileName)
df.repartition(1).write.mode(SaveMode.Overwrite).partitionBy("date").orc("/user/db/table_data/)
But mode(SaveMode.Overwrite) is deleting all the data from the HDFS. When I tried without mode(SaveMode.Overwrite) method, it was throwing error file already exists.
Can anyone help me to proceed?

As suggested by #Avseiytsev, I have stored by merged orc files in different folder as source in HDFS and moved the data to the table path after the completion of the job.

Related

Find out target csv file name for a Spark DataFrame.write.csv() call

In a pyspark session when I do this:
df = spark.read.parquet(file)
df.write.csv('output')
it creates a directory called output with a bunch of files, one of which is a target csv file with unpredictable name, example: part-00006-80ba8022-33cb-4478-aab3-29f08efc160a-c000.csv
Is there a way to know what the output file name is after the .csv() call?
When you read a parquet file in the dataframe it will have some partitions as we are using distributed storage here. Similarly when you save that dataframe as an csv file it would get saved in an distributed manner based on the number of partitions that dataframe had.
The path that you provide at the time of writing the csv file would create a folder with that name is what happens and then you would have multiple partitions files inside that folder. Each file would have some portion of data and when you combine all that partitions file you get the entire content of the csv file.
Also if you read that folder path then you would be able to see the entire content of the csv file. This is the default behaviour of how spark and distributed computing works.

Pyspark dataframe parquet vs delta : different number of rows

I have data written in Delta on HDFS. From what I understand, Delta is storing the data as parquet, just has an additional layer over it with advanced features.
But when reading data with Pyspark, I get a different result if dataframe is read with spark.read.parquet() or spark.read.format('delta').load()
df = spark.read.format('delta').load("my_data")
df.count()
> 184511389
df = spark.read.parquet("my_data")
df.count()
> 369022778
As you can see the difference is quite big.
Is there something I misunderstood about delta vs parquet?
Pyspark version is 2.4.
The most probable explanation is that you wrote into the Delta two times using the overwrite option. But Delta is versioned data format - when you use overwrite, it doesn't delete previous data, it just writes new files, and don't delete files immediately - they are just marked as deleted in the manifest file that Delta uses. And when you read from Delta, it knows which files are deleted, or not, and read only actual data. Actual deletion of the data files happens when you're performing VACUUM on Delta lake.
But when you read with Parquet, it doesn't have information about deleted files, so it reads everything that you have in directory, so you get twice as many rows.

How to append data in existing AVRO file using Python

I have a dataframe with similar schema, I need to append the data into the AVRO file. I don't like to add the avro file into folder as a part. For your information, my AVRO file is not into the folder as a part. Can you please help me to solve the task.
You can write the data by using mode overwrite while writing the dataframe.
But the part file is created as spark is distributed processing and each executor spits out a files based on the amount of data

Use Unmanaged table in Delta lake on Top of ADLS Gen2

I use ADF to ingest the data from SQL server to ADLS GEN2 in a Parquet Snappy format, But the size of the file in sink goes upto 120 GB, The size causes me a lot of problem when I read this file in Spark and join the data from this file with many other Parquet files.
I am thinking to use Delta lake's unmanage table with the location pointing to the ADLS location, I am able to create an UnManaged table if I don't specify any partition using this
" CONVERT TO DELTA parquet.PATH TO FOLDER CONTAINING A PARQUET FILE(S)"
But if I would want to partition this file for query optimization
" CONVERT TO DELTA parquet.PATH TO FOLDER CONTAINING A PARQUET FILE(S), PARTITIONED_COLUMN DATATYPE"
It gives me error like the one mentioned in the screenshot (find the attachment).
Error in Text :-
org.apache.spark.sql.AnalysisException: Expecting 1 partition column(s): [<PARTITIONED_COLUMN>], but found 0 partition column(s): [] from parsing the file name: abfss://mydirectory#myADLS.dfs.core.windows.net/level1/Level2/Table1.parquet.snappy;
There is no way that I can create this Parquet file using ADF with partition details (Am open for suggestions)
Am I giving a wrong Syntax or this can be even done?
Ok, I found the answer to this. While you convert parquet files to delta using the above approach, Delta would look for the correct directory structure with partition information along with the name of the column mentioned in "Partitioned By" clause.
For E.g, I have a folder called /Parent, inside this I have a directory structure with partition information, the partitioned parquet files are kept one level further inside the partitioned folders, the folder names are like this
/Parent/Subfolder=0/part-00000-62ef2efd-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=1/part-00000-fsgvfabv-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=2/part-00000-fbfdfbfe-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=3/part-00000-gbgdbdtb-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
in this case, subfolder is the partitions created inside parent.
CONVERT TO DELTA parquet./Parent/ partitioned by (Subfolder INT)
will just take this directory structure and convert the whole partitioned data to delta and will store the partitioned information in metastore.
Summary:- This command is only to utilize already created partitioned Parquet files. To create partition on single Parquet file you would have to take different route, Which I can explain you later if you are interested ;)

Do Parquet Metadata Files Need to be Rolled-back?

When a Parquet file data is written with partitioning on its date column we get a directory structure like:
/data
_common_metadata
_metadata
_SUCCESS
/date=1
part-r-xxx.gzip
part-r-xxx.gzip
/date=2
part-r-xxx.gzip
part-r-xxx.gzip
If the partition date=2 is deleted without the involvement of Parquet utilities (via the shell or file browser, etc) do any of the metadata files need to be rolled back to when there was only the partition date=1?
Or is it ok to delete partitions at will and rewrite them (or not) later?
If you're using DataFrame there is no need to roll back the metadata files.
For example:
You can write your DataFrame to S3
df.write.partitionBy("date").parquet("s3n://bucket/folderPath")
Then, manually delete one of your partitions (date=1 folder in S3) using S3 browser (e.g. CloudBerry)
Now you can
Load your data and see that the data is still valid except the data you had in partition date=1 sqlContext.read.parquet("s3n://bucket/folderPath").count
Or rewrite your DataFrame (or any other DataFrame with the same schema) using append mode
df2.write.mode("append").partitionBy("date").parquet("s3n://bucket/folderPath")
You can also take a look at this question from databricks forum.

Resources