Use Unmanaged table in Delta lake on Top of ADLS Gen2 - apache-spark

I use ADF to ingest the data from SQL server to ADLS GEN2 in a Parquet Snappy format, But the size of the file in sink goes upto 120 GB, The size causes me a lot of problem when I read this file in Spark and join the data from this file with many other Parquet files.
I am thinking to use Delta lake's unmanage table with the location pointing to the ADLS location, I am able to create an UnManaged table if I don't specify any partition using this
" CONVERT TO DELTA parquet.PATH TO FOLDER CONTAINING A PARQUET FILE(S)"
But if I would want to partition this file for query optimization
" CONVERT TO DELTA parquet.PATH TO FOLDER CONTAINING A PARQUET FILE(S), PARTITIONED_COLUMN DATATYPE"
It gives me error like the one mentioned in the screenshot (find the attachment).
Error in Text :-
org.apache.spark.sql.AnalysisException: Expecting 1 partition column(s): [<PARTITIONED_COLUMN>], but found 0 partition column(s): [] from parsing the file name: abfss://mydirectory#myADLS.dfs.core.windows.net/level1/Level2/Table1.parquet.snappy;
There is no way that I can create this Parquet file using ADF with partition details (Am open for suggestions)
Am I giving a wrong Syntax or this can be even done?

Ok, I found the answer to this. While you convert parquet files to delta using the above approach, Delta would look for the correct directory structure with partition information along with the name of the column mentioned in "Partitioned By" clause.
For E.g, I have a folder called /Parent, inside this I have a directory structure with partition information, the partitioned parquet files are kept one level further inside the partitioned folders, the folder names are like this
/Parent/Subfolder=0/part-00000-62ef2efd-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=1/part-00000-fsgvfabv-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=2/part-00000-fbfdfbfe-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=3/part-00000-gbgdbdtb-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
in this case, subfolder is the partitions created inside parent.
CONVERT TO DELTA parquet./Parent/ partitioned by (Subfolder INT)
will just take this directory structure and convert the whole partitioned data to delta and will store the partitioned information in metastore.
Summary:- This command is only to utilize already created partitioned Parquet files. To create partition on single Parquet file you would have to take different route, Which I can explain you later if you are interested ;)

Related

Pyspark dataframe parquet vs delta : different number of rows

I have data written in Delta on HDFS. From what I understand, Delta is storing the data as parquet, just has an additional layer over it with advanced features.
But when reading data with Pyspark, I get a different result if dataframe is read with spark.read.parquet() or spark.read.format('delta').load()
df = spark.read.format('delta').load("my_data")
df.count()
> 184511389
df = spark.read.parquet("my_data")
df.count()
> 369022778
As you can see the difference is quite big.
Is there something I misunderstood about delta vs parquet?
Pyspark version is 2.4.
The most probable explanation is that you wrote into the Delta two times using the overwrite option. But Delta is versioned data format - when you use overwrite, it doesn't delete previous data, it just writes new files, and don't delete files immediately - they are just marked as deleted in the manifest file that Delta uses. And when you read from Delta, it knows which files are deleted, or not, and read only actual data. Actual deletion of the data files happens when you're performing VACUUM on Delta lake.
But when you read with Parquet, it doesn't have information about deleted files, so it reads everything that you have in directory, so you get twice as many rows.

How to write partitioned parquet files to blob storage

I want to load data from On Premise SQL SERVER to blob storage with copy activity in ADF, the target file is parquet, the size of this one is 5 Gb.
The pipeline work well and he wrote one parquet file, now i need to split this file in multiple parquet file to optimise loading data with Poly base and for another uses.
With Spark we can partition file in multiple file by this syntaxe :
df.repartition(5).write.parquet("path")
Short question, short answer.
Partitioned data: https://learn.microsoft.com/en-us/azure/data-factory/how-to-read-write-partitioned-data
Parquet format: https://learn.microsoft.com/en-us/azure/data-factory/format-parquet
Blob storage connector: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-blob-storage
Hope this helped!

Unable to Merge Small ORC Files using Spark

I have an external ORC table with a large number of the small files, which are coming from the source on daily basis. I need to merge these files into larger files.
I tried to load ORC files to the spark and save with overwrite method
val fileName = "/user/db/table_data/" //This table contains multiple partition on date column with small data files.
val df = hiveContext.read.format("orc").load(fileName)
df.repartition(1).write.mode(SaveMode.Overwrite).partitionBy("date").orc("/user/db/table_data/)
But mode(SaveMode.Overwrite) is deleting all the data from the HDFS. When I tried without mode(SaveMode.Overwrite) method, it was throwing error file already exists.
Can anyone help me to proceed?
As suggested by #Avseiytsev, I have stored by merged orc files in different folder as source in HDFS and moved the data to the table path after the completion of the job.

Create Hive ORC table from ORC files of other server

we have 2 clusters one Map R and another our own. We want created new setup in our own hardware using the Map R data.
I have copied all the orc files from the Map R cluster and followed the same folder structure
Created a orc formatted table with location of #1
then executed this command "MSCK REPAIR TABLE <>"
above steps passed without error, but when i query the partitions then job fails with below error
java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 4958903
at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:193)
at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:238)
Can some one tell me can we create HIVE ORC partition tables directly from the orc files?
My storage is Azure data lake.
According to your description, based on my understanding, I think you want to copy all orc files from a cluster to another and load these orc files as a hive table.
For doing it, please just try to follow the command below to create external table for loading orcfile data.
CREATE EXTERNAL TABLE IF NOT EXSISTS <table name> (<column_name column_type>, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS ORC
LOCATION '<orcfile path>'
If not aware of the columns list of an orc file, you can refer to the Hive manual ORC File Dump Utility to print the ORC file metadata in JSON format via hive --orcfiledump -j -p <location-of-orc-file-or-directory>.

Do Parquet Metadata Files Need to be Rolled-back?

When a Parquet file data is written with partitioning on its date column we get a directory structure like:
/data
_common_metadata
_metadata
_SUCCESS
/date=1
part-r-xxx.gzip
part-r-xxx.gzip
/date=2
part-r-xxx.gzip
part-r-xxx.gzip
If the partition date=2 is deleted without the involvement of Parquet utilities (via the shell or file browser, etc) do any of the metadata files need to be rolled back to when there was only the partition date=1?
Or is it ok to delete partitions at will and rewrite them (or not) later?
If you're using DataFrame there is no need to roll back the metadata files.
For example:
You can write your DataFrame to S3
df.write.partitionBy("date").parquet("s3n://bucket/folderPath")
Then, manually delete one of your partitions (date=1 folder in S3) using S3 browser (e.g. CloudBerry)
Now you can
Load your data and see that the data is still valid except the data you had in partition date=1 sqlContext.read.parquet("s3n://bucket/folderPath").count
Or rewrite your DataFrame (or any other DataFrame with the same schema) using append mode
df2.write.mode("append").partitionBy("date").parquet("s3n://bucket/folderPath")
You can also take a look at this question from databricks forum.

Resources