how to create Synapse serverless pool partition table similar to hive - azure

In hive we can create hive partition table as
CREATE EXTERNAL TABLE testdb.test_table(name string, age int)
PARTITIONED BY (dept string)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION '/path/to/dataFile/';
for files spread across locations like:
/path/to/dataFile/dept1
/path/to/dataFile/dept2
/path/to/dataFile/dept3
and then we can update the partition like
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept1') LOCATION '/path/to/dataFile/dept1';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept2') LOCATION '/path/to/dataFile/dept2';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept3') LOCATION '/path/to/dataFile/dept3';
In Azure our files are spread in a container in different folders . I need to create a partition external table in the synapse serverless pool .The syntax i am following is
CREATE EXTERNAL TABLE [testdb].[test1]
(
[STUDYID] varchar(2000) ,[SITEID] varchar(2000) )
WITH
(
    LOCATION = '/<abc_location>/csv/archive/',
    DATA_SOURCE = [datalake],
    FILE_FORMAT = [csv_comma_values]
)
I was checking the azure docs but didn't find any relevant documentation for this . Is there any way we can achieve something similar to hive code.

Partition External table is currently not available in Azure Synapse.
Location starts from the root folder and by specifying /**, you can return the files from subfolders. The external table will skip the files starts with underscore _ and period .
CREATE EXTERNAL TABLE sample_table
(
firstname varchar(50),
lastname varchar(50)
)
WITH (
LOCATION = '/csv/**',
DATA_SOURCE = blob_storage,
FILE_FORMAT = csv_file_format
)

Related

How to create an external unmanaged table in delta lake in Azure Databricks

I have this scenario where I am reading files from my blob storage and then creating a delta table in Azure Databricks. When you create a delta table in Datbaricks , there are delta files which are created by default which we cant access.
Other way is to create a unmanaged delta table and specify your own path to store these delta files. I would like to know how to do this. How can i specify where I want to store my delta files. and what does this external table mean, how can i specify the path for external table?
I tried below code and it fails on creating external table command:
spark.conf.set("fs.azure.account.auth.type.xyzstorageaccount.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.xyzstorageaccount.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.xyzstorageaccount.dfs.core.windows.net", "<sas token>")
%sql
CREATE EXTERNAL TABLE IF NOT EXISTS axytable
LOCATION 'abfss://xyzstorageaccount/tables';
I know I might be doing something wrong here and I have not completely understood what external table actually means.. Does it stays inside the Databricks cluster and instead of providing my storage account path , should I provide Databricks path? How do i create a custom catalog which also seems to be a requirement here? Also this code works here and i am able to write it in storage account by the path provided but I fail to understand it.. Any SME who can help me out here?
path = "abfss://xxx#xyzstorageacount.dfs.core.windows.net/XYY"
(DF.writeStream
.format('delta')
.outputMode("append")
.trigger(once=True)
.option("mergeSchema", "true")
.option('checkpointLocation', path+"/bronze_checkpoint")
.start(path + "/myTable"))
This documentation provide good description of what managed tables are and how are they different from unmanaged tables. In nutshell, managed tables are created in a "default" location, and both data & table metadata a managed by Hive metastore or Unity Catalog, so when you drop a table, actual data is deleted as well. Unmanaged tables are different as only metadata are controlled by Hive metastore or Unity Catalog - if you drop table, only table definition will be dropped, but not data.
You can create unamanged table different ways:
Create from scratch using syntax create table <name> (columns definition) using delta location 'path' (doc)
Create table for existing data using syntax create table name using delta location 'path' (you don't need to provide columns definition) (doc)
Provide path option with path to data when saving table using saveAsTable (for batch), or toTable (streaming). In your example it will be following:
path = "abfss://xxx#xyzstorageacount.dfs.core.windows.net/XYY"
(DF.writeStream
.format('delta')
.outputMode("append")
.trigger(once=True)
.option("mergeSchema", "true")
.option('checkpointLocation', path+"/bronze_checkpoint")
.option('path', path + "/myTable")
.toTable('<table_name>')
)

How to specify delta table properties when writing a steaming spark dataframe

Let's assume I have a streaming dataframe, and I'm writing it to Databricks Delta Lake:
someStreamingDf.writeStream
.format("delta")
.outputMode("append")
.start("targetPath")
and then creating a delta table out of it:
spark.sql("CREATE TABLE <TBL_NAME> USING DELTA LOCATION '<targetPath>'
TBLPROPERTIES ('delta.autoOptimize.optimizeWrite'=true)")
which fails with AnalysisException: The specified properties do not match the existing properties at <targetPath>.
I know I can create a table beforehand:
CREATE TABLE <TBL_NAME> (
//columns
)
USING DELTA LOCATION "< targetPath >"
TBLPROPERTIES (
"delta.autoOptimize.optimizeWrite" = true,
....
)
and then just write to it, but writting this SQL with all the columns and their types looks like a bit of extra/unnecessary work. So is there a way to specify these TBLPROPERTIES while writing to a delta table (for the first time) and not beforehand?
If you look into documentation, you can see that you can set following property:
spark.conf.set(
"spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite", "true")
and then all newly created tables will have delta.autoOptimize.optimizeWrite set to true.
another approach - create table without option, and then try to do alter table set tblprperties (not tested although)

Write spark Dataframe to an exisitng Delta Table by providing TABLE NAME instead of TABLE PATH

I am trying to write spark dataframe into an existing delta table.
I do have multiple scenarios where I could save data into different tables as shown below.
SCENARIO-01:
I have an existing delta table and I have to write dataframe into that table with option mergeSchema since the schema may change for each load.
I am doing the same with below command by providing delta table path
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").save(finalDF01DestFolderPath)
Just want to know whether this can be done by providing exisiting delta TABLE NAME instead of delta PATH.
This has been resolved by updating data write command as below.
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").saveAsTable(finalDF01DestTableName)
Is this the correct way ?
SCENARIO 02:
I have to update the existing table if the record already exists and if not insert a new record.
For this I am currently doing as shown below.
spark.sql("SET spark.databricks.delta.schema.autoMerge.enabled = true")
DeltaTable.forPath(DestFolderPath)
.as("t")
.merge(
finalDataFrame.as("s"),
"t.id = s.id AND t.name= s.name")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
I tried with below script.
destMasterTable.as("t")
.merge(
vehMasterDf.as("s"),
"t.id = s.id")
.whenNotMatched().insertAll()
.execute()
but getting below error(even with alias instead of as).
error: value as is not a member of String
destMasterTable.as("t")
Here also I am using delta table path as destination, Is there any way so that we could provide delta TABLE NAME instead of TABLE PATH?
It will be good to provide TABLE NAME instead of TABLE PATH, In case if we chage the table path later will not affect the code.
I have not seen anywhere in databricks documentation providing table name along with mergeSchema and autoMerge.
Is it possible to do so?
To use existing data as a table instead of path you either were need to use saveAsTable from the beginning, or just register existing data in the Hive metastore using the SQL command CREATE TABLE USING, like this (syntax could be slightly different depending on if you're running on Databricks, or OSS Spark, and depending on the version of Spark):
CREATE TABLE IF NOT EXISTS my_table
USING delta
LOCATION 'path_to_existing_data'
after that, you can use saveAsTable.
For the second question - it looks like destMasterTable is just a String. To refer to existing table, you need to use function forName from the DeltaTable object (doc):
DeltaTable.forName(destMasterTable)
.as("t")
...

Hive managed tables are not drop on azure data lake store

I recently discover what I call a bug and I'd be sure it's a bug.
We work on an azure platform with HDInsight 3.6 with two separates storages : a blobstorage and a data lake store.
For most part of our work we use Hive.
From what we know when you drop a managed table the data under this table are drop too.
To be sure of this we tried this :
CREATE TABLE test(id String) PARTITIONED BY (part String) STORED AS ORC ;
INSERT INTO TABLE PARTITION(part='part1') VALUES('id1') ;
INSERT INTO TABLE PARTITION(part='part2') VALUES('id2') ;
INSERT INTO TABLE PARTITION(part='part3') VALUES('id3') ;
These queries are executed on the default database i.e on the blob storage.
The data are well stored under the location of the table test : if we check we have three directories part=* with files under them.
Then i drop the table :
DROP TABLE test ;
If we check the database directory there is no more directory named test so the data are well dropped and we expect this to be the correct hive behavior.
And now is the trick : For our work we use databases located on a datalake store and when we use this code :
use database_located_on_adl ;
CREATE TABLE test(id String) PARTITIONED BY (part String) STORED AS ORC ;
INSERT INTO TABLE PARTITION(part='part1') VALUES('id1') ;
INSERT INTO TABLE PARTITION(part='part2') VALUES('id2') ;
INSERT INTO TABLE PARTITION(part='part3') VALUES('id3') ;
DROP TABLE test ;
The table are well created, the data are well stored BUT the data are not dropped on the DROP TABLE command ...
Am I missing something ? Or is this a normal behavior ?
If anyone see this old post and have the same issue : Our problem was that we missed the write right on the hive trash (/user/hiveUserName/.Trash HDFS folder).
Hope this can help !

How to create an EXTERNAL Spark table from data in HDFS

I have loaded a parquet table from HDFS into a DataFrame:
val df = spark.read.parquet("hdfs://user/zeppelin/my_table")
I now want to expose this table to Spark SQL but this must be a persitent table because I want to access it from a JDBC connection or other Spark Sessions.
Quick way could be to call df.write.saveAsTable method, but in this case it will materialize the contents of the DataFrame and create a pointer to the data in the Hive metastore, creating another copy of the data in HDFS.
I don't want to have two copies of the same data, so I would want create like an external table to point to existing data.
To create a Spark External table you must specify the "path" option of the DataFrameWriter. Something like this:
df.write.
option("path","hdfs://user/zeppelin/my_mytable").
saveAsTable("my_table")
The problem though, is that it will empty your hdfs path hdfs://user/zeppelin/my_mytable eliminating your existing files and will cause an org.apache.spark.SparkException: Job aborted.. This looks like a bug in Spark API...
Anyway, the workaround to this (tested in Spark 2.3) is to create an external table but from a Spark DDL. If your table have many columns creating the DDL could be a hassle. Fortunately, starting from Spark 2.0, you could call the DDL SHOW CREATE TABLE to let spark do the hard work. The problem is that you can actually run the SHOW CREATE TABLE in a persistent table.
If the table is pretty big, I recommend to get a sample of the table, persist it to another location, and then get the DDL. Something like this:
// Create a sample of the table
val df = spark.read.parquet("hdfs://user/zeppelin/my_table")
df.limit(1).write.
option("path", "/user/zeppelin/my_table_tmp").
saveAsTable("my_table_tmp")
// Now get the DDL, do not truncate output
spark.sql("SHOW CREATE TABLE my_table_tmp").show(1, false)
You are going to get a DDL like:
CREATE TABLE `my_table_tmp` (`ID` INT, `Descr` STRING)
USING parquet
OPTIONS (
`serialization.format` '1',
path 'hdfs:///user/zeppelin/my_table_tmp')
Which you would want to change to have the original name of the table and the path to the original data. You can now run the following to create the Spark External table pointing to your existing HDFS data:
spark.sql("""
CREATE TABLE `my_table` (`ID` INT, `Descr` STRING)
USING parquet
OPTIONS (
`serialization.format` '1',
path 'hdfs:///user/zeppelin/my_table')""")

Resources