Hive managed tables are not drop on azure data lake store

Hive managed tables are not drop on azure data lake store - azure

I recently discover what I call a bug and I'd be sure it's a bug.
We work on an azure platform with HDInsight 3.6 with two separates storages : a blobstorage and a data lake store.
For most part of our work we use Hive.
From what we know when you drop a managed table the data under this table are drop too.
To be sure of this we tried this :
CREATE TABLE test(id String) PARTITIONED BY (part String) STORED AS ORC ;
INSERT INTO TABLE PARTITION(part='part1') VALUES('id1') ;
INSERT INTO TABLE PARTITION(part='part2') VALUES('id2') ;
INSERT INTO TABLE PARTITION(part='part3') VALUES('id3') ;
These queries are executed on the default database i.e on the blob storage.
The data are well stored under the location of the table test : if we check we have three directories part=* with files under them.
Then i drop the table :
DROP TABLE test ;
If we check the database directory there is no more directory named test so the data are well dropped and we expect this to be the correct hive behavior.
And now is the trick : For our work we use databases located on a datalake store and when we use this code :
use database_located_on_adl ;
CREATE TABLE test(id String) PARTITIONED BY (part String) STORED AS ORC ;
INSERT INTO TABLE PARTITION(part='part1') VALUES('id1') ;
INSERT INTO TABLE PARTITION(part='part2') VALUES('id2') ;
INSERT INTO TABLE PARTITION(part='part3') VALUES('id3') ;
DROP TABLE test ;
The table are well created, the data are well stored BUT the data are not dropped on the DROP TABLE command ...
Am I missing something ? Or is this a normal behavior ?

If anyone see this old post and have the same issue : Our problem was that we missed the write right on the hive trash (/user/hiveUserName/.Trash HDFS folder).
Hope this can help !

Related

Azure Data Studio: _delta_log/.' cannot be listed

I'm trying to query my delta tables using Azure Synapse Serverless SQL Pool.
Login in Azure Data Studio using the SQL admin credentials.
This is a simple query to table that I'm trying trying to make:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://(...).dfs.core.windows.net/(...)/table/',
FORMAT = 'DELTA'
) AS [result]
I get the error:
Content of directory on path 'https://.../table/_delta_log/*.*' cannot be listed.
If I query any other table, e.g. table_copy I have no error.
I can query every table I have, except this table one.
Following every piece of documentation and threads I find, tried the following:
(IAM) setting up Storage Blob Contributor, Storage Blob Owner, Storage Queue Data Contributor and Owner
Going in ACL setting up Read, Write, Execute Access and Default permissions, for the Managed Identity (Synapse Studio),
Propagating the ACL into every children
Restored the default permissions for the folder
Making a copy of the table, deleting the original, and overwrite it again (pyspark)
# Read original table
table_copy = spark.read.format("delta")
.option("recursiveFileLookup", "True")
.load(f"abfss://...#....dfs.core.windows.net/.../table/")
# Create a copy of it
table_copy.write.format('delta')
.mode("overwrite")
.option("overwriteSchema","true")
.save(f"abfss://...#....dfs.core.windows.net/.../table_copy/")
# Remove original one
dbutils.fs.rm('abfss://...#....dfs.core.windows.net/.../table/',recurse=True)
# Overwrite it
table_copy.write.format('delta')
.mode("overwrite")
.option("overwriteSchema","true")
.save(f"abfss://...#....dfs.core.windows.net/.../table/")
If I make a copy of the table to table_copy, I can read it.
Note that in Azure Synapse UI I can query the table. Outside of it I can't.

It seems like the permission and firewall settings are set up correctly.
One thing you can try and check the table is in correct format (Delta format) and it has correct schema and also check you directory delta_log create or not.
Try this approach:
First I don't have any delta table . so I created sample dataframe df using spark.read.
Then, I overwrite dataframe df into delta format with abfss://<container_name>#<storage_account_name>... path and also parallelly created a table using saveAsTable name: test_table
table_path = f"abfss://<container_name>#<storage_account_name>.dfs.core.windows.net/<folder>"
df.write.format("delta").mode("overwrite").option("path", table_path).saveAsTable("test_table")
You can check test_table and abfss storage location. I successfully got the data in delta format.
Another Alterative way that you can create a new delta table and copy the data from old table to the new delta table. You can use the query like this:

How to create an external unmanaged table in delta lake in Azure Databricks

I have this scenario where I am reading files from my blob storage and then creating a delta table in Azure Databricks. When you create a delta table in Datbaricks , there are delta files which are created by default which we cant access.
Other way is to create a unmanaged delta table and specify your own path to store these delta files. I would like to know how to do this. How can i specify where I want to store my delta files. and what does this external table mean, how can i specify the path for external table?
I tried below code and it fails on creating external table command:
spark.conf.set("fs.azure.account.auth.type.xyzstorageaccount.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.xyzstorageaccount.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.xyzstorageaccount.dfs.core.windows.net", "<sas token>")
%sql
CREATE EXTERNAL TABLE IF NOT EXISTS axytable
LOCATION 'abfss://xyzstorageaccount/tables';
I know I might be doing something wrong here and I have not completely understood what external table actually means.. Does it stays inside the Databricks cluster and instead of providing my storage account path , should I provide Databricks path? How do i create a custom catalog which also seems to be a requirement here? Also this code works here and i am able to write it in storage account by the path provided but I fail to understand it.. Any SME who can help me out here?
path = "abfss://xxx#xyzstorageacount.dfs.core.windows.net/XYY"
(DF.writeStream
.format('delta')
.outputMode("append")
.trigger(once=True)
.option("mergeSchema", "true")
.option('checkpointLocation', path+"/bronze_checkpoint")
.start(path + "/myTable"))

This documentation provide good description of what managed tables are and how are they different from unmanaged tables. In nutshell, managed tables are created in a "default" location, and both data & table metadata a managed by Hive metastore or Unity Catalog, so when you drop a table, actual data is deleted as well. Unmanaged tables are different as only metadata are controlled by Hive metastore or Unity Catalog - if you drop table, only table definition will be dropped, but not data.
You can create unamanged table different ways:
Create from scratch using syntax create table <name> (columns definition) using delta location 'path' (doc)
Create table for existing data using syntax create table name using delta location 'path' (you don't need to provide columns definition) (doc)
Provide path option with path to data when saving table using saveAsTable (for batch), or toTable (streaming). In your example it will be following:
path = "abfss://xxx#xyzstorageacount.dfs.core.windows.net/XYY"
(DF.writeStream
.format('delta')
.outputMode("append")
.trigger(once=True)
.option("mergeSchema", "true")
.option('checkpointLocation', path+"/bronze_checkpoint")
.option('path', path + "/myTable")
.toTable('<table_name>')
)

Write spark Dataframe to an exisitng Delta Table by providing TABLE NAME instead of TABLE PATH

I am trying to write spark dataframe into an existing delta table.
I do have multiple scenarios where I could save data into different tables as shown below.
SCENARIO-01:
I have an existing delta table and I have to write dataframe into that table with option mergeSchema since the schema may change for each load.
I am doing the same with below command by providing delta table path
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").save(finalDF01DestFolderPath)
Just want to know whether this can be done by providing exisiting delta TABLE NAME instead of delta PATH.
This has been resolved by updating data write command as below.
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").saveAsTable(finalDF01DestTableName)
Is this the correct way ?
SCENARIO 02:
I have to update the existing table if the record already exists and if not insert a new record.
For this I am currently doing as shown below.
spark.sql("SET spark.databricks.delta.schema.autoMerge.enabled = true")
DeltaTable.forPath(DestFolderPath)
.as("t")
.merge(
finalDataFrame.as("s"),
"t.id = s.id AND t.name= s.name")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
I tried with below script.
destMasterTable.as("t")
.merge(
vehMasterDf.as("s"),
"t.id = s.id")
.whenNotMatched().insertAll()
.execute()
but getting below error(even with alias instead of as).
error: value as is not a member of String
destMasterTable.as("t")
Here also I am using delta table path as destination, Is there any way so that we could provide delta TABLE NAME instead of TABLE PATH?
It will be good to provide TABLE NAME instead of TABLE PATH, In case if we chage the table path later will not affect the code.
I have not seen anywhere in databricks documentation providing table name along with mergeSchema and autoMerge.
Is it possible to do so?

To use existing data as a table instead of path you either were need to use saveAsTable from the beginning, or just register existing data in the Hive metastore using the SQL command CREATE TABLE USING, like this (syntax could be slightly different depending on if you're running on Databricks, or OSS Spark, and depending on the version of Spark):
CREATE TABLE IF NOT EXISTS my_table
USING delta
LOCATION 'path_to_existing_data'
after that, you can use saveAsTable.
For the second question - it looks like destMasterTable is just a String. To refer to existing table, you need to use function forName from the DeltaTable object (doc):
DeltaTable.forName(destMasterTable)
.as("t")
...

Spark Hive: temporary table disappear within session

I create a couple of temporary tables using
hive.executeUpdate("CREATE TEMPORARY TABLE AS SELECT ...")
in Hive from Spark.
I check all tables with
hive.showTables().show()
in the session between each query I perform later (all like INSERT INTO ... SELECT ...) and the temporary tables are being dropped unpredictably.
This is not happening in HiveQL.
Anyone had similar issues?

By seeing your api, I think you are using hortonworks-spark connector
you have to prefix your table with databaseschema.table all over.
or set the database like this.
hive.setDatabase("default")
then your CTAS
hive.executeUpdate("CREATE TEMPORARY TABLE AS SELECT ...")
for example:
val sql = s"create temporary table $tmpTableName like $dbName.$tabName "
and then
INSERT INTO ... SELECT ...)
what ever you want to do.
Q: This is not happening in HiveQL.
Anyone had similar issues?
In hiveql you will use the same database schema thats the reason its working as expected.

What is the best way to cleanup and recreate databricks delta table?

I am trying to cleanup and recreate databricks delta table for integration tests.
I want to run the tests on devops agent so i am using JDBC (Simba driver) but it says statement type "DELETE" is not supported.
When i cleanup the underlying DBFS location using DBFS API "rm -r" it cleans up the table but next read after recreate gives an error - A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table DELETE statement.
Also if i simply do DELETE from delta table on data i still see the underlying dbfs directory and the files intact. How can I clean up the delta as well as underlying files gracefully?

You can use VACUUM command to do the clean up. I haven't used it yet.
If you are using spark, you can use overwriteSchema option to reload the data.
If you can provide the more details on how you are using it, it would be better

The perfect steps are as follows:
When you do a DROP TABLE and DELETE FROM TABLE TABLE NAME the following things happen in :
DROP TABLE : drops your table but data still resides.(Also you can't create a new table definition with changes in schema in the same location.)
DELETE FROM TABLE deletes data from table but transaction log still resides.
So, Step 1 - DROP TABLE schema.Tablename
STEP 2 - %fs rm -r /mnt/path/where/your/table/definition/is/pointed/fileNames.parquet
Step 3 - % fs ls make sure there is no data and also no transaction log at that location
Step 4 : NOW>!!!!! RE_RUN your CREATE TABLE statement with any changes you desire UISNG delta location /mnt/path/where/your/table/definition/is/pointed/fileNames.parquet
Step 5 : Start using the table and verify using %sql desc formatted schema.Tablename

Make sure that you are not creating an external table. There are two types of tables:
1) Managed Tables
2) External Tables (Location for dataset is specified)
When you delete Managed Table, spark is responsible for cleanup of metaData of that table stored in metastore and for cleanup of the data (files) present in that table.
But for external table, spark do not owns the data, so when you delete external table, only metadata present in metastore is deleted by spark and data (files) which were present in that table do not get deleted.
After this if you confirm that your tables are managed tables and still dropping table is not deleting files then you can use VACUUM command:
VACUUM <databaseName>.<TableName> [RETAIN NUM HOURS]
This will cleanup all the uncommitted files from table's folder.
I hope this helps you.

import os
path = "<Your Azure Databricks Delta Lake Folder Path>"
for delta_table in os.listdir(path):
dbutils.fs.rm("<Your Azure Databricks Delta Lake Folder Path>" + delta_table)
How to find your <Your Azure Databricks Delta Lake Folder Path>:
Step 1: Go to Databricks.
Step 2: Click Data - Create Table - DBFS. Then, you will find your delta tables.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string