What is the best way to cleanup and recreate databricks delta table?

What is the best way to cleanup and recreate databricks delta table? - databricks

I am trying to cleanup and recreate databricks delta table for integration tests.
I want to run the tests on devops agent so i am using JDBC (Simba driver) but it says statement type "DELETE" is not supported.
When i cleanup the underlying DBFS location using DBFS API "rm -r" it cleans up the table but next read after recreate gives an error - A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table DELETE statement.
Also if i simply do DELETE from delta table on data i still see the underlying dbfs directory and the files intact. How can I clean up the delta as well as underlying files gracefully?

You can use VACUUM command to do the clean up. I haven't used it yet.
If you are using spark, you can use overwriteSchema option to reload the data.
If you can provide the more details on how you are using it, it would be better

The perfect steps are as follows:
When you do a DROP TABLE and DELETE FROM TABLE TABLE NAME the following things happen in :
DROP TABLE : drops your table but data still resides.(Also you can't create a new table definition with changes in schema in the same location.)
DELETE FROM TABLE deletes data from table but transaction log still resides.
So, Step 1 - DROP TABLE schema.Tablename
STEP 2 - %fs rm -r /mnt/path/where/your/table/definition/is/pointed/fileNames.parquet
Step 3 - % fs ls make sure there is no data and also no transaction log at that location
Step 4 : NOW>!!!!! RE_RUN your CREATE TABLE statement with any changes you desire UISNG delta location /mnt/path/where/your/table/definition/is/pointed/fileNames.parquet
Step 5 : Start using the table and verify using %sql desc formatted schema.Tablename

Make sure that you are not creating an external table. There are two types of tables:
1) Managed Tables
2) External Tables (Location for dataset is specified)
When you delete Managed Table, spark is responsible for cleanup of metaData of that table stored in metastore and for cleanup of the data (files) present in that table.
But for external table, spark do not owns the data, so when you delete external table, only metadata present in metastore is deleted by spark and data (files) which were present in that table do not get deleted.
After this if you confirm that your tables are managed tables and still dropping table is not deleting files then you can use VACUUM command:
VACUUM <databaseName>.<TableName> [RETAIN NUM HOURS]
This will cleanup all the uncommitted files from table's folder.
I hope this helps you.

import os
path = "<Your Azure Databricks Delta Lake Folder Path>"
for delta_table in os.listdir(path):
dbutils.fs.rm("<Your Azure Databricks Delta Lake Folder Path>" + delta_table)
How to find your <Your Azure Databricks Delta Lake Folder Path>:
Step 1: Go to Databricks.
Step 2: Click Data - Create Table - DBFS. Then, you will find your delta tables.

Related

Azure Data Studio: _delta_log/.' cannot be listed

I'm trying to query my delta tables using Azure Synapse Serverless SQL Pool.
Login in Azure Data Studio using the SQL admin credentials.
This is a simple query to table that I'm trying trying to make:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://(...).dfs.core.windows.net/(...)/table/',
FORMAT = 'DELTA'
) AS [result]
I get the error:
Content of directory on path 'https://.../table/_delta_log/*.*' cannot be listed.
If I query any other table, e.g. table_copy I have no error.
I can query every table I have, except this table one.
Following every piece of documentation and threads I find, tried the following:
(IAM) setting up Storage Blob Contributor, Storage Blob Owner, Storage Queue Data Contributor and Owner
Going in ACL setting up Read, Write, Execute Access and Default permissions, for the Managed Identity (Synapse Studio),
Propagating the ACL into every children
Restored the default permissions for the folder
Making a copy of the table, deleting the original, and overwrite it again (pyspark)
# Read original table
table_copy = spark.read.format("delta")
.option("recursiveFileLookup", "True")
.load(f"abfss://...#....dfs.core.windows.net/.../table/")
# Create a copy of it
table_copy.write.format('delta')
.mode("overwrite")
.option("overwriteSchema","true")
.save(f"abfss://...#....dfs.core.windows.net/.../table_copy/")
# Remove original one
dbutils.fs.rm('abfss://...#....dfs.core.windows.net/.../table/',recurse=True)
# Overwrite it
table_copy.write.format('delta')
.mode("overwrite")
.option("overwriteSchema","true")
.save(f"abfss://...#....dfs.core.windows.net/.../table/")
If I make a copy of the table to table_copy, I can read it.
Note that in Azure Synapse UI I can query the table. Outside of it I can't.

It seems like the permission and firewall settings are set up correctly.
One thing you can try and check the table is in correct format (Delta format) and it has correct schema and also check you directory delta_log create or not.
Try this approach:
First I don't have any delta table . so I created sample dataframe df using spark.read.
Then, I overwrite dataframe df into delta format with abfss://<container_name>#<storage_account_name>... path and also parallelly created a table using saveAsTable name: test_table
table_path = f"abfss://<container_name>#<storage_account_name>.dfs.core.windows.net/<folder>"
df.write.format("delta").mode("overwrite").option("path", table_path).saveAsTable("test_table")
You can check test_table and abfss storage location. I successfully got the data in delta format.
Another Alterative way that you can create a new delta table and copy the data from old table to the new delta table. You can use the query like this:

Cannot Drop the unmannaged Delta lake table through pyspark code

I am trying to drop a unmanaged table but it only drops its metadata. I am using the following Code in Databricks
spark.sql("DROP TABLE IF EXISTS default.StoresSales")
dbutils.fs.rm("dbfs:/mnt/ext_source/sparkDeltaTables/default.StoresSales",True)
Tried True and False both options but nothing works with the files at located on the Storage. I need to manually delete the files.
The command gives the following Output:
java.io.IOException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.

How to create an external unmanaged table in delta lake in Azure Databricks

I have this scenario where I am reading files from my blob storage and then creating a delta table in Azure Databricks. When you create a delta table in Datbaricks , there are delta files which are created by default which we cant access.
Other way is to create a unmanaged delta table and specify your own path to store these delta files. I would like to know how to do this. How can i specify where I want to store my delta files. and what does this external table mean, how can i specify the path for external table?
I tried below code and it fails on creating external table command:
spark.conf.set("fs.azure.account.auth.type.xyzstorageaccount.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.xyzstorageaccount.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.xyzstorageaccount.dfs.core.windows.net", "<sas token>")
%sql
CREATE EXTERNAL TABLE IF NOT EXISTS axytable
LOCATION 'abfss://xyzstorageaccount/tables';
I know I might be doing something wrong here and I have not completely understood what external table actually means.. Does it stays inside the Databricks cluster and instead of providing my storage account path , should I provide Databricks path? How do i create a custom catalog which also seems to be a requirement here? Also this code works here and i am able to write it in storage account by the path provided but I fail to understand it.. Any SME who can help me out here?
path = "abfss://xxx#xyzstorageacount.dfs.core.windows.net/XYY"
(DF.writeStream
.format('delta')
.outputMode("append")
.trigger(once=True)
.option("mergeSchema", "true")
.option('checkpointLocation', path+"/bronze_checkpoint")
.start(path + "/myTable"))

This documentation provide good description of what managed tables are and how are they different from unmanaged tables. In nutshell, managed tables are created in a "default" location, and both data & table metadata a managed by Hive metastore or Unity Catalog, so when you drop a table, actual data is deleted as well. Unmanaged tables are different as only metadata are controlled by Hive metastore or Unity Catalog - if you drop table, only table definition will be dropped, but not data.
You can create unamanged table different ways:
Create from scratch using syntax create table <name> (columns definition) using delta location 'path' (doc)
Create table for existing data using syntax create table name using delta location 'path' (you don't need to provide columns definition) (doc)
Provide path option with path to data when saving table using saveAsTable (for batch), or toTable (streaming). In your example it will be following:
path = "abfss://xxx#xyzstorageacount.dfs.core.windows.net/XYY"
(DF.writeStream
.format('delta')
.outputMode("append")
.trigger(once=True)
.option("mergeSchema", "true")
.option('checkpointLocation', path+"/bronze_checkpoint")
.option('path', path + "/myTable")
.toTable('<table_name>')
)

Spark: refresh Delta Table in S3

how can I run the refresh table command on a Delta Table in S3?
When I do
deltatable = DeltaTable.forPath(spark, "s3a://test-bucket/delta_table/")
spark.catalog.refreshTable(deltatable)
I am getting the error:
AttributeError: 'DeltaTable' object has no attribute '_get_object_id'
Does the refresh command only work for Hive tables?
Thanks!

Ok. It's really an incorrect function - the spark.catalog.refreshTable function (doc) is used to refresh table metadata inside the Spark. It has nothing to do with recovery of the Delta table.
To fix this on Delta you need to do something different. Unfortunately I'm not 100% sure about right way for open source Delta implementation - on Databricks we have FSCK REPAIR TABLE SQL command for that. I would try following (be careful, make a backup!):
If removed files were in the recent version, then you may try to use RESTORE command with spark.sql.files.ignoreMissingFiles set to true
If removed files were for the specific partition, then you can read the table (again with spark.sql.files.ignoreMissingFiles set to true), leave data only for that partitions, and write data using overwrite mode with replaceWhere option (doc) that contains condition
Or you can read the whole Delta table (again with spark.sql.files.ignoreMissingFiles set to true) and write it back in Overwrite mode - it will of course duplicate your data, but the old files will be removed by the VACUUM.

Write spark Dataframe to an exisitng Delta Table by providing TABLE NAME instead of TABLE PATH

I am trying to write spark dataframe into an existing delta table.
I do have multiple scenarios where I could save data into different tables as shown below.
SCENARIO-01:
I have an existing delta table and I have to write dataframe into that table with option mergeSchema since the schema may change for each load.
I am doing the same with below command by providing delta table path
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").save(finalDF01DestFolderPath)
Just want to know whether this can be done by providing exisiting delta TABLE NAME instead of delta PATH.
This has been resolved by updating data write command as below.
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").saveAsTable(finalDF01DestTableName)
Is this the correct way ?
SCENARIO 02:
I have to update the existing table if the record already exists and if not insert a new record.
For this I am currently doing as shown below.
spark.sql("SET spark.databricks.delta.schema.autoMerge.enabled = true")
DeltaTable.forPath(DestFolderPath)
.as("t")
.merge(
finalDataFrame.as("s"),
"t.id = s.id AND t.name= s.name")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
I tried with below script.
destMasterTable.as("t")
.merge(
vehMasterDf.as("s"),
"t.id = s.id")
.whenNotMatched().insertAll()
.execute()
but getting below error(even with alias instead of as).
error: value as is not a member of String
destMasterTable.as("t")
Here also I am using delta table path as destination, Is there any way so that we could provide delta TABLE NAME instead of TABLE PATH?
It will be good to provide TABLE NAME instead of TABLE PATH, In case if we chage the table path later will not affect the code.
I have not seen anywhere in databricks documentation providing table name along with mergeSchema and autoMerge.
Is it possible to do so?

To use existing data as a table instead of path you either were need to use saveAsTable from the beginning, or just register existing data in the Hive metastore using the SQL command CREATE TABLE USING, like this (syntax could be slightly different depending on if you're running on Databricks, or OSS Spark, and depending on the version of Spark):
CREATE TABLE IF NOT EXISTS my_table
USING delta
LOCATION 'path_to_existing_data'
after that, you can use saveAsTable.
For the second question - it looks like destMasterTable is just a String. To refer to existing table, you need to use function forName from the DeltaTable object (doc):
DeltaTable.forName(destMasterTable)
.as("t")
...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string