Cannot Drop the unmannaged Delta lake table through pyspark code - databricks

I am trying to drop a unmanaged table but it only drops its metadata. I am using the following Code in Databricks
spark.sql("DROP TABLE IF EXISTS default.StoresSales")
dbutils.fs.rm("dbfs:/mnt/ext_source/sparkDeltaTables/default.StoresSales",True)
Tried True and False both options but nothing works with the files at located on the Storage. I need to manually delete the files.
The command gives the following Output:
java.io.IOException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.

Related

Azure Data Studio: _delta_log/*.*' cannot be listed

I'm trying to query my delta tables using Azure Synapse Serverless SQL Pool.
Login in Azure Data Studio using the SQL admin credentials.
This is a simple query to table that I'm trying trying to make:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://(...).dfs.core.windows.net/(...)/table/',
FORMAT = 'DELTA'
) AS [result]
I get the error:
Content of directory on path 'https://.../table/_delta_log/*.*' cannot be listed.
If I query any other table, e.g. table_copy I have no error.
I can query every table I have, except this table one.
Following every piece of documentation and threads I find, tried the following:
(IAM) setting up Storage Blob Contributor, Storage Blob Owner, Storage Queue Data Contributor and Owner
Going in ACL setting up Read, Write, Execute Access and Default permissions, for the Managed Identity (Synapse Studio),
Propagating the ACL into every children
Restored the default permissions for the folder
Making a copy of the table, deleting the original, and overwrite it again (pyspark)
# Read original table
table_copy = spark.read.format("delta")
.option("recursiveFileLookup", "True")
.load(f"abfss://...#....dfs.core.windows.net/.../table/")
# Create a copy of it
table_copy.write.format('delta')
.mode("overwrite")
.option("overwriteSchema","true")
.save(f"abfss://...#....dfs.core.windows.net/.../table_copy/")
# Remove original one
dbutils.fs.rm('abfss://...#....dfs.core.windows.net/.../table/',recurse=True)
# Overwrite it
table_copy.write.format('delta')
.mode("overwrite")
.option("overwriteSchema","true")
.save(f"abfss://...#....dfs.core.windows.net/.../table/")
If I make a copy of the table to table_copy, I can read it.
Note that in Azure Synapse UI I can query the table. Outside of it I can't.
It seems like the permission and firewall settings are set up correctly.
One thing you can try and check the table is in correct format (Delta format) and it has correct schema and also check you directory delta_log create or not.
Try this approach:
First I don't have any delta table . so I created sample dataframe df using spark.read.
Then, I overwrite dataframe df into delta format with abfss://<container_name>#<storage_account_name>... path and also parallelly created a table using saveAsTable name: test_table
table_path = f"abfss://<container_name>#<storage_account_name>.dfs.core.windows.net/<folder>"
df.write.format("delta").mode("overwrite").option("path", table_path).saveAsTable("test_table")
You can check test_table and abfss storage location. I successfully got the data in delta format.
Another Alterative way that you can create a new delta table and copy the data from old table to the new delta table. You can use the query like this:

ADF Default Columns

ADF Copy task:
Importing flat files with wildcard *.txt, some files have 18 cols, some have 24.
SQL table sink has 24 cols.
Fails because it does not find a mapping for cols 19-24.
Can i default the mapping of the last 6 cols to NULL when no value is found ?
EDIT:
I copied my source to blob and used a dataflow with schema drift instead. I can connect to my source and can see that it writes parquet files to the staging folder, but after calculating the rows the workflow fails with error:
Operation on target Dataflow1 failed: {"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Sink 'nsodevsynapse': Unable to stage data before write. Check configuration/credentials of storage","Details":"org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.\n\tat org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2607)\n\tat org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2617)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.deleteFile(NativeAzureFileSystem.java:2657)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem$2.execute(NativeAzureFileSystem.java:2391)\n\tat org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor.executeParallel(AzureFileSystemThreadPoolExecutor.java:223)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.deleteWithoutAuth(NativeAzureFileSystem.java:2403)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.delete(NativeAzureFileSystem.java:2453)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.delete(NativeAzureFileSystem.java:1936)\n\tat org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter."}
My sink is using a sql account to connect, i can connect to sql using that account. I can write edit SQL tables using that account.
the managed instance has owner permissions on the storage account.
You can load data using the dataflow activity by enabling “Allow schema drift” in source and sink transformations and it will automatically default the values to NULL when not passed.
Source files:
Dataflow:
• In source and Sink, enable "Allow schema drift" if the source schema changes often.
• Add mapping to map all source columns to destination.
Destination SQL table:

AssertionError: assertion failed: No plan for DeleteFromTable In Databricks

Is there any reason this command works well:
%sql SELECT * FROM Azure.Reservations WHERE timestamp > '2021-04-02'
returning 2 rows, while the below:
%sql DELETE FROM Azure.Reservations WHERE timestamp > '2021-04-02'
fails with:
Error in SQL statement: AssertionError: assertion failed: No plan for
DeleteFromTable (timestamp#394 > 1617321600000000)
?
I'm new to Databricks but I'm sure I ran similar command on another table (without WHERE clause). The table is created basing on a Parquet file.
DELETE FROM (and similarly UPDATE, or MERGE) aren't supported on the Parquet files - right now on Databricks it's supported for Delta format. You can convert your parquet files into delta using CONVERT TO DELTA, and then this command will work for you.
Another alternative is to implement it is to read parquet files, filter out the rows that you want to keep, and overwrite your parquet files.
It could be that you are trying to DELETE from a VIEW (in case it is not a parquet file)
Unfortunately, there is no easy way to differentiate between a VIEW and a TABLE in databricks; the only way you can test if it's indeed a view is by:
SHOW VIEWS FROM Azure like 'reser*'
or, if it's a table:
SHOW TABLES FROM Azure like 'reser*'
Show tables syntax
Show views syntax
just delete from the delta
%sql
delete from delta.`/mnt/path`
where x

What is the best way to cleanup and recreate databricks delta table?

I am trying to cleanup and recreate databricks delta table for integration tests.
I want to run the tests on devops agent so i am using JDBC (Simba driver) but it says statement type "DELETE" is not supported.
When i cleanup the underlying DBFS location using DBFS API "rm -r" it cleans up the table but next read after recreate gives an error - A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table DELETE statement.
Also if i simply do DELETE from delta table on data i still see the underlying dbfs directory and the files intact. How can I clean up the delta as well as underlying files gracefully?
You can use VACUUM command to do the clean up. I haven't used it yet.
If you are using spark, you can use overwriteSchema option to reload the data.
If you can provide the more details on how you are using it, it would be better
The perfect steps are as follows:
When you do a DROP TABLE and DELETE FROM TABLE TABLE NAME the following things happen in :
DROP TABLE : drops your table but data still resides.(Also you can't create a new table definition with changes in schema in the same location.)
DELETE FROM TABLE deletes data from table but transaction log still resides.
So, Step 1 - DROP TABLE schema.Tablename
STEP 2 - %fs rm -r /mnt/path/where/your/table/definition/is/pointed/fileNames.parquet
Step 3 - % fs ls make sure there is no data and also no transaction log at that location
Step 4 : NOW>!!!!! RE_RUN your CREATE TABLE statement with any changes you desire UISNG delta location /mnt/path/where/your/table/definition/is/pointed/fileNames.parquet
Step 5 : Start using the table and verify using %sql desc formatted schema.Tablename
Make sure that you are not creating an external table. There are two types of tables:
1) Managed Tables
2) External Tables (Location for dataset is specified)
When you delete Managed Table, spark is responsible for cleanup of metaData of that table stored in metastore and for cleanup of the data (files) present in that table.
But for external table, spark do not owns the data, so when you delete external table, only metadata present in metastore is deleted by spark and data (files) which were present in that table do not get deleted.
After this if you confirm that your tables are managed tables and still dropping table is not deleting files then you can use VACUUM command:
VACUUM <databaseName>.<TableName> [RETAIN NUM HOURS]
This will cleanup all the uncommitted files from table's folder.
I hope this helps you.
import os
path = "<Your Azure Databricks Delta Lake Folder Path>"
for delta_table in os.listdir(path):
dbutils.fs.rm("<Your Azure Databricks Delta Lake Folder Path>" + delta_table)
How to find your <Your Azure Databricks Delta Lake Folder Path>:
Step 1: Go to Databricks.
Step 2: Click Data - Create Table - DBFS. Then, you will find your delta tables.

Error While Writing into a Hive table from Spark Sql

I am trying to insert data into a Hive External table from Spark Sql.
I am created the hive external table through the following command
CREATE EXTERNAL TABLE tab1 ( col1 type,col2 type ,col3 type) CLUSTERED BY (col1,col2) SORTED BY (col1) INTO 8 BUCKETS STORED AS PARQUET
In my spark job , I have written the following code
Dataset df = session.read().option("header","true").csv(csvInput);
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.saveAsTable(hiveTableName);
Each time I am running this code I am getting the following exception
org.apache.spark.sql.AnalysisException: Table `tab1` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:408)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:393)
at somepackage.Parquet_Read_WriteNew.writeToParquetHiveMetastore(Parquet_Read_WriteNew.java:100)
You should be specifying a save mode while saving the data in hive.
df.write.mode(SaveMode.Append)
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.insertInto(hiveTableName);
Spark provides the following save modes:
Save Mode
ErrorIfExists: Throws an exception if the target already exists. If target doesn’t exist write the data out.
Append: If target already exists, append the data to it. If the data doesn’t exist write the data out.
Overwrite: If the target already exists, delete the target. Write the data out.
Ignore: If the target already exists, silently skip writing out. Otherwise write out the data.
You are using the saveAsTable API, which create the table into Hive. Since you have already created the hive table through command, the table tab1 already exists. so when Spark API trying to create it, it throws error saying table already exists, org.apache.spark.sql.AnalysisException: Tabletab1already exists.
Either drop the table and let spark API saveAsTable create the table itself.
Or use the API insertInto to insert into an existing hive table.
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.insertInto(hiveTableName);

Resources