ADF Default Columns - azure

ADF Copy task:
Importing flat files with wildcard *.txt, some files have 18 cols, some have 24.
SQL table sink has 24 cols.
Fails because it does not find a mapping for cols 19-24.
Can i default the mapping of the last 6 cols to NULL when no value is found ?
EDIT:
I copied my source to blob and used a dataflow with schema drift instead. I can connect to my source and can see that it writes parquet files to the staging folder, but after calculating the rows the workflow fails with error:
Operation on target Dataflow1 failed: {"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Sink 'nsodevsynapse': Unable to stage data before write. Check configuration/credentials of storage","Details":"org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.\n\tat org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2607)\n\tat org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2617)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.deleteFile(NativeAzureFileSystem.java:2657)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem$2.execute(NativeAzureFileSystem.java:2391)\n\tat org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor.executeParallel(AzureFileSystemThreadPoolExecutor.java:223)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.deleteWithoutAuth(NativeAzureFileSystem.java:2403)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.delete(NativeAzureFileSystem.java:2453)\n\tat org.apache.hadoop.fs.azure.NativeAzureFileSystem.delete(NativeAzureFileSystem.java:1936)\n\tat org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter."}
My sink is using a sql account to connect, i can connect to sql using that account. I can write edit SQL tables using that account.
the managed instance has owner permissions on the storage account.

You can load data using the dataflow activity by enabling “Allow schema drift” in source and sink transformations and it will automatically default the values to NULL when not passed.
Source files:
Dataflow:
• In source and Sink, enable "Allow schema drift" if the source schema changes often.
• Add mapping to map all source columns to destination.
Destination SQL table:

Related

Azure Data Studio: _delta_log/*.*' cannot be listed

I'm trying to query my delta tables using Azure Synapse Serverless SQL Pool.
Login in Azure Data Studio using the SQL admin credentials.
This is a simple query to table that I'm trying trying to make:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://(...).dfs.core.windows.net/(...)/table/',
FORMAT = 'DELTA'
) AS [result]
I get the error:
Content of directory on path 'https://.../table/_delta_log/*.*' cannot be listed.
If I query any other table, e.g. table_copy I have no error.
I can query every table I have, except this table one.
Following every piece of documentation and threads I find, tried the following:
(IAM) setting up Storage Blob Contributor, Storage Blob Owner, Storage Queue Data Contributor and Owner
Going in ACL setting up Read, Write, Execute Access and Default permissions, for the Managed Identity (Synapse Studio),
Propagating the ACL into every children
Restored the default permissions for the folder
Making a copy of the table, deleting the original, and overwrite it again (pyspark)
# Read original table
table_copy = spark.read.format("delta")
.option("recursiveFileLookup", "True")
.load(f"abfss://...#....dfs.core.windows.net/.../table/")
# Create a copy of it
table_copy.write.format('delta')
.mode("overwrite")
.option("overwriteSchema","true")
.save(f"abfss://...#....dfs.core.windows.net/.../table_copy/")
# Remove original one
dbutils.fs.rm('abfss://...#....dfs.core.windows.net/.../table/',recurse=True)
# Overwrite it
table_copy.write.format('delta')
.mode("overwrite")
.option("overwriteSchema","true")
.save(f"abfss://...#....dfs.core.windows.net/.../table/")
If I make a copy of the table to table_copy, I can read it.
Note that in Azure Synapse UI I can query the table. Outside of it I can't.
It seems like the permission and firewall settings are set up correctly.
One thing you can try and check the table is in correct format (Delta format) and it has correct schema and also check you directory delta_log create or not.
Try this approach:
First I don't have any delta table . so I created sample dataframe df using spark.read.
Then, I overwrite dataframe df into delta format with abfss://<container_name>#<storage_account_name>... path and also parallelly created a table using saveAsTable name: test_table
table_path = f"abfss://<container_name>#<storage_account_name>.dfs.core.windows.net/<folder>"
df.write.format("delta").mode("overwrite").option("path", table_path).saveAsTable("test_table")
You can check test_table and abfss storage location. I successfully got the data in delta format.
Another Alterative way that you can create a new delta table and copy the data from old table to the new delta table. You can use the query like this:

Cannot Drop the unmannaged Delta lake table through pyspark code

I am trying to drop a unmanaged table but it only drops its metadata. I am using the following Code in Databricks
spark.sql("DROP TABLE IF EXISTS default.StoresSales")
dbutils.fs.rm("dbfs:/mnt/ext_source/sparkDeltaTables/default.StoresSales",True)
Tried True and False both options but nothing works with the files at located on the Storage. I need to manually delete the files.
The command gives the following Output:
java.io.IOException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.

How to load a table from a databse in synapse to default(spark)?

I am trying to load a table sitting in a database in synapse azure to default(spark) so that i can call the table to run the respective pandas code. However i am not able to do it.
%%spark
val df = spark.read.sqlanalytics("emea***********.rpt.Vw_APInvoices")
df.write.mode("overwrite").saveAsTable("default.t1")
Error:
Error : com.microsoft.spark.sqlanalytics.exception.SQLAnalyticsConnectorException: The specified table does not exist. Please provide a valid table.
at com.microsoft.spark.sqlanalytics.read.SQLAnalyticsReader.readSchema(SQLAnalyticsReader.scala:103)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:175)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:204)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at org.apache.spark.sql.SqlAnalyticsConnector$SQLAnalyticsFormatReader.sqlanalytics(SqlAnalyticsConnector.scala:42)
... 52 elided
The error message clearly says - The specified table does not exist. Please provide a valid table.
Error : com.microsoft.spark.sqlanalytics.exception.SQLAnalyticsConnectorException: The specified table does not exist. Please provide a valid table.
Make sure specified table exists before you running above code.
Reference: Azure Synapse Analytics - Load the NYC Taxi data into the Spark nyctaxi database.

What is the best way to cleanup and recreate databricks delta table?

I am trying to cleanup and recreate databricks delta table for integration tests.
I want to run the tests on devops agent so i am using JDBC (Simba driver) but it says statement type "DELETE" is not supported.
When i cleanup the underlying DBFS location using DBFS API "rm -r" it cleans up the table but next read after recreate gives an error - A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table DELETE statement.
Also if i simply do DELETE from delta table on data i still see the underlying dbfs directory and the files intact. How can I clean up the delta as well as underlying files gracefully?
You can use VACUUM command to do the clean up. I haven't used it yet.
If you are using spark, you can use overwriteSchema option to reload the data.
If you can provide the more details on how you are using it, it would be better
The perfect steps are as follows:
When you do a DROP TABLE and DELETE FROM TABLE TABLE NAME the following things happen in :
DROP TABLE : drops your table but data still resides.(Also you can't create a new table definition with changes in schema in the same location.)
DELETE FROM TABLE deletes data from table but transaction log still resides.
So, Step 1 - DROP TABLE schema.Tablename
STEP 2 - %fs rm -r /mnt/path/where/your/table/definition/is/pointed/fileNames.parquet
Step 3 - % fs ls make sure there is no data and also no transaction log at that location
Step 4 : NOW>!!!!! RE_RUN your CREATE TABLE statement with any changes you desire UISNG delta location /mnt/path/where/your/table/definition/is/pointed/fileNames.parquet
Step 5 : Start using the table and verify using %sql desc formatted schema.Tablename
Make sure that you are not creating an external table. There are two types of tables:
1) Managed Tables
2) External Tables (Location for dataset is specified)
When you delete Managed Table, spark is responsible for cleanup of metaData of that table stored in metastore and for cleanup of the data (files) present in that table.
But for external table, spark do not owns the data, so when you delete external table, only metadata present in metastore is deleted by spark and data (files) which were present in that table do not get deleted.
After this if you confirm that your tables are managed tables and still dropping table is not deleting files then you can use VACUUM command:
VACUUM <databaseName>.<TableName> [RETAIN NUM HOURS]
This will cleanup all the uncommitted files from table's folder.
I hope this helps you.
import os
path = "<Your Azure Databricks Delta Lake Folder Path>"
for delta_table in os.listdir(path):
dbutils.fs.rm("<Your Azure Databricks Delta Lake Folder Path>" + delta_table)
How to find your <Your Azure Databricks Delta Lake Folder Path>:
Step 1: Go to Databricks.
Step 2: Click Data - Create Table - DBFS. Then, you will find your delta tables.

How to include blob metadata in copy data mapping

I'm working on a ADF v2 pipeline, which copies data from csv blob to Azure SQL database table. For each load I would like to collect source metadata, like source blob name, and save it to a target table as a part of data lineage framework.
My blob source run the following schema:
StoreName,
StoreLocation,
StoreTaxId.
My destination table run the following schema:
StoreName,
StoreLocation,
DwhProcessDate,
DwhSourceName.
I do not know, how to properly include name of the source in the mapping section of Copy Data activity.
For the moment I have:
defined a [Get Metadata1] activity to get references to all blobs that are available from Azure Blob Storage
defined a [ForEach1] activity, iterating through the output of an expression #activity('Get Metadata1').output.childitems
inside the [ForEach1] activity, I have placed [Copy Data1] activity, where I have source and sink sections defined.
What I'm looking for is a way to add extra line to the mapping section, which will samehow bind #item().name to destination column [DwhSourceName]
Thanks for all suggestion on how to achieve this.
Actually,based on my test,you can specify the dymatic content of column key,but you can't set blob metadata as value of columns in copy data mapping at the pipeline run time. Please see the rules mentioned in this document.
You still need to add the FileName column in your source data before the copy activity.Maybe you could use Azure Blob Trigger Function to get the blob file name so hat you could add the FileName column when any data stream into the blob.(Please refer to this case:How Do I get the Name of The inputBlob That Triggered My Azure Function With Python)

Resources