Deltalake error- MERGE destination only supports Delta sources - apache-spark

I am trying to implement scd-type-2 in delta lake and i am getting the following error- "MERGE destination only supports Delta sources".
Below is the snippet code i am executing.
MERGE INTO stageviews.employeetarget t
USING (
-- The records from the first select statement, will have both new & updated records
SELECT id as mergeKey, src.*
FROM stageviews.employeeupdate src
UNION ALL
-- Identify the updated records & setting the mergeKey to NULL forces these rows to NOT MATCH and be INSERTED into target.
SELECT NULL as mergeKey, src.*
FROM stageviews.employeeupdate src JOIN stageviews.employeetarget tgt
ON src.id = tgt.id
WHERE tgt.ind_flag = "1"
AND sha2(src.EmployeeName,256) <> sha2(tgt.EmployeeName ,256)
) as s
ON t.id = s.mergeKey
WHEN MATCHED AND
( t.ind_flag = "1" AND sha2(t.EmployeeName,256) <> sha2(s.EmployeeName ,256) ) THEN
UPDATE SET t.ind_flag = "0", t.eff_end_date = current_date()-1
WHEN NOT MATCHED THEN
INSERT(t.Id,t.EmployeeName,t.JobTitle,t.BasePay,t.OvertimePay,t.OtherPay,t.Benefits,t.TotalPay,t.TotalPayBenefits,t.Year,t.Notes,t.Agency,t.Status,t.ind_flag,t.create_date,t.update_date,t.eff_start_date,t.eff_end_date)
values(s.Id,s.EmployeeName,s.JobTitle,s.BasePay,s.OvertimePay,s.OtherPay,s.Benefits,s.TotalPay,s.TotalPayBenefits,s.Year,s.Notes,s.Agency,s.Status,s.ind_flag,
current_date(),current_date(),current_date(),to_date('9999-12-31'))

Unfortunately, Databricks only supports updates for delta (delta lake) tables.
The error message Error in SQL statement: AnalysisException: MERGE destination only supports Delta sources indicates that you try the update on a non-delta-table.
Merge a set of updates, insertions, and deletions based on a source table into a target Delta table.
Reference: Azure Databricks - Merge and SCD Type 2 using Merge.

Related

Delta table merge on multiple columns

i have a table which has primary key as multiple columns so I need to perform the merge logic on multiple columns
DeltaTable.forPath(spark, "path")
.as("data")
.merge(
finalDf1.as("updates"),
"data.column1 = updates.column1 AND data.column2 = updates.column2 AND data.column3 = updates.column3 AND data.column4 = updates.column4 AND data.column5 = updates.column5")
.whenMatched
.updateAll()
.whenNotMatched
.insertAll()
.execute()
When I check the data counts it is not updating as expected.
Could someone help me here on this?
Please try also approach like in this example: https://docs.databricks.com/_static/notebooks/merge-in-cdc.html
Create a changes tables with additional columns which you will note
if a row is new (be inserted)
old (primary key exists) and nothing has changed
old (primary key exists) but other fields needs an update
and then use additional conditions on merge, for example:
.whenMatched("s.new = true")
.insert()
.whenMatched("s.updated = true")
.updateExpr(Map("key" -> "s.key", "value" -> "s.newValue"))
How are you counting your rows?
One thing to keep in mind is that directly reading and counting from the parquet files produced by Delta Lake will potentially give you a different result than reading the rows through the delta table interface. Remember that delta keeps a log and supports time travel so it does store copies of rows as they change over time.
Here's a way to accurately count the current rows in a delta table:
deltaTable = DeltaTable.forPath(spark,<path to your delta table>)
deltaTable.toDF().count()

SQL Server : MERGE statement, compare with select data instead of table data

merge into item_set TARGET
using (select '545934' as product_id_01, 4 as set_sort_no, 15 as article_id,
'Note for this item set' as note, 0 as is_deleted) as SOURCE
on TARGET.set_sort_no = SOURCE.set_sort_no and TARGET.product_id_01 = SOURCE.product_id_01
WHEN MATCHED THEN
UPDATE
SET TARGET.article_id = SOURCE.article_id,
TARGET.note = SOURCE.note,
TARGET.is_deleted = SOURCE.is_deleted,
TARGET.version = TARGET.version
WHEN NOT MATCHED THEN
INSERT (product_id_01, set_sort_no, article_id, note, is_deleted, version)
VALUES (SOURCE.product_id_01, SOURCE.set_sort_no, SOURCE.article_id, SOURCE.note, SOURCE.is_deleted, 3);
I have a query as shown above, I would like to know if it is possible to use multiple values(array of values) instead of the below statement from the query without using a table
(select
'545934' as product_id_01,
4 as set_sort_no, 15 as article_id,
'Note for this item set' as note, 0 as is_deleted) as SOURCE
Thanks in advance.
No.MS SQL Server was not designed to support arrays

Merge in Spark SQL - WHEN NOT MATCHED BY SOURCE THEN

I am coding Python and Spark SQL in Databricks and I am using spark 2.4.5.
I have two tables.
Create table IF NOT EXISTS db_xsi_ed_faits_shahgholi_ardalan.Destination
(
id Int,
Name string,
Deleted int
) USING Delta;
Create table IF NOT EXISTS db_xsi_ed_faits_shahgholi_ardalan.Source
(
id Int,
Name string,
Deleted int
) USING Delta;
I need to ran a Merge command between my source and destination. I wrote below command
%sql
MERGE INTO db_xsi_ed_faits_shahgholi_ardalan.Destination AS D
USING db_xsi_ed_faits_shahgholi_ardalan.Source AS S
ON (S.id = D.id)
-- UPDATE
WHEN MATCHED AND S.Name <> D.Name THEN
UPDATE SET
D.Name = S.Name
-- INSERT
WHEN NOT MATCHED THEN
INSERT (id, Name, Deleted)
VALUES (S.id, S.Name, S.Deleted)
-- DELETE
WHEN NOT MATCHED BY SOURCE THEN
UPDATE SET
D.Deleted = 1
When i ran this command i have below error:
It seems that we do not have NOT MATCHED BY SOURCE in spark! I need a solution to do that.
I wrote this code but still i am looking for better approach
%sql
MERGE INTO db_xsi_ed_faits_shahgholi_ardalan.Destination AS D
USING db_xsi_ed_faits_shahgholi_ardalan.Source AS S
ON (S.id = D.id)
-- UPDATE
WHEN MATCHED AND S.Name <> D.Name THEN
UPDATE SET
D.Name = S.Name
-- INSERT
WHEN NOT MATCHED THEN
INSERT (id, Name, Deleted)
VALUES (S.id, S.Name, S.Deleted)
;
%sql
-- Logical delete
UPDATE db_xsi_ed_faits_shahgholi_ardalan.Destination
SET Deleted = 1
WHERE db_xsi_ed_faits_shahgholi_ardalan.Destination.id in
(
SELECT
D.id
FROM db_xsi_ed_faits_shahgholi_ardalan.Destination AS D
LEFT JOIN db_xsi_ed_faits_shahgholi_ardalan.Source AS S ON (S.id = D.id)
WHERE S.id is null
)

update and insert into Azure data warehouse using Azure data factory pipelines

I'm trying to run an adf copy pipeline with and update and insert statements that is supposed to replace merge statement. basically a statement like:
UPDATE TARGET
SET ProductName = SOURCE.ProductName,
TARGET.Rate = SOURCE.Rate
FROM Products AS TARGET
INNER JOIN UpdatedProducts AS SOURCE
ON TARGET.ProductID = SOURCE.ProductID
WHERE TARGET.ProductName <> SOURCE.ProductName
OR TARGET.Rate <> SOURCE.Rate
INSERT Products (ProductID, ProductName, Rate)
SELECT SOURCE.ProductID, SOURCE.ProductName, SOURCE.Rate
FROM UpdatedProducts AS SOURCE
WHERE NOT EXISTS
(
SELECT 1
FROM Products
WHERE ProductID = SOURCE.ProductID
)
If the target is an azure sql db I would use this way: https://www.taygan.co/blog/2018/04/20/upsert-to-azure-sql-db-with-azure-data-factory
but if the target is an adw a stored procedure option doesn't exist! any suggestion? do I have to have a staging table first then I run the update and insert statements from stg_table to target_table? or maybe there is any possibility to do it directly from adf?
If you can't use a stored procedure, my suggestion would be to create a second copy data transform. Run the pre-script on the second transform and drop the table since its a temp table that you created on the first.
BEGIN
MERGE Target AS target_sqldb
USING TempTable AS source_tblstg
ON (target_sqldb.Id= source_tblstg.Id)
WHEN MATCHED THEN
UPDATE SET
[Name] = source_tblstg.Name,
[State] = source_tblstg.State
WHEN NOT MATCHED THEN
INSERT([Name], [State])
VALUES (source_tblstg.Name, source_tblstg.State);
DROP TABLE TempTable;
END

Polybase - maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed

[Question from customer]
I have following data in a text file. Delimited by |
A | null , ZZ
C | D
When I run this query using HDInsight:
CREATE EXTERNAL TABLE myfiledata(
col1 string,
col2 string
)
row format delimited fields terminated by '|' STORED AS TEXTFILE LOCATION 'wasb://.....';
I get the following result as expected:
A null , ZZ
C D
But when I run the same query using SQL DW Polybase, it throws error:
Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
How do I fix this?
Here's my script in SQL DW:
-- Creating external data source (Azure Blob Storage)
CREATE EXTERNAL DATA SOURCE azure_storage1
WITH
(
TYPE = HADOOP
, LOCATION ='wasbs://....blob.core.windows.net'
, CREDENTIAL = ASBSecret
)
;
-- Creating external file format (delimited text file)
CREATE EXTERNAL FILE FORMAT text_file_format
WITH
(
FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS (
FIELD_TERMINATOR ='|'
, USE_TYPE_DEFAULT = TRUE
)
)
;
-- Creating external table pointing to file stored in Azure Storage
CREATE EXTERNAL TABLE [Myfile]
(
Col1 varchar(5),
Col2 varchar(5)
)
WITH
(
LOCATION = '/myfile.txt'
, DATA_SOURCE = azure_storage1
, FILE_FORMAT = text_file_format
)
;
We’re currently working on a way to bubble up the reason for reject to the user.
In the meantime, here's what's happening:
The default # of rows allowed to fail schema matching is 0. This means that if at least one of the rows you’re loading in from /myfile.txt doesn’t match the schema. In Hive, strings can accommodate an arbitrary amount of chars, but varchars cannot. In this case it’s failing on the varchar(5) for “null , ZZ” because that is more than 5 characters.
If you’d like to change the REJECT_VALUE in the CREATE EXTERNAL TABLE call, that will let through the other row – more info can be found here: https://msdn.microsoft.com/library/dn935021(v=sql.130).aspx
It's due to dirty record for the respective file format, for example in the case of parquet if the column contains '' (empty string) then it won't work, and will throw Query aborted-- the maximum reject threshold
[AZURE.NOTE] A query on an external table can fail with the error "Query aborted-- the maximum reject threshold was reached while reading from an external source". This indicates that your external data contains dirty records. A data record is considered 'dirty' if the actual data types/number of columns do not match the column definitions of the external table or if the data doesn't conform to the specified external file format. To fix this, ensure that your external table and external file format definitions are correct and your external data conform to these definitions. In case a subset of external data records is dirty, you can choose to reject these records for your queries by using the reject options in CREATE EXTERNAL TABLE DDL.

Resources