update and insert into Azure data warehouse using Azure data factory pipelines - azure

I'm trying to run an adf copy pipeline with and update and insert statements that is supposed to replace merge statement. basically a statement like:
UPDATE TARGET
SET ProductName = SOURCE.ProductName,
TARGET.Rate = SOURCE.Rate
FROM Products AS TARGET
INNER JOIN UpdatedProducts AS SOURCE
ON TARGET.ProductID = SOURCE.ProductID
WHERE TARGET.ProductName <> SOURCE.ProductName
OR TARGET.Rate <> SOURCE.Rate
INSERT Products (ProductID, ProductName, Rate)
SELECT SOURCE.ProductID, SOURCE.ProductName, SOURCE.Rate
FROM UpdatedProducts AS SOURCE
WHERE NOT EXISTS
(
SELECT 1
FROM Products
WHERE ProductID = SOURCE.ProductID
)
If the target is an azure sql db I would use this way: https://www.taygan.co/blog/2018/04/20/upsert-to-azure-sql-db-with-azure-data-factory
but if the target is an adw a stored procedure option doesn't exist! any suggestion? do I have to have a staging table first then I run the update and insert statements from stg_table to target_table? or maybe there is any possibility to do it directly from adf?

If you can't use a stored procedure, my suggestion would be to create a second copy data transform. Run the pre-script on the second transform and drop the table since its a temp table that you created on the first.
BEGIN
MERGE Target AS target_sqldb
USING TempTable AS source_tblstg
ON (target_sqldb.Id= source_tblstg.Id)
WHEN MATCHED THEN
UPDATE SET
[Name] = source_tblstg.Name,
[State] = source_tblstg.State
WHEN NOT MATCHED THEN
INSERT([Name], [State])
VALUES (source_tblstg.Name, source_tblstg.State);
DROP TABLE TempTable;
END

Related

SQL Server : MERGE statement, compare with select data instead of table data

merge into item_set TARGET
using (select '545934' as product_id_01, 4 as set_sort_no, 15 as article_id,
'Note for this item set' as note, 0 as is_deleted) as SOURCE
on TARGET.set_sort_no = SOURCE.set_sort_no and TARGET.product_id_01 = SOURCE.product_id_01
WHEN MATCHED THEN
UPDATE
SET TARGET.article_id = SOURCE.article_id,
TARGET.note = SOURCE.note,
TARGET.is_deleted = SOURCE.is_deleted,
TARGET.version = TARGET.version
WHEN NOT MATCHED THEN
INSERT (product_id_01, set_sort_no, article_id, note, is_deleted, version)
VALUES (SOURCE.product_id_01, SOURCE.set_sort_no, SOURCE.article_id, SOURCE.note, SOURCE.is_deleted, 3);
I have a query as shown above, I would like to know if it is possible to use multiple values(array of values) instead of the below statement from the query without using a table
(select
'545934' as product_id_01,
4 as set_sort_no, 15 as article_id,
'Note for this item set' as note, 0 as is_deleted) as SOURCE
Thanks in advance.
No.MS SQL Server was not designed to support arrays

Execute stored procedure in Azure Data Platform - Post SQL Scripts

Based on the documentation below,
https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-sql-database
There is a feature to run post SQL script. Would it be possible to run stored procedure from there?
I have tried, it does not seem to be working and currently investigating.
Thanks for your information in advance.
I created a test to prove that the stored procedure can be called in the Post SQL scripts.
I created two tables:
CREATE TABLE [dbo].[emp](
id int IDENTITY(1,1),
[name] [nvarchar](max) NULL,
[age] [nvarchar](max) NULL
)
CREATE TABLE [dbo].[emp_stage](
id int,
[name] [nvarchar](max) NULL,
[age] [nvarchar](max) NULL
)
I created a sotred procedure.
create PROCEDURE [dbo].[spMergeEmpData]
AS
BEGIN
SET IDENTITY_INSERT dbo.emp ON
MERGE [dbo].[emp] AS target
USING [dbo].[emp_stage] AS source
ON (target.[id] = source.[id])
WHEN MATCHED THEN
UPDATE SET name = source.name,
age = source.age
WHEN NOT matched THEN
INSERT (id, name, age)
VALUES (source.id, source.name, source.age);
TRUNCATE TABLE [dbo].[emp_stage]
END
I will copy the csv file into my Azure SQL staging table [dbo].[emp_stage], then use stored porcedure [dbo].[spMergeEmpData] to transfer data from [dbo].[emp_stage] to [dbo].[emp].
Enter the stored procedure name exec [dbo].[spMergeEmpData] in the Post SQL scripts field.
I successfully debugged.
I can see the data are all in TABLE [dbo].[emp].

Deltalake error- MERGE destination only supports Delta sources

I am trying to implement scd-type-2 in delta lake and i am getting the following error- "MERGE destination only supports Delta sources".
Below is the snippet code i am executing.
MERGE INTO stageviews.employeetarget t
USING (
-- The records from the first select statement, will have both new & updated records
SELECT id as mergeKey, src.*
FROM stageviews.employeeupdate src
UNION ALL
-- Identify the updated records & setting the mergeKey to NULL forces these rows to NOT MATCH and be INSERTED into target.
SELECT NULL as mergeKey, src.*
FROM stageviews.employeeupdate src JOIN stageviews.employeetarget tgt
ON src.id = tgt.id
WHERE tgt.ind_flag = "1"
AND sha2(src.EmployeeName,256) <> sha2(tgt.EmployeeName ,256)
) as s
ON t.id = s.mergeKey
WHEN MATCHED AND
( t.ind_flag = "1" AND sha2(t.EmployeeName,256) <> sha2(s.EmployeeName ,256) ) THEN
UPDATE SET t.ind_flag = "0", t.eff_end_date = current_date()-1
WHEN NOT MATCHED THEN
INSERT(t.Id,t.EmployeeName,t.JobTitle,t.BasePay,t.OvertimePay,t.OtherPay,t.Benefits,t.TotalPay,t.TotalPayBenefits,t.Year,t.Notes,t.Agency,t.Status,t.ind_flag,t.create_date,t.update_date,t.eff_start_date,t.eff_end_date)
values(s.Id,s.EmployeeName,s.JobTitle,s.BasePay,s.OvertimePay,s.OtherPay,s.Benefits,s.TotalPay,s.TotalPayBenefits,s.Year,s.Notes,s.Agency,s.Status,s.ind_flag,
current_date(),current_date(),current_date(),to_date('9999-12-31'))
Unfortunately, Databricks only supports updates for delta (delta lake) tables.
The error message Error in SQL statement: AnalysisException: MERGE destination only supports Delta sources indicates that you try the update on a non-delta-table.
Merge a set of updates, insertions, and deletions based on a source table into a target Delta table.
Reference: Azure Databricks - Merge and SCD Type 2 using Merge.

Alternative of sp_depends in Azure Data Warehouse

I need to get the list of tables used in a stored procedure,However in Azure Datawarehouse sp_depends is not supported.
The other alternative I thought of having is to get the stored proc code from INFORMATION_SCHEMA.ROUTINES and then run a script to get the [schema].[tablename] from the stored procedure definition but here the issue is in storing the whole stored proc into a variable. VARCHAR(MAX)has a limit of 8000 to store and if my proc exceeds this limit then I wont be able to get the complete table list.
Try using sys.sql_expression_dependencies. The following query may help you:
SELECT ReferencingObjectType = o1.type,
ReferencingObject = SCHEMA_NAME(o1.schema_id)+'.'+o1.name,
ReferencedObject = SCHEMA_NAME(o2.schema_id)+'.'+ed.referenced_entity_name,
ReferencedObjectType = o2.type
FROM sys.sql_expression_dependencies ed
INNER JOIN sys.objects o1
ON ed.referencing_id = o1.object_id
INNER JOIN sys.objects o2
ON ed.referenced_id = o2.object_id
WHERE o1.type in ('P','TR','V', 'TF')
ORDER BY ReferencingObjectType, ReferencingObject

COPY FROM CSV with static fields on Postgres

I'd like to switch an actual system importing data into a PostgreSQL 9.5 database from CSV files to a more efficient system.
I'd like to use the COPY statement because of its good performance. The problem is that I need to have one field populated that is not in the CSV file.
Is there a way to have the COPY statement add a static field to all the rows inserted ?
The perfect solution would have looked like that :
COPY data(field1, field2, field3='Account-005')
FROM '/tmp/Account-005.csv'
WITH DELIMITER ',' CSV HEADER;
Do you know a way to have that field populated in every row ?
My server is running node.js so I'm open to any cost-efficient solution to complete the files using node before COPYing it.
Use a temp table to import into. This allows you to:
add/remove/update columns
add extra literal data
delete or ignore records (such as duplicates)
, before inserting the new records into the actual table.
-- target table
CREATE TABLE data
( id SERIAL PRIMARY KEY
, batch_name varchar NOT NULL
, remote_key varchar NOT NULL
, payload varchar
, UNIQUE (batch_name, remote_key)
-- or::
-- , UNIQUE (remote_key)
);
-- temp table
CREATE TEMP TABLE temp_data
( remote_key varchar -- PRIMARY KEY
, payload varchar
);
COPY temp_data(remote_key,payload)
FROM '/tmp/Account-005'
;
-- The actual insert
-- (you could also filter out or handle duplicates here)
INSERT INTO data(batch_name, remote_key, payload)
SELECT 'Account-005', t.remote_key, t.payload
FROM temp_data t
;
BTW It is possible to automate the above: put it into a function (or maybe a prepared statement), using the filename/literal as argument.
Set a default for the column:
alter table data
alter column field3 set default 'Account-005'
Do not mention it the the copy command:
COPY data(field1, field2) FROM...

Resources