I am trying to load files from my Azure blob to Snowflake table incrementally. After which in snowflake, I put streams on that table and load the data to the target table.
I am unable to do incremental load from Azure to Snowflake. I have tried many ways but not working. I am attaching the images of my 2 different ways (pipelines) to do the same.
In this pipeline, I just added 3 extra columns which I wanted in my sink
In this pipeline, I tried creating conditional splits
Both of these have not worked out.
Kindly suggest me how to go about this.
You can achieve this by selecting Allow Upsert in sink settings under the Update method.
Below are my repro details:
This is the staging table in snowflake which I am loading incremental data to.
Source file – Incremental data
a) This file contains records that exist in the staging table (StateCode = ‘AK’ & ‘CA’), so these 2 records should be updated in the staging table with new values in Flag.
b) Other 2 records (StateCode = ‘FL’ & ‘AZ’) should be inserted into the staging table.
Dataflow source settings:
Adding DerivedColumn transformation to add a column modified_date which is not in the source file but in the sink table.
Adding AlterRow transformation: When you are using the Upsert method, AlterRow transformation is a must to include the upsert condition.
a) In condition, you can mention to upsert the rows only when the unique column (StateCode in my case) is not null.
Adding sink transformation to AlterRow with Snowflake stage table as sink dataset.
a) In sink settings, select Update method as Allow upsert and provide the Key (unique) column based on which the Upsert should happen in sink table.
After you run the pipeline, you can see the results in a sink.
a) As StateCode AK & CA already exists in the table, only other column values (flag & modified_date) for those rows are updated.
b) StateCode AZ & FL are not found in the stage (sink) table so, these rows are inserted.
Related
I am new to Synapse Data Flows and have not been able to achieve, what might be a simple Transaformation.
In Dynamics I have a table with over 3.5 million rows, that i am bringing daily with a simple dataflow (source > sink) into a Dedicated SQL Database in Synapse. Right now I'm truncating the table and loading all data with every pipeline run.
I want to only insert new rows (key is GUID) and update existing rows, where the field versionnumber has changed.
I found this detailed guide for SSIS but have not been able to replicate it in Synapse. I'm not sure what transformations I need.
I have tried with Source > Alter Rows > Sink, providing a definition for upsert in alter rows with only the GUID or with a combination of GUID and versionnumber.
Steps to change the data in sink as per source using ADF dataflow
Delete the rows which are in sink (SQL pool) but not in source
Upsert (Update if available, insert if not available) the rows that are available in source to sink.
I tried to repro this using sample dataset. Below is the approach
Initially Data in source and sink are taken as in below image.
In source, Name is updated in id=1, new row is inserted with id=6 and row with id=3 is deleted.
| id | Name |
|----|-----------|
| 1 | Kaala |
| 2 | Arulmozhi |
| 6 | Rajaraja |
Step:1 To delete the rows
In dataflow, Source1 is taken with above data.
Source2 is taken with the synapse SQL pool table.
Exists transformation is added in order to find the ids which are not in source1 but in source2.
Left stream: source2
Right stream: source1
Exists type: Doesn't exist
Exists condition: Source1#id = Source2#id
Alter Row transformation is added next to exists transformation, and condition given is
delete if true()
Sink dataset is same as source2 dataset. (Synapse SQL pool). In sink settings, Allow delete is selected as update method. id is selected as Key column.
The above steps will delete the records in sink which are deleted from source.
Step:2 To upsert the data
In order to upsert, near source1 transformation, new branch is selected.
Alter transformation is added and condition given is upsert if true().
Sink transformation is added and same sql pool is given as sink dataset. Update method is allow upsert and key column given as id.
Overall design of dataflow
When the pipeline is run with this dataflow, data is upserted and deleted as per source.
You can replace id column of this repro with version_number+GUID column combinations and do the same steps.
I have a table in sql and it is copied to ADLS. After copying, sql table got inserted with new rows. I wanted to get the new rows.
I tried to use join transformation. But I couldn't get the output. What is the way to achieve this.
Refer this link. Using this you can get newly added rows from sql to data lake storage. Reproduced issue from my side and able to get newly added records from pipeline.
Created two tables in sql storage with names data_source_table and watermarktable.
data_source_table is the one which is having data in table and watermarktable used for tracking new records based date.
Created pipeline as shown below,
In lookup1 selecting the datasource table
In lookup2 select Query as follows
MAX(LastModifytime) as NewWatermarkvalue from data_source_table;
Then in copy activity source and sink taken as shown below images
SOURCE:
Query in Source:
select `* from data_source_table where LastModifytime > '#{activity('Lookup1').output.firstRow.WatermarkValue}' and LastModifytime <= '#{activity('Lookup1').output.firstRow.Watermarkvalue}'
SINK:
Pipeline ran successfully and data in sql table is loaded into data lake storage file.
Inserted new rows inserted in data_source_table and able to get those records from Lookup activity
I am trying to do an incremental data load to Azure sql from csv files in ADLS through ADF. The problem I am facing is Azure SQL would generate the primary key column (ID) and the data would be inserted to Azure SQl. But when the pipeline is re triggered the data would be duplicated again. So how do I handle these duplicates ? Because only incremental load should be updated everytime but since primary key column is generated by SQL there would be duplicates every run. Please help !!
You can consider comparing source and sink data first by excluding
Primary key column and then filter that rows which modified and take
it to sink table.
In below video I created a hash on top of few columns from source and sink and compared them to identify changed data. Same way you can consider checking the changed data first and then load it to sink table.
https://www.youtube.com/watch?v=i2PkwNqxj1E
I want to create an ADF data pipeline that compares both tables and after the comparison to add the missing rows from table A to table B
table A - 100 records
table B - 90 records
add the difference of 10 rows from table A to table B
This is what I tried:
picture1
picture2
if condition 1 - #greaterOrEquals(activity('GetLastModifiedDate').output.lastModified,adddays(utcnow(),-7))
if condition 2 - #and(equals(item().name,'master_data'),greaterOrEquals(activity('GetLastModifiedDate').output.lastModified,adddays(utcnow(),-7)))
The Copy activity has an Upsert mode which I think would help here. Simple instructions:
Create one Copy activity
Set your source database in the Source tab of the Copy activity
Set your target (or sink) database in the Sink tab. Set the mode to Upsert
Specify the interim schema. This is used to create a transient table which holds data during the Upsert
Specify the unique keys for the source and target table in the Key columns section so the Upsert can take place successfully
A simple example:
Failing that, simply use a Copy activity to land the data into a table in your target database and use a Stored Proc activity to implement your more complicated logic.
I have an excel file as source that needs to be copied into the Azure SQL database using Azure Data Factory.
The ADF pipeline needs to copy the rows from the excel source to SQL database only if it is already not existing in the database. If it exists in the SQL database then no action needs to be taken.
looking forward to the best optimized solution.
You can achieve it using Azure data factory data flow by joining source and sink data and filter the new insert rows to insert if the row does not exist in the sink database.
Example:
Connect excel source to source transformation in the data flow.
Source preview:
You can transform the source data if required using the derived column transformation. This is optional.
Add another source transformation and connect it with the sink dataset (Azure SQL database). Here in the Source option, you can select a table if you are comparing all columns of the sink dataset with the source dataset, or you can select query and write the query to select only matching columns.
Source2 output:
Join source1 and source2 transformations using the Join transformation with join type as Left outer join and add the Join conditions based on the requirement.
Join output:
Using filter transformation, filter out the existing rows from the join output.
Filter condition: isNull(source2#Id)==true()
Filter output:
Using the Select transformation, you can remove the duplicate columns (like source2 columns) from the list. You can also do this in sink mapping by editing manually and deleting the duplicate rows.
Add sink and connect to sink dataset (azure SQL database) to get the required output.
You should create this using a Copy activity and a stored procedure as the Sink. Write code in the stored proc (eg MERGE or INSERT ... WHERE NOT EXISTS ...) to handle the record existing or not existing.
An example of a MERGE proc from the documentation:
CREATE PROCEDURE usp_OverwriteMarketing
#Marketing [dbo].[MarketingType] READONLY,
#category varchar(256)
AS
BEGIN
MERGE [dbo].[Marketing] AS target
USING #Marketing AS source
ON (target.ProfileID = source.ProfileID and target.Category = #category)
WHEN MATCHED THEN
UPDATE SET State = source.State
WHEN NOT MATCHED THEN
INSERT (ProfileID, State, Category)
VALUES (source.ProfileID, source.State, source.Category);
END
This article runs through the process in more detail.