Here is my situation. Iam using Alteryx ETL tool where in basically we are appending new records to tableau by using option provided like 'Overwrite the file'.
What it does is any data incoming is captured to the target and delete the old data--> publish results in Tableau visualisation tool.
So whatever data coming in source must overwrite the existing data in Sink table.
How can we achieve this in Azure data Flow?
If you are writing to a database table, you'll see a sink setting for "truncate table" which will remove all previous rows and replace them with your new rows. Or if you are trying to overwrite just specific rows based on a key, then use an Alter Row transformation and use the "Update" option.
If your requirement is just to copy data from your source to target and truncate the table data before the latest data is copied, then you can just use a copy activity in Azure Data factory. In copy activity you have an option called Pre-copy script, in which you can specify a query to truncate the table data and then proceed with copying the latest data.
Here is an article by a community volunteer where a similar requirement has been discussed with various approaches - How to truncate table in Azure Data Factory
In case if your requirement is to do data transformation first and then copy the data to your target sql table and truncate table before your copy the latest transformed data then, you will have to use mapping data flow activity.
Related
I am setting up a pipeline in data factory where the first part of the pipeline needs some pre-processing cleaning. I currently have a script set up to query these rows that need to be deleted, and export these results into a csv.
What I am looking for is essentially the opposite of an upsert copy activity. I would like the procedure to delete the rows in my table based on a matching row.
Apologies in advanced if this is an easy solution, I am fairly new to data factory and just need help looking in the right direction.
Assuming the source from which you are initially getting the rows is different from the sink
There are multiple ways to achieve it.
in case if the number of rows is less, we can leverage script activity or lookup activity to delete the records from the destination table
in case of larger dataset, limitations of lookup activity, you can copy the data into a staging table with in destination and leverage a script activity to delete the matching rows
in case if your org supports usage of dataflows, you can use that to achieve it
I am trying to do an incremental data load to Azure sql from csv files in ADLS through ADF. The problem I am facing is Azure SQL would generate the primary key column (ID) and the data would be inserted to Azure SQl. But when the pipeline is re triggered the data would be duplicated again. So how do I handle these duplicates ? Because only incremental load should be updated everytime but since primary key column is generated by SQL there would be duplicates every run. Please help !!
You can consider comparing source and sink data first by excluding
Primary key column and then filter that rows which modified and take
it to sink table.
In below video I created a hash on top of few columns from source and sink and compared them to identify changed data. Same way you can consider checking the changed data first and then load it to sink table.
https://www.youtube.com/watch?v=i2PkwNqxj1E
I am looking to use Azure Data Factory to import a number of users from a third party source CSV file into the D365 user table/entity. This is usually straight forward but on this occasion I have a complication.
The D365 users table/entity is already populated. The source CSV user file will have a mixture of users that are already in the D365 table/entity and others that are not.
What I would like to do is ensure the users in my source file that are already in the D365 table are not copied over as this would create duplicates.
Source CSV FILE
Existing D365 User table (subset of fields just to illustrate)
Updated D365 table with two new record added from source csv
From what I can see there are two possible solutions in Azure Data Factory
Configure the D365 sink to do it. e.g configure th sink in order to ignore records that match on a certain column?
Is it possible to configure the sink in some way to accomplish this?
Pull in the D365 table/entity as a source and use it to filter my source CSV to remove user records that already exist in D365 perhaps by using a common field such as fullname to identify such records. This would ensure I only try to import new users.
I have had a look into both methods but have been struggling to find a way to implement them.
I'd like to think the scenario I have outlined above is not uncommon and there are tried and tested methods to filter out records from a source CSV that already exists in the target D365 table?
I'd apprecate any help/suggestion to help me achieve this
You can use any one of these 2 approaches.
Use Azure data flow and Upsert the data to sink using Upsert as your writeBehavior in your dynamics sink transformation. You can refer to this SO link for information to use the Upsert method in the Azure data factory.
Pull CSV data as source1 and D365 table data as source2 and connect both sources to join transformation with left outer join. Then you can use filter transformation to filter out the NULL records of source2 (or right table). The output of filter transformation will be only new records which can be directly passed to D365 sink transformation. You can refer to this SO thread to similar process.
When we do data extract as yours from Synapse into Azure, upsert is not working correctly, many times go into a dead loop
What we do:
Create a temp table in the target
select source table data and extract it into the target temp table
run stored procedure to update, insert and delete target real table based on the temp table
Here is the stored procedure query, hope it can help you:
UPDATE t
SET
t.bcfsa_teamid = syn.bcfsa_teamid
,t.bcfsa_account = syn.bcfsa_account
,t.bcfsa_name = syn.bcfsa_name
,t.bcfsa_contactteamownerid = syn.bcfsa_contactteamownerid
FROM website.bcfsa_team t
INNER JOIN syn.bcfsa_team syn on t.id = syn.id;
INSERT website.bcfsa_team (id,bcfsa_teamid,bcfsa_account,bcfsa_name,bcfsa_contactteamownerid)
SELECT id,bcfsa_teamid,bcfsa_account,bcfsa_name,bcfsa_contactteamownerid
FROM syn.bcfsa_team
EXCEPT
SELECT id,bcfsa_teamid,bcfsa_account,bcfsa_name,bcfsa_contactteamownerid
FROM website.bcfsa_team;
DELETE FROM website.bcfsa_team
WHERE Id NOT in (SELECT id FROM syn.bcfsa_team);
TRUNCATE TABLE syn.bcfsa_team;
I am trying to add an additional column to the Azure Data Explorer sink from the source Blob Storage using the "Additional columns" in the Copy Activity "Source" tab, but I am getting the following error:
"Additional columns are not supported for your sink dataset, please create a new dataset to enable additional columns."
When I changed the sink dataset to a blob storage, it works fine and an additional column gets created. Is there anything I am missing on here when I using the Azure Data Explorer as sink?
Alternatively, how can I add an additional column to the Azure Data Explorer table as a sink?
Additional Columns
As per official doc - This feature works with the latest dataset model. If you don't see this option from the UI, try creating a new dataset.
ADX sink doesn't support altering the table using Copy activity.
To add a column to the ADX table, use .alter-merge table command in advance and map the additional column to the target column under the Mapping tab of the Copy activity.
.alter-merge table command
Is there any way we can fetch the max of last modified date from the last processed file and store it in a config table
From Supported data stores and formats you can see Salesforce, Salesforce service cloud and Marketing cloud are supported.
You have to perform the following steps:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Follow this to setup Linked Service with Salesforce in Azure Data Factory
When copying data from Salesforce, you can use either SOQL query or SQL query. Note that these two has different syntax and functionality support, do not mix it. You are suggested to use the SOQL query, which is natively supported by Salesforce.
Process for incremental loading, or delta loading, of data through a watermark:
In this case, you define a watermark in your source database. A watermark is a column that has the last updated time stamp or an incrementing key. The delta loading solution loads the changed data between an old watermark and a new watermark. The workflow for this approach is depicted in the following diagram:
ADF will scan all the files from the source store, apply the file filter by their LastModifiedDate, and only copy the new and updated file since last time to the destination store.
For capabilities, Prerequisites and Salesforce request limits refer Copy data from and to Salesforce by using Azure Data Factory
Refer doc: Delta copy from a database with a control table This article describes a template that's available to incrementally load new or updated rows from a database table to Azure by using an external control table that stores a high-watermark value.
This template requires that the schema of the source database contains a timestamp column or incrementing key to identify new or updated rows.
The template contains four activities:
Lookup retrieves the old high-watermark value, which is stored in an external control table.
Another Lookup activity retrieves the current high-watermark value from the source database.
Copy copies only changes from the source database to the destination store. The query that identifies the changes in the source database is similar to 'SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column > “last high-watermark” and TIMESTAMP_Column <= “current high-watermark”'.
StoredProcedure writes the current high-watermark value to an external control table for delta copy next time.
Go to the Delta copy from Database template. Create a New connection to the source database that you want to data copy from.
Create connections to the external control table and stored procedure that you created and Select Use this template.
Choose the available pipeline
For Stored procedure name, choose [dbo].[update_watermark]. Select Import parameter, and then select Add dynamic content.
click Add dynamic content and Type in below query. This will get a maximum date in your watermark column that we can use for delta slice.
You can use this query to fetch the max of last modified date from the last processed file
select MAX(LastModifytime) as NewWatermarkvalue from data_source_table"
or
For files only you can use Incrementally copy new and changed files based on LastModifiedDate by using the Copy Data tool
Refer: Source: ADF Incremental loading with configuration stored in a table