Azure Data flow - presql script, dynamic content - azure

I want to run a presql script in the Data flow in SINK. I want to delete exiting records for particular year.
This particular year will be coming from source excel file.
Say I have a file for 2021 data and loaded that data into DB. when I rerun the pipeline for the same excel file I want to delete 2021 related records in DB and insert fresh. This table may contain multiple years data. So everytime a new file arrives for a particular year, I want to delete that respective records and reload the new data.
I can read the year value from source file column. And I can keep it as derived column. How Can I write a presql script to delete?
delete from where year = <sourcefile.year>
how can i do this in data flow?
Pls help!

You can use a temporary table in sink1 and delete the records in the sink2.
Please follow the demonstration below:
This is my SQL table with a sample data.
Create two SQL sinks for it from same source using a new branch.
In the first sink, provide a dataset with edit table name option for the temporary table.
In sink check on Recreate table.
In sink 2, give your SQL table with the below SQL script.
DELETE FROM exceltable WHERE year in (select year from dbo.temp1)
drop table dbo.temp1;
In the settings of Data flow give the Write order of sinks.
Temporary table sink should execute first.
This is my result in the SQL table after deleting records.

Related

Insert Excel sheet data to an existing Big Query table

We have an existing table on bigquery that gets updated via a scheduler that checks ftp server and upload the new added data into it.
The issue is that few days were dropped from the FTP and now I need to upload the data manually into the table.
Hopefully, I didn't want to create another table and upload the data into it and then make union between the two tables, I was looking for a solution that would insert the sheets to the main table right away

Overwrite sql table with new data in Azure Dataflow

Here is my situation. Iam using Alteryx ETL tool where in basically we are appending new records to tableau by using option provided like 'Overwrite the file'.
What it does is any data incoming is captured to the target and delete the old data--> publish results in Tableau visualisation tool.
So whatever data coming in source must overwrite the existing data in Sink table.
How can we achieve this in Azure data Flow?
If you are writing to a database table, you'll see a sink setting for "truncate table" which will remove all previous rows and replace them with your new rows. Or if you are trying to overwrite just specific rows based on a key, then use an Alter Row transformation and use the "Update" option.
If your requirement is just to copy data from your source to target and truncate the table data before the latest data is copied, then you can just use a copy activity in Azure Data factory. In copy activity you have an option called Pre-copy script, in which you can specify a query to truncate the table data and then proceed with copying the latest data.
Here is an article by a community volunteer where a similar requirement has been discussed with various approaches - How to truncate table in Azure Data Factory
In case if your requirement is to do data transformation first and then copy the data to your target sql table and truncate table before your copy the latest transformed data then, you will have to use mapping data flow activity.

Having access to Records that went to Sink from a Copy Activity

The source of my Copy activity is the result of calling a REST API and the Sink is a Azure SQL Table that I insert those records in it.
Now I want to know what records we just got inserted in that Sink so I can do some update statement on those records. So my question is how can I know what went into Sink so I can now update those records.
Two methods come to mind
If the table has a timestamp of when the records were inserted then you could use that value to know which were just inserted.
Instead of putting the records directly in the final table put them in a staging table. Then you can either update as part of the insert to move them to the final table do the updates OR update them in staging table and then copy over to the final table. Just remember to truncate the staging table every run so it only has the new records.

External Table in Databricks is showing only future date data

I had a delta table in databricks and data is available in ADLS. data is partitioned by date column, from 01-06-2022 onwards data is available in parquet format in adls but when i query the table in databricks i can see data from future date onwards everyday.older data is not displaying. every day data is overwriting to the table path with partitioned date column.
df.write.format('delta').mode('overwrite').save('{}/{}'.format(DELTALAKE_PATH, table))
Using Overwrite mode will delete past data and add new data. This is the reason for your issue.
df.write.format('delta').mode('append').save('{}/{}'.format(DELTALAKE_PATH, table))
Using append mode will append new data beneath the existing data. This will keep your existing data and when you execute a query, it will return past records as well.
You need to use append mode in place of overwrite mode.
Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only to queries where existing rows in the Result Table are not expected to change.
Reference - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts

Filter out duplicate records with azure data factory before importing into Dynamics 365 user table

I am looking to use Azure Data Factory to import a number of users from a third party source CSV file into the D365 user table/entity. This is usually straight forward but on this occasion I have a complication.
The D365 users table/entity is already populated. The source CSV user file will have a mixture of users that are already in the D365 table/entity and others that are not.
What I would like to do is ensure the users in my source file that are already in the D365 table are not copied over as this would create duplicates.
Source CSV FILE
Existing D365 User table (subset of fields just to illustrate)
Updated D365 table with two new record added from source csv
From what I can see there are two possible solutions in Azure Data Factory
Configure the D365 sink to do it. e.g configure th sink in order to ignore records that match on a certain column?
Is it possible to configure the sink in some way to accomplish this?
Pull in the D365 table/entity as a source and use it to filter my source CSV to remove user records that already exist in D365 perhaps by using a common field such as fullname to identify such records. This would ensure I only try to import new users.
I have had a look into both methods but have been struggling to find a way to implement them.
I'd like to think the scenario I have outlined above is not uncommon and there are tried and tested methods to filter out records from a source CSV that already exists in the target D365 table?
I'd apprecate any help/suggestion to help me achieve this
You can use any one of these 2 approaches.
Use Azure data flow and Upsert the data to sink using Upsert as your writeBehavior in your dynamics sink transformation. You can refer to this SO link for information to use the Upsert method in the Azure data factory.
Pull CSV data as source1 and D365 table data as source2 and connect both sources to join transformation with left outer join. Then you can use filter transformation to filter out the NULL records of source2 (or right table). The output of filter transformation will be only new records which can be directly passed to D365 sink transformation. You can refer to this SO thread to similar process.
When we do data extract as yours from Synapse into Azure, upsert is not working correctly, many times go into a dead loop
What we do:
Create a temp table in the target
select source table data and extract it into the target temp table
run stored procedure to update, insert and delete target real table based on the temp table
Here is the stored procedure query, hope it can help you:
UPDATE t
SET
t.bcfsa_teamid = syn.bcfsa_teamid
,t.bcfsa_account = syn.bcfsa_account
,t.bcfsa_name = syn.bcfsa_name
,t.bcfsa_contactteamownerid = syn.bcfsa_contactteamownerid
FROM website.bcfsa_team t
INNER JOIN syn.bcfsa_team syn on t.id = syn.id;
INSERT website.bcfsa_team (id,bcfsa_teamid,bcfsa_account,bcfsa_name,bcfsa_contactteamownerid)
SELECT id,bcfsa_teamid,bcfsa_account,bcfsa_name,bcfsa_contactteamownerid
FROM syn.bcfsa_team
EXCEPT
SELECT id,bcfsa_teamid,bcfsa_account,bcfsa_name,bcfsa_contactteamownerid
FROM website.bcfsa_team;
DELETE FROM website.bcfsa_team
WHERE Id NOT in (SELECT id FROM syn.bcfsa_team);
TRUNCATE TABLE syn.bcfsa_team;

Resources