In azure datafactory how to copy data from blob to sql without duplication? - azure

In azure datafactory how to copy data from blob to sql without duplication ie if the pipeline runs at the slice of every 15 min then how to avoid getting duplicate data

The solution isn't automatic, but you can use a Copy Activity, and use a stored procedure in the SQL sink to handle rows that may already exist. Perhaps TSQL Merge statement, or an Insert / Update statement inside.
https://azure.microsoft.com/en-us/documentation/articles/data-factory-copy-activity/
Invoke stored procedure for SQL Sink. When copying data into SQL Server or Azure SQL Database, a user specified stored procedure could be configured and invoked.
Thanks, Jason

I had the same issue and I found that you can add the slice start time and slice end time to your stored procedure and filter the queries using them as any other parameter, that will help you to load the data by slices and not the same data the number of slices you have, hope it's clear enough.
"typeProperties": {
"storedProcedureName": "sp_sample",
"storedProcedureParameters": {
"DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)"
}
}
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-stored-proc-activity

I had the same problem and found this link to be helpful:
https://www.mssqltips.com/sqlservertip/6365/incremental-file-load-using-azure-data-factory/
In our case, we only add files to blob storage and never modify them after that so the job is to simply pick up new files created within the latest 15 minutes and add them to the SQL container. The Incremental Copy procedure described in the link seems to work great so far.
I can imagine that in some cases you may need to add a stored procedure to act on the SQL container after this, but we did not need it.

Related

Find All the tables related to a stored procedure in Azure Synapse Datawarehouse

Is there a simple way to find out all the tables referenced in a stored procedure in azure analytics data warehouse other than parsing the stored procedure code? I tried few commands like sp_tables, sp_depends but none seems to be working in azure data warehouse.
sys.sql_expression_dependencies is supported in Azure Synapse Analytics, dedicated SQL pools, but only supports tables, views and functions at this time. A simple example:
SELECT * FROM sys.sql_expression_dependencies;
So you are left either parsing sys.sql_modules. Something like this is imperfect (ie doesn't deal with schema name, square brackets, partial matches etc) but could server as a starting point:
SELECT
sm.[definition],
OBJECT_SCHEMA_NAME(t.object_id) schemaName,
OBJECT_NAME(t.object_id) tableName
FROM sys.sql_modules sm
CROSS JOIN sys.tables t
WHERE sm.object_id = OBJECT_ID('dbo.usp_test')
AND sm.[definition] Like '%' + t.name + '%';
I actually use SQL Server Data Tools (SSDT) with dedicated SQL pools so your dependencies can't get out of step and are trackable via the project.

Unable to get scalar value of a query on cosmos db in azure data factory

I am trying to get the count of all records present in cosmos db in a lookup activity of azure data factory. I need this value to do a comparison with other value activity outputs.
The query I used is SELECT VALUE count(1) from c
When I try to preview the data after inserting this query I get an error saying
One or more errors occurred. Unable to cast object of type
'Newtonsoft.Json.Linq.JValue' to type 'Newtonsoft.Json.Linq.JObject'
as shown in the below image:
snapshot of my azure lookup activity settings
Could someone help me in resolving this error and if this is the limitation of azure data factory how can I get the count of all the rows of the cosmos db document using some other ways inside azure data factory?
I reproduce your issue on my side exactly.
I think the count result can't be mapped as normal JsonObject. As workaround,i think you could just use Azure Function Activity(Inside Azure Function method ,you could use SDK to execute any sql as you want) to output your desired result: {"number":10}.Then bind the Azure Function Activity with other activities in ADF.
Here is contradiction right now:
The query sql outputs a scalar array,not other things like jsonObject,or even jsonstring.
However, ADF Look Up Activity only accepts JObject,not JValue. I can't use any convert built-in function here because the query sql need to be produced with correct syntax anyway. I already submitted a ticket to MS support team,but get no luck with this limitation.
I also tried select count(1) as num from c which works in the cosmos db portal. But it still has limitation because the sql crosses partitions.
So,all i can do here is trying to explain the root cause of issue,but can't change the product behaviours.
2 rough ideas:
1.Try no-partitioned collection to execute above sql to produce json output.
2.If the count is not large,try to query columns from db and loop the result with ForEach Activity.
You can use:
select top 1 column from c order by column desc

ADF copy data activity - check for duplicate records before inserting into SQL db

I have a very simple ADF pipeline to copy data from local mongoDB (self-hosted integration environment) to Azure SQL database.
My pipleline is able to copy the data from mongoDB and insert into SQL db.
Currently if I run the pipeline it inserts duplicate data if run multiple times.
I have made _id column as unique in SQL database and now running pipeline throws and error because of SQL constraint wont letting it insert the record.
How do I check for duplicate _id before inserting into SQL db?
should I use Pre-copy script / stored procedure?
Some guidance / directions would be helpful on where to add extra steps. Thanks
Azure Data Factory Data Flow can help you achieve that:
You can follow these steps:
Add two sources: Cosmos db table(source1) and SQL database table(source2).
Using Join active to get all the data from two tables(left join/full join/right join) on Cosmos table.id= SQL table.id.
AlterRow expression to filter the duplicate _id, it not duplicate then insert it.
Then mapping the no-duplicate column to the Sink SQL database table.
Hope this helps.
You Should implement your SQL Logic to eliminate duplicate at the Pre-Copy Script
Currently I got the solution using a Stored Procedure which look like a lot less work as far this requirement is concerned.
I have followed this article:
https://www.cathrinewilhelmsen.net/2019/12/16/copy-sql-server-data-azure-data-factory/
I created table type and used in stored procedure to check for duplicate.
my sproc is very simple as shown below:
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[spInsertIntoDb]
(#sresults dbo.targetSensingResults READONLY)
AS
BEGIN
MERGE dbo.sensingresults AS target
USING #sresults AS source
ON (target._id = source._id)
WHEN NOT MATCHED THEN
INSERT (_id, sensorNumber, applicationType, place, spaceType, floorCode, zoneCountNumber, presenceStatus, sensingTime, createdAt, updatedAt, _v)
VALUES (source._id, source.sensorNumber, source.applicationType, source.place, source.spaceType, source.floorCode,
source.zoneCountNumber, source.presenceStatus, source.sensingTime, source.createdAt, source.updatedAt, source.updatedAt);
END
I think using stored proc should do for and also will help in future if I need to do more transformation.
Please let me know if using sproc in this case has potential risk in future ?
To remove the duplicates you can use the pre-copy script. OR what you can do is you can store the incremental or new data into a temp table using copy activity and use a store procedure to delete only those Ids from the main table which are in temp table after deletion insert the temp table data into the main table. and then drop the temp table.

How to set up output path while copying data from Azure Cosmos DB to ADLS Gen 2 via Azure Data Factory

I have a cosmos DB collection in the following format:
{
"deviceid": "xxx",
"partitionKey": "key1",
.....
"_ts": 1544583745
}
I'm using Azure Data Factory to copy data from Cosmos DB to ADLS Gen 2. If I copy using a copy activity, it is quite straightforward. However, my main concern is the output path in ADLS Gen 2. Our requirements state that we need to have the output path in a specific format. Here is a sample of the requirement:
outerfolder/version/code/deviceid/year/month/day
Now since deviceid, year, month, day are all in the payload itself I can't find a way to use them except create a lookup activity and use the output of the lookup activity in the copy activity.
And this is how I set the ouput folder using the dataset property:
I'm using SQL API on Cosmos DB to query the data.
Is there a better way I can achieve this?
I think that your way works, but its not the cleanest. What I'd do is create a different variable inside the pipeline for each one: version, code, deviceid, etc. Then, after the lookup you can assign the variables, and finally do the copy activity referencing the pipeline variables.
It may look kind of redundant, but think of someone (or you 2 years from now) having to modify the pipeline and if you are not around (or have forgotten), this way makes it clear how it works, and what you should modify.
Hope this helped!!

Populate Azure Data Factory dataset from query

Cannot find an answer via google, msdn (and other microsoft) documentation, or SO.
In Azure Data Factory you can get data from a dataset by using copy activity in a pipeline. The pipeline definition includes a query. All the queries I have seen in documentation are simple, single table queries with no joins. In this case, a dataset is defined as a table in the database with "TableName"= "mytable". Additionally, one could retrieve data from a stored procedure, presumably allowing more complex sql.
Is there a way to define a more complex query in a pipeline that includes joins and/or transformation logic that alters the data from or pipeline from a query rather than stored procedure. I know that you can specify fields in a dataset, but don't know how to get around the "tablename" property.
If there is a way, what would that method be?
input is on-premises sql server. output is azure sql database.
UPDATED for clarity.
Yes, the sqlReaderQuery can be much more complex than what is provided in the examples, and it doesn't have to only use the Table Name in the Dataset.
In one of my pipelines, I have a Dataset with the TableName "dbo.tbl_Build", but my sqlReaderQuery looks at several tables in that database. Here's a heavily truncated example:
with BuildErrorNodes as (select infoNode.BuildId, ...) as MessageValue from dbo.tbl_BuildInformation2 as infoNode inner join dbo.tbl_BuildInformationType as infoType on (infoNode.PartitionId = infoType), BuildInfo as ...
It's a bit confusing to list a single table name in the Dataset, then use multiple tables in the query, but it works just fine.
There's a way to move data from on-premise SQL to Azure SQL using Data Factory.
You can use Copy Activity, check this code sample for your case specifically GitHub link to the ADF Activity source.
Basically you need create Copy Activity which will have TypeProperties with SqlSource and SqlSink sets look like this:
<!-- language: lang-json -->
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "select * from [Source]"
},
"sink": {
"type": "SqlSink",
"WriteBatchSize": 1000000,
"WriteBatchTimeout": "00:05:00"
}
},
Also do mention - you can use not only selects from tables or views, but also [Table-Valued-Functions] will work as well.

Resources