Populate Azure Data Factory dataset from query - azure

Cannot find an answer via google, msdn (and other microsoft) documentation, or SO.
In Azure Data Factory you can get data from a dataset by using copy activity in a pipeline. The pipeline definition includes a query. All the queries I have seen in documentation are simple, single table queries with no joins. In this case, a dataset is defined as a table in the database with "TableName"= "mytable". Additionally, one could retrieve data from a stored procedure, presumably allowing more complex sql.
Is there a way to define a more complex query in a pipeline that includes joins and/or transformation logic that alters the data from or pipeline from a query rather than stored procedure. I know that you can specify fields in a dataset, but don't know how to get around the "tablename" property.
If there is a way, what would that method be?
input is on-premises sql server. output is azure sql database.
UPDATED for clarity.

Yes, the sqlReaderQuery can be much more complex than what is provided in the examples, and it doesn't have to only use the Table Name in the Dataset.
In one of my pipelines, I have a Dataset with the TableName "dbo.tbl_Build", but my sqlReaderQuery looks at several tables in that database. Here's a heavily truncated example:
with BuildErrorNodes as (select infoNode.BuildId, ...) as MessageValue from dbo.tbl_BuildInformation2 as infoNode inner join dbo.tbl_BuildInformationType as infoType on (infoNode.PartitionId = infoType), BuildInfo as ...
It's a bit confusing to list a single table name in the Dataset, then use multiple tables in the query, but it works just fine.

There's a way to move data from on-premise SQL to Azure SQL using Data Factory.
You can use Copy Activity, check this code sample for your case specifically GitHub link to the ADF Activity source.
Basically you need create Copy Activity which will have TypeProperties with SqlSource and SqlSink sets look like this:
<!-- language: lang-json -->
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "select * from [Source]"
},
"sink": {
"type": "SqlSink",
"WriteBatchSize": 1000000,
"WriteBatchTimeout": "00:05:00"
}
},
Also do mention - you can use not only selects from tables or views, but also [Table-Valued-Functions] will work as well.

Related

Azure Data Factory Pipeline - Store single-value source query output as a variable to then use in Copy Data activity

I am looking to implement an incremental table loading pipeline in ADF. I want to execute a query to get the latest timestamp from the table in an Azure SQL database. Then, store this value as a variable in ADF so I can then reference it in the "Source" query of a Copy Data activity.
The goal is to only request data from an API with a timestamp greater than the latest timestamp in the SQL table.
Is this functionality possible within ADF pipelines? or do I need to look to Azure functions or Data Flows?
This is definitely possible with Data Factory. You could use the Lookup Activity or a Stored Procedure, but the team just released the new Script Activity:
This will return results like so:
{
"resultSetCount": 1,
"recordsAffected": 0,
"resultSets": [
{
"rowCount": 1,
"rows": [
{
"MaxDate": "2018-03-20"
}
]
...
}
Here is the expression to read this into a variable:
#activity('Script1').output.resultSets[0].rows[0].MaxDate

Dynamic REST calls in Azure Synapse Pipeline

I am making a call to a REST API with Azure Synapse and the return dataset looks something like this:
{
"links": [
{
"rel": "next",
"href": "[myRESTendpoint]?limit=1000&offset=1000"
},
{
"rel": "last",
"href": "[myRESTendpoint]?limit=1000&offset=60000"
},
{
"rel": "self",
"href": "[myRESTendpoint]"
}
],
"count": 1000,
"hasMore": true,
"items": [
{
"links": [],
"closedate": "6/16/2014",
"id": "16917",
"number": "62000",
"status": "H",
"tranid": "0062000"
},...
],
"offset": 0,
"totalResults": 60316
}
I am familiar with making a REST call to a single endpoint that can return all the the data with a single call using a Synapse pipeline, but this particular REST endpoint has a hard limit on only returning 1000 records, but it does give a property named "hasMore".
Is there a way to recursively make rest calls in a Synapse pipeline until the "hasMore" property equals false?
The end goal of this is to sink data to either a dedicated SQL pool or into ADLS2 and transform from there.
I have tried to achieve the same scenario using Azure Data Factory which seems to be more appropriate and easy to achieve the goal "The end goal of this is to sink data to either a dedicated SQL pool or into ADLS2 and transform from there".
As you have to hit the page recursively to fetch 1000 records , you might set it in the following fashion if the response header/response body contain the URL for the next page.
You're less likely to be able to use the functionality if the next page link or query parameter isn't included in the response headers/body.
Alternatively, you may utilise loop logic and do the Copy Activity.
Create two parameters in the Rest Connector:
Fill in the parameters for the RestConnector's relative URL.
Using the Set Variable action, the value of this variable would be increased in a loop. For each cycle, the URL for the Copy Activity is dynamically set.If you want to loop or iterate, you may use the Until activity.
Alternative:
In my experience, the REST connection pagination is quite rigid. Usually put the action within a loop. As a result, to have more control.
FOREACH Loop, here
For those following the thread, I used IpsitaDash-MT's suggestion using the ForEach loop. In the case of this API, when a call is made I get a property returned at the end of the call named "totalResults". Here are the steps I used to achieve what I was looking to do:
Make a dummy call to the API to get the "totalResults" parameter. This is just a call to return the number of results I am looking to get. In the case of this API, the body of the request is a SQL statement, so when the dummy request is made I am only asking for the ID's of the results I am looking to get.
SQL statement example
I then take the property "totalResults" from that request set a dynamic value in the "Items" of the ForEach loop like this:
#range(0,add(div(sub(int(activity('Get Pages Customers').output.totalResults),mod(int(activity('Get Pages Customers').output.totalResults),1000)),1000),1))
NOTE: The API only allows pages of 1000 results, I do some math to get a range of page numbers. I also have to add 1 to the final result to include the last page.
ForEach Loop Settings
In the API I have two parameters that can be passed "limit" and "offset". Since I want all of the data there is no reason to have limit set to anything other than 1000 (the max allowable number). The offset parameter can be set to any number less than or equal to "totalResults" - "limit" and greater than or equal to 0. So I use the range established in step 2 and multiply it out by 1000 to set the offset parameter in the URL.
Setting the offset parameter in the copy data activity
Dynamic value of the Relative URL in the REST connector
NOTE: I found it better to sink the data as JSON into ADLS2 first rather than into a dedicated SQL pool due to the Lookup feature.
Since synapse does not allow nested ForEach loops, I run the data through a data flow to format the data and check for duplicates and updates.
When the data flow is completed it kicks off a lookup activity to get the data that was just processed and pass it into a new pipeline to use another ForEach loop to get the child data for each ID of parent data.
Data Flow and Lookup for child data pipeline

How to set up output path while copying data from Azure Cosmos DB to ADLS Gen 2 via Azure Data Factory

I have a cosmos DB collection in the following format:
{
"deviceid": "xxx",
"partitionKey": "key1",
.....
"_ts": 1544583745
}
I'm using Azure Data Factory to copy data from Cosmos DB to ADLS Gen 2. If I copy using a copy activity, it is quite straightforward. However, my main concern is the output path in ADLS Gen 2. Our requirements state that we need to have the output path in a specific format. Here is a sample of the requirement:
outerfolder/version/code/deviceid/year/month/day
Now since deviceid, year, month, day are all in the payload itself I can't find a way to use them except create a lookup activity and use the output of the lookup activity in the copy activity.
And this is how I set the ouput folder using the dataset property:
I'm using SQL API on Cosmos DB to query the data.
Is there a better way I can achieve this?
I think that your way works, but its not the cleanest. What I'd do is create a different variable inside the pipeline for each one: version, code, deviceid, etc. Then, after the lookup you can assign the variables, and finally do the copy activity referencing the pipeline variables.
It may look kind of redundant, but think of someone (or you 2 years from now) having to modify the pipeline and if you are not around (or have forgotten), this way makes it clear how it works, and what you should modify.
Hope this helped!!

Query from Azure Comos DB and save to Azure Table Storage using Data Factory

I want to save C._ts+C.ttl as one entity in my Azure Table Storage. I do the following query in my Copy Activity:
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "#concat('SELECT (C.ts+C.ttl) FROM C WHERE (C.ttl+C._ts)<= ', string(pipeline().parameters.bufferdays))",
"type": "Expression"
},
"nestingSeparator": "."
},
I dont want to copy all the fields from my source i.e. CosmosDB to my sink i.e. Table Storage. I just want to store the result of this query as one value. How can I do that?
According to my test, I presume the null value you queried because of that the collection level ttl affects each document , but will not generate ttl property within document.
So when you execute SELECT c.ttl,c._ts FROM c , just get below result.
Document level ttl is not defined, just follow collection level ttl.
You need to bulk add ttl property into per document so that you could transfer _ts+ttl caculator results.
Your Copy Activity settings looks good , just add an alias in SQL, or set the name of the field via column mapping.
Hope it helps you.

In azure datafactory how to copy data from blob to sql without duplication?

In azure datafactory how to copy data from blob to sql without duplication ie if the pipeline runs at the slice of every 15 min then how to avoid getting duplicate data
The solution isn't automatic, but you can use a Copy Activity, and use a stored procedure in the SQL sink to handle rows that may already exist. Perhaps TSQL Merge statement, or an Insert / Update statement inside.
https://azure.microsoft.com/en-us/documentation/articles/data-factory-copy-activity/
Invoke stored procedure for SQL Sink. When copying data into SQL Server or Azure SQL Database, a user specified stored procedure could be configured and invoked.
Thanks, Jason
I had the same issue and I found that you can add the slice start time and slice end time to your stored procedure and filter the queries using them as any other parameter, that will help you to load the data by slices and not the same data the number of slices you have, hope it's clear enough.
"typeProperties": {
"storedProcedureName": "sp_sample",
"storedProcedureParameters": {
"DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)"
}
}
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-stored-proc-activity
I had the same problem and found this link to be helpful:
https://www.mssqltips.com/sqlservertip/6365/incremental-file-load-using-azure-data-factory/
In our case, we only add files to blob storage and never modify them after that so the job is to simply pick up new files created within the latest 15 minutes and add them to the SQL container. The Incremental Copy procedure described in the link seems to work great so far.
I can imagine that in some cases you may need to add a stored procedure to act on the SQL container after this, but we did not need it.

Resources