Azure Data Factory Pipeline - Store single-value source query output as a variable to then use in Copy Data activity - azure

I am looking to implement an incremental table loading pipeline in ADF. I want to execute a query to get the latest timestamp from the table in an Azure SQL database. Then, store this value as a variable in ADF so I can then reference it in the "Source" query of a Copy Data activity.
The goal is to only request data from an API with a timestamp greater than the latest timestamp in the SQL table.
Is this functionality possible within ADF pipelines? or do I need to look to Azure functions or Data Flows?

This is definitely possible with Data Factory. You could use the Lookup Activity or a Stored Procedure, but the team just released the new Script Activity:
This will return results like so:
{
"resultSetCount": 1,
"recordsAffected": 0,
"resultSets": [
{
"rowCount": 1,
"rows": [
{
"MaxDate": "2018-03-20"
}
]
...
}
Here is the expression to read this into a variable:
#activity('Script1').output.resultSets[0].rows[0].MaxDate

Related

Azure Data Factory Lookup activity fail to execute CosmosDb query 'Cross partition query only supports 'VALUE ' for aggregates.'

I am trying to configure Azure Data Factory Lookup activity to get MAX datetime field value from CosmosDb container. But unfortunately simplest query just totally doesn't work, the query is
SELECT max(members.lastModifiedOn) as dt FROM members
In CosmosDb control panel we see results
[
{
"dt": "2020-09-01T07:32:03.6733333"
}
]
But in Azure Data Factory preview we see nothing but error
One or more errors occurred.
Message: {"Errors":["Cross partition query only supports 'VALUE ' for aggregates."]}
The only one trick which I've found is to execute query for Lookup activity in such weird format
SELECT VALUE r FROM (SELECT MAX(m.lastModifiedOn) as lastModifiedOn FROM m) as r
Seems Azure Data Factory Lookup activity is expecting to have array of object as the result of query execution

How to set up output path while copying data from Azure Cosmos DB to ADLS Gen 2 via Azure Data Factory

I have a cosmos DB collection in the following format:
{
"deviceid": "xxx",
"partitionKey": "key1",
.....
"_ts": 1544583745
}
I'm using Azure Data Factory to copy data from Cosmos DB to ADLS Gen 2. If I copy using a copy activity, it is quite straightforward. However, my main concern is the output path in ADLS Gen 2. Our requirements state that we need to have the output path in a specific format. Here is a sample of the requirement:
outerfolder/version/code/deviceid/year/month/day
Now since deviceid, year, month, day are all in the payload itself I can't find a way to use them except create a lookup activity and use the output of the lookup activity in the copy activity.
And this is how I set the ouput folder using the dataset property:
I'm using SQL API on Cosmos DB to query the data.
Is there a better way I can achieve this?
I think that your way works, but its not the cleanest. What I'd do is create a different variable inside the pipeline for each one: version, code, deviceid, etc. Then, after the lookup you can assign the variables, and finally do the copy activity referencing the pipeline variables.
It may look kind of redundant, but think of someone (or you 2 years from now) having to modify the pipeline and if you are not around (or have forgotten), this way makes it clear how it works, and what you should modify.
Hope this helped!!

Azure Data factory copy activity failed mapping strings (from csv) to Azure SQL table sink uniqueidentifier field

I have an Azure data factory (DF) pipeline that consists a Copy activity. The Copy activity uses HTTP connector as source to invoke a REST end-point and returns csv stream that sinks with Azure SQL Database table.
The Copy fails when CSV contains strings (such as 40f52caf-e616-4321-8ea3-12ea3cbc54e9) which are mapped to an uniqueIdentifier field in target table with error message The given value of type String from the data source cannot be converted to type uniqueidentifier of the specified target column.
I have tried to wrapped the source string with {} such as {40f52caf-e616-4321-8ea3-12ea3cbc54e9} with no success.
The Copy activity will work if I modified the target table field from uniqueIdentifier to nvarchar(100).
I reproduce your issue on my side.
The reason is data types of source and sink are dismatch.You could check the Data type mapping for SQL server.
Your source data type is string which is mapped to nvarchar or varchar, and uniqueidentifier in sql database needs GUID type in azure data factory.
So,please configure sql server stored procedure in your sql server sink as a workaround.
Please follow the steps from this doc:
Step 1: Configure your Sink dataset:
Step 2: Configure Sink section in copy activity as follows:
Step 3: In your database, define the table type with the same name as sqlWriterTableType. Notice that the schema of the table type should be same as the schema returned by your input data.
CREATE TYPE [dbo].[CsvType] AS TABLE(
[ID] [varchar](256) NOT NULL
)
Step 4: In your database, define the stored procedure with the same name as SqlWriterStoredProcedureName. It handles input data from your specified source, and merge into the output table. Notice that the parameter name of the stored procedure should be the same as the "tableName" defined in dataset.
Create PROCEDURE convertCsv #ctest [dbo].[CsvType] READONLY
AS
BEGIN
MERGE [dbo].[adf] AS target
USING #ctest AS source
ON (1=1)
WHEN NOT MATCHED THEN
INSERT (id)
VALUES (convert(uniqueidentifier,source.ID));
END
Output:
Hope it helps you.Any concern,please free feel to let me know.
There is a way to fix guid conversion into uniqueidentifier SQL column type properly via JSON configuration.
Edit the Copy Activity via Code {} button in top right toolbar.
Put:
"translator": {
"type": "TabularTranslator",
"typeConversion": true
}
into typeProperties block of the Copy activity. This will also work if Mapping schema is unspecified / dynamic.

Is it possible to use U-SQL managed tables as output datasets in Azure Data Factory?

I have a small ADF pipeline that copies a series of files from an Azure Storage Account to an Azure Data Lake account. As a final activity in the pipeline I want to run a U-SQL script that uses the copied files as inputs and outputs the result to a U-SQL managed table.
The U-SQL script basically extracts the data from the copied files, applies some transformation and then INSERT´s it into an existing U-SQL managed table.
How (if possible) can I add the U-SQL table as a output dataset in Azure Data Factory?
You cannot currently add a U-SQL internal table as an output dataset in Azure Data Factory (ADF). A similar question came up recently here and the answer from Michael Rys (the "father" of U-SQL) was "I know that the ADF team has a work item to do this for you."
You could use howerver Azure Data Factory to run a parameterised U-SQL script, where the input parameter is the filepath. This would have a similar result.
Example pipeline from a recent question:
{
"name": "ComputeEventsByRegionPipeline",
"properties": {
"description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.",
"activities": [
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "adlascripts\\SearchLogProcessing.txt",
"scriptLinkedService": "StorageLinkedService",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/input/SearchLog.tsv",
"out": "/output/Result.tsv"
}
},
...
Basically the U-SQL script goes from:
#searchlog =
EXTRACT ...
FROM #in
USING Extractors.Tsv();
to:
#searchlog =
EXTRACT ...
FROM "/input/SearchLog.tsv"
USING Extractors.Tsv();
which I think achieves the same thing you want.

Populate Azure Data Factory dataset from query

Cannot find an answer via google, msdn (and other microsoft) documentation, or SO.
In Azure Data Factory you can get data from a dataset by using copy activity in a pipeline. The pipeline definition includes a query. All the queries I have seen in documentation are simple, single table queries with no joins. In this case, a dataset is defined as a table in the database with "TableName"= "mytable". Additionally, one could retrieve data from a stored procedure, presumably allowing more complex sql.
Is there a way to define a more complex query in a pipeline that includes joins and/or transformation logic that alters the data from or pipeline from a query rather than stored procedure. I know that you can specify fields in a dataset, but don't know how to get around the "tablename" property.
If there is a way, what would that method be?
input is on-premises sql server. output is azure sql database.
UPDATED for clarity.
Yes, the sqlReaderQuery can be much more complex than what is provided in the examples, and it doesn't have to only use the Table Name in the Dataset.
In one of my pipelines, I have a Dataset with the TableName "dbo.tbl_Build", but my sqlReaderQuery looks at several tables in that database. Here's a heavily truncated example:
with BuildErrorNodes as (select infoNode.BuildId, ...) as MessageValue from dbo.tbl_BuildInformation2 as infoNode inner join dbo.tbl_BuildInformationType as infoType on (infoNode.PartitionId = infoType), BuildInfo as ...
It's a bit confusing to list a single table name in the Dataset, then use multiple tables in the query, but it works just fine.
There's a way to move data from on-premise SQL to Azure SQL using Data Factory.
You can use Copy Activity, check this code sample for your case specifically GitHub link to the ADF Activity source.
Basically you need create Copy Activity which will have TypeProperties with SqlSource and SqlSink sets look like this:
<!-- language: lang-json -->
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "select * from [Source]"
},
"sink": {
"type": "SqlSink",
"WriteBatchSize": 1000000,
"WriteBatchTimeout": "00:05:00"
}
},
Also do mention - you can use not only selects from tables or views, but also [Table-Valued-Functions] will work as well.

Resources