How to set up output path while copying data from Azure Cosmos DB to ADLS Gen 2 via Azure Data Factory - azure

I have a cosmos DB collection in the following format:
{
"deviceid": "xxx",
"partitionKey": "key1",
.....
"_ts": 1544583745
}
I'm using Azure Data Factory to copy data from Cosmos DB to ADLS Gen 2. If I copy using a copy activity, it is quite straightforward. However, my main concern is the output path in ADLS Gen 2. Our requirements state that we need to have the output path in a specific format. Here is a sample of the requirement:
outerfolder/version/code/deviceid/year/month/day
Now since deviceid, year, month, day are all in the payload itself I can't find a way to use them except create a lookup activity and use the output of the lookup activity in the copy activity.
And this is how I set the ouput folder using the dataset property:
I'm using SQL API on Cosmos DB to query the data.
Is there a better way I can achieve this?

I think that your way works, but its not the cleanest. What I'd do is create a different variable inside the pipeline for each one: version, code, deviceid, etc. Then, after the lookup you can assign the variables, and finally do the copy activity referencing the pipeline variables.
It may look kind of redundant, but think of someone (or you 2 years from now) having to modify the pipeline and if you are not around (or have forgotten), this way makes it clear how it works, and what you should modify.
Hope this helped!!

Related

Azure Data Factory Pipeline - Store single-value source query output as a variable to then use in Copy Data activity

I am looking to implement an incremental table loading pipeline in ADF. I want to execute a query to get the latest timestamp from the table in an Azure SQL database. Then, store this value as a variable in ADF so I can then reference it in the "Source" query of a Copy Data activity.
The goal is to only request data from an API with a timestamp greater than the latest timestamp in the SQL table.
Is this functionality possible within ADF pipelines? or do I need to look to Azure functions or Data Flows?
This is definitely possible with Data Factory. You could use the Lookup Activity or a Stored Procedure, but the team just released the new Script Activity:
This will return results like so:
{
"resultSetCount": 1,
"recordsAffected": 0,
"resultSets": [
{
"rowCount": 1,
"rows": [
{
"MaxDate": "2018-03-20"
}
]
...
}
Here is the expression to read this into a variable:
#activity('Script1').output.resultSets[0].rows[0].MaxDate

Azure Stream Analytics: Regex in Reference Data

I have an Azure Stream Analytics job that uses an EventHub and a Reference data in Blob storage as 2 inputs. The reference data is CSV that looks something like this:
REGEX_PATTERN,FRIENDLY_NAME
115[1-2]{1}9,Name 1
115[3-9]{1}9,Name 2
I then need to lookup an attribute in the incoming event in EventHub against this CSV to get the
FRIENDLY_NAME.
Typical way of of using reference data is using JOIN clause. But in this case I cannot use it because such regex matching is not supported with LIKE operator.
UDF is another option, but I cannot seem to find a way of using reference data as a CSV inside the function.
Is there any other way of doing this in an Azure Stream Analytics job?
As I know, the JOIN is not supported in your scenario. The join key should be specific, can't be a regex value.
Thus, reference data is not suitable here because it should be used in the ASA sql like below:
SELECT I1.EntryTime, I1.LicensePlate, I1.TollId, R.RegistrationId
FROM Input1 I1 TIMESTAMP BY EntryTime
JOIN Registration R
ON I1.LicensePlate = R.LicensePlate
WHERE R.Expired = '1'
The join key is needed. What I mean is that the reference data input is not needed even here.
Your idea is using UDF script and load the data in the UDF to compare with the hardcode regex data. This idea is not easy to maintain. Maybe you could consider my workaround:
1.You said you have different reference data,please group them and store as json array. Assign one group id to every group. For example:
Group Id 1:
[
{
"REGEX":"115[1-2]{1}9",
"FRIENDLY_NAME":"Name 1"
},
{
"REGEX":"115[3-9]{1}9",
"FRIENDLY_NAME":"Name 2"
}
]
....
2.Add one column to referring group id and set Azure Function as Output of your ASA SQL. Inside Azure Function, please accept the group id column and load the corresponding group of json array. Then loop the rows to match the regex and save the data into destination residence.
I think Azure Function is more flexible then UDF in ASA sql job. Additional,this solution is maybe easier to maintain.

Unable to get scalar value of a query on cosmos db in azure data factory

I am trying to get the count of all records present in cosmos db in a lookup activity of azure data factory. I need this value to do a comparison with other value activity outputs.
The query I used is SELECT VALUE count(1) from c
When I try to preview the data after inserting this query I get an error saying
One or more errors occurred. Unable to cast object of type
'Newtonsoft.Json.Linq.JValue' to type 'Newtonsoft.Json.Linq.JObject'
as shown in the below image:
snapshot of my azure lookup activity settings
Could someone help me in resolving this error and if this is the limitation of azure data factory how can I get the count of all the rows of the cosmos db document using some other ways inside azure data factory?
I reproduce your issue on my side exactly.
I think the count result can't be mapped as normal JsonObject. As workaround,i think you could just use Azure Function Activity(Inside Azure Function method ,you could use SDK to execute any sql as you want) to output your desired result: {"number":10}.Then bind the Azure Function Activity with other activities in ADF.
Here is contradiction right now:
The query sql outputs a scalar array,not other things like jsonObject,or even jsonstring.
However, ADF Look Up Activity only accepts JObject,not JValue. I can't use any convert built-in function here because the query sql need to be produced with correct syntax anyway. I already submitted a ticket to MS support team,but get no luck with this limitation.
I also tried select count(1) as num from c which works in the cosmos db portal. But it still has limitation because the sql crosses partitions.
So,all i can do here is trying to explain the root cause of issue,but can't change the product behaviours.
2 rough ideas:
1.Try no-partitioned collection to execute above sql to produce json output.
2.If the count is not large,try to query columns from db and loop the result with ForEach Activity.
You can use:
select top 1 column from c order by column desc

Populate Azure Data Factory dataset from query

Cannot find an answer via google, msdn (and other microsoft) documentation, or SO.
In Azure Data Factory you can get data from a dataset by using copy activity in a pipeline. The pipeline definition includes a query. All the queries I have seen in documentation are simple, single table queries with no joins. In this case, a dataset is defined as a table in the database with "TableName"= "mytable". Additionally, one could retrieve data from a stored procedure, presumably allowing more complex sql.
Is there a way to define a more complex query in a pipeline that includes joins and/or transformation logic that alters the data from or pipeline from a query rather than stored procedure. I know that you can specify fields in a dataset, but don't know how to get around the "tablename" property.
If there is a way, what would that method be?
input is on-premises sql server. output is azure sql database.
UPDATED for clarity.
Yes, the sqlReaderQuery can be much more complex than what is provided in the examples, and it doesn't have to only use the Table Name in the Dataset.
In one of my pipelines, I have a Dataset with the TableName "dbo.tbl_Build", but my sqlReaderQuery looks at several tables in that database. Here's a heavily truncated example:
with BuildErrorNodes as (select infoNode.BuildId, ...) as MessageValue from dbo.tbl_BuildInformation2 as infoNode inner join dbo.tbl_BuildInformationType as infoType on (infoNode.PartitionId = infoType), BuildInfo as ...
It's a bit confusing to list a single table name in the Dataset, then use multiple tables in the query, but it works just fine.
There's a way to move data from on-premise SQL to Azure SQL using Data Factory.
You can use Copy Activity, check this code sample for your case specifically GitHub link to the ADF Activity source.
Basically you need create Copy Activity which will have TypeProperties with SqlSource and SqlSink sets look like this:
<!-- language: lang-json -->
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "select * from [Source]"
},
"sink": {
"type": "SqlSink",
"WriteBatchSize": 1000000,
"WriteBatchTimeout": "00:05:00"
}
},
Also do mention - you can use not only selects from tables or views, but also [Table-Valued-Functions] will work as well.

In azure datafactory how to copy data from blob to sql without duplication?

In azure datafactory how to copy data from blob to sql without duplication ie if the pipeline runs at the slice of every 15 min then how to avoid getting duplicate data
The solution isn't automatic, but you can use a Copy Activity, and use a stored procedure in the SQL sink to handle rows that may already exist. Perhaps TSQL Merge statement, or an Insert / Update statement inside.
https://azure.microsoft.com/en-us/documentation/articles/data-factory-copy-activity/
Invoke stored procedure for SQL Sink. When copying data into SQL Server or Azure SQL Database, a user specified stored procedure could be configured and invoked.
Thanks, Jason
I had the same issue and I found that you can add the slice start time and slice end time to your stored procedure and filter the queries using them as any other parameter, that will help you to load the data by slices and not the same data the number of slices you have, hope it's clear enough.
"typeProperties": {
"storedProcedureName": "sp_sample",
"storedProcedureParameters": {
"DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)"
}
}
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-stored-proc-activity
I had the same problem and found this link to be helpful:
https://www.mssqltips.com/sqlservertip/6365/incremental-file-load-using-azure-data-factory/
In our case, we only add files to blob storage and never modify them after that so the job is to simply pick up new files created within the latest 15 minutes and add them to the SQL container. The Incremental Copy procedure described in the link seems to work great so far.
I can imagine that in some cases you may need to add a stored procedure to act on the SQL container after this, but we did not need it.

Resources