I have an Azure Stream Analytics job that uses an EventHub and a Reference data in Blob storage as 2 inputs. The reference data is CSV that looks something like this:
REGEX_PATTERN,FRIENDLY_NAME
115[1-2]{1}9,Name 1
115[3-9]{1}9,Name 2
I then need to lookup an attribute in the incoming event in EventHub against this CSV to get the
FRIENDLY_NAME.
Typical way of of using reference data is using JOIN clause. But in this case I cannot use it because such regex matching is not supported with LIKE operator.
UDF is another option, but I cannot seem to find a way of using reference data as a CSV inside the function.
Is there any other way of doing this in an Azure Stream Analytics job?
As I know, the JOIN is not supported in your scenario. The join key should be specific, can't be a regex value.
Thus, reference data is not suitable here because it should be used in the ASA sql like below:
SELECT I1.EntryTime, I1.LicensePlate, I1.TollId, R.RegistrationId
FROM Input1 I1 TIMESTAMP BY EntryTime
JOIN Registration R
ON I1.LicensePlate = R.LicensePlate
WHERE R.Expired = '1'
The join key is needed. What I mean is that the reference data input is not needed even here.
Your idea is using UDF script and load the data in the UDF to compare with the hardcode regex data. This idea is not easy to maintain. Maybe you could consider my workaround:
1.You said you have different reference data,please group them and store as json array. Assign one group id to every group. For example:
Group Id 1:
[
{
"REGEX":"115[1-2]{1}9",
"FRIENDLY_NAME":"Name 1"
},
{
"REGEX":"115[3-9]{1}9",
"FRIENDLY_NAME":"Name 2"
}
]
....
2.Add one column to referring group id and set Azure Function as Output of your ASA SQL. Inside Azure Function, please accept the group id column and load the corresponding group of json array. Then loop the rows to match the regex and save the data into destination residence.
I think Azure Function is more flexible then UDF in ASA sql job. Additional,this solution is maybe easier to maintain.
Related
Sorry if this is a bit vague or rambly, I'm still getting to grips with Data Factory and a lot of it seems a bit obtuse...
What I want to do is query my Cosmos Database for a list of Ids of records that need to be updated. For each of these records, I want to call a REST API using the Id (i.e. /Record/{Id}/Details)
I've created a Data Flow that took a string as a parameter and then called the REST API fine.
I then made a pipeline using a Lookup with a query (select c.RecordId from c where...) and pass that into a ForEach with items set to #activity('Lookup1').output.value
I then setup the Activity of the ForEach to my Data flow. From research, I think I'm supposed to set the Parameter value to "#item().RecordId", but that gives an error "parameter [name] does not match parameter type 'string'".
I can change the type of the parameter to any (and use toString([parameter]) to cast it ) and then when I try and debug it passes the parameter in, but it gives an error of "Job failed due to reason: at (Line 2/Col 14): Datatype any not found".
I'm not sure what the solution is. Is there a way to cast the result of the lookup to an integer or string? Is there a way to narrow an any down? Is there a better way than toString() that would work? Is there a better way than ForEach?
I tried to reproduce similar scenario what you are trying.
My sample data in cosmos
To query Cosmos Database for a list of Ids and call a REST API using the Id For each of these records.
First, I took Lookup activity in data factory and selected the id's where the last_name is Bluth
Its output and settings are as below:
Then I passed the output of lookup activity to For-each activity.
Then inside for each activity I created Dataflow activity and for that DataSource I gave the source as Rest API. My Rest API to call specific user is https://reqres.in/api/users/2 I gave base URL as https://reqres.in/api/users.
Then I created parameter called demoId as datatype string and in relative URL I gave that dynamic value as #dataset().demoId
After this I gave value source parameter as #item().id as after https://reqres.in/api/users there is only id should be provided to get data in you case you can try Record/#{item().id}/Details.
For each id it is successfully passing id to rest API and fetching data:
I am trying to get the count of all records present in cosmos db in a lookup activity of azure data factory. I need this value to do a comparison with other value activity outputs.
The query I used is SELECT VALUE count(1) from c
When I try to preview the data after inserting this query I get an error saying
One or more errors occurred. Unable to cast object of type
'Newtonsoft.Json.Linq.JValue' to type 'Newtonsoft.Json.Linq.JObject'
as shown in the below image:
snapshot of my azure lookup activity settings
Could someone help me in resolving this error and if this is the limitation of azure data factory how can I get the count of all the rows of the cosmos db document using some other ways inside azure data factory?
I reproduce your issue on my side exactly.
I think the count result can't be mapped as normal JsonObject. As workaround,i think you could just use Azure Function Activity(Inside Azure Function method ,you could use SDK to execute any sql as you want) to output your desired result: {"number":10}.Then bind the Azure Function Activity with other activities in ADF.
Here is contradiction right now:
The query sql outputs a scalar array,not other things like jsonObject,or even jsonstring.
However, ADF Look Up Activity only accepts JObject,not JValue. I can't use any convert built-in function here because the query sql need to be produced with correct syntax anyway. I already submitted a ticket to MS support team,but get no luck with this limitation.
I also tried select count(1) as num from c which works in the cosmos db portal. But it still has limitation because the sql crosses partitions.
So,all i can do here is trying to explain the root cause of issue,but can't change the product behaviours.
2 rough ideas:
1.Try no-partitioned collection to execute above sql to produce json output.
2.If the count is not large,try to query columns from db and loop the result with ForEach Activity.
You can use:
select top 1 column from c order by column desc
I am trying to write an SAQL on the data which is coming from event Hub in json format.
The input to the azure Stream Analytics job is as shown below.
{"ver":"2019-12-28 18:41:45.4184730","Data":"Data01","d":{"IDNUM":"XXXXX01","Time1":"2017-12-20T00:00:00.0000000Z","abc":"610000","efg":"0000","XYZ":"00000","ver":"2017-12-20T18:41:45.4184730Z"}}
{"ver":"2019-12-28 18:41:45.4184730","Data":"Data01","d":{"IDNUM":"XXXXX02","Time1":"2017-12-20T00:00:00.0000000Z","abc":"750000","efg":"0000","XYZ":"90000","ver":"2017-12-20T18:41:45.4184730Z"}}
{"ver":"2017-01-01 06:28:52.5041237","Data":"Data02","d":{"IDNUM":"XXXXX03","acc":-10.7000,"PQR":35.420639038085938,"XYZ":139.95817565917969,"ver":"2017-01-01T06:28:52.5041237Z"}}
{"ver":"2017-01-01 06:28:52.5041237","Data":"Data02","d":{"IDNUM":"XXXXX04","acc":-8.5999,"PQR":35.924240112304688,"XYZ":139.6097412109375,"ver":"2017-01-01T06:28:52.5041237Z"}}
In the first two rows, the attribute Time1 is available where as in last two rows Time1 attribute itself is not present.
I have to store the data into cosmos DB based on the Time1 attribute in the input data.
Path in json data >>> input.d.Time1.
I have to store data which are having Time1 into a cosmosDB container and data which are not having Time1 into another container.
I tried with the below SAQL.
SELECT [input].ver,
[input].Data,
d.*
INTO [cosmosDB01]
FROM [input] PARTITION BY PartitionId
WHERE [input].Data is not null
AND [input].d.Time1 is not null
SELECT [input].ver,
[input].Data,
d.*
INTO [cosmosDB01]
FROM [input] PARTITION BY PartitionId
WHERE [input].Data is not null
AND [input].d.Time1 is null
Is there any other ways like IS EXISTS keyword in stream analytics query ?
Per my knowledge,there is no is_exists or is_defined sql built-in keyword in ASA so far. You have to follow the way you mentioned in the question to deal with multiple outputs scenario.
(Similar case:Azure Stream Analytics How to handle multiple output table?)
Surely,you could submit feedback to ASA team to push the progress of ASA.
I have a cosmos DB collection in the following format:
{
"deviceid": "xxx",
"partitionKey": "key1",
.....
"_ts": 1544583745
}
I'm using Azure Data Factory to copy data from Cosmos DB to ADLS Gen 2. If I copy using a copy activity, it is quite straightforward. However, my main concern is the output path in ADLS Gen 2. Our requirements state that we need to have the output path in a specific format. Here is a sample of the requirement:
outerfolder/version/code/deviceid/year/month/day
Now since deviceid, year, month, day are all in the payload itself I can't find a way to use them except create a lookup activity and use the output of the lookup activity in the copy activity.
And this is how I set the ouput folder using the dataset property:
I'm using SQL API on Cosmos DB to query the data.
Is there a better way I can achieve this?
I think that your way works, but its not the cleanest. What I'd do is create a different variable inside the pipeline for each one: version, code, deviceid, etc. Then, after the lookup you can assign the variables, and finally do the copy activity referencing the pipeline variables.
It may look kind of redundant, but think of someone (or you 2 years from now) having to modify the pipeline and if you are not around (or have forgotten), this way makes it clear how it works, and what you should modify.
Hope this helped!!
I'm querying Azure table storage using the Azure Storage Explorer. I want to find all messages that contain the given text, like this in T-SQL:
message like '%SysFn%'
Executing the T-SQL gives "An error occurred while processing this request"
What is the equivalent of this query in Azure?
There's no direct equivalent, as there is no wildcard searching. All supported operations are listed here. You'll see eq, gt, ge, lt, le, etc. You could make use of these, perhaps, to look for specific ranges.
Depending on your partitioning scheme, you may be able to select a subset of entities based on specific partition key, and then scan through each entity, examining message to find the specific ones you need (basically a partial partition scan).
While an advanced wildcard search isn't strictly possible in Azure Table Storage, you can use a combination of the "ge" and "lt" operators to achieve a "prefix" search. This process is explained in a blog post by Scott Helme here.
Essentially this method uses ASCII incrementing to query Azure Table Storage for any rows whose property begins with a certain string of text. I've written a small Powershell function that generates the custom filter needed to do a prefix search.
Function Get-AzTableWildcardFilter {
param (
[Parameter(Mandatory=$true)]
[string]$FilterProperty,
[Parameter(Mandatory=$true)]
[string]$FilterText
)
Begin {}
Process {
$SearchArray = ([char[]]$FilterText)
$SearchArray[-1] = [char](([int]$SearchArray[-1]) + 1)
$SearchString = ($SearchArray -join '')
}
End {
Write-Output "($($FilterProperty) ge '$($FilterText)') and ($($FilterProperty) lt '$($SearchString)')"
}
}
You could then use this function with Get-AzTableRow like this (where $CloudTable is your Microsoft.Azure.Cosmos.Table.CloudTable object):
Get-AzTableRow -Table $CloudTable -CustomFilter (Get-AzTableWildcardFilter -FilterProperty 'RowKey' -FilterText 'foo')
Another option would be export the logs from Azure Table storage to csv. Once you have the csv you can open this in excel or any other app and search for the text.
You can export table storage data using TableXplorer (http://clumsyleaf.com/products/tablexplorer). In this there is an option to export the filtered data to csv.