Azure stream analytics - Joining on a csv file returns 0 rows - azure

I have the following query:
SELECT
[VanList].deviceId
,[VanList].[VanName]
events.[timestamp]
,events.externaltemp
,events.internaltemp
,events.humidity
,events.latitude
,events.longitude
INTO
[iot-powerBI]
FROM
[iot-EventHub] as events timestamp by [timestamp]
join [VanList] on events.DeviceId = [VanList].deviceId
where iot-eventHub is my event hub and VanList is a reference list (csv file) that has been uploaded to azure storage.
I have tried uploading sample data to test the query, but it always returns 0 rows.
Below is a sample of the JSON captured by my Event Hub Input
[
{
"DeviceId":1,
"Timestamp":"2015-06-29T12:15:18.0000000",
"ExternalTemp":9,
"InternalTemp":8,
"Humidity":43,
"Latitude":51.3854942,
"Longitude":-1.12774682,
"EventProcessedUtcTime":"2015-06-29T12:25:46.0932317Z",
"PartitionId":1,
"EventEnqueuedUtcTime":"2015-06-29T12:15:18.5990000Z"
} ]
Below is a sample of my CSV reference data.
deviceId,VanName
1,VAN 1
2,VAN 2
3,Standby Van
Both lists contain a device id of 1, so I am expecting my query to be able to join the two together.
I have tried using both "inner join" and "join" in my query syntax, but neither result in a successful join.
What is wrong with my Stream Analytics query?

Try adding a CAST function in the join. I'm not sure why that works and adding a CREATE TABLE clause for the VanList reference data input doesn't accomplish the same thing. But I think this works.
SELECT
[VanList].deviceId
,[VanList].[VanName]
,events.[timestamp]
,events.externaltemp
,events.internaltemp
,events.humidity
,events.latitude
,events.longitude
INTO
[iot-powerBI]
FROM
[iot-EventHub] as events timestamp by [Timestamp]
join [VanList] on events.DeviceId = cast([VanList].deviceId as bigint)

The only thing I can see is you are missing a comma in your original query, otherwise it looks correct. I would try recreating the Stream Analytics job. Here is another example that worked for me.
SELECT
countryref.CountryName as Geography,
input.GeographyId as GeographyId
into [country-out]
FROM input timestamp by [TransactionDateTime]
Join countryref
on countryref.GeographyID = input.GeographyId here
Input data example
{"pageid":801,"firstname":"Gertrude","geographyid":2,"itemid":2,"itemprice":79.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"2ggnC"}
{"pageid":801,"firstname":"Venice","geographyid":1,"itemid":10,"itemprice":169.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"xLyOp"}
{"pageid":801,"firstname":"Christinia","geographyid":2,"itemid":2,"itemprice":79.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"VuycQ"}
{"pageid":801,"firstname":"Dorethea","geographyid":4,"itemid":2,"itemprice":79.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"tgvQP"}
{"pageid":801,"firstname":"Dwain","geographyid":4,"itemid":4,"itemprice":129.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"O5TwV"}
Country ref data
[
{
"GeographyID":1,
"CountryName":"USA"
},
{
"GeographyID":2,
"CountryName":"China"
},
{
"GeographyID":3,
"CountryName":"Brazil"
},
{
"GeographyID":4,
"CountryName":"Andrews country"
},
{
"GeographyID":5,
"CountryName":"Chile"
}
]

Related

How can I ingest data from Apache Avro into the Azure Data Explorer?

for several days I'm trying to ingest Apache Avro formatted data from a blob storage into the Azure Data Explorer.
I'm able to reference the toplevel JSON-keys like $.Body (see red underlined example in the screenshot below), but when it goes to the nested JSON-keys, Azure fails to parse them properly and displays nothing (as seen in the green column: I would expect $.Body.entityId to reference the key "entityId" inside the Body-JSON).
Many thanks in advance for any help!
Here is a screenshot of the azure data explorer web interface
Edit 1
I already tried to increase the "Nested Levels" Option to 2, but all I got is this error message with no further details. The error message won't even disappear when I decrease the Level back to 1. I have to cancel and start the process all over agein.
I just recognize that the auto-generated columns have some strange types. Seems like they add up to the type string... This seems a little odd to me either.
Edit 2
Here is some kql-Code.
This is the schema of my input .avro file, what I get from my Eventhub-Capture:
{
SequenceNumber: ...,
Offset: ...,
EnqueuedTimeUTC: ...,
SystemProperties: ...,
Properties: ...,
Body: {
entityId: ...,
eventTime: ...,
messageId: ...,
data: ...
}
}, ...
And with these ingestion commands I can't reference the inner JSON-keys. The toplevel keys work perfectly fine.
// Create table command
////////////////////////////////////////////////////////////
.create table ['test_table'] (['Body']:dynamic, ['entityId']:string)
// Create mapping command
////////////////////////////////////////////////////////////
.create table ['test_table'] ingestion apacheavro mapping 'test_table_mapping' '[{"column":"Body", "Properties":{"Path":"$.Body"}},{"column":"entityId", "Properties":{"Path":"$.Body.entityId"}}]'
// Ingest data into table command
///////////////////////////////////////////////////////////
.ingest async into table ['test_table'] (h'[SAS URL]') with (format='apacheavro',ingestionMappingReference='test_table_mapping',ingestionMappingType='apacheavro',tags="['503a2cfb-5b81-4c07-8658-639009870862']")
I would love to ingest the inner data fields on separate columns, instead of building any workaround with update policies.
For those having the same issue, here is the workaround we currently use:
First, assume that we want to ingest the contents of the Body field from the avro file to the table avro_destination.
Step 1: Create an ingestion table
.create table avro_ingest(
Body: dynamic
// optional other columns, if you want...
)
Step 2: Create an update policy
.create-or-alter function
with (docstring = 'Convert avro_ingest to avro_destination', folder='ingest')
convert_avro_ingest() {
avro_ingest
| extend entityId = tostring(Body.entityId)
| extend messageId = tostring(Body.messageId)
| extend eventTime = todatetime(Body.eventTime)
| extend data = Body.data
| project entityId, messageId, eventTime, data
}
.alter table avro_destination policy update
#'[{ "IsEnabled": true, "Source": "avro_ingest", "Query": "convert_avro_ingest()", "IsTransactional": false, "PropagateIngestionProperties": true}]'
Step 3: Ingest the .avro files into the avro_ingest table
...as seen in the Question, with one Column containing the whole Body-JSON per entry.
Following the OP updates
Here is the Avro schema of an Event Hubs capture.
As you can see, Body as of type bytes, so there is practically nothing you can do with it at this form, other than ingesting it As Is (as Dynamic).
{
"type":"record",
"name":"EventData",
"namespace":"Microsoft.ServiceBus.Messaging",
"fields":[
{"name":"SequenceNumber","type":"long"},
{"name":"Offset","type":"string"},
{"name":"EnqueuedTimeUtc","type":"string"},
{"name":"SystemProperties","type":{"type":"map","values":["long","double","string","bytes"]}},
{"name":"Properties","type":{"type":"map","values":["long","double","string","bytes"]}},
{"name":"Body","type":["null","bytes"]}
]
}
If you'll take a look on the ingested data, you'll see that the content of Body is arrays of integers.
Those integers are the decimal values of the characters that construct Body.
capture
| project Body
| take 3
Body
[123,34,105,100,34,58,32,34,56,49,55,98,50,99,100,57,45,97,98,48,49,45,52,100,51,53,45,57,48,51,54,45,100,57,55,50,51,55,55,98,54,56,50,57,34,44,32,34,100,116,34,58,32,34,50,48,50,49,45,48,56,45,49,50,84,49,54,58,52,56,58,51,50,46,53,57,54,50,53,52,34,44,32,34,105,34,58,32,48,44,32,34,109,121,105,110,116,34,58,32,50,48,44,32,34,109,121,102,108,111,97,116,34,58,32,48,46,51,57,56,53,52,52,56,55,52,53,57,56,57,48,55,57,55,125]
[123,34,105,100,34,58,32,34,57,53,100,52,100,55,56,48,45,97,99,100,55,45,52,52,57,50,45,98,97,54,100,45,52,56,49,54,97,51,56,100,52,56,56,51,34,44,32,34,100,116,34,58,32,34,50,48,50,49,45,48,56,45,49,50,84,49,54,58,52,56,58,51,50,46,53,57,54,50,53,52,34,44,32,34,105,34,58,32,49,44,32,34,109,121,105,110,116,34,58,32,56,56,44,32,34,109,121,102,108,111,97,116,34,58,32,48,46,54,53,53,51,55,51,51,56,49,57,54,53,50,52,52,49,125]
[123,34,105,100,34,58,32,34,53,50,100,49,102,54,54,53,45,102,57,102,54,45,52,49,50,49,45,97,50,57,99,45,55,55,56,48,102,101,57,53,53,55,48,56,34,44,32,34,100,116,34,58,32,34,50,48,50,49,45,48,56,45,49,50,84,49,54,58,52,56,58,51,50,46,53,57,54,50,53,52,34,44,32,34,105,34,58,32,50,44,32,34,109,121,105,110,116,34,58,32,49,57,44,32,34,109,121,102,108,111,97,116,34,58,32,48,46,52,53,57,54,49,56,54,51,49,51,49,50,50,52,50,50,51,125]
Body can be converted to text using make_string() and then parsed to JSON using todynamic()
capture
| project BodyJSON = todynamic(make_string(Body))
| take 3
BodyJSON
{"id":"817b2cd9-ab01-4d35-9036-d972377b6829","dt":"2021-08-12T16:48:32.5962540Z","i":0,"myint":20,"myfloat":"0.398544874598908"}
{"id":"95d4d780-acd7-4492-ba6d-4816a38d4883","dt":"2021-08-12T16:48:32.5962540Z","i":1,"myint":88,"myfloat":"0.65537338196524408"}
{"id":"52d1f665-f9f6-4121-a29c-7780fe955708","dt":"2021-08-12T16:48:32.5962540Z","i":2,"myint":19,"myfloat":"0.45961863131224223"}
Simply increase "Nested levels" to 2.

Kusto/Azure Data Explorer - How can I partition an external table using a timespan field?

Hoping someone can help..
I am new to Kusto and have to get an external table reading data from an Azure Blob storage account working, but the one table I have is unique in that the data for the timestamp column is split into 2 separate columns , i.e. LogDate and LogTime (see script below).
My data is stored in the following structure in the Azure Storage account container (container is named "employeedata", for example):
{employeename}/{year}/{month}/{day}/{hour}/{minute}.csv, in a simple CSV format.
I know the CSV is good because if I import it into a normal Kusto table, it works perfectly.
My KQL script for the external table creation looks as follows:
.create-or-alter external table EmpLogs (Employee: string, LogDate: datetime, LogTime:timestamp)
kind=blob
partition by (EmployeeName:string = Employee, yyyy:datetime = startofday(LogDate), MM:datetime = startofday(LogDate), dd:datetime = startofday(LogDate), HH:datetime = todatetime(LogTime), mm:datetime = todatetime(LogTime))
pathformat = (EmployeeName "/" datetime_pattern("yyyy", yyyy) "/" datetime_pattern("MM", MM) "/" datetime_pattern("dd", dd) "/" substring(HH, 0, 2) "/" substring(mm, 3, 2) ".csv")
dataformat=csv
(
h#'************************'
)
with (folder="EmployeeInfo", includeHeaders="All")
I am getting the error below constantly, which is not very helpful (redacted from full error, basically comes down to the fact there is a syntax error somewhere):
Syntax error: Query could not be parsed: {
"error": {
"code": "BadRequest_SyntaxError",
"message": "Request is invalid and cannot be executed.",
"#type": "Kusto.Data.Exceptions.SyntaxException",
"#message": "Syntax error: Query could not be parsed: . Query: '.create-or-alter external table ........
I know the todatetime() function works on timespan's, I tested it with another table and it created a date similar to the following: 0001-01-01 20:18:00.0000000.
I have tried using the bin() function on the timestamp/LogTime columns, but the same error as above, and even tried importing the time value as a string and doing some string manipulation on it, no luck. Getting the same syntax error.
Any help/guidance would be greatly appreciated.
Thank you!!
Currently, there's no way to define an external table partition based on more than one column. If your dataset timestamp is splitted between two columns: LogDate:datetime and LogTime:timestamp, then the best you can do is use virtual column for the partition by time:
.create-or-alter external table EmpLogs(Employee: string, LogDate:datetime, LogTime:timespan)
kind=blob
partition by (EmployeeName:string = Employee, PartitionDate:datetime)
pathformat = (EmployeeName "/" datetime_pattern("yyyy/MM/dd/HH/mm", PartitionDate))
dataformat=csv
(
//h#'************************'
)
with (folder="EmployeeInfo", includeHeaders="All")
Now, you can filter by the virtual column and fine tune using LogTime:
external_table("EmpLogs")
| where Employee in ("John Doe", ...)
| where PartitionDate between(datetime(2020-01-01 10:00:00) .. datetime(2020-01-01 11:00:00))
| where LogTime ...

To use the output of a lookup activity to query the db and write to a csv file in storage account usin ADF

My requirement is to use ADF to read data (columnA) from an xlx/csv file which is in the storage account and use that (columnA) to query my db and the output of my query which includes (columnA) should be written to a file in storage account.
I was able to read the data from the storage account but getting it as table. I Need to use it as a individual entry like select * from table where id=columnA.
Then the next task if I'm able to read each data, how to write it to a file
I used lookup activity to read data from excel, the below is the sample output, I need to use only the sku number for my query next, not able to proceed with this. Kindly suggest a solution
I set a variable as the output of the lookup as suggested here https://www.mssqltips.com/sqlservertip/6185/azure-data-factory-lookup-activity-example/ and tried to use that variable in my query, but I'm getting exception when I trigger it, bad template error.
Please try this:
I create a sample like yours and there is no need to use set variable.
Details:
Below is lookup output:
{
"count": 3,
"value": [
{
"SKU": "aaaa"
},
{
"SKU": "bbbb"
},
{
"SKU": "ccc"
}
]
}
Setting of copy data activity:
Query sql:
select * from data_source_table where Name = '#{activity('Lookup1').output.value[0].SKU}'
You can also use this sql,if you need:
select * from data_source_table where Name in('#{activity('Lookup1').output.value[0].SKU}','#{activity('Lookup1').output.value[1].SKU}','#{activity('Lookup1').output.value[2].SKU}')
This is my test data in my SQL DataBase:
Here is the result:
1,"aaaa",0,2017-09-01 00:56:00.0000000
2,"bbbb",0,2017-09-02 05:23:00.0000000
Hope this can help you.
Update:
You can try to use DataFlow.
source1 is your csv file,source2 is SQL DataBase.
This is setting of lookup
Filter condition:!isNull(PersonID)(One column in your SQL DataBase.)
Then,use select delete the SKU column.
Finally,Output to single file.

JOIN in Azure Stream Analytics

I have a requirement to validate the values of one column with a master data in stream analytics.
I have written queries to fetch some data from a blob location and One of the column value should be validated against a master data available in another blob location.
Below is the SAQL I tried. signals1 is the master data in blob and signals2 is the data processed and to be validated:
WITH MASTER AS (
SELECT [signals1].VAL as VAL
FROM [signals1]
)
SELECT
ID,
VAL,
SIG
INTO [output]
FROM signals2
I have to check the VAL from signals2 to be validated against VAL in signals1.
If the VAL in signals2 is there in signals1, then we should write to output.
If the VAL in signals2 is not there in signals1, then that doc should be ignored(should not write into output).
I tried with JOIN and WHERE clause, but not working as expected.
Any leads, how to achieve this using JOIN or WHERE?
In case your Signal1 data is the reference input, and Signal2 is the streaming input, you can use something like the following query:
with signals as (select * from Signal2 I join Signal1 R ON I.Val = R.Val)
select * into output from signals
I tested this query locally, and I assumed that your reference data(Signal1) is in the format:
[
{
"Val":"123",
"Data":"temp"
},
{
"Val":"321",
"Data":"humidity"
}
]
And for example, your Signal2 - the streaming input is:
{
"Val":"123",
"SIG":"k8s23kk",
"ID":"1234589"
}
Have a look at this query and data samples to see if it can guide you towards the solution.
Side note you cannot use this join in case that Signal1 is the streaming data. The way these types of joins are working is that you have to use time-windowing. Without that is not possible.

Stream analytics join query not working - return 0 rows

I have two input files on my stream analytics. one is csv and another one is json. I just making join query on stream analytics but its not working.
This is my query
SELECT
i1.Serial_No,i2.Customer_Id
FROM input1 i1
JOIN input2 i2
ON
i1.Serial_No = i2.Serial_No
Sample data
Json :
{
"Serial_No":"12345",
"Device_type":"Owned"
}
CSV :
"Serial_No,Customer_Id"
12345,12345
Please any one help me in this
Using join query on stream analytics comma delimiter is not working on CSV. But tab delimiter is working.

Resources