Azure Stream Analytics Reference Input Join - azure

0
I am trying to use following query to join stream input (deviceinput) and reference input (refinputpdjson):
Following are my Inputs for Stream Analytics Job:
Input 1: Stream Input from IoT Hub
Input 2: Reference Data from Azure Blob Storage
SELECT
din.EventProcessedUtcTime,
din.deviceid as streamdeviceid,
din.heartrate as streamheartrate,
refin.deviceid as refdeviceid,
refin.patientid as refpatientid
FROM
deviceinput din
TIMESTAMP BY EventProcessedUtcTime
LEFT OUTER JOIN
refinputpdjson refin
ON din.deviceid = refin.deviceid
but its failing with following reasons:
The join predicate is not time bounded. JOIN operation between data streams requires specifying max time distances between matching events. Please add DATEDIFF to the JOIN condition. Example: SELECT input1.a, input2.b FROM input1 JOIN input2 ON DATEDIFF(minute, input1, input2) BETWEEN 0 AND 10

As error shows,
JOIN operation between data streams requires specifying max time
distances between matching events.
So you can try something like this:
SELECT
din.EventProcessedUtcTime, din.deviceid as streamdeviceid, din.heartrate as streamheartrate, refin.deviceid as refdeviceid, refin.patientid as refpatientid
FROM deviceinput din TIMESTAMP BY EventProcessedUtcTime
LEFT OUTER JOIN refinputpdjson refin TIMESTAMP BY EventEndUtcTime
ON din.deviceid = refin.deviceid
AND DATEDIFF(minute,din,refin) BETWEEN 0 AND 15
More detail, you can refer to this documentation.

It was a Azure service bug. Its resolved by MS Team.
Its working fine now.
Thank you,
Kamlesh Khollam

Related

Azure Stream analytics how to write window function when getting the data from different devices and all devices have diff fields/properties

I have telemetry coming to IOT hub from 16 devices and all devices are of different types. I want to apply window function to get the average of the last 10mins data. Since each device will have very different parameters I am unable to define a query. The device parameters are
Voltage L1
Voltage L2
Voltage L3
Current L1
Current L2
Current L3
Power
Consumer Energy
Main_Tank_Oil_Level
Calibration_Level
Temperature
Actual Pressure
Actual_Flow
Main_Tank_Oil_Level
Calibration_Level
Temperature
Actual Pressure
Oil_Flow
and each device have multiple parameters.
How do we apply window function in such scenario
From the comments, I understand you actually have multiple logical pipelines, inside a single physical stream. You have multiple device types, requiring each a different Azure Stream Analytics (ASA) query. They all share the same input event/iot hub.
There are 3 approaches here.
1 - Multiple ASA jobs
You can create multiple ASA jobs, all reading from the same source event/iot hub. Each one will have a different query addressing a specific device type. The first step of each query will be to filter the stream to look only at the device type they're scoped on (WHERE deviceType = 1 or WHERE VoltageL1 IS NOT NULL).
They will have a different output schema. So you can either target a single output across them, but it will need to be a "schema-less" service like event hub, cosmos db, storage account... Or you can output to different destinations if you're targeting a strongly typed storage component (different tables in a SQL database).
The key here is to define and use a different consumer groups for each ASA jobs in the event hub.
2 - Single job, independent queries
To reduce cost, and if the volume of data allows it, you can put all your queries into a single ASA job. It will look something like this:
--- This step concentrates reads on the input source to reduce consumer group exhaustion
WITH SingleSource AS (
SELECT * FROM MyInput
)
SELECT
DeviceId,
DeviceType,
System.Timestamp() AS WindowEndtime,
AVG(VoltageL1) as AvgVoltageL1,
AVG(VoltageL2) as AvgVoltageL2
INTO MyOutput
FROM SingleSource
WHERE DeviceType = 1 -- (or something like VoltageL1 IS NOT NULL)
GROUP BY DeviceId, DeviceType, TUMBLINGWINDOW(minute,10)
SELECT
DeviceId,
DeviceType,
System.Timestamp() AS WindowEndtime,
AVG(Main_Tank_Oil_Level) as AvgMain_Tank_Oil_Level,
AVG(Calibration_Level_Temperature) as AvgCalibration_Level_Temperature
INTO MyOutput
FROM SingleSource
WHERE DeviceType = 2 -- (or something like Main_Tank_Oil_Level IS NOT NULL)
GROUP BY DeviceId, DeviceType, TUMBLINGWINDOW(minute,10)
...
That's if you want to output everything into the same destination (MyOutput here). Same as for multiple jobs, you can always create multiple outputs and vary the INTO clause.
3 - Single job, conforming query
If you also want to process the data so everything fits on a single schema, you can add a final step to do so:
--- This step concentrates reads on the input source to reduce consumer group exhaustion
WITH SingleSource AS (
SELECT * FROM MyInput
),
Type1 AS (
SELECT
DeviceId,
DeviceType,
System.Timestamp() AS WindowEndtime,
AVG(VoltageL1) as AvgVoltageL1,
AVG(VoltageL2) as AvgVoltageL2
INTO MyOutput
FROM SingleSource
WHERE DeviceType = 1 -- (or something like VoltageL1 IS NOT NULL)
GROUP BY DeviceId, DeviceType, TUMBLINGWINDOW(minute,10)
),
Type2 AS (
SELECT
DeviceId,
DeviceType,
System.Timestamp() AS WindowEndtime,
AVG(Main_Tank_Oil_Level) as AvgMain_Tank_Oil_Level,
AVG(Calibration_Level_Temperature) as AvgCalibration_Level_Temperature
INTO MyOutput
FROM SingleSource
WHERE DeviceType = 2 -- (or something like Main_Tank_Oil_Level IS NOT NULL)
GROUP BY DeviceId, DeviceType, TUMBLINGWINDOW(minute,10)
),
...
AllRecords AS (
SELECT DeviceId, DeviceType,WindowEndtime, AvgVoltageL1, AvgVoltageL2, NULL AS AvgMain_Tank_Oil_Level, NULL AS AvgCalibration_Level_Temperature FROM Type1
UNION ALL
SELECT DeviceId, DeviceType,WindowEndtime, NULL AS AvgVoltageL1, NULL AS AvgVoltageL2, AvgMain_Tank_Oil_Level, AvgCalibration_Level_Temperature FROM Type2
...
)
SELECT * INTO MyOutput FROM AllRecords
That's the best option if you share some common metrics and need to put that into a single SQL table.

how to implement timeWindow() in Apache Flink's StreamTableEnvironment?

everyone,
I want to use flink time window in StreamTableEnvironment.
I have previously used the timeWindow(Time.seconds()) function with a dataStream that comes from a kafka topic.
For external issues I am converting this DataStream to DataTable and applying a SQL query with sqlQuery().
I want to do x second time window aggregations with SQL and then send it to another kafka topic
Data source:
val stream = senv
.addSource(new FlinkKafkaConsumer[String]("flink", new SimpleStringSchema(), properties))
example of previous aggregation:
val windowCounts = stream.keyBy("x").timeWindow(Time.seconds(5), Time.seconds(5))
Current DataTable:
val tableA = tableEnv.fromDataStream(parsed, 'user, 'product, 'amount)
In this part there should be a query that makes an aggregation each X time
val result = tableEnv.sqlQuery(
s"SELECT * FROM $tableA WHERE amount > 2".stripMargin)
more or less my aggregation will be count(y) OVER(PARTITION BY x)
Thank you!
Ververica's training for Flink SQL will help you with this. In includes some exercises/examples that cover just this kind of query in the section on Querying Dynamic Tables with SQL.
You'll have to establish the source of timing information for each event, which can be either processing time or event time, after which the query corresponding to stream.keyBy("x").timeWindow(Time.seconds(5), Time.seconds(5)) will be something like this:
SELECT
x,
TUMBLE_END(timestamp, INTERVAL '5' SECOND) AS t,
COUNT(*) AS cnt
FROM Events
GROUP BY
x, TUMBLE(timestamp, INTERVAL '5' SECOND);
For details on how to work with time attributes, see the Introduction to Time Attributes.
And for more detailed documentation on windowing with Flink SQL see the docs on Group Windows.

JOIN in Azure Stream Analytics

I have a requirement to validate the values of one column with a master data in stream analytics.
I have written queries to fetch some data from a blob location and One of the column value should be validated against a master data available in another blob location.
Below is the SAQL I tried. signals1 is the master data in blob and signals2 is the data processed and to be validated:
WITH MASTER AS (
SELECT [signals1].VAL as VAL
FROM [signals1]
)
SELECT
ID,
VAL,
SIG
INTO [output]
FROM signals2
I have to check the VAL from signals2 to be validated against VAL in signals1.
If the VAL in signals2 is there in signals1, then we should write to output.
If the VAL in signals2 is not there in signals1, then that doc should be ignored(should not write into output).
I tried with JOIN and WHERE clause, but not working as expected.
Any leads, how to achieve this using JOIN or WHERE?
In case your Signal1 data is the reference input, and Signal2 is the streaming input, you can use something like the following query:
with signals as (select * from Signal2 I join Signal1 R ON I.Val = R.Val)
select * into output from signals
I tested this query locally, and I assumed that your reference data(Signal1) is in the format:
[
{
"Val":"123",
"Data":"temp"
},
{
"Val":"321",
"Data":"humidity"
}
]
And for example, your Signal2 - the streaming input is:
{
"Val":"123",
"SIG":"k8s23kk",
"ID":"1234589"
}
Have a look at this query and data samples to see if it can guide you towards the solution.
Side note you cannot use this join in case that Signal1 is the streaming data. The way these types of joins are working is that you have to use time-windowing. Without that is not possible.

Stream analytics join query not working - return 0 rows

I have two input files on my stream analytics. one is csv and another one is json. I just making join query on stream analytics but its not working.
This is my query
SELECT
i1.Serial_No,i2.Customer_Id
FROM input1 i1
JOIN input2 i2
ON
i1.Serial_No = i2.Serial_No
Sample data
Json :
{
"Serial_No":"12345",
"Device_type":"Owned"
}
CSV :
"Serial_No,Customer_Id"
12345,12345
Please any one help me in this
Using join query on stream analytics comma delimiter is not working on CSV. But tab delimiter is working.

Azure Stream Analytics and Event Order

I am trying to set up a ASA job. My use case is that, I receive telemetry data from vehicles in the field. They transmit various attributes of the vehicle. e.g. Engine currently on or off
This data goes to IoT Hub and then ASA consumes it.
My problem is with the inputs coming out of sequence. For example, in the above diagram #9 came before #8.
He is my ASA Query
With AllSpeeder AS {
SELECT
telemetry.deviceId as vehicleid,
telemetry.enginestatus as currentenginestatus,
telemetry.datetime as currenttime,
Last(telemetry.containerdt) over (partition by telemetry.vehicleid limit duration(day,7)
when (telemetry.enginestatus = 'On2Off')) as engineontime
FROM
theiot
Timestamp by cast ( telemetry.containerdt as datetime)
where
telemetry.enginestatus = 'Off' or telemetry.enginestatus = 'On2Off'
}
SELECT * INTO theblob FROM AllSpeeder
But the above query (TimeStamp by) did not fix it.
I experimented with Event Order in ASA But still 8 and 9 above are not getting reordered.

Resources