Azure Stream Analytics - Joining Two Streaming Source - azure

I am trying to join 2 Streaming Source which produces the same data output from EventHub.
I am trying to find the Maximum Open Price for the Stock every 5 mins and trying to write it to the the Table. I am interested in the time at within the 5 min window at which the stock was maximum and the window time.
I used the below mentioned query but it isn't producing any output for the same.
I think I have messed the joining the condition.
WITH Source1 AS (
SELECT
System.TimeStamp() as TimeSlot,max([open]) as 'MaxOpenPrice'
FROM
EventHubInputData TIMESTAMP BY TimeSlot
GROUP BY TumblingWindow(minute,5)
),
Source2 AS(
SELECT EventEnqueuedUtcTime,[open]
FROM EventHubInputDataDup TIMESTAMP BY EventEnqueuedUtcTime),
Source3 as (
select Source2.EventEnqueuedUtcTime as datetime,Source1.MaxOpenPrice,System.TimeStamp() as TimeSlot
FROM Source1
JOIN Source2
ON Source2.[Open] = Source1.[MaxOpenPrice] AND DATEDIFF (minute,Source1,Source2) BETWEEN 0 AND 5
)
SELECT datetime,MaxOpenPrice,TimeSlot
INTO EventHubOutPutSQLDB
FROM Source3 ```

The logic is good here. First you identify the maximum value on each window of 5 minutes, then you lookup in the original stream the time at which it happened.
WITH MaxOpen5MinTumbling AS (
SELECT
--TickerId,
System.TimeStamp() AS WindowEnd, --this always return the end of the window when windowing
MAX([open]) AS 'MaxOpenPrice'
FROM EventHubInputData --no need to timestamp if using ingestion time
GROUP BY TumblingWindow(minute,5)
)
SELECT
--M.TickedId,
M.WindowEnd,
M.MaxOpenPrice,
O.EventEnqueuedUtcTime AS MaxOpenPriceTime
FROM MaxOpen5MinTumbling M
LEFT JOIN EventHubInputData O
ON M.MaxOpenPrice = o.[open]
AND DATEDIFF(minute,M,O) BETWEEN -5 AND 0 --The new timestamp is at the end of the window, you need to look back 5 minutes
--AND M.TickerId = O.TickerId
Note that at this point you could get multiple results per time window if the max price happens multiple times.

Related

Azure Stream Analytics current day aggregation

I'm quite new in Azure Stream Analytics but I need to push to Power BI (live dashboard) rolling totals from start of the day every time when new event arrives to Azure Stream Analytics job. I've created next SQL query to calculate this
SELECT
Factory_Id,
COUNT(0) as events_count,
MAX(event_create_time) as last_event_time,
SUM(event_value) as event_value_total
INTO
[powerbi]
FROM
[eventhub] TIMESTAMP BY event_create_time
WHERE DAY(event_create_time) = DAY(System.Timestamp) and MONTH(event_create_time) = MONTH(System.Timestamp) and YEAR(event_create_time) = YEAR(System.Timestamp)
GROUP BY Factory_Id, SlidingWindow(day,1)
But this didn't give me desired result - I get total for last 24 hours(not only for current day) and some times record with bigger last_event_time has events_count smaller then record with smaller last_event_time. The question is - What I'm doing wrong and how can I achieve desired outcome?
EDIT following comment: This computes the results for the last 24h, but what's needed is the running sum/count to day (from 00:00 until now). See updated answer below.
I'm wondering if an analytics approach would work better than an aggregation here.
Instead of using a time window, you calculate and emit a record for each event in input:
SELECT
Factory_Id,
COUNT(*) OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) AS events_count,
system.timestamp() as last_event_time,
SUM(event_value) OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) as event_value_total
INTO PowerBI
FROM [eventhub] TIMESTAMP BY event_create_time
The only hiccup is for events landing on the same time stamp:
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:00:00", "event_value" : 0.1}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:01:00", "event_value" : 2}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:01:00", "event_value" : 10}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:02:00", "event_value" : 0.2}
You won't get a single record on that timestamp:
Factory_Id
events_count
last_event_time
event_value_total
1
1
2021-12-10T10:00:00.0000000Z
0.1
1
2
2021-12-10T10:01:00.0000000Z
2.1
1
3
2021-12-10T10:01:00.0000000Z
12.1
1
4
2021-12-10T10:02:00.0000000Z
12.2
We may want to add a step to the query to deal with it if it's an issue for your dashboard. Let me know!
EDIT following comment
This new version will emit progressive results on a daily tumbling window. To do that, every time we get a new record, we collect the last 24h. Then we remove the rows from the previous day, and re-calculate the new aggregates. To collect properly, we first need to make sure we only have 1 record per timestamp.
-- First we make sure we get only 1 record per timestamp, to avoid duplication in the analytics function below
WITH Collapsed AS (
SELECT
Factory_Id,
system.timestamp() as last_event_time,
COUNT(*) AS C,
SUM(event_value) AS S
FROM [input1] TIMESTAMP BY event_create_time
GROUP BY Factory_Id, system.timestamp()
),
-- Then we build an array at each timestamp, containing all records from the last 24h
Collected as (
SELECT
Factory_Id,
system.timestamp() as last_event_time,
COLLECT() OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) AS all_events
FROM Collapsed
)
-- Finally we expand the array, removing the rows on the previous day, and aggregate
SELECT
C.Factory_Id,
system.timestamp() as last_event_time,
SUM(U.ArrayValue.C) AS events_count,
SUM(U.ArrayValue.S) AS event_value_total
FROM Collected AS C
CROSS APPLY GETARRAYELEMENTS(C.all_events) AS U
WHERE DAY(U.ArrayValue.last_event_time) = DAY(system.Timestamp())
GROUP BY C.Factory_Id, C.last_event_time, system.timestamp()
Let me know how it goes.

How do I group datetimes with a sqlite windowing function?

Let's say I have a table with the following fields:
customerid, transactiontime, transactiontype
I want to group a customer's transactions by time, and select the customerid and the count of those transactions. But rather than simply grouping all transaction times into certain increments (15 min, 30 min, etc.), for which I've seen various solutions here, I'd like to group a set a customer's transactions based on how soon each transaction occurs after the previous.
In other words, if any transaction occurs more than 15 minutes after a previous transaction, I'd like it to be grouped separately.
I expect the customer to generate a few transactions close together, and potentially generate a few more later in the day. So if those two sets of transactions occur more than 15, 30 minutes apart, they'll be grouped into separate windows. Is this possible?
Yes, you can do this using a window function in SQLite. This syntax is a bit new to me, but this is how it would start:
select customer_id,
event_start_minute,
sum(subgroup_start) over (order by customer_id, event_start_minute) as subgroup
from (
select customer_id,
event_start_minute,
case
when event_start_minute - lag(event_start_minute) over win > 15
then 1
else 0
end as subgroup_start
from t1
window win as (
partition by b
order by c
)
) as groups
order by customer_id, event_start_minute

Azure Stream Analytics : remove duplicates while aggregating

I'm working on a system of temperature and pressure sensors, where my data is flowing through a Stream analytics job. Now there maybe duplicate messages sent in because of acknowledgements not being received and various other reasons. So my data could be of the format:-
DeviceID TimeStamp MeasurementName Value
1 1 temperature 50
1 1 temperature 50
1 2 temperature 60
Note that the 2nd record is a duplicate of the 1st one as DeviceId and Timestamp and MeasurementName are same.
I wish to take an average over 5 min tumbling window for this data in the stream analytics job. So I have this query
SELECT
AVG(Value)
FROM
SensorData
GROUP BY
DeviceId,
MeasurementName,
TumblingWindow(minute, 5)
This query is expected to give me average measurement of temperature and pressure values for each device in 5 min.
In doing this average I need to eliminate duplicates. The actual average is (50+60)/2 = 55.
But the average given my this query will be (50+50+60)/3 = 53.33
How do I tweak this query for the right output?
Thanks in advance.
According to the Query Language Elements in ASA,it seems that distinct is not supported by ASA directly. However, you could find it could be used with COUNT from here.
So,may be you could refer to my below sql to get avg of Value without duplicate data.
with temp as
(
select count(distinct DeviceID) AS device,
count(distinct TimeStamp) AS time,
count(distinct MeasurementName) AS name,
Value as v
from jsoninput
group by Value,TumblingWindow(minute, 5)
)
select avg(v) from temp
group by TumblingWindow(minute, 5)
Output with your sample data:

How to determine time stamps for Cassandra queries

One of The values inserted into the table is current time. I compute the current time using toTimestamp(now()). Now, I want to compute current time minus 90 days , current time minus 15 days.
My question is how do I compute current time - nth day ?
Query for current timestamp :
INSERT INTO TABLE_NAME (col_1, col_2, col_3) VALUES ('val_1', toTimestamp(now()), val_3);
In the above query, val_2 is current timestamp. Current time stamp is determined by
toTimestamp(now())
How do I compute current time - 90 days , current time - 2weeks
This functionality is not built into CQL.
If you are able to use UDFs, you can (building on the example given here:
How to get Last 6 Month data comparing with timestamp column using cassandra query?) do the following:
Enable UDFs as needed by adding or changing this line to true in cassandra.yaml:
enable_user_defined_functions: true
Then add two user defined functions like this:
CREATE FUNCTION dateadd(date timestamp, daydiff int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS $$java.util.Calendar c = java.util.Calendar.getInstance();c.setTime(date);c.add(java.util.Calendar.DATE, daydiff);return c.getTime();$$
CREATE FUNCTION weekadd(date timestamp, weekdiff int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS $$java.util.Calendar c = java.util.Calendar.getInstance();c.setTime(date);c.add(java.util.Calendar.DATE, weekdiff*7);return c.getTime();$$
Select the data from your table like this:
select dateadd(col_2,-90) from TABLE_NAME;
select weekadd(col_2,-2) from TABLE_NAME;

How can i count all azure EventHub events of the current day using StreamAnalitics each 5 minutes?

I need to count all events collected during the current day, from 0:00 to 23:59 utc time each five minutes.
I'm using a stream analytics service withe the current query:
SELECT Cast(pid as bigint) as PublisherID,Cast(cid as bigint) as CampaignID, Count(*) as Count
INTO
[SQLTableClicks]
FROM
[Clicks]
GROUP BY pid,cid, TumblingWindow(Day,1)
it works but it only collect data once a day and i need to update the info each five minutes.
I think hopping window is what you need, it will give you result every 5 minutes, but looking a day back.
Try something like this (I didn't run it, but should give you an idea):
With data as
(
SELECT
Cast(pid as bigint) as PublisherID,
Cast(cid as bigint) as CampaignID,
Count(*) as Count,
System.TimeStamp as Time
FROM
[Clicks]
)
SELECT PublisherID, CampaignID, Count
INTO
[SQLTableClicks]
FROM
[data]
WHERE (DAY(System.TimeStamp) == Day(Time))
GROUP BY pid,cid, HoppingWindow(Duration(day, 1), Hop(minute, 5))

Resources