I'm quite new in Azure Stream Analytics but I need to push to Power BI (live dashboard) rolling totals from start of the day every time when new event arrives to Azure Stream Analytics job. I've created next SQL query to calculate this
SELECT
Factory_Id,
COUNT(0) as events_count,
MAX(event_create_time) as last_event_time,
SUM(event_value) as event_value_total
INTO
[powerbi]
FROM
[eventhub] TIMESTAMP BY event_create_time
WHERE DAY(event_create_time) = DAY(System.Timestamp) and MONTH(event_create_time) = MONTH(System.Timestamp) and YEAR(event_create_time) = YEAR(System.Timestamp)
GROUP BY Factory_Id, SlidingWindow(day,1)
But this didn't give me desired result - I get total for last 24 hours(not only for current day) and some times record with bigger last_event_time has events_count smaller then record with smaller last_event_time. The question is - What I'm doing wrong and how can I achieve desired outcome?
EDIT following comment: This computes the results for the last 24h, but what's needed is the running sum/count to day (from 00:00 until now). See updated answer below.
I'm wondering if an analytics approach would work better than an aggregation here.
Instead of using a time window, you calculate and emit a record for each event in input:
SELECT
Factory_Id,
COUNT(*) OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) AS events_count,
system.timestamp() as last_event_time,
SUM(event_value) OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) as event_value_total
INTO PowerBI
FROM [eventhub] TIMESTAMP BY event_create_time
The only hiccup is for events landing on the same time stamp:
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:00:00", "event_value" : 0.1}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:01:00", "event_value" : 2}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:01:00", "event_value" : 10}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:02:00", "event_value" : 0.2}
You won't get a single record on that timestamp:
Factory_Id
events_count
last_event_time
event_value_total
1
1
2021-12-10T10:00:00.0000000Z
0.1
1
2
2021-12-10T10:01:00.0000000Z
2.1
1
3
2021-12-10T10:01:00.0000000Z
12.1
1
4
2021-12-10T10:02:00.0000000Z
12.2
We may want to add a step to the query to deal with it if it's an issue for your dashboard. Let me know!
EDIT following comment
This new version will emit progressive results on a daily tumbling window. To do that, every time we get a new record, we collect the last 24h. Then we remove the rows from the previous day, and re-calculate the new aggregates. To collect properly, we first need to make sure we only have 1 record per timestamp.
-- First we make sure we get only 1 record per timestamp, to avoid duplication in the analytics function below
WITH Collapsed AS (
SELECT
Factory_Id,
system.timestamp() as last_event_time,
COUNT(*) AS C,
SUM(event_value) AS S
FROM [input1] TIMESTAMP BY event_create_time
GROUP BY Factory_Id, system.timestamp()
),
-- Then we build an array at each timestamp, containing all records from the last 24h
Collected as (
SELECT
Factory_Id,
system.timestamp() as last_event_time,
COLLECT() OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) AS all_events
FROM Collapsed
)
-- Finally we expand the array, removing the rows on the previous day, and aggregate
SELECT
C.Factory_Id,
system.timestamp() as last_event_time,
SUM(U.ArrayValue.C) AS events_count,
SUM(U.ArrayValue.S) AS event_value_total
FROM Collected AS C
CROSS APPLY GETARRAYELEMENTS(C.all_events) AS U
WHERE DAY(U.ArrayValue.last_event_time) = DAY(system.Timestamp())
GROUP BY C.Factory_Id, C.last_event_time, system.timestamp()
Let me know how it goes.
Related
I have a data modeling question. In my application I'm reading data from a few different sensors and storing it in Cassandra. The sensors generate new values at very different rates: Some every other second, some every other month.
Furthermore, the assumption is that a value stays valid until the next one is encountered. Example: Sensor 1 sent a value of 500 at 10s after EPOCH and a value of 1000 at 20s after EPOCH. The valid value for 15s after EPOCH would need to be 500.
Since some rates are going to be high and I don't want unbounded partitions, I want to apply bucketing. I'm thinking about modeling my data like this:
CREATE TABLE sensor_data (
sensor_id text,
some_timing_bucket date,
measured_at time,
value double
PRIMARY KEY ((sensor_id, some_timing_bucket), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC);
The usual queries the application would need to serve are "give me the data of the last 5/15 minutes/1 day", so I would choose the some_timing_bucket accordingly. Maybe even have multiple tables with different bucket sizes.
What I cannot wrap my head around is this: Consider I choose one day as bucketing interval. Now I want to retrieve the current value of a sensor that hasn't updated in ten days. There will be no partition for today, so on my application layer I would need to send nine queries that yield nothing until I have gone far enough back in time to encounter the value that is currently valid. That doesn't sound very efficient and I'd appreciate any input on how to model this.
Side note: This would not be an issue if all data for the same sensor was in the same partition: Just ask for all the points with a timestamp less than the beginning of the ranged query and limit the results to one. But that's not feasible because of the unbounded partition.
There is a much simpler way to model your data by using one-day buckets. Something like:
CREATE TABLE sensor_data_by_day (
sensor_id text,
year int,
month int,
day int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, year, month, day), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
If a sensor measures a data point every second, then there are 86,400 maximum possible values for a single day (60 secs x 60 mins * 24 hrs). 86K rows per partition is still manageable.
If today is 17 August 2022 and you wanted to retrieve the data for the previous day, the query would be:
SELECT value FROM sensor_data_by_day
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 16
Assuming it is currently 08:30:00 GMT on the 17th of August (1660725000000 ms since epoch), to retrieve the data for the last 15 minutes (900 secs ago or 1660724100000 ms):
SELECT value FROM
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 17
AND measured_at > 1660724100000
I think you'll find that it is easier to work with timestamps because it provides a bit more flexibility when it comes to doing range queries. Cheers!
you can do this with a simpler table like this:
CREATE TABLE sensor_data (
sensor_id text,
day_number_from_1970 int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, day_number_from_1970), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
and you can query data like that:
SELECT value
FROM sensor_data
WHERE sensor_id = some_sensor_id
AND day_number_from_1970 = day_number
AND measured_at > start_time
AND measured_at < end_time
with a single int column, you should less data on disk and get results well
I am trying to join 2 Streaming Source which produces the same data output from EventHub.
I am trying to find the Maximum Open Price for the Stock every 5 mins and trying to write it to the the Table. I am interested in the time at within the 5 min window at which the stock was maximum and the window time.
I used the below mentioned query but it isn't producing any output for the same.
I think I have messed the joining the condition.
WITH Source1 AS (
SELECT
System.TimeStamp() as TimeSlot,max([open]) as 'MaxOpenPrice'
FROM
EventHubInputData TIMESTAMP BY TimeSlot
GROUP BY TumblingWindow(minute,5)
),
Source2 AS(
SELECT EventEnqueuedUtcTime,[open]
FROM EventHubInputDataDup TIMESTAMP BY EventEnqueuedUtcTime),
Source3 as (
select Source2.EventEnqueuedUtcTime as datetime,Source1.MaxOpenPrice,System.TimeStamp() as TimeSlot
FROM Source1
JOIN Source2
ON Source2.[Open] = Source1.[MaxOpenPrice] AND DATEDIFF (minute,Source1,Source2) BETWEEN 0 AND 5
)
SELECT datetime,MaxOpenPrice,TimeSlot
INTO EventHubOutPutSQLDB
FROM Source3 ```
The logic is good here. First you identify the maximum value on each window of 5 minutes, then you lookup in the original stream the time at which it happened.
WITH MaxOpen5MinTumbling AS (
SELECT
--TickerId,
System.TimeStamp() AS WindowEnd, --this always return the end of the window when windowing
MAX([open]) AS 'MaxOpenPrice'
FROM EventHubInputData --no need to timestamp if using ingestion time
GROUP BY TumblingWindow(minute,5)
)
SELECT
--M.TickedId,
M.WindowEnd,
M.MaxOpenPrice,
O.EventEnqueuedUtcTime AS MaxOpenPriceTime
FROM MaxOpen5MinTumbling M
LEFT JOIN EventHubInputData O
ON M.MaxOpenPrice = o.[open]
AND DATEDIFF(minute,M,O) BETWEEN -5 AND 0 --The new timestamp is at the end of the window, you need to look back 5 minutes
--AND M.TickerId = O.TickerId
Note that at this point you could get multiple results per time window if the max price happens multiple times.
Let's say I have a table with the following fields:
customerid, transactiontime, transactiontype
I want to group a customer's transactions by time, and select the customerid and the count of those transactions. But rather than simply grouping all transaction times into certain increments (15 min, 30 min, etc.), for which I've seen various solutions here, I'd like to group a set a customer's transactions based on how soon each transaction occurs after the previous.
In other words, if any transaction occurs more than 15 minutes after a previous transaction, I'd like it to be grouped separately.
I expect the customer to generate a few transactions close together, and potentially generate a few more later in the day. So if those two sets of transactions occur more than 15, 30 minutes apart, they'll be grouped into separate windows. Is this possible?
Yes, you can do this using a window function in SQLite. This syntax is a bit new to me, but this is how it would start:
select customer_id,
event_start_minute,
sum(subgroup_start) over (order by customer_id, event_start_minute) as subgroup
from (
select customer_id,
event_start_minute,
case
when event_start_minute - lag(event_start_minute) over win > 15
then 1
else 0
end as subgroup_start
from t1
window win as (
partition by b
order by c
)
) as groups
order by customer_id, event_start_minute
I'm working on a system of temperature and pressure sensors, where my data is flowing through a Stream analytics job. Now there maybe duplicate messages sent in because of acknowledgements not being received and various other reasons. So my data could be of the format:-
DeviceID TimeStamp MeasurementName Value
1 1 temperature 50
1 1 temperature 50
1 2 temperature 60
Note that the 2nd record is a duplicate of the 1st one as DeviceId and Timestamp and MeasurementName are same.
I wish to take an average over 5 min tumbling window for this data in the stream analytics job. So I have this query
SELECT
AVG(Value)
FROM
SensorData
GROUP BY
DeviceId,
MeasurementName,
TumblingWindow(minute, 5)
This query is expected to give me average measurement of temperature and pressure values for each device in 5 min.
In doing this average I need to eliminate duplicates. The actual average is (50+60)/2 = 55.
But the average given my this query will be (50+50+60)/3 = 53.33
How do I tweak this query for the right output?
Thanks in advance.
According to the Query Language Elements in ASA,it seems that distinct is not supported by ASA directly. However, you could find it could be used with COUNT from here.
So,may be you could refer to my below sql to get avg of Value without duplicate data.
with temp as
(
select count(distinct DeviceID) AS device,
count(distinct TimeStamp) AS time,
count(distinct MeasurementName) AS name,
Value as v
from jsoninput
group by Value,TumblingWindow(minute, 5)
)
select avg(v) from temp
group by TumblingWindow(minute, 5)
Output with your sample data:
I need to count all events collected during the current day, from 0:00 to 23:59 utc time each five minutes.
I'm using a stream analytics service withe the current query:
SELECT Cast(pid as bigint) as PublisherID,Cast(cid as bigint) as CampaignID, Count(*) as Count
INTO
[SQLTableClicks]
FROM
[Clicks]
GROUP BY pid,cid, TumblingWindow(Day,1)
it works but it only collect data once a day and i need to update the info each five minutes.
I think hopping window is what you need, it will give you result every 5 minutes, but looking a day back.
Try something like this (I didn't run it, but should give you an idea):
With data as
(
SELECT
Cast(pid as bigint) as PublisherID,
Cast(cid as bigint) as CampaignID,
Count(*) as Count,
System.TimeStamp as Time
FROM
[Clicks]
)
SELECT PublisherID, CampaignID, Count
INTO
[SQLTableClicks]
FROM
[data]
WHERE (DAY(System.TimeStamp) == Day(Time))
GROUP BY pid,cid, HoppingWindow(Duration(day, 1), Hop(minute, 5))