Azure Stream Analytics : remove duplicates while aggregating - azure

I'm working on a system of temperature and pressure sensors, where my data is flowing through a Stream analytics job. Now there maybe duplicate messages sent in because of acknowledgements not being received and various other reasons. So my data could be of the format:-
DeviceID TimeStamp MeasurementName Value
1 1 temperature 50
1 1 temperature 50
1 2 temperature 60
Note that the 2nd record is a duplicate of the 1st one as DeviceId and Timestamp and MeasurementName are same.
I wish to take an average over 5 min tumbling window for this data in the stream analytics job. So I have this query
SELECT
AVG(Value)
FROM
SensorData
GROUP BY
DeviceId,
MeasurementName,
TumblingWindow(minute, 5)
This query is expected to give me average measurement of temperature and pressure values for each device in 5 min.
In doing this average I need to eliminate duplicates. The actual average is (50+60)/2 = 55.
But the average given my this query will be (50+50+60)/3 = 53.33
How do I tweak this query for the right output?
Thanks in advance.

According to the Query Language Elements in ASA,it seems that distinct is not supported by ASA directly. However, you could find it could be used with COUNT from here.
So,may be you could refer to my below sql to get avg of Value without duplicate data.
with temp as
(
select count(distinct DeviceID) AS device,
count(distinct TimeStamp) AS time,
count(distinct MeasurementName) AS name,
Value as v
from jsoninput
group by Value,TumblingWindow(minute, 5)
)
select avg(v) from temp
group by TumblingWindow(minute, 5)
Output with your sample data:

Related

Cassandra: Data Modeling for event based time series

I have a data modeling question. In my application I'm reading data from a few different sensors and storing it in Cassandra. The sensors generate new values at very different rates: Some every other second, some every other month.
Furthermore, the assumption is that a value stays valid until the next one is encountered. Example: Sensor 1 sent a value of 500 at 10s after EPOCH and a value of 1000 at 20s after EPOCH. The valid value for 15s after EPOCH would need to be 500.
Since some rates are going to be high and I don't want unbounded partitions, I want to apply bucketing. I'm thinking about modeling my data like this:
CREATE TABLE sensor_data (
sensor_id text,
some_timing_bucket date,
measured_at time,
value double
PRIMARY KEY ((sensor_id, some_timing_bucket), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC);
The usual queries the application would need to serve are "give me the data of the last 5/15 minutes/1 day", so I would choose the some_timing_bucket accordingly. Maybe even have multiple tables with different bucket sizes.
What I cannot wrap my head around is this: Consider I choose one day as bucketing interval. Now I want to retrieve the current value of a sensor that hasn't updated in ten days. There will be no partition for today, so on my application layer I would need to send nine queries that yield nothing until I have gone far enough back in time to encounter the value that is currently valid. That doesn't sound very efficient and I'd appreciate any input on how to model this.
Side note: This would not be an issue if all data for the same sensor was in the same partition: Just ask for all the points with a timestamp less than the beginning of the ranged query and limit the results to one. But that's not feasible because of the unbounded partition.
There is a much simpler way to model your data by using one-day buckets. Something like:
CREATE TABLE sensor_data_by_day (
sensor_id text,
year int,
month int,
day int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, year, month, day), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
If a sensor measures a data point every second, then there are 86,400 maximum possible values for a single day (60 secs x 60 mins * 24 hrs). 86K rows per partition is still manageable.
If today is 17 August 2022 and you wanted to retrieve the data for the previous day, the query would be:
SELECT value FROM sensor_data_by_day
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 16
Assuming it is currently 08:30:00 GMT on the 17th of August (1660725000000 ms since epoch), to retrieve the data for the last 15 minutes (900 secs ago or 1660724100000 ms):
SELECT value FROM
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 17
AND measured_at > 1660724100000
I think you'll find that it is easier to work with timestamps because it provides a bit more flexibility when it comes to doing range queries. Cheers!
you can do this with a simpler table like this:
CREATE TABLE sensor_data (
sensor_id text,
day_number_from_1970 int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, day_number_from_1970), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
and you can query data like that:
SELECT value
FROM sensor_data
WHERE sensor_id = some_sensor_id
AND day_number_from_1970 = day_number
AND measured_at > start_time
AND measured_at < end_time
with a single int column, you should less data on disk and get results well

Azure Stream Analytics current day aggregation

I'm quite new in Azure Stream Analytics but I need to push to Power BI (live dashboard) rolling totals from start of the day every time when new event arrives to Azure Stream Analytics job. I've created next SQL query to calculate this
SELECT
Factory_Id,
COUNT(0) as events_count,
MAX(event_create_time) as last_event_time,
SUM(event_value) as event_value_total
INTO
[powerbi]
FROM
[eventhub] TIMESTAMP BY event_create_time
WHERE DAY(event_create_time) = DAY(System.Timestamp) and MONTH(event_create_time) = MONTH(System.Timestamp) and YEAR(event_create_time) = YEAR(System.Timestamp)
GROUP BY Factory_Id, SlidingWindow(day,1)
But this didn't give me desired result - I get total for last 24 hours(not only for current day) and some times record with bigger last_event_time has events_count smaller then record with smaller last_event_time. The question is - What I'm doing wrong and how can I achieve desired outcome?
EDIT following comment: This computes the results for the last 24h, but what's needed is the running sum/count to day (from 00:00 until now). See updated answer below.
I'm wondering if an analytics approach would work better than an aggregation here.
Instead of using a time window, you calculate and emit a record for each event in input:
SELECT
Factory_Id,
COUNT(*) OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) AS events_count,
system.timestamp() as last_event_time,
SUM(event_value) OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) as event_value_total
INTO PowerBI
FROM [eventhub] TIMESTAMP BY event_create_time
The only hiccup is for events landing on the same time stamp:
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:00:00", "event_value" : 0.1}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:01:00", "event_value" : 2}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:01:00", "event_value" : 10}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:02:00", "event_value" : 0.2}
You won't get a single record on that timestamp:
Factory_Id
events_count
last_event_time
event_value_total
1
1
2021-12-10T10:00:00.0000000Z
0.1
1
2
2021-12-10T10:01:00.0000000Z
2.1
1
3
2021-12-10T10:01:00.0000000Z
12.1
1
4
2021-12-10T10:02:00.0000000Z
12.2
We may want to add a step to the query to deal with it if it's an issue for your dashboard. Let me know!
EDIT following comment
This new version will emit progressive results on a daily tumbling window. To do that, every time we get a new record, we collect the last 24h. Then we remove the rows from the previous day, and re-calculate the new aggregates. To collect properly, we first need to make sure we only have 1 record per timestamp.
-- First we make sure we get only 1 record per timestamp, to avoid duplication in the analytics function below
WITH Collapsed AS (
SELECT
Factory_Id,
system.timestamp() as last_event_time,
COUNT(*) AS C,
SUM(event_value) AS S
FROM [input1] TIMESTAMP BY event_create_time
GROUP BY Factory_Id, system.timestamp()
),
-- Then we build an array at each timestamp, containing all records from the last 24h
Collected as (
SELECT
Factory_Id,
system.timestamp() as last_event_time,
COLLECT() OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) AS all_events
FROM Collapsed
)
-- Finally we expand the array, removing the rows on the previous day, and aggregate
SELECT
C.Factory_Id,
system.timestamp() as last_event_time,
SUM(U.ArrayValue.C) AS events_count,
SUM(U.ArrayValue.S) AS event_value_total
FROM Collected AS C
CROSS APPLY GETARRAYELEMENTS(C.all_events) AS U
WHERE DAY(U.ArrayValue.last_event_time) = DAY(system.Timestamp())
GROUP BY C.Factory_Id, C.last_event_time, system.timestamp()
Let me know how it goes.

Azure Stream Analytics - Joining Two Streaming Source

I am trying to join 2 Streaming Source which produces the same data output from EventHub.
I am trying to find the Maximum Open Price for the Stock every 5 mins and trying to write it to the the Table. I am interested in the time at within the 5 min window at which the stock was maximum and the window time.
I used the below mentioned query but it isn't producing any output for the same.
I think I have messed the joining the condition.
WITH Source1 AS (
SELECT
System.TimeStamp() as TimeSlot,max([open]) as 'MaxOpenPrice'
FROM
EventHubInputData TIMESTAMP BY TimeSlot
GROUP BY TumblingWindow(minute,5)
),
Source2 AS(
SELECT EventEnqueuedUtcTime,[open]
FROM EventHubInputDataDup TIMESTAMP BY EventEnqueuedUtcTime),
Source3 as (
select Source2.EventEnqueuedUtcTime as datetime,Source1.MaxOpenPrice,System.TimeStamp() as TimeSlot
FROM Source1
JOIN Source2
ON Source2.[Open] = Source1.[MaxOpenPrice] AND DATEDIFF (minute,Source1,Source2) BETWEEN 0 AND 5
)
SELECT datetime,MaxOpenPrice,TimeSlot
INTO EventHubOutPutSQLDB
FROM Source3 ```
The logic is good here. First you identify the maximum value on each window of 5 minutes, then you lookup in the original stream the time at which it happened.
WITH MaxOpen5MinTumbling AS (
SELECT
--TickerId,
System.TimeStamp() AS WindowEnd, --this always return the end of the window when windowing
MAX([open]) AS 'MaxOpenPrice'
FROM EventHubInputData --no need to timestamp if using ingestion time
GROUP BY TumblingWindow(minute,5)
)
SELECT
--M.TickedId,
M.WindowEnd,
M.MaxOpenPrice,
O.EventEnqueuedUtcTime AS MaxOpenPriceTime
FROM MaxOpen5MinTumbling M
LEFT JOIN EventHubInputData O
ON M.MaxOpenPrice = o.[open]
AND DATEDIFF(minute,M,O) BETWEEN -5 AND 0 --The new timestamp is at the end of the window, you need to look back 5 minutes
--AND M.TickerId = O.TickerId
Note that at this point you could get multiple results per time window if the max price happens multiple times.

filter for key-value pair in cassandra wide rows

I am trying to model time series data with many sensors (> 50k) with cassandra. As I would like to do filtering on multiple sensors at the same time, I thought using the following (wide row) schema might be suitable:
CREATE TABLE data(
time timestamp,
session_id int,
sensor text,
value float,
PRIMARY KEY((time, session_id), sensor)
);
If every sensor value was a column in an RDBMS, my query would ideally look like:
SELECT * FROM data WHERE sensor_1 > 10 AND sensor_2 < 2;
Translated to my cassandra schema, I assumed the query might look like:
SELECT * FROM data
WHERE
sensor = 'sensor_1' AND
value > 10 AND
sensor = 'sensor_2' AND
value < 2;
I now have two problems:
cassandra tells me that I can filter on the sensor column only
once:
sensor cannot be restricted by more than one relation if it
includes an Equal
Obviously, the filter on value doesn't make sense at the moment. I wouldn't know how to express the relationship
between sensor and value in the query in order to filter multiple
columns in the same (wide) row.
I do know that a solution to the first question would be to use CQL's IN clause. This however doesn't solve the second problem.
Is this scenario even suitable for cassandra?
Many thanks in advance.
You could try to use IN clause here.
So your query would be like this:
SELECT * FROM data
WHERE time = <time> and session_id = <session id>
AND sensor IN ('sensor_1', 'sensor_2')
AND value > 10 AND value < 2

Presto Cassandra Connector Clustering Index

CQL Execution [returns instantly, assuming uses clustering key index]:
cqlsh:stats> select count(*) from events where month='2015-04' and day = '2015-04-02';
count
-------
5447
Presto Execution [takes around 8secs]:
presto:default> select count(*) as c from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02';
c
------
5447
(1 row)
Query 20150228_171912_00102_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:08 [147K rows, 144KB] [17.6K rows/s, 17.2KB/s]
Why should presto get to process 147K rows when cassandra itself responds with just 5447 rows for the same query [I tried select * too]?
Why presto is not able to use the clustering key optimization?
I tried all possible values like timestamp, date, different formats of dates. Not able to see any effect on number of rows being fetched.
CF Reference:
CREATE TABLE events (
month text,
day timestamp,
test_data text,
some_random_column text,
event_time timestamp,
PRIMARY KEY (month, day, event_time)
) WITH comment='Test Data'
AND read_repair_chance = 1.0;
Added event_timestamp too as a constraint in response to Dain's answer
presto:default> select count(*) from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02 00:00:00+0000' and event_time = timestamp '2015-04-02 00:00:34+0000';
_col0
-------
1
(1 row)
Query 20150301_071417_00009_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:07 [147K rows, 144KB] [21.3K rows/s, 20.8KB/s]
The Presto engine will pushdown simple WHERE clauses like this to a connector (you can see this in the Hive connector), so the question is, why does the Cassandra connector not take advantage of this. To see why, we'll have to look at the code.
The pushdown system first interacts with connectors in the ConnectorSplitManager.getPartitions(ConnectorTableHandle, TupleDomain) method, so looking at the CassandraSplitManager, I see it is delegating the logic to getPartitionKeysSet. This method looks for a range constraint (e.g., x=33 or x BETWEEN 1 AND 10) for every column in the primary key, so in your case, you would need to add a constraint on event_time.
I don't know why the code insists on having a constraint on every column in the primary key, but I'd guess that it is a bug. It should be easy to tweak this code to remove that constraint.

Resources