Let's say I have a table with the following fields:
customerid, transactiontime, transactiontype
I want to group a customer's transactions by time, and select the customerid and the count of those transactions. But rather than simply grouping all transaction times into certain increments (15 min, 30 min, etc.), for which I've seen various solutions here, I'd like to group a set a customer's transactions based on how soon each transaction occurs after the previous.
In other words, if any transaction occurs more than 15 minutes after a previous transaction, I'd like it to be grouped separately.
I expect the customer to generate a few transactions close together, and potentially generate a few more later in the day. So if those two sets of transactions occur more than 15, 30 minutes apart, they'll be grouped into separate windows. Is this possible?
Yes, you can do this using a window function in SQLite. This syntax is a bit new to me, but this is how it would start:
select customer_id,
event_start_minute,
sum(subgroup_start) over (order by customer_id, event_start_minute) as subgroup
from (
select customer_id,
event_start_minute,
case
when event_start_minute - lag(event_start_minute) over win > 15
then 1
else 0
end as subgroup_start
from t1
window win as (
partition by b
order by c
)
) as groups
order by customer_id, event_start_minute
Related
I am building a saved search in NetSuite -it is currently a Inventory Detail search- Here is what I am trying to do in a simple sense:
Quantity per transaction = pulled from the transaction
Running total quantity = based on quantity per transaction (add/subtract)
Rate per transaction = pulled from the transaction
Running average cost = calculated by running total amount / running total quantity
Amount per transaction = IF transaction amount = 0 THEN quantity per transaction * running average cost ELSE pull transaction amount
Running total amount = calculated by quantity per transaction * (rate per transaction OR running average cost if rate is null)
I am able to get all of the above fields except for the running total amount because I am utilizing a trick I found online: 'SUM/* comment */(x)...'
Here is what I'm trying and failing at:
SUM/* comment */(CASE WHEN {transaction.amount} = 0 THEN {itemcount} * NVL(ROUND((SUM/* comment */({transaction.amount}) OVER(PARTITION BY {item} ORDER BY {transaction.datecreated} ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))/(SUM/* comment */({itemcount}) OVER(PARTITION BY {item} ORDER BY {transaction.datecreated} ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),5),0) ELSE {transaction.amount} END) OVER(PARTITION BY {item} ORDER BY {transaction.datecreated} ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
I believe I know the reason its failing; because you can't nest a sum in a sum? But I am not sure of another way, I had the idea of declaring a variable to store the average cost and amount in then call them, but I don't think you can declare in a NetSuite saved search.
I have a data modeling question. In my application I'm reading data from a few different sensors and storing it in Cassandra. The sensors generate new values at very different rates: Some every other second, some every other month.
Furthermore, the assumption is that a value stays valid until the next one is encountered. Example: Sensor 1 sent a value of 500 at 10s after EPOCH and a value of 1000 at 20s after EPOCH. The valid value for 15s after EPOCH would need to be 500.
Since some rates are going to be high and I don't want unbounded partitions, I want to apply bucketing. I'm thinking about modeling my data like this:
CREATE TABLE sensor_data (
sensor_id text,
some_timing_bucket date,
measured_at time,
value double
PRIMARY KEY ((sensor_id, some_timing_bucket), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC);
The usual queries the application would need to serve are "give me the data of the last 5/15 minutes/1 day", so I would choose the some_timing_bucket accordingly. Maybe even have multiple tables with different bucket sizes.
What I cannot wrap my head around is this: Consider I choose one day as bucketing interval. Now I want to retrieve the current value of a sensor that hasn't updated in ten days. There will be no partition for today, so on my application layer I would need to send nine queries that yield nothing until I have gone far enough back in time to encounter the value that is currently valid. That doesn't sound very efficient and I'd appreciate any input on how to model this.
Side note: This would not be an issue if all data for the same sensor was in the same partition: Just ask for all the points with a timestamp less than the beginning of the ranged query and limit the results to one. But that's not feasible because of the unbounded partition.
There is a much simpler way to model your data by using one-day buckets. Something like:
CREATE TABLE sensor_data_by_day (
sensor_id text,
year int,
month int,
day int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, year, month, day), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
If a sensor measures a data point every second, then there are 86,400 maximum possible values for a single day (60 secs x 60 mins * 24 hrs). 86K rows per partition is still manageable.
If today is 17 August 2022 and you wanted to retrieve the data for the previous day, the query would be:
SELECT value FROM sensor_data_by_day
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 16
Assuming it is currently 08:30:00 GMT on the 17th of August (1660725000000 ms since epoch), to retrieve the data for the last 15 minutes (900 secs ago or 1660724100000 ms):
SELECT value FROM
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 17
AND measured_at > 1660724100000
I think you'll find that it is easier to work with timestamps because it provides a bit more flexibility when it comes to doing range queries. Cheers!
you can do this with a simpler table like this:
CREATE TABLE sensor_data (
sensor_id text,
day_number_from_1970 int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, day_number_from_1970), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
and you can query data like that:
SELECT value
FROM sensor_data
WHERE sensor_id = some_sensor_id
AND day_number_from_1970 = day_number
AND measured_at > start_time
AND measured_at < end_time
with a single int column, you should less data on disk and get results well
I'm quite new in Azure Stream Analytics but I need to push to Power BI (live dashboard) rolling totals from start of the day every time when new event arrives to Azure Stream Analytics job. I've created next SQL query to calculate this
SELECT
Factory_Id,
COUNT(0) as events_count,
MAX(event_create_time) as last_event_time,
SUM(event_value) as event_value_total
INTO
[powerbi]
FROM
[eventhub] TIMESTAMP BY event_create_time
WHERE DAY(event_create_time) = DAY(System.Timestamp) and MONTH(event_create_time) = MONTH(System.Timestamp) and YEAR(event_create_time) = YEAR(System.Timestamp)
GROUP BY Factory_Id, SlidingWindow(day,1)
But this didn't give me desired result - I get total for last 24 hours(not only for current day) and some times record with bigger last_event_time has events_count smaller then record with smaller last_event_time. The question is - What I'm doing wrong and how can I achieve desired outcome?
EDIT following comment: This computes the results for the last 24h, but what's needed is the running sum/count to day (from 00:00 until now). See updated answer below.
I'm wondering if an analytics approach would work better than an aggregation here.
Instead of using a time window, you calculate and emit a record for each event in input:
SELECT
Factory_Id,
COUNT(*) OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) AS events_count,
system.timestamp() as last_event_time,
SUM(event_value) OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) as event_value_total
INTO PowerBI
FROM [eventhub] TIMESTAMP BY event_create_time
The only hiccup is for events landing on the same time stamp:
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:00:00", "event_value" : 0.1}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:01:00", "event_value" : 2}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:01:00", "event_value" : 10}
{"Factory_Id" : 1, "event_create_time" : "2021-12-10T10:02:00", "event_value" : 0.2}
You won't get a single record on that timestamp:
Factory_Id
events_count
last_event_time
event_value_total
1
1
2021-12-10T10:00:00.0000000Z
0.1
1
2
2021-12-10T10:01:00.0000000Z
2.1
1
3
2021-12-10T10:01:00.0000000Z
12.1
1
4
2021-12-10T10:02:00.0000000Z
12.2
We may want to add a step to the query to deal with it if it's an issue for your dashboard. Let me know!
EDIT following comment
This new version will emit progressive results on a daily tumbling window. To do that, every time we get a new record, we collect the last 24h. Then we remove the rows from the previous day, and re-calculate the new aggregates. To collect properly, we first need to make sure we only have 1 record per timestamp.
-- First we make sure we get only 1 record per timestamp, to avoid duplication in the analytics function below
WITH Collapsed AS (
SELECT
Factory_Id,
system.timestamp() as last_event_time,
COUNT(*) AS C,
SUM(event_value) AS S
FROM [input1] TIMESTAMP BY event_create_time
GROUP BY Factory_Id, system.timestamp()
),
-- Then we build an array at each timestamp, containing all records from the last 24h
Collected as (
SELECT
Factory_Id,
system.timestamp() as last_event_time,
COLLECT() OVER (PARTITION BY Factory_Id LIMIT DURATION (hour, 24)) AS all_events
FROM Collapsed
)
-- Finally we expand the array, removing the rows on the previous day, and aggregate
SELECT
C.Factory_Id,
system.timestamp() as last_event_time,
SUM(U.ArrayValue.C) AS events_count,
SUM(U.ArrayValue.S) AS event_value_total
FROM Collected AS C
CROSS APPLY GETARRAYELEMENTS(C.all_events) AS U
WHERE DAY(U.ArrayValue.last_event_time) = DAY(system.Timestamp())
GROUP BY C.Factory_Id, C.last_event_time, system.timestamp()
Let me know how it goes.
I'm working on a system of temperature and pressure sensors, where my data is flowing through a Stream analytics job. Now there maybe duplicate messages sent in because of acknowledgements not being received and various other reasons. So my data could be of the format:-
DeviceID TimeStamp MeasurementName Value
1 1 temperature 50
1 1 temperature 50
1 2 temperature 60
Note that the 2nd record is a duplicate of the 1st one as DeviceId and Timestamp and MeasurementName are same.
I wish to take an average over 5 min tumbling window for this data in the stream analytics job. So I have this query
SELECT
AVG(Value)
FROM
SensorData
GROUP BY
DeviceId,
MeasurementName,
TumblingWindow(minute, 5)
This query is expected to give me average measurement of temperature and pressure values for each device in 5 min.
In doing this average I need to eliminate duplicates. The actual average is (50+60)/2 = 55.
But the average given my this query will be (50+50+60)/3 = 53.33
How do I tweak this query for the right output?
Thanks in advance.
According to the Query Language Elements in ASA,it seems that distinct is not supported by ASA directly. However, you could find it could be used with COUNT from here.
So,may be you could refer to my below sql to get avg of Value without duplicate data.
with temp as
(
select count(distinct DeviceID) AS device,
count(distinct TimeStamp) AS time,
count(distinct MeasurementName) AS name,
Value as v
from jsoninput
group by Value,TumblingWindow(minute, 5)
)
select avg(v) from temp
group by TumblingWindow(minute, 5)
Output with your sample data:
I have a table in MS ACCESS 2013 that looks like this:
Id Department Status FollowingDept ActualArea
1000 Thinkerers Thinking Thinkerer Thinkerer
1000 Drawers OnDrawBoard Drawers Drawers
1000 MaterialPlan To Plan MaterialPlan MaterialPlan
1000 Painters MatNeeded MaterialPlan
1000 Builders DrawsNeeded Drawers
The table gives follow to an ID which has to pass through five departments, each department with atleast 5 different status.
Each status has a FollowingDept value, like *Department* Thinkerers has the status MoreCoffeeNow which means *FollowingDept* Drawers.
All columns except for ActualArea are columns which values are gotten from the feed of a query.
ActualArea is an Expr where I inserted this logic:
Iif(FollowingDept = Department, FollowingDept, "")
My logic is simple, if the FollowingDept and Department coincide, then the ID's ActualArea gets the value from FollowingDept.
But as you can see, there can be rare cases where an ID is like my example above, where 3 departments coincide with the FollowingDept. This cases are rare, but I would like to add something like a priority to Access.
Thinkerers has the top priority, then MaterialPlan, then Drawers, then Builders and lastly Painters. So, following the same example, after ActualArea gets 3 values, Access will execute another query or subquery or whatever, where it will evaluate each value priority and only leave behind that one with the top priority. So in this example, Thinkerers gets the top priority, and the other two values are eliminated from the ActualArea column.
Please keep in mind there are over 500 different IDs, each id is repeated 5 times, so the total records to evaluate will be 2500.
You need another table, with the possible values for actualArea and the priorities as numbers, and then you can select with a JOIN and order on the priority:
SELECT TOP 1 d.*, p.priority
FROM departments d
LEFT JOIN priorities p ON d.actualArea = p.dept
WHERE d.id = 1000
AND p.priority IS NOT NULL
ORDER BY p.priority ASC
The IS NOT NULL clause eliminates all of the rows where actualArea is empty. The TOP condition leaves only the row with the top priority.
You don't seem to have a primary key for your table. If you don't, then I'll give another query in a minute, but I would strongly advise you to go back and add a primary key to the table. It will save you an incredible amount of headache later. I did add such a key to my test table, and it's called pID. This query makes use of that pID to remove the records you need:
DELETE FROM departments WHERE pID NOT IN (
SELECT TOP 1 d.pID
FROM departments d
LEFT JOIN priorities p ON d.actualArea = p.dept
WHERE id = 1000
AND p.priority IS NOT NULL
ORDER BY p.priority ASC
)
If you can't add a primary key to the data and actualArea is assumed to be unique, then you can just use the actualArea values to perform the delete:
DELETE FROM departments WHERE actualArea NOT IN (
SELECT TOP 1 d.actualArea
FROM departments d
LEFT JOIN priorities p ON d.actualArea = p.dept
WHERE id = 1000
AND p.priority IS NOT NULL
ORDER BY p.priority ASC
) AND id = 1000
If actualArea is not going to be unique, then we'll need to revisit this answer. This answer also assumes that you already have the id number.