I have a table of events that are done every minute. I want to be able to filter these events by time period and also aggregate data for hour/day/etc.
My data model:
create table min_dev_data (
device TEXT,
event_time BIGINT,
hour BIGINT,
day BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
)
CREATE MATERIALIZED VIEW hour_dev_data AS
SELECT device, event_time, hour, value
FROM min_dev_data
WHERE hour IS NOT NULL AND value IS NOT NULL
and event_time IS NOT NULL AND device IS NOT NULL
PRIMARY KEY ((device), hour, event_time)
my query is
select hour, sum(value)
from hour_dev_data
where device = 'tst' and event_time < 149000000 group by device, hour;
fails with error
code=2200 [Invalid query] message="PRIMARY KEY column "event_time" cannot be restricted as preceding column "hour" is not restricted"
The only way to make it work is to add ALLOW FILTERING, which is unpredictable.
How can I change my data model to address my query and avoid ALLOW FILTERING mode?
You have to proactively produce these results:
create table min_dev_data (
device TEXT,
event_time BIGINT,
hour BIGINT,
day BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
create table hour_dev_data (
device TEXT,
hour BIGINT,
day BIGINT,
event_time BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
create table day_dev_data (
device TEXT,
day BIGINT,
event_time BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
Each table will satisfy ONE granularity only.
Every hour you query the minute data for the latest hour data for each device with something like:
SELECT * FROM min_dev_data WHERE device = X AND event_time < YYYY
Sum that at application level and store this value into the hour table:
INSERT INTO hour_dev_data (device, hour, day, event_time, value) VALUES (....);
And every day you query the hour table to produce the further aggregate data:
SELECT * FROM hour_dev_data WHERE device = X AND event_time < YYYY
sum at application level and store this value into the day table.
Please consider adding some form of bucketing because, at one minute interval, in two months your minute table will have wide partitions. This should not be a problem if you keep the table in reverse order (like I did) and query only for the last couple of hours. But if you want to query back in time as well then you must definitely use bucketing in your tables.
I think you had things pretty much right already, but you need to change your filter on event_time to be a filter on hour.
select hour, sum(value)
from hour_dev_data
where device = 'tst' and hour < 1500000000
group by device, hour;
When you were filtering on event_time, you were implicitly requiring a full scan of the row, as the event_time is clustered after the hour. To filter by the event_time, every cell would need to be examined to check the event_time. When you filter by hour, it is first in the clustering key, so it can be efficiently scanned and filtered. See this post post on ALLOW FILTERING for more on this.
I agree with xmas79 that you probably want to be bucketing at some level, perhaps by month or year depending on your frequency of events. If you're always going to be looking for the most recent values, then setting the clustering key ordering to desc is probably going to be helpful too:
CREATE MATERIALIZED VIEW hour_dev_data3 AS
SELECT device, event_time, hour, value
FROM min_dev_data
WHERE hour IS NOT NULL AND value IS NOT NULL
and event_time IS NOT NULL AND device IS NOT NULL
PRIMARY KEY ((device), hour, event_time)
WITH CLUSTERING ORDER BY (hour DESC);
Scheduling aggregations like xmas79 suggests is going to be more efficient as the sum is done once, rather than summing every time reads are done, however it does add more maintenance burden, where the materialised view handles it for you.
Related
Let's say I have a table schema that has a timestamp for the event:
CREATE TABLE event_bucket_1 (
event_source text,
event_year int,
event_month int,
event_id text,
event_time timestamp,
...
PRIMARY KEY ((event_source, event_year, event_month), event_id)
) WITH CLUSTERING ORDER BY (event_id DESC)
My question is: Can I skip adding the event_year and event_month columns and replace it with some kind of function like year(event_time) and month(event_time)? The thinking is that event_year, event_month are both duplication of information from event_time.
No, it is not possible. But, from my understanding, you want to query based on year and month, right? You can accomplish this by replacing the event_year and event_month by event_time in your compound key and use query time ranges:
SELECT * FROM event_bucket_1 where event_source='source' and event_time > '2018-06-01 00:00:00' and event_time < '2018-07-01 00:00:00';
No, the partition key needs to be static, and AFAIK, can't be evaluated at.
You can try to open a ticket as an improvement for future versions at https://issues.apache.org/jira/secure/Dashboard.jspa
Seems to be a good use case and would fit more scenarios.
I have a device table (say 'device' table) which has the static fields with current statistics and I have another table (say 'devicestat' table) which has the statistics of that device collected for every one minute and sorted by timestamp like below.
Example :
CREATE TABLE device(
"partitionId" text,
"deviceId" text,
"name" text,
"totalMemoryInMB" bigint,
"totalCpu" int,
"currentUsedMemoryInMB" bigint,
"totalStorageInMB" bigint,
"currentUsedCpu" int,
"ipAddress" text,
primary key ("partitionId","deviceId"));
CREATE TABLE devicestat(
"deviceId" text,
"timestamp" timestamp,
"totalMemoryInMB" bigint,
"totalCpu" int,
"usedMemoryInMB" bigint,
"totalStorageInMB" bigint,
"usedCpu" int
primary key ("deviceId","timestamp"));
where,
currentUsedMemoryInMB & currentUsedCpu => Hold the most recent statistics
usedMemoryInMB & usedCpu => Hold the most and also old statistics based on time stamp.
Could somebody suggest me the correct approach for the following concept?
So whenever I need static data with the most recent statistics I read from device table, Whenever I need history of device staistical data I read from the devicestat table
This looks fine for me, But only problem is I need to write the statitics in both table, In case of devicestat table It will be a new entry based on timestamp but In case of device table, we will just update the statistics. What is your thought on this, Does this need to be maintained in only the single stat table or Is it fine to update the most recent stat in device table too.
in Cassandra the common approach is to have a table(ColumnFamily) per query. And denormalization is also a good practice in Cassandra. So it's ok to keep 2 column families in this case.
Another way to get the latest stat from devicestat table is make data be DESC sorted by timestamp:
CREATE TABLE devicestat(
"deviceId" text,
"timestamp" timestamp,
"totalMemoryInMB" bigint,
"totalCpu" int,
"usedMemoryInMB" bigint,
"totalStorageInMB" bigint,
"usedCpu" int
primary key ("deviceId","timestamp"))
WITH CLUSTERING ORDER BY (timestamp DESC);
so you can query with limit 1 when you know deviceId
select * from devicestat where deviceId = 'someId' limit 1;
But if you want to list last stat of devices by partitionId then your approach with updating device table with latest stat is correct
I have a table with timestamps in a 15 min interval. It's possible to aggregate or group by hour and the load field being the average?
Theres a post on materialized views. You can use it to create a copy of data batched by hour. Then use the average aggregate functions on load. I think CASSANDRA-11871 Is what your looking for though, which has all its dependancies in group by has recently been completed but hasnt been worked on yet.
Kinda just guessing on your schema but something like (disclaimer not really tested):
CREATE TABLE load (
ref_equip text,
ptd_assoc text,
date timestamp,
date_hour bigint,
load float,
PRIMARY KEY ((ref_equip, ptd_assoc), date)
);
CREATE MATERIALIZED VIEW load_by_hour AS
SELECT * FROM load
WHERE ref_equip IS NOT NULL AND ptd_assoc IS NOT NULL
PRIMARY KEY ((ref_equip, ptd_assoc), date_hour, date);
where date_hour is just the timestamp with hour resolution, meaning divide by 1000*60*60 (epoc is ms) when doing insert. Can then select average
SELECT avg(load) FROM load_by_hour WHERE ref_equip='blarg' AND ptd_assoc='blargy' AND date_hour = 410632;
Alternatively something that may just be better to begin with is to store you data, partitioned by hour:
CREATE TABLE load (
ref_equip text,
ptd_assoc text,
date timestamp,
date_hour bigint,
load float,
PRIMARY KEY ((ref_equip, ptd_assoc, date_hour), date)
);
i have a cassandra table like:
CREATE TABLE sensor_data (
sensor VARCHAR,
timestamp timestamp,
value float,
PRIMARY KEY ((sensor), timestamp)
)
And and aggregation table.
CREATE TABLE sensor_data_aggregated (
sensor VARCHAR,
aggregation VARCHAR /* hour or day */
timestamp timestamp,
aggragation
min_timestamp timestamp,
min_value float,
max_timestamp timestamp,
max_value float,
avg_value float,
PRIMARY KEY ((sensor, aggregation), timestamp)
)
Is there a possibility of any trigger, to fill the "sensor_data_aggregated" table automaticly on insert, update, delete or "sensor_data" table?
My current solution whould be to write an custom trigger, with second commit log.
And an application that read and truncate this log peridicly to generate the aggregated data.
But i also found information that the datastax ops center can do this but no instruction how to do that?
What will be the best solution how to to this?
You can implement your own C* trigger for that, which will execute additional queries for your aggregation table after each row insert into sensor_data.
Also, for maintaining min/max values you can use CAS and C* lightweight transactions like
update sensor_data_aggregated
set min_value=123
where
sensor='foo'
and aggregation='bar'
and ts='2015-01-01 00:00:00'
if min_value>123;
using a bit updated schema ('timestamp' is a reserved keyword in cql3, you cannot use it unescaped):
CREATE TABLE sensor_data_aggregated (
sensor text,
aggregation text,
ts timestamp,
min_timestamp timestamp,
min_value float,
max_timestamp timestamp,
max_value float,
avg_value float,
PRIMARY KEY ((sensor, aggregation), ts)
)
In the following scenario:
CREATE TABLE temperature_by_day (
weatherstation_id text,
date text,
event_time timestamp,
temperature text,
PRIMARY KEY ((weatherstation_id,date),event_time)
)WITH CLUSTERING ORDER BY (event_time DESC);
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-03','2013-04-03 07:01:00','72F');
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-03','2013-04-03 08:01:00','74F');
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-04','2013-04-04 07:01:00','73F');
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES ('1234ABCD','2013-04-04','2013-04-04 08:01:00','76F');
If I do the following query:
SELECT *
FROM temperature_by_day
WHERE weatherstation_id='1234ABCD'
AND date in ('2013-04-04', '2013-04-03') limit 2;
I realized that the result of cassandra is ordered by the same sequence of patkeys in clausa IN. In this case, I'd like to know if the expected result is ALWAYS the two records of the day '2013-04-04'? Ie Cassadra respects the order of the IN clause in the ordering of the result even in a scenario with multiple nodes?