I have a data modeling question. In my application I'm reading data from a few different sensors and storing it in Cassandra. The sensors generate new values at very different rates: Some every other second, some every other month.
Furthermore, the assumption is that a value stays valid until the next one is encountered. Example: Sensor 1 sent a value of 500 at 10s after EPOCH and a value of 1000 at 20s after EPOCH. The valid value for 15s after EPOCH would need to be 500.
Since some rates are going to be high and I don't want unbounded partitions, I want to apply bucketing. I'm thinking about modeling my data like this:
CREATE TABLE sensor_data (
sensor_id text,
some_timing_bucket date,
measured_at time,
value double
PRIMARY KEY ((sensor_id, some_timing_bucket), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC);
The usual queries the application would need to serve are "give me the data of the last 5/15 minutes/1 day", so I would choose the some_timing_bucket accordingly. Maybe even have multiple tables with different bucket sizes.
What I cannot wrap my head around is this: Consider I choose one day as bucketing interval. Now I want to retrieve the current value of a sensor that hasn't updated in ten days. There will be no partition for today, so on my application layer I would need to send nine queries that yield nothing until I have gone far enough back in time to encounter the value that is currently valid. That doesn't sound very efficient and I'd appreciate any input on how to model this.
Side note: This would not be an issue if all data for the same sensor was in the same partition: Just ask for all the points with a timestamp less than the beginning of the ranged query and limit the results to one. But that's not feasible because of the unbounded partition.
There is a much simpler way to model your data by using one-day buckets. Something like:
CREATE TABLE sensor_data_by_day (
sensor_id text,
year int,
month int,
day int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, year, month, day), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
If a sensor measures a data point every second, then there are 86,400 maximum possible values for a single day (60 secs x 60 mins * 24 hrs). 86K rows per partition is still manageable.
If today is 17 August 2022 and you wanted to retrieve the data for the previous day, the query would be:
SELECT value FROM sensor_data_by_day
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 16
Assuming it is currently 08:30:00 GMT on the 17th of August (1660725000000 ms since epoch), to retrieve the data for the last 15 minutes (900 secs ago or 1660724100000 ms):
SELECT value FROM
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 17
AND measured_at > 1660724100000
I think you'll find that it is easier to work with timestamps because it provides a bit more flexibility when it comes to doing range queries. Cheers!
you can do this with a simpler table like this:
CREATE TABLE sensor_data (
sensor_id text,
day_number_from_1970 int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, day_number_from_1970), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
and you can query data like that:
SELECT value
FROM sensor_data
WHERE sensor_id = some_sensor_id
AND day_number_from_1970 = day_number
AND measured_at > start_time
AND measured_at < end_time
with a single int column, you should less data on disk and get results well
Related
I have different locations where mobile measurements of temperature are made at random times (not periodically). The unique identifier of the measurement is it's location together with the time of measurement (two measurements on the same location and at the same time are not possible). I need to find all the measurements on particular location, that were recorded in a particular part of the year (something like "spring measurements", "summer measurements"...). For example I might want to get all the measurements recorded from 2nd to 25th January, regardless of the year of the measurement (NOTE: I do not want the measurements from 2nd to 25th January of particular year - but for all the years during which the measurements were taken!).
I have come up with a data model/table like this:
CREATE TABLE meas (
latitude double,
longitude double,
measurementTime timestamp,
dayOfYear int,
value double,
PRIMARY KEY ((latitude, longitude), measurementTime, dayOfYear)
) WITH CLUSTERING ORDER BY (measurementTime DESC, dayOfYear DESC)
I would like to query:
Give me all the rows where latitude=XXX and longitude=YYY and dayOfYear >= ZZZ and dayOfYear <= WWW
So an example query would be:
SELECT * FROM meas WHERE latitude=46.1 AND longitude=15.1 AND dayOfYear >= 2 AND dayOfYear <= 25;
The problem I am facing is that this query involves filtering (since I am not providing the value of the measurementTime) - therefore I need to specify ALLOW FILTERING in the query. However I would like to avoid filtering (to make query more efficient). Any ideas how the data model would look like?
If you need to query by day of year then you should model your table accordingly:
CREATE TABLE temp_by_location_dayofyear (
...
PRIMARY KEY ((latitude, longitude), dayOfYear)
But the "day-of-the-year" is not a unique clustering key because each year has such day. You're better off with a year AND a day as clustering keys:
CREATE TABLE temp_by_location_dayofyear (
latitude double,
longitude double,
year int,
dayofyear int,
measurementtime timestamp,
value double,
PRIMARY KEY ((latitude, longitude), year, dayofyear)
You would then be able to query the table with:
SELECT * FROM temp_by_location_dayofyear
WHERE latitude = ?
AND longitude = ?
AND year = ?
AND dayofyear >= ?
AND dayofyear <= ?
But strictly speaking using the day of the year is unnecessary. You were right in the first place to use the measurement time as the clustering key -- you just need to restrict the query based on the timestamp.
Here's the table schema:
CREATE TABLE temp_by_location (
latitude double,
longitude double,
measurementtime timestamp,
value double,
PRIMARY KEY ((latitude, longitude), measurementtime)
and you would query the table with:
SELECT * FROM temp_by_location_dayofyear
WHERE latitude = ?
AND longitude = ?
AND measurementtime >= '2022-01-02 +0000'
AND dayofyear <= '2022-01-25 +0000'
Remember that the CQL timestamp type is encoded as the number of milliseconds since Unix epoch so you can extract the date from it. Cheers!
I have the requirement to forward data at certain intervals from my system to an external system. To do this, I already stored all rows in a table. Already forwarded data should not be exported again.
The idea is to memorize the last export time on client side and export the following records the next time. Old rows are deleted after a successful export.
CREATE TABLE export(
id int,
import_date_time timestamp,
data text,
PRIMARY KEY (id, import_date_time)
) WITH CLUSTERING ORDER BY (import_date_time DESC)
insert into export(id, import_date_time, data) values (1, toUnixTimestamp(now()), 'content')
select * from export where id = 1 and import_date_time > '2017-03-30 16:22:37'
delete from export where id = 1 and import_date_time <= '2017-03-30 16:22:37'
Has anyone already implemented similar or do you have a different
solution?
If possible, I do not need an id for the request because I want to
export all data
If you used fixed partition key value (id = 1), then all the insert, select and delete will happen on a same node (If RF=1) over and over. And also for every delete cassandra create a tombstone entry, when you execute select query cassandra needs to merge each entry. So your select query performance will degrade.
So instead of having fixed value, use dynamic value like the below one :
CREATE TABLE export(
hour int,
day int,
month int,
year int,
import_date_time timestamp,
data text,
PRIMARY KEY ((hour, day, month, year), import_date_time)
) WITH CLUSTERING ORDER BY (import_date_time DESC);
Here you can insert the value of hour, day, month, year extracted from import_date_time
You need to take care of two case When selecting data :
Previous export time and current export time both at same hour.
Both time are not inside same hour.
For case one you need only one query and for case two you have to execute two query.
Example Query :
SELECT * FROM export WHERE hour = 16 AND day = 30 AND month = 3 AND year = 2017 AND import_date_time > '2017-03-30 16:22:37';
There are 100s of data points, each data point has its own seperate table with schema and queries as mentioned below:
Current Schema in SQLite
Table Name: Name of Data Point e.g. Tempearature
Column-1: Name: Timestamp Type: TEXT (yyyy-MM-dd HH:mm:ss.ttt format) PRIMARY KEY
Column-2: Name: Value Type: FLOAT
Column-3: Name: Quality Type: TEXT ("GOOD", "BAD")
Queries for SQLite
SELECT * FROM data-point-name;
SELECT * FROM data-point-name WHERE Timestamp BETWEEN timesamp-1 AND timestamp-2;
INSERT INTO data-point-name (Timestamp, Value, Quality) VALUES ("2016-01-01 00:00:05.254", 123.25454, "GOOD"); (this is an example)
Currently I have SQLite db where I have a table per data-point with above schema, essentially I have 100s of tables. This way reads/writes are not disturbing queries running on different data-points.
How to translate this schema to be used in Cassandra?
In your case, you can store all your data points in a single table :
CREATE TABLE datapoints (
datatype varchar(30),
time timestamp,
value float,
quality varchar(4),
PRIMARY KEY (datatype, time)
);
With this structure, you can run queries like :
SELECT *
FROM datapoints
WHERE datatype = 'data-point-name';
SELECT *
FROM datapoints
WHERE datatype = 'data-point-name'
AND time >= '2016-01-01 00:00:00'
AND time <= '2016-01-02 00:00:00';
But with this structure, cassandra will partition data by datapoint name,
if you have many points, your partition will be huge and you can have query performence issues.
You can also refine the partitionning by decompose the time :
CREATE TABLE datapoints (
datatype varchar(30),
year int,
month int,
day int,
milisecondsinday int,
value float,
quality varchar(4),
PRIMARY KEY ((datatype, year, month, day), milisecondsinday)
) WITH CLUSTERING ORDER BY (milisecondsinday ASC);
In this case, this structure allow cassandra to store datas in more small partition than the first exemple and it's more powerfull if you query you data by day :
SELECT *
FROM datapoints
WHERE datatype = 'data-point-type'
AND year = 2016
AND month = 1
AND day = 1;
get all points for 'data-points-type'
for the 2016-01-01
between 00:00 AM and 01:00 AM
SELECT *
FROM datapoints
WHERE datatype = 'data-point-type'
AND year = 2016
AND month = 1
AND day = 1
AND milisecondsinday >= 0
AND milisecondsinday <= 3600000;
Of course, you can partition by day (like exemple) or others time scale (hours, minutes, seconds and miliseconds). If you can, small partition will be good for performence.
Hope this can help you.
I have a table called Price in MYSQL which looks like this :
+---------+-------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------+------+-----+-------------------+-----------------------------+
| Current | float(20,3) | YES | | NULL | |
| Time | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+---------+-------------+------+-----+-------------------+-----------------------------+
My application requires me to sum and retrieve results from the last 1 hour, 2 hours up to the last week from now. I am trying to move to Cassandra and wanted to make a suitable model for my data. Currently i have built a table in Cassandra which looks something like this :
CREATE TABLE IF NOT EXISTS HAS.Price (
ID INT,
Current float,
Time timestamp,
Time_uuid timeuuid,
PRIMARY KEY (ID, Time_uuid)
);
This is not logical as it just creates one big table and i dont think this will distribute data to other nodes. I am using a fixed id of 1 here. I believe in my case the logical partition key to choose would be "hour" so for example i can sum all the current values from last hour, last 2 hours and so on. In this case i am referring to this post . If i create hour as a partition key for example all the data for lets say the 15th hour of the day will go in this row
2015-08-06 15:00:00
and the data for the next hour will go to 2015-08-06 16:00:00. However lets say the current time is 2015-08-06 16:12:43 and i want to select records from last hour how will my query look like because part of the data is in 2015-08-06 15:00:00 which will have a different primary key
Try the following option. ( I have correct the answer)
Design for your queries. Here, possible queries I could see other than upto minute
Get sum for day
Get sum for hour
Get sum for last hour (any time on the hour)
CREATE TABLE mykeyspace.price (
day text,
hour text,
inserttime timeuuid,
current float,
PRIMARY KEY ((day, hour), inserttime)
) WITH CLUSTERING ORDER BY (inserttime DESC)
Make 2 insert for every transaction like below
insert into price (day, hour , inserttime , current ) VALUES ('20150813','',now(),2.00)
insert into price (day, hour , inserttime , current ) VALUES ('',’ 2015081317',now(),2.00)
Where
day is YYYYMMDD
hour YYYYMMDDhhmmss (2015081317)
Select Query to get last hour at any minute: Use minTimeuuid and maxTimeuuid
select day,hour,dateOf(inserttime) from price where day = 0 and hour IN ( 2015081317, 2015081316) and inserttime > maxTimeuuid('2015-08-13 16:20:00-0500') and inserttime < minTimeuuid('2015-08-13 17:20:00-0500');
Note: Range query is not allowed on a partition key, although documentation says you could use token function but the results are not predictable.
This is not logical as it just creates one big table and i dont think this will distribute data to other nodes.
Yes, this won't distribute data across you nodes.
Here what I think solution should be
CREATE TABLE IF NOT EXISTS HAS.Price (
Time_uuid timeuuid,
Current float,
PRIMARY KEY (Time_uuid)
);
Then simply find start hour time_uuid and end hour time_uuid and write query like
`SELECT * FROM HAS.Price WHERE time_uuid>=cdb36860-4444-11e5-8080-808080808080 AND time_uuid<=f784b8ef-450d-11e5-7f7f-7f7f7f7f7f7f`
CQL Execution [returns instantly, assuming uses clustering key index]:
cqlsh:stats> select count(*) from events where month='2015-04' and day = '2015-04-02';
count
-------
5447
Presto Execution [takes around 8secs]:
presto:default> select count(*) as c from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02';
c
------
5447
(1 row)
Query 20150228_171912_00102_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:08 [147K rows, 144KB] [17.6K rows/s, 17.2KB/s]
Why should presto get to process 147K rows when cassandra itself responds with just 5447 rows for the same query [I tried select * too]?
Why presto is not able to use the clustering key optimization?
I tried all possible values like timestamp, date, different formats of dates. Not able to see any effect on number of rows being fetched.
CF Reference:
CREATE TABLE events (
month text,
day timestamp,
test_data text,
some_random_column text,
event_time timestamp,
PRIMARY KEY (month, day, event_time)
) WITH comment='Test Data'
AND read_repair_chance = 1.0;
Added event_timestamp too as a constraint in response to Dain's answer
presto:default> select count(*) from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02 00:00:00+0000' and event_time = timestamp '2015-04-02 00:00:34+0000';
_col0
-------
1
(1 row)
Query 20150301_071417_00009_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:07 [147K rows, 144KB] [21.3K rows/s, 20.8KB/s]
The Presto engine will pushdown simple WHERE clauses like this to a connector (you can see this in the Hive connector), so the question is, why does the Cassandra connector not take advantage of this. To see why, we'll have to look at the code.
The pushdown system first interacts with connectors in the ConnectorSplitManager.getPartitions(ConnectorTableHandle, TupleDomain) method, so looking at the CassandraSplitManager, I see it is delegating the logic to getPartitionKeysSet. This method looks for a range constraint (e.g., x=33 or x BETWEEN 1 AND 10) for every column in the primary key, so in your case, you would need to add a constraint on event_time.
I don't know why the code insists on having a constraint on every column in the primary key, but I'd guess that it is a bug. It should be easy to tweak this code to remove that constraint.