How to define keyspaces for a timeseries data in Cassandra? - cassandra

There are 100s of data points, each data point has its own seperate table with schema and queries as mentioned below:
Current Schema in SQLite
Table Name: Name of Data Point e.g. Tempearature
Column-1: Name: Timestamp Type: TEXT (yyyy-MM-dd HH:mm:ss.ttt format) PRIMARY KEY
Column-2: Name: Value Type: FLOAT
Column-3: Name: Quality Type: TEXT ("GOOD", "BAD")
Queries for SQLite
SELECT * FROM data-point-name;
SELECT * FROM data-point-name WHERE Timestamp BETWEEN timesamp-1 AND timestamp-2;
INSERT INTO data-point-name (Timestamp, Value, Quality) VALUES ("2016-01-01 00:00:05.254", 123.25454, "GOOD"); (this is an example)
Currently I have SQLite db where I have a table per data-point with above schema, essentially I have 100s of tables. This way reads/writes are not disturbing queries running on different data-points.
How to translate this schema to be used in Cassandra?

In your case, you can store all your data points in a single table :
CREATE TABLE datapoints (
datatype varchar(30),
time timestamp,
value float,
quality varchar(4),
PRIMARY KEY (datatype, time)
);
With this structure, you can run queries like :
SELECT *
FROM datapoints
WHERE datatype = 'data-point-name';
SELECT *
FROM datapoints
WHERE datatype = 'data-point-name'
AND time >= '2016-01-01 00:00:00'
AND time <= '2016-01-02 00:00:00';
But with this structure, cassandra will partition data by datapoint name,
if you have many points, your partition will be huge and you can have query performence issues.
You can also refine the partitionning by decompose the time :
CREATE TABLE datapoints (
datatype varchar(30),
year int,
month int,
day int,
milisecondsinday int,
value float,
quality varchar(4),
PRIMARY KEY ((datatype, year, month, day), milisecondsinday)
) WITH CLUSTERING ORDER BY (milisecondsinday ASC);
In this case, this structure allow cassandra to store datas in more small partition than the first exemple and it's more powerfull if you query you data by day :
SELECT *
FROM datapoints
WHERE datatype = 'data-point-type'
AND year = 2016
AND month = 1
AND day = 1;
get all points for 'data-points-type'
for the 2016-01-01
between 00:00 AM and 01:00 AM
SELECT *
FROM datapoints
WHERE datatype = 'data-point-type'
AND year = 2016
AND month = 1
AND day = 1
AND milisecondsinday >= 0
AND milisecondsinday <= 3600000;
Of course, you can partition by day (like exemple) or others time scale (hours, minutes, seconds and miliseconds). If you can, small partition will be good for performence.
Hope this can help you.

Related

Cassandra data model for intersection of ranges

Assume data with pk (text), start (int), end (int), extra_data(text).
Query is: given a pk (e.g. 'pk1') and a range (e.g [1000, 2000]), find all rows for 'pk1' which intersect that range. This (sql) logically translates to WHERE pk=pk1 AND end>=1000 AND start<=2000 (intersection condition)
Notice this is NOT the same as the more conventional query of:
all rows for pk1 where start>1000 and start<2000
If I define a table with end as part of the clustering key:
CREATE TABLE test1 (
pk text,
start bigint,
end bigint,
extra_data text,
PRIMARY KEY ((pk), start, end)
)...
Then Cassandra does not allow the query:
select * from test1 where pk='pk1' and start < 2000 and end > 1000;
with "Clustering column "end" cannot be restricted (preceding column "start" is restricted by a non-EQ relation)"
Why does Cassandra not allow further filtering to limit ranged rows (forces to do this filter with results application-side).
A second try would be to remove 'end' from clustering columns:
CREATE TABLE test1 (
pk text,
start bigint,
end bigint,
extra_data text,
PRIMARY KEY ((pk), start)
)...
Then Cassandra warns the query:
select * from test1 where pk='pk1' and start < 2000 and end > 1000;
with "Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
Here I would like to understand if I can safely add the ALLOW FILTERING and be assured Cassandra will perform the scan only of 'pk1'.
Using cqlsh 5.0.1 | Cassandra 3.11.3
Actually, I think you made the fatal mistake of designing your table first and then trying to adapt the application query to fit the table design.
In Cassandra data modelling, the primary principle is to always start by listing all your application queries THEN design a table for each of those application queries -- not the other way around.
Let's say I have an IoT use case where I have sensors collecting temperature readings once a day. If my application needs to retrieve the readings from the last 7 days from a sensor, the app query is:
Get the temperature for the last 7 days for sensor X
Assuming today is October 25, a more SQL-like representation of this app query is:
SELECT temperature FROM table
WHERE sensor = X
AND reading_date >= 2022-10-18
AND reading_date < 2022-10-25
This means that we need to design the table such that:
it is partitioned by sensor, and
the data is clustered by date.
The table schema would look like:
CREATE TABLE readings_by_sensor (
sensor text,
reading_date date,
temp float,
PRIMARY KEY (sensor, reading_date)
)
We can then perform a range query on the date:
SELECT temperature FROM readings_by_sensor
WHERE sensor = ?
AND reading_date >= 2022-10-18
AND reading_date < 2022-10-25
You don't need two separate columns to represent the start and end range because. Cheers!

Cassandra: Data Modeling for event based time series

I have a data modeling question. In my application I'm reading data from a few different sensors and storing it in Cassandra. The sensors generate new values at very different rates: Some every other second, some every other month.
Furthermore, the assumption is that a value stays valid until the next one is encountered. Example: Sensor 1 sent a value of 500 at 10s after EPOCH and a value of 1000 at 20s after EPOCH. The valid value for 15s after EPOCH would need to be 500.
Since some rates are going to be high and I don't want unbounded partitions, I want to apply bucketing. I'm thinking about modeling my data like this:
CREATE TABLE sensor_data (
sensor_id text,
some_timing_bucket date,
measured_at time,
value double
PRIMARY KEY ((sensor_id, some_timing_bucket), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC);
The usual queries the application would need to serve are "give me the data of the last 5/15 minutes/1 day", so I would choose the some_timing_bucket accordingly. Maybe even have multiple tables with different bucket sizes.
What I cannot wrap my head around is this: Consider I choose one day as bucketing interval. Now I want to retrieve the current value of a sensor that hasn't updated in ten days. There will be no partition for today, so on my application layer I would need to send nine queries that yield nothing until I have gone far enough back in time to encounter the value that is currently valid. That doesn't sound very efficient and I'd appreciate any input on how to model this.
Side note: This would not be an issue if all data for the same sensor was in the same partition: Just ask for all the points with a timestamp less than the beginning of the ranged query and limit the results to one. But that's not feasible because of the unbounded partition.
There is a much simpler way to model your data by using one-day buckets. Something like:
CREATE TABLE sensor_data_by_day (
sensor_id text,
year int,
month int,
day int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, year, month, day), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
If a sensor measures a data point every second, then there are 86,400 maximum possible values for a single day (60 secs x 60 mins * 24 hrs). 86K rows per partition is still manageable.
If today is 17 August 2022 and you wanted to retrieve the data for the previous day, the query would be:
SELECT value FROM sensor_data_by_day
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 16
Assuming it is currently 08:30:00 GMT on the 17th of August (1660725000000 ms since epoch), to retrieve the data for the last 15 minutes (900 secs ago or 1660724100000 ms):
SELECT value FROM
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 17
AND measured_at > 1660724100000
I think you'll find that it is easier to work with timestamps because it provides a bit more flexibility when it comes to doing range queries. Cheers!
you can do this with a simpler table like this:
CREATE TABLE sensor_data (
sensor_id text,
day_number_from_1970 int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, day_number_from_1970), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
and you can query data like that:
SELECT value
FROM sensor_data
WHERE sensor_id = some_sensor_id
AND day_number_from_1970 = day_number
AND measured_at > start_time
AND measured_at < end_time
with a single int column, you should less data on disk and get results well

How would I model a CQL table to be able to retrieve timeseries data by range of days?

I have different locations where mobile measurements of temperature are made at random times (not periodically). The unique identifier of the measurement is it's location together with the time of measurement (two measurements on the same location and at the same time are not possible). I need to find all the measurements on particular location, that were recorded in a particular part of the year (something like "spring measurements", "summer measurements"...). For example I might want to get all the measurements recorded from 2nd to 25th January, regardless of the year of the measurement (NOTE: I do not want the measurements from 2nd to 25th January of particular year - but for all the years during which the measurements were taken!).
I have come up with a data model/table like this:
CREATE TABLE meas (
latitude double,
longitude double,
measurementTime timestamp,
dayOfYear int,
value double,
PRIMARY KEY ((latitude, longitude), measurementTime, dayOfYear)
) WITH CLUSTERING ORDER BY (measurementTime DESC, dayOfYear DESC)
I would like to query:
Give me all the rows where latitude=XXX and longitude=YYY and dayOfYear >= ZZZ and dayOfYear <= WWW
So an example query would be:
SELECT * FROM meas WHERE latitude=46.1 AND longitude=15.1 AND dayOfYear >= 2 AND dayOfYear <= 25;
The problem I am facing is that this query involves filtering (since I am not providing the value of the measurementTime) - therefore I need to specify ALLOW FILTERING in the query. However I would like to avoid filtering (to make query more efficient). Any ideas how the data model would look like?
If you need to query by day of year then you should model your table accordingly:
CREATE TABLE temp_by_location_dayofyear (
...
PRIMARY KEY ((latitude, longitude), dayOfYear)
But the "day-of-the-year" is not a unique clustering key because each year has such day. You're better off with a year AND a day as clustering keys:
CREATE TABLE temp_by_location_dayofyear (
latitude double,
longitude double,
year int,
dayofyear int,
measurementtime timestamp,
value double,
PRIMARY KEY ((latitude, longitude), year, dayofyear)
You would then be able to query the table with:
SELECT * FROM temp_by_location_dayofyear
WHERE latitude = ?
AND longitude = ?
AND year = ?
AND dayofyear >= ?
AND dayofyear <= ?
But strictly speaking using the day of the year is unnecessary. You were right in the first place to use the measurement time as the clustering key -- you just need to restrict the query based on the timestamp.
Here's the table schema:
CREATE TABLE temp_by_location (
latitude double,
longitude double,
measurementtime timestamp,
value double,
PRIMARY KEY ((latitude, longitude), measurementtime)
and you would query the table with:
SELECT * FROM temp_by_location_dayofyear
WHERE latitude = ?
AND longitude = ?
AND measurementtime >= '2022-01-02 +0000'
AND dayofyear <= '2022-01-25 +0000'
Remember that the CQL timestamp type is encoded as the number of milliseconds since Unix epoch so you can extract the date from it. Cheers!

Cassandra export/forward data only once

I have the requirement to forward data at certain intervals from my system to an external system. To do this, I already stored all rows in a table. Already forwarded data should not be exported again.
The idea is to memorize the last export time on client side and export the following records the next time. Old rows are deleted after a successful export.
CREATE TABLE export(
id int,
import_date_time timestamp,
data text,
PRIMARY KEY (id, import_date_time)
) WITH CLUSTERING ORDER BY (import_date_time DESC)
insert into export(id, import_date_time, data) values (1, toUnixTimestamp(now()), 'content')
select * from export where id = 1 and import_date_time > '2017-03-30 16:22:37'
delete from export where id = 1 and import_date_time <= '2017-03-30 16:22:37'
Has anyone already implemented similar or do you have a different
solution?
If possible, I do not need an id for the request because I want to
export all data
If you used fixed partition key value (id = 1), then all the insert, select and delete will happen on a same node (If RF=1) over and over. And also for every delete cassandra create a tombstone entry, when you execute select query cassandra needs to merge each entry. So your select query performance will degrade.
So instead of having fixed value, use dynamic value like the below one :
CREATE TABLE export(
hour int,
day int,
month int,
year int,
import_date_time timestamp,
data text,
PRIMARY KEY ((hour, day, month, year), import_date_time)
) WITH CLUSTERING ORDER BY (import_date_time DESC);
Here you can insert the value of hour, day, month, year extracted from import_date_time
You need to take care of two case When selecting data :
Previous export time and current export time both at same hour.
Both time are not inside same hour.
For case one you need only one query and for case two you have to execute two query.
Example Query :
SELECT * FROM export WHERE hour = 16 AND day = 30 AND month = 3 AND year = 2017 AND import_date_time > '2017-03-30 16:22:37';

Query min partition key based on date range (clustering key)

I have a table Foo in cassandra with 4 columns foo_id bigint, date datetime, ref_id bigint, type int
here the partitioning key is foo_id. the clustering keys are date desc, ref_id and type
I want to write a CSQL query which is the equivalent of the SQL below
select min(foo_id) from foo where date >= '2016-04-01 00:00:00+0000'
I wrote the following CSQL
select foo_id from foo where
foo_id IN (-9223372036854775808, 9223372036854775807)
and date >= '2016-04-01 00:00:00+0000';
but this returns empty results.
Then I tried
select foo_id from foo where
token(foo_id) > -9223372036854775808
and token(foo_id) < 9223372036854775807
and date >= '2016-04-01 00:00:00+0000';
but this results in error
Unable to execute CSQL Script on 'Cassandra'. Cannot execute this query
as it might involve data filtering and thus may have unpredictable
performance. If you want to execute this query despite performance
unpredictability, use ALLOW FILTERING.
I don't want to use ALLOW FILTERING. but I want the minimum of foo_id at the start of the specified date.
You should probably denormalize your data and create a new table for the purpose. I propose something like:
CREATE TABLE foo_reverse (
year int,
month int,
day int,
foo_id bigint,
date datetime,
ref_id bigint,
type int,
PRIMARY KEY ((year, month, day), foo_id)
)
To get the minimum foo_id you would query that table by something like:
SELECT * FROM foo_reverse WHERE year = 2016 AND month = 4 AND day = 1 LIMIT 1;
That table would allow you to query on a "per day" basis. You can change the partition key to better reflect your needs. Beware of the potential hot spots you (and I) could create by selecting an appropriate time range.

Resources