Query columns based on datetime in Cassandra - cassandra

We are trying to create/query information from a CF based on the following structure (e.g. a datetime, datetime, integer)
e.g.
03-22-2012 10.00, 03-22-2012 10.30 100
03-22-2012 10.30, 03-22-2012 11.00 50
03-22-2012 11.00, 03-22-2012 11.30 200
How do I model the above structure in Cassandra and perform the following queries via Hector.
select * from <CF> where datetime1 > 03-22-2012 10.00 and datetime2 < 03-22-2012 10.30
select * from <CF> where datetime1 > 03-22-2012 10.00 and datetime2 < 03-22-2012 11.00
select * from <CF> where datetime = 03-22-2012 (i.e. for the entire day)

This is a great introduction to working with dates and times in Cassandra: Basic Time Series with Cassandra.
In short, use timestamps (or v1 UUIDs) as your column names and set the comparator to LongType (or TimeUUIDType) in order to get chronological sorting of the columns. It's then easy to get a slice of data between two points in time.
Your question isn't totally clear about this, but if you want to get all events that happened during a given range of time of day regardless of the date, then you will want to structure your data differently. In this case, column names may be CompositeType(LongType, AsciiType), where the first component is a normal timestamp mod 86400 (the number of seconds in a day), and the second component is the date or something else that changes over time, like a full timestamp. You would also want to break up the row in this case, perhaps dedicating a different row to each hour.

Unfortunately there is no way to do this easily with just one column family in Cassandra. The problem is you are wanting cassandra to sort based on two different things: datetime1 and datetime2.
The obvious structure for this would be to have your Columns being Composite types of Composite(TimeUUID, TimeUUID, Integer). In this case, they will get sorted by datetime1, then datetime2, then integer.
But you will always get the ordering based on datetime1 and not on datetime2 (though if two entries have the same datetime1 then it will then order just those entries based on datetime2).
A possible workaround would be to have two column families with duplicate data (or indeed two rows for each logical row). One row where data is inserted (datetime1:datetime2:integer) and the other where it is inserted (datetime2:datetime1:integer). You can then do a multigetslice operation on these two rows and combine the data before handing it off to the caller:
final MultigetSliceQuery<String, Composite, String> query = HFactory.createMultigetSliceQuery(keyspace,
StringSerializer.get(),
CompositeSerializer.get(),
StringSerializer.get());
query.setColumnFamily("myColumnFamily");
startQuery.setKeys("myRow.arrangedByDateTime1", "myRow.arrangedByDateTime2");
startQuery.setRange(new Composite(startTime), new Composite(endTime), false, Integer.MAX_VALUE);
final QueryResult<Rows<String,Composite,String>> queryResult = query.execute();
final Rows<String,Composite,String> rows = queryResult.get();

Related

Cassandra: Data Modeling for event based time series

I have a data modeling question. In my application I'm reading data from a few different sensors and storing it in Cassandra. The sensors generate new values at very different rates: Some every other second, some every other month.
Furthermore, the assumption is that a value stays valid until the next one is encountered. Example: Sensor 1 sent a value of 500 at 10s after EPOCH and a value of 1000 at 20s after EPOCH. The valid value for 15s after EPOCH would need to be 500.
Since some rates are going to be high and I don't want unbounded partitions, I want to apply bucketing. I'm thinking about modeling my data like this:
CREATE TABLE sensor_data (
sensor_id text,
some_timing_bucket date,
measured_at time,
value double
PRIMARY KEY ((sensor_id, some_timing_bucket), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC);
The usual queries the application would need to serve are "give me the data of the last 5/15 minutes/1 day", so I would choose the some_timing_bucket accordingly. Maybe even have multiple tables with different bucket sizes.
What I cannot wrap my head around is this: Consider I choose one day as bucketing interval. Now I want to retrieve the current value of a sensor that hasn't updated in ten days. There will be no partition for today, so on my application layer I would need to send nine queries that yield nothing until I have gone far enough back in time to encounter the value that is currently valid. That doesn't sound very efficient and I'd appreciate any input on how to model this.
Side note: This would not be an issue if all data for the same sensor was in the same partition: Just ask for all the points with a timestamp less than the beginning of the ranged query and limit the results to one. But that's not feasible because of the unbounded partition.
There is a much simpler way to model your data by using one-day buckets. Something like:
CREATE TABLE sensor_data_by_day (
sensor_id text,
year int,
month int,
day int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, year, month, day), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
If a sensor measures a data point every second, then there are 86,400 maximum possible values for a single day (60 secs x 60 mins * 24 hrs). 86K rows per partition is still manageable.
If today is 17 August 2022 and you wanted to retrieve the data for the previous day, the query would be:
SELECT value FROM sensor_data_by_day
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 16
Assuming it is currently 08:30:00 GMT on the 17th of August (1660725000000 ms since epoch), to retrieve the data for the last 15 minutes (900 secs ago or 1660724100000 ms):
SELECT value FROM
WHERE sensor_id = ?
AND year = 2022
AND month = 8
AND day = 17
AND measured_at > 1660724100000
I think you'll find that it is easier to work with timestamps because it provides a bit more flexibility when it comes to doing range queries. Cheers!
you can do this with a simpler table like this:
CREATE TABLE sensor_data (
sensor_id text,
day_number_from_1970 int,
measured_at timestamp,
value double,
PRIMARY KEY ((sensor_id, day_number_from_1970), measured_at)
) WITH CLUSTERING ORDER BY (measured_at DESC)
and you can query data like that:
SELECT value
FROM sensor_data
WHERE sensor_id = some_sensor_id
AND day_number_from_1970 = day_number
AND measured_at > start_time
AND measured_at < end_time
with a single int column, you should less data on disk and get results well

How to determine time stamps for Cassandra queries

One of The values inserted into the table is current time. I compute the current time using toTimestamp(now()). Now, I want to compute current time minus 90 days , current time minus 15 days.
My question is how do I compute current time - nth day ?
Query for current timestamp :
INSERT INTO TABLE_NAME (col_1, col_2, col_3) VALUES ('val_1', toTimestamp(now()), val_3);
In the above query, val_2 is current timestamp. Current time stamp is determined by
toTimestamp(now())
How do I compute current time - 90 days , current time - 2weeks
This functionality is not built into CQL.
If you are able to use UDFs, you can (building on the example given here:
How to get Last 6 Month data comparing with timestamp column using cassandra query?) do the following:
Enable UDFs as needed by adding or changing this line to true in cassandra.yaml:
enable_user_defined_functions: true
Then add two user defined functions like this:
CREATE FUNCTION dateadd(date timestamp, daydiff int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS $$java.util.Calendar c = java.util.Calendar.getInstance();c.setTime(date);c.add(java.util.Calendar.DATE, daydiff);return c.getTime();$$
CREATE FUNCTION weekadd(date timestamp, weekdiff int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS $$java.util.Calendar c = java.util.Calendar.getInstance();c.setTime(date);c.add(java.util.Calendar.DATE, weekdiff*7);return c.getTime();$$
Select the data from your table like this:
select dateadd(col_2,-90) from TABLE_NAME;
select weekadd(col_2,-2) from TABLE_NAME;

filter for key-value pair in cassandra wide rows

I am trying to model time series data with many sensors (> 50k) with cassandra. As I would like to do filtering on multiple sensors at the same time, I thought using the following (wide row) schema might be suitable:
CREATE TABLE data(
time timestamp,
session_id int,
sensor text,
value float,
PRIMARY KEY((time, session_id), sensor)
);
If every sensor value was a column in an RDBMS, my query would ideally look like:
SELECT * FROM data WHERE sensor_1 > 10 AND sensor_2 < 2;
Translated to my cassandra schema, I assumed the query might look like:
SELECT * FROM data
WHERE
sensor = 'sensor_1' AND
value > 10 AND
sensor = 'sensor_2' AND
value < 2;
I now have two problems:
cassandra tells me that I can filter on the sensor column only
once:
sensor cannot be restricted by more than one relation if it
includes an Equal
Obviously, the filter on value doesn't make sense at the moment. I wouldn't know how to express the relationship
between sensor and value in the query in order to filter multiple
columns in the same (wide) row.
I do know that a solution to the first question would be to use CQL's IN clause. This however doesn't solve the second problem.
Is this scenario even suitable for cassandra?
Many thanks in advance.
You could try to use IN clause here.
So your query would be like this:
SELECT * FROM data
WHERE time = <time> and session_id = <session id>
AND sensor IN ('sensor_1', 'sensor_2')
AND value > 10 AND value < 2

fetching timeseries/range data in cassandra

I am new to Cassandra and trying to see if it fits my data query needs. I am populating test data in a table and fetching them using cql client in Golang.
I am storing time series data in Cassandra, sorted by timestamp. I store data on a per-minute basis.
Schema is like this:
parent: string
child: string
bytes: int
val2: int
timestamp: date/time
I need to answer queries where a timestamp range is provided and a childname is given. The result needs to be the bytes value in that time range(Single value, not series) I made a primary key(child, timestamp). I followed this approach rather than the column-family, comparator-type with timeuuid since that was not supported in cql.
Since the data stored in every timestamp(every minute) is the accumulated value, when I get a range query for time t1 to t2, I need to find the bytes value at t2, bytes value at t1 and subtract the 2 values before returning. This works fine if t1 and t2 actually had entries in the table. If they do not, I need to find those times between (t1, t2) that have data and return the difference.
One approach I can think of is to "select * from tablename WHERE timestamp <= t2 AND timestamp >= t1;" and then find the difference between the first and last entry in this array of rows returned. Is this the best way to do it? Since MIN and MAX queries are not supported, is there is a way to find the maximum timestamp in the table less than a given value? Thanks for your time.
Are you storing each entry as a new row with a different partition key(first column in the Primary key)? If so, select * from x where f < a and f > b is a cluster wide query, which will cause you problems. Consider adding a "fake" partition key, or use a partition key per date / week / month etc. so that your queries hit a single partition.
Also, your queries in cassandra are >= and <= even if you specify > and <. If you need strictly greater than or less than, you'll need to filter client side.

Reading the most recent updated row in cassandra

I have a use case and want suggestion on the below.
Structure :
Rowkey_1:
Column1 = value1;
Column2 = value2;
Rowkey_2:
Column1 = value1;
Column2 = value2;
" Suppose i am writing 1000 rows into cassandra with each row having couple of columns. After sometime i update only 100 rows and make changes for column values ".
-> when i read data from cassandra i only want to get these 100 updated rows and not the entire row key information.
Is there a way to say to cassandra like give me all row keys from start - > end where time in between "Time_start" to "Time_end"
in SQL Lingo -- > select from "" to "" where time between "time_start" and "time_end".
P.S. i read Basic Time Series with Cassandra where it says you can annotate rowkey like the below
Inserting data — {:key => ‘server1-load-20110306′, :column_name => TimeUUID(now), :column_value => 0.75}
Here the column family has TimeUUID columns.
My question is can you annotate you rowkey with date and time like this : { :key ==> 2012-11-18 16:00:15 }
OR any other way to get only the recent updated rows.
Any suggestion/ guidance really appreciated.
You can't do range queries on keys unless you use ByteOrderedPartitioner, which you shouldn't use. The way to do this is by writing known sentinel values as keys, such as a timestamp representing the beginning of the day. Then you can do the column slice by time.

Resources