How to retrieve a date range from cassandra - cassandra

I have a very simple table to store collection of IDs by a date rage
CREATE TABLE schedule_range (
start_date timestamp,
end_date timestamp,
schedules set<text>,
PRIMARY KEY ((start_date, end_date)));
I was hoping to be able to query it by a date range
SELECT *
FROM schedule_range
WHERE start_date >= 'xxx'
AND end_date < 'yyy'
Unfortunately it doesn't work this way. I've tried few different approaches and it always fail for a different reason.
How should I store IDs to be able to get them all by a date range?

In cassandra you only can use >, < operators with last field of primary key, in your case 'end_date'. For previous fields you must use equal operator. If you just considerate that schema maybe you could use other choices.
One approximation is use Apache Spark. There is some projects that built an abstraction layer in Spark over Cassandra and let you make operations in cassandra such as joins, any filter, groups by ...
Check this projects:
Stratio Deep
Datastax Connector

Using this table with a query that somewhat resembles yours works because 1) it doesn't use the conditional on the partition key start_date. Only EQ and IN relation are supported on the partition key. 2) The greater-than and less-than comparison on the clustering column is restricted to filters that select a contiguous ordering of rows. Filtering by the clustering column--2nd component in the compound key--id, does the latter.
create table schedule_range2(start_date timestamp, end_date timestamp, id int, schedules set<text>, primary key (start_date, id, end_date));
insert into schedule_range2 (start_date, id, end_date, schedules) VALUES ('2014-02-03 04:05', 1, '2014-02-04 04:00', {'event1', 'event2'});
insert into schedule_range2 (start_date, id, end_date, schedules) VALUES ('2014-02-05 04:05', 1, '2014-02-06 04:00', {'event3', 'event4'});
select * from schedule_range2 where id=1 and end_date >='2014-02-04 04:00' and end_date < '2014-02-06 04:00' ALLOW FILTERING;

Related

How to sum up cassandra counter grouping by only one column in the primary key set?

I am trying to keep track of the amount of events of each type that occured in one-hour buckets of time, and then sum the counts per category in arbitrary time ranges. So, I create a table like this:
CREATE TABLE IF NOT EXISTS sensor_activity_stats(
sensor_id text,
datetime_hour_bucket timestamp,
activity_type text,
activity_count counter,
PRIMARY KEY ((sensor_id), datetime_hour_bucket, activity_type)
)
WITH CLUSTERING ORDER BY(datetime_hour_bucket DESC, activity_type ASC);
I would like to be able to achieve this kind of query:
SELECT datetime_hour_bucket, activity_type, SUM(activity_count) as count
FROM sensor_activity_stats
WHERE sensor_id=:sensorId
AND datetime_hour_bucket >= :fromDate AND datetime_hour_bucket < :untilDate
GROUP BY activity_type
Cassandra complains about because grouping must be done in the order of the primary key columns. And, if I change the order I won't be able to query by a range over any activity_type.
Some notes:
I am grouping by hours because some users could ask me to show the data in different timezones and I want to be able to perform a decent conversion.
The activity_type has low cardinality, however I can not be sure I'll always be able to predict it's possible values.
Right now my solution was to query the whole data in the range and perform the aggregation myself in code. Have you have faced similar situation and what was your solution? Would you suggest a different way of querying or arranging the data?
I hope you've found the solution of your problem, however I have a way to you try.
First, you can chage the create table to change the order of fields:
CREATE TABLE IF NOT EXISTS sensor_activity_stats(
sensor_id text,
datetime_hour_bucket timestamp,
activity_type text,
activity_count counter,
PRIMARY KEY (activity_type, sensor_id, datetime_hour_bucket, activity_count)
)
WITH CLUSTERING ORDER BY(activity_type ASC, datetime_hour_bucket DESC);
Then, the query you can add the field "datetime_hour_bucket" in the Group By clause:
SELECT datetime_hour_bucket, activity_type, SUM(activity_count) as count
FROM sensor_activity_stats
WHERE sensor_id=:sensorId
AND datetime_hour_bucket >= :fromDate AND datetime_hour_bucket < :untilDate
GROUP BY activity_type, datetime_hour_bucket;

Delete records in Cassandra table based on time range

I have a Cassandra table with schema:
CREATE TABLE IF NOT EXISTS TestTable(
documentId text,
sequenceNo bigint,
messageData blob,
clientId text
PRIMARY KEY(documentId, sequenceNo))
WITH CLUSTERING ORDER BY(sequenceNo DESC);
Is there a way to delete the records which were inserted between a given time range? I know internally Cassandra must be using some timestamp to track the insertion time of each record, which would be used by features like TTL.
Since there is no explicit column for insertion timestamp in the given schema, is there a way to use the implicit timestamp or is there any better approach?
There is never any update to the records after insertion.
It's an interesting question...
All columns that aren't part of the primary key have so-called WriteTime that could be retrieved using the writetime(column_name) function of CQL (warning: it doesn't work with collection columns, and return null for UDTs!). But because we don't have nested queries in the CQL, you will need to write a program to fetch data, filter out entries by WriteTime, and delete entries where WriteTime is older than your threshold. (note that value of writetime is in microseconds, not milliseconds as in CQL's timestamp type).
The easiest way is to use Spark Cassandra Connector's RDD API, something like this:
val timestamp = someDate.toInstant.getEpochSecond * 1000L
val oldData = sc.cassandraTable(srcKeyspace, srcTable)
.select("prk1", "prk2", "reg_col".writeTime as "writetime")
.filter(row => row.getLong("writetime") < timestamp)
oldData.deleteFromCassandra(srcKeyspace, srcTable,
keyColumns = SomeColumns("prk1", "prk2"))
where: prk1, prk2, ... are all components of the primary key (documentId and sequenceNo in your case), and reg_col - any of the "regular" columns of the table that isn't collection or UDT (for example, clientId). It's important that list of the primary key columns in select and deleteFromCassandra was the same.

Get last row in table of time series?

I am already able to get the last row of time-series table as:
SELECT * from myapp.locations WHERE organization_id=1 and user_id=15 and date='2017-2-22' ORDER BY unix_time DESC LIMIT 1;
That works fine, however, I am wondering about performance and overhead of executing ORDER BY as rows are already sorted, I just use it to get the last row, is it an overhead in my case?
If I don't use ORDER BY, I will always get the first row in the table, so, I though I might be able to use INSERT in another way, ex: insert always in the beginning instead of end of table?
Any advice? shall I use ORDER BY without worries about performance?
Just define your clustering key order to DESC
Like the below schema :
CREATE TABLE locations (
organization_id int,
user_id int,
date text,
unix_time bigint,
lat double,
long double,
PRIMARY KEY ((organization_id, user_id, date), unix_time)
) WITH CLUSTERING ORDER BY (unix_time DESC);
So by default your data will sorted by unix_time desc, you don't need to specify in query
Now you can just use the below query to get the last row :
SELECT * from myapp.locations WHERE organization_id = 1 and user_id = 15 and date = '2017-2-22' LIMIT 1;
If your query pattern for that table is always ORDER BY unix_time DESC then you are in a reverse order time-series scenario, and I can say that your model is inaccurate (not wrong).
There's no reason not to sort the records in reverse order by adding a WITH CLUSTERING ORDER BY unix_time DESC in the table definition, and in my opinion the ORDER BY unix_time DESC will perform at most on par with something explicitly meant for these use cases (well, I think it will perform worse).

CQL query on 'validFrom/validTo timestamps'

I'm currently trying to model a column family that has two timestamps specifying whether an entry is valid (or 'active') at a given date (typically execution time).
No big issue with traditional SQL, 64 gigs of RAM and some indices, we're doing that quite often with our SQL server.
However, in CQL I haven't managed to model this scenario and write valid queries for it.
My basic model is (I skipped the PK definition!)
create table myTable(
id uuid,
validFrom timeuuid,
validTo timeuuid,
someInformationalData varChar
);
Some explanations:
due to the fact, that a validity date is not unique, I need a combined key in my final application this is going to be a usergroup reference (would be an ideal partition key)
validFrom/To are designed to be optional, but I could deal with by using boundary values (1970, 2038) for 'null' values passed through the persistence layer
I tried various combinations of partitioning/clustering keys, however neither of them resulted in valid CQL
-- only active results
select *
from
myTable
where
validFrom < now()
and
validTo > now()
I'm quite new to the NoSQL/CQL world and am struggling a bit with converting some of our applications. I could do it in memory, but I'm afraid, this could get a bottleneck at some point...
No sure if this kind of 'I have no idea what I'm doing' yell is appropriate, but any kind of help would be appreciated. :)
edit Here's one of the approaches I've been messing around with
drop table if exists myTable;
create table myTable(
id int,
datefrom timeuuid,
dateto timeuuid,
someColumns varChar,
primary key((id,datefrom),dateto)
);
create index if not exists my_idx on myTable(datefrom);
insert into myTable(id, datefrom,dateto,somecolumns)
values(0,minTimeuuid('1970-01-01 00:00:00'),minTimeuuid('2020-01-01 00:00:00'),'test');
insert into myTable(id,datefrom,dateto,somecolumns)
values(1,minTimeuuid('1970-01-01 00:00:00'),minTimeuuid('2012-01-01 00:00:00'),'test2');
select * from myTable where dateto > now() allow filtering;
-- invalid ("A column of a partition key can be restricted only if the preceding one is restricted by an Equal relation.")
select * from myTable where datefrom < now() and dateto > now() allow filtering;
The first query is limiting my result, the row with 'validTo=2012-01-01' is filtered, but I wasn't able to work out a scheme that worked on both limitations in the where clause.
If I understand your problem, what you are looking for is a way to run a range query based on the timestamp. Basically to be able to do this, your model will have to have the timestamp component as part of the clustering key:
create table myTable(
eventType uuid,
ts timestamp,
val text,
PRIMARY KEY (eventType, ts)
);
The above will allow you to run a query like: SELECT eventType, val from myTable where eventType = 'your_event' and ts >= 'start_ts' and ts < 'end_ts'.
What you need to remember is that the clustering keys are dictating the order on disk, thus making it possible to run efficiently queries like above. You can read more details about this in the CQL spec SELECT section.
Their is no such thing as Now() in cassandra like any other sql databases. you have to clearly mention today's date instead of Now() ..
You can use columns in which you defined as primary key or secondary index in where clause.

Cassandra CQL - clustering order with multiple clustering columns

I have a column family with primary key definition like this:
...
PRIMARY KEY ((website_id, item_id), user_id, date)
which will be queried using queries such as:
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id = 0 AND date > 'some_date' ;
However, I'd like to keep my column family ordered by date only, such as SELECT date FROM myCF ; would return the most recent inserted date.
Due to the order of clustering columns, what I get is an order per user_id then per date.
If I change the primary key definition to:
PRIMARY KEY ((website_id, item_id), date, user_id)
I can no longer run the same query, as date must be restricted is user_id is.
I thought there might be some way to say:
...
PRIMARY KEY ((website_id, shop_id), store_id, date)
) WITH CLUSTERING ORDER BY (store_id RANDOMPLEASE, date DESC) ;
But it doesn't seem to exist. Worst, maybe this is completely stupid and I don't get why.
Is there any ways of achieving this? Am I missing something?
Many thanks!
Your query example restricts user_id so that should work with the second table format. But if you are actually trying to run queries like
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND date > 'some_date'
Then you need an additional table which is created to handle those queries, it would only order on Date and not on user id
Create Table LookupByDate ... PRIMARY KEY ((website_id, item_id), date)
In addition to your primary query, if all you try to get is "return the most recent inserted date", you may not need an additional table. You can use "static column" to store the last update time per partition. CASSANDRA-6561
It probably won't help your particular case (since I imagine your list of all users is unmanagably large), but if the condition on the first clustering column is matching one of a relatively small set of values then you can use IN.
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id IN ? AND date > 'some_date'
Don't use IN on the partition key because this will create an inefficient query that hits multiple nodes putting stress on the coordinator node. Instead, execute multiple asynchronous queries in parallel. But IN on a clustering column is absolutely fine.

Resources