Retrieve rows from last 24 hours - cassandra

I have a table with the following (with other fields removed)
CREATE TABLE if NOT EXISTS request_audit (
user_id text,
request_body text,
lookup_timestamp TIMESTAMP
PRIMARY KEY ((user_id), lookup_timestamp)
) WITH CLUSTERING ORDER BY ( lookup_timestamp DESC);
I create a record with the following
INSERT INTO request_audit (user_id, lookup_timestamp, request_body) VALUES (?, ?, toTimestamp(now()))
I am trying to retrieve all rows within the last 24 hours, but I am having trouble with the timestamp,
I have tried
SELECT * from request_audit WHERE user_id = '1234' AND lookup_timestamp > toTimestamp(now() - "1 day" )
and various other ways of trying to take a day away from the query.

Cassandra has a very limited date operation support. What you need is a custom function to do date math calculation.
Inspired from here.
How to get Last 6 Month data comparing with timestamp column using cassandra query?
you can write a UDF (user defined function) to date operation.
CREATE FUNCTION dateAdd(date timestamp, day int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS
$$java.util.Calendar c = java.util.Calendar.getInstance();
c.setTime(date);
c.add(java.util.Calendar.DAY_OF_MONTH, day);
return c.getTime();$$ ;
remember that you would have to enable UDF in config. Cassandra.yml. Hope that is possible.
enable_user_defined_functions: true
once done this query works perfectly.
SELECT * from request_audit WHERE user_id = '1234' AND lookup_timestamp > dateAdd(dateof(now()), -1)

You couldn't do it directly from CQL, as it doesn't support this kind of expressions. If you're running this query from cqlsh, then you can try to substitute the desired date with something like this:
date --date='-1 day' '+%F %T%z'
and execute this query.
If you're invoking this from your program, just use corresponding date/time library to get date corresponding -1 day, but this depends on the language that you're using.

Related

How to determine time stamps for Cassandra queries

One of The values inserted into the table is current time. I compute the current time using toTimestamp(now()). Now, I want to compute current time minus 90 days , current time minus 15 days.
My question is how do I compute current time - nth day ?
Query for current timestamp :
INSERT INTO TABLE_NAME (col_1, col_2, col_3) VALUES ('val_1', toTimestamp(now()), val_3);
In the above query, val_2 is current timestamp. Current time stamp is determined by
toTimestamp(now())
How do I compute current time - 90 days , current time - 2weeks
This functionality is not built into CQL.
If you are able to use UDFs, you can (building on the example given here:
How to get Last 6 Month data comparing with timestamp column using cassandra query?) do the following:
Enable UDFs as needed by adding or changing this line to true in cassandra.yaml:
enable_user_defined_functions: true
Then add two user defined functions like this:
CREATE FUNCTION dateadd(date timestamp, daydiff int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS $$java.util.Calendar c = java.util.Calendar.getInstance();c.setTime(date);c.add(java.util.Calendar.DATE, daydiff);return c.getTime();$$
CREATE FUNCTION weekadd(date timestamp, weekdiff int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS $$java.util.Calendar c = java.util.Calendar.getInstance();c.setTime(date);c.add(java.util.Calendar.DATE, weekdiff*7);return c.getTime();$$
Select the data from your table like this:
select dateadd(col_2,-90) from TABLE_NAME;
select weekadd(col_2,-2) from TABLE_NAME;

Cassandra export/forward data only once

I have the requirement to forward data at certain intervals from my system to an external system. To do this, I already stored all rows in a table. Already forwarded data should not be exported again.
The idea is to memorize the last export time on client side and export the following records the next time. Old rows are deleted after a successful export.
CREATE TABLE export(
id int,
import_date_time timestamp,
data text,
PRIMARY KEY (id, import_date_time)
) WITH CLUSTERING ORDER BY (import_date_time DESC)
insert into export(id, import_date_time, data) values (1, toUnixTimestamp(now()), 'content')
select * from export where id = 1 and import_date_time > '2017-03-30 16:22:37'
delete from export where id = 1 and import_date_time <= '2017-03-30 16:22:37'
Has anyone already implemented similar or do you have a different
solution?
If possible, I do not need an id for the request because I want to
export all data
If you used fixed partition key value (id = 1), then all the insert, select and delete will happen on a same node (If RF=1) over and over. And also for every delete cassandra create a tombstone entry, when you execute select query cassandra needs to merge each entry. So your select query performance will degrade.
So instead of having fixed value, use dynamic value like the below one :
CREATE TABLE export(
hour int,
day int,
month int,
year int,
import_date_time timestamp,
data text,
PRIMARY KEY ((hour, day, month, year), import_date_time)
) WITH CLUSTERING ORDER BY (import_date_time DESC);
Here you can insert the value of hour, day, month, year extracted from import_date_time
You need to take care of two case When selecting data :
Previous export time and current export time both at same hour.
Both time are not inside same hour.
For case one you need only one query and for case two you have to execute two query.
Example Query :
SELECT * FROM export WHERE hour = 16 AND day = 30 AND month = 3 AND year = 2017 AND import_date_time > '2017-03-30 16:22:37';

CQL query on 'validFrom/validTo timestamps'

I'm currently trying to model a column family that has two timestamps specifying whether an entry is valid (or 'active') at a given date (typically execution time).
No big issue with traditional SQL, 64 gigs of RAM and some indices, we're doing that quite often with our SQL server.
However, in CQL I haven't managed to model this scenario and write valid queries for it.
My basic model is (I skipped the PK definition!)
create table myTable(
id uuid,
validFrom timeuuid,
validTo timeuuid,
someInformationalData varChar
);
Some explanations:
due to the fact, that a validity date is not unique, I need a combined key in my final application this is going to be a usergroup reference (would be an ideal partition key)
validFrom/To are designed to be optional, but I could deal with by using boundary values (1970, 2038) for 'null' values passed through the persistence layer
I tried various combinations of partitioning/clustering keys, however neither of them resulted in valid CQL
-- only active results
select *
from
myTable
where
validFrom < now()
and
validTo > now()
I'm quite new to the NoSQL/CQL world and am struggling a bit with converting some of our applications. I could do it in memory, but I'm afraid, this could get a bottleneck at some point...
No sure if this kind of 'I have no idea what I'm doing' yell is appropriate, but any kind of help would be appreciated. :)
edit Here's one of the approaches I've been messing around with
drop table if exists myTable;
create table myTable(
id int,
datefrom timeuuid,
dateto timeuuid,
someColumns varChar,
primary key((id,datefrom),dateto)
);
create index if not exists my_idx on myTable(datefrom);
insert into myTable(id, datefrom,dateto,somecolumns)
values(0,minTimeuuid('1970-01-01 00:00:00'),minTimeuuid('2020-01-01 00:00:00'),'test');
insert into myTable(id,datefrom,dateto,somecolumns)
values(1,minTimeuuid('1970-01-01 00:00:00'),minTimeuuid('2012-01-01 00:00:00'),'test2');
select * from myTable where dateto > now() allow filtering;
-- invalid ("A column of a partition key can be restricted only if the preceding one is restricted by an Equal relation.")
select * from myTable where datefrom < now() and dateto > now() allow filtering;
The first query is limiting my result, the row with 'validTo=2012-01-01' is filtered, but I wasn't able to work out a scheme that worked on both limitations in the where clause.
If I understand your problem, what you are looking for is a way to run a range query based on the timestamp. Basically to be able to do this, your model will have to have the timestamp component as part of the clustering key:
create table myTable(
eventType uuid,
ts timestamp,
val text,
PRIMARY KEY (eventType, ts)
);
The above will allow you to run a query like: SELECT eventType, val from myTable where eventType = 'your_event' and ts >= 'start_ts' and ts < 'end_ts'.
What you need to remember is that the clustering keys are dictating the order on disk, thus making it possible to run efficiently queries like above. You can read more details about this in the CQL spec SELECT section.
Their is no such thing as Now() in cassandra like any other sql databases. you have to clearly mention today's date instead of Now() ..
You can use columns in which you defined as primary key or secondary index in where clause.

Cassandra CQL - clustering order with multiple clustering columns

I have a column family with primary key definition like this:
...
PRIMARY KEY ((website_id, item_id), user_id, date)
which will be queried using queries such as:
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id = 0 AND date > 'some_date' ;
However, I'd like to keep my column family ordered by date only, such as SELECT date FROM myCF ; would return the most recent inserted date.
Due to the order of clustering columns, what I get is an order per user_id then per date.
If I change the primary key definition to:
PRIMARY KEY ((website_id, item_id), date, user_id)
I can no longer run the same query, as date must be restricted is user_id is.
I thought there might be some way to say:
...
PRIMARY KEY ((website_id, shop_id), store_id, date)
) WITH CLUSTERING ORDER BY (store_id RANDOMPLEASE, date DESC) ;
But it doesn't seem to exist. Worst, maybe this is completely stupid and I don't get why.
Is there any ways of achieving this? Am I missing something?
Many thanks!
Your query example restricts user_id so that should work with the second table format. But if you are actually trying to run queries like
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND date > 'some_date'
Then you need an additional table which is created to handle those queries, it would only order on Date and not on user id
Create Table LookupByDate ... PRIMARY KEY ((website_id, item_id), date)
In addition to your primary query, if all you try to get is "return the most recent inserted date", you may not need an additional table. You can use "static column" to store the last update time per partition. CASSANDRA-6561
It probably won't help your particular case (since I imagine your list of all users is unmanagably large), but if the condition on the first clustering column is matching one of a relatively small set of values then you can use IN.
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id IN ? AND date > 'some_date'
Don't use IN on the partition key because this will create an inefficient query that hits multiple nodes putting stress on the coordinator node. Instead, execute multiple asynchronous queries in parallel. But IN on a clustering column is absolutely fine.

Selecting timeuuid columns corresponding to a specific date

Short version: Is it possible to query for all timeuuid columns corresponding to a particular date?
More details:
I have a table defined as follows:
CREATE TABLE timetest(
key uuid,
activation_time timeuuid,
value text,
PRIMARY KEY(key,activation_time)
);
I have populated this with a single row, as follows (f0532ef0-2a15-11e3-b292-51843b245f21 is a timeuuid corresponding to the date 2013-09-30 22:19:06+0100):
insert into timetest (key, activation_time, value) VALUES (7daecb80-29b0-11e3-92ec-e291eb9d325e, f0532ef0-2a15-11e3-b292-51843b245f21, 'some value');
And I can query for that row as follows:
select activation_time,dateof(activation_time) from timetest where key=7daecb80-29b0-11e3-92ec-e291eb9d325e
which results in the following (using cqlsh)
activation_time | dateof(activation_time)
--------------------------------------+--------------------------
f0532ef0-2a15-11e3-b292-51843b245f21 | 2013-09-30 22:19:06+0100
Now lets assume there's a lot of data in my table and I want to retrieve all rows where activation_time corresponds to a particular date, say 2013-09-30 22:19:06+0100.
I would have expected to be able to query for the range of all timeuuids between minTimeuuid('2013-09-30 22:19:06+0100') and maxTimeuuid('2013-09-30 22:19:06+0100') but this doesn't seem possible (the following query returns zero rows):
select * from timetest where key=7daecb80-29b0-11e3-92ec-e291eb9d325e and activation_time>minTimeuuid('2013-09-30 22:19:06+0100') and activation_time<=maxTimeuuid('2013-09-30 22:19:06+0100');
It seems I need to use a hack whereby I increment the second date in my query (by a second) to catch the row(s), i.e.,
select * from timetest where key=7daecb80-29b0-11e3-92ec-e291eb9d325e and activation_time>minTimeuuid('2013-09-30 22:19:06+0100') and activation_time<=maxTimeuuid('2013-09-30 22:19:07+0100');
This feels wrong. Am I missing something? Is there a cleaner way to do this?
The CQL documentation discusses timeuuid functions but it's pretty short on gte/lte expressions with timeuuids, beyond:
The min/maxTimeuuid example selects all rows where the timeuuid column, t, is strictly later than 2013-01-01 00:05+0000 but strictly earlier than 2013-02-02 10:00+0000. The t >= maxTimeuuid('2013-01-01 00:05+0000') does not select a timeuuid generated exactly at 2013-01-01 00:05+0000 and is essentially equivalent to t > maxTimeuuid('2013-01-01 00:05+0000').
p.s. the following query also returns zero rows:
select * from timetest where key=7daecb80-29b0-11e3-92ec-e291eb9d325e and activation_time<=maxTimeuuid('2013-09-30 22:19:06+0100');
and the following query returns the row(s):
select * from timetest where key=7daecb80-29b0-11e3-92ec-e291eb9d325e and activation_time>minTimeuuid('2013-09-30 22:19:06+0100');
I'm sure the problem is that cqlsh does not display milliseconds for your timestamps
So the real timestamp is something like '2013-09-30 22:19:06.123+0100'
When you call maxTimeuuid('2013-09-30 22:19:06+0100') as milliseconds are missing, zero is assumed so it is the same as calling maxTimeuuid('2013-09-30 22:19:06.000+0100')
And as 22:19:06.123 > 22:19:06.000 that causes record to be filtered out.
Not directly related to answer but as an additional addon to #dimas answer.
cqlsh (version 5.0.1) seem to show the miliseconds now
system.dateof(id)
---------------------------------
2016-06-03 02:42:09.990000+0000
2016-05-28 17:07:30.244000+0000

Resources