How to use basic arithmetic operations in WHERE clause of CQL? - cql

I am trying to write a CQL query which looks like:
select * from mytable
WHERE timestamp >= unixTimestampOf(maxTimeuuid('2016-03-01 00:00:00')) /1000
and timestamp <= unixTimestampOf(minTimeuuid('2016-03-31 23:59:59')) / 1000
and am getting this error:
Invalid syntax at line 1, char XXX
If I change the query to
select * from mytable
WHERE timestamp >= unixTimestampOf(maxTimeuuid('2016-03-01 00:00:00'))
and timestamp <= unixTimestampOf(minTimeuuid('2016-03-31 23:59:59'))
it does not give any error but obviously I don't get the desired result since unixTimestampOf returns milliseconds whereas my timestamp column is storing seconds. How can I fix this?

This JIRA ticket is what you're probably looking for: Operator functionality in CQL. It's still unresolved, but it's moving on. One of its subtasks Add support for arithmetic operators has been marked as resolved recently, and is being merged in Cassandra 3.12, but it works for numeric data types only. I don't think it will work out of the box with your schema, you'll probably need to change timestamps to numeric data types.
HTH.

Related

Optimization of a query which uses arithmetic operations in WHERE clause

I need to retrieve records where the expiration date is today. The expiration date is calculated dynamically using two other fields (startDate and durationDays):
SELECT * FROM subscription WHERE startDate + durationDays < currentDate()
Does it make sense to add two indexes for these two columns? Or should I consider adding a new column expirationDate and create an index for it only?
SELECT * FROM subscription WHERE startDate + durationDays < currentDate()
I'm wondering how does Cassandra handle such a filter as in my example? Does it make a full scan?
First of all, your question is predicated on CQL's ability to perform (date) arithmetic. It cannot.
> SELECT * FROM subscription WHERE startDate + durationDays < currentDate();
SyntaxException: line 1:43 no viable alternative at input '+' (SELECT * FROM subscription WHERE [startDate] +...)
Secondly the currentDate() function does not exist in Cassandra 3.11.4.
> SELECT currentDate() FROM system.local;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Unknown function 'currentdate'"
That does work in Cassandra 4.0, which as it has not been released yet, you really shouldn't be using.
So let's assume that you've created your secondary indexes on startDate and durationDays and you're just querying on those, without any arithmetic.
Does it execute a full table scan?
ABSOLUTELY.
The reason, is that querying solely on secondary index columns does not have a partition key. Therefore, it has to search for these values on all partitions on all nodes. In a large cluster, your query would likely time out.
Also, when it finds matching data, it has to keep querying. As those values are not unique; it's entirely possible that there are several results to be returned. Carlos in 100% correct is advising you to rebuild your table based on what you want to query.
Recommendations:
Try not to build a table with secondary indexes. Like ever.
If you have to build a table with secondary indexes, try to have a partition key in your WHERE clause to keep the query isolated to a single node.
Any filtering on dynamic (computed) values needs to be done on the application side.
In your case, it might make more sense to create a column called expirationDate, do your date arithmetic in your app, and then INSERT that value into your table.
You'll also want follow the "time bucket" pattern for handling time series data (which is what this appears to be). Say that month works as a "bucket" (it may or may not for your use case). PRIMARY KEY ((month),expirationDate,id) would be a good key. This way, all the subscriptions for a particular month are stored together, clustered by expirationDate, with id on the end to act as a tie-breaker for uniqueness.
One of the main differences between Cassandra and relational databases is that the definition of the tables depend on the query that will be used. The conditional of how the data will be retrieved (WHERE statement) should be included in the primary key as it will perform better than an index on the table.
There are multiple resources regarding the read path, and the quirks of primary keys vs indexes, this talk from the Cassandra Summit may be useful.

Unable to coerce to a formatted date - Cassandra timestamp type

I have the values stored for timestamp type column in cassandra table in format of
2018-10-27 11:36:37.950000+0000 (GMT date).
I get Unable to coerce '2018-10-27 11:36:37.950000+0000' to a formatted date (long) when I run below query to get data.
select create_date from test_table where create_date='2018-10-27 11:36:37.950000+0000' allow filtering;
How to get the query working if the data is already stored in the table (of format, 2018-10-27 11:36:37.950000+0000) and also perform range (>= or <=) operations on create_date column?
I tried with create_date='2018-10-27 11:36:37.95Z',
create_date='2018-10-27 11:36:37.95' create_date='2018-10-27 11:36:37.95'too.
Is it possible to perform filtering on this kind of timestamp type data?
P.S. Using cqlsh to run query on cassandra table.
In first case, the problem is that you specify timestamp with microseconds, while Cassandra operates with milliseconds - try to remove the three last digits - .950 instead of .950000 (see this document for details). The timestamps are stored inside Cassandra as 64-bit number, and then formatted when printing results using the format specified by datetimeformat options of cqlshrc (see doc). Dates without explicit timezone will require that default timezone is specified in cqlshrc.
Regarding your question about filtering the data - this query will work only for small amounts of data, and on bigger data sizes will most probably timeout, as it will need to scan all data in the cluster. Also, the data won't be sorted correctly, because sorting happens only inside single partition.
If you want to perform such queries, then maybe the Spark Cassandra Connector will be the better choice, as it can effectively select required data, and then you can perform sorting, etc. Although this will require much more resources.
I recommend to take DS220 course from DataStax Academy to understand how to model data for Cassandra.
This is works for me
var datetime = DateTime.UtcNow.ToString("yyyy-MM-dd HH:MM:ss");
var query = $"SET updatedat = '{datetime}' WHERE ...

How to get current year on cassandra

How can I get just a part of the current date in Cassandra? In my particular case I need to get just the year.
I got to this for the moment select dateof(now()) from system.local;
But I could not find any function to get just the year in the documentation
https://docs.datastax.com/en/dse/5.1/cql/cql/cql_reference/refCqlFunction.html#refCqlFunction__toTimestamp
I'm new with Cassandra so this maybe a silly question.
The safe way, would be to return a timestamp and parse-out the year client-side.
Natively, Cassandra does not have any functions that can help with this. However, you can write a user-defined function (UDF) to accomplish this:
First, user defined functions are disabled by default. You'll need to adjust that setting in your cassandra.yaml and restart your node(s).
enable_user_defined_functions=true
NOTE: This setting is defaulted this way for a reason. Although Cassandra 3.x has some safeguards in-place to protect against malicious code, it is a good idea to leave this turned-off unless both you need it and you know what you are doing. And even then, you'll want to keep an eye on UDFs that get defined.
Now I'll create my function using Java, from within cqlsh:
cassdba#cqlsh:stackoverflow> CREATE OR REPLACE FUNCTION year (input DATE)
RETURNS NULL ON NULL INPUT RETURNS TEXT
LANGUAGE java AS 'return input.toString().substring(0,4);';
Note that there are a number of ways (and types) to query the current date/time:
cassdba#cqlsh:stackoverflow> SELECT todate(now()) as date,
totimestamp(now()) as timestamp, now() as timeuuid FROm system.local;
date | timestamp | timeuuid
------------+---------------------------------+-------------------------------------
2017-12-20 | 2017-12-20 21:18:37.708000+0000 | 58167cc1-e5cb-11e7-9765-a98c427e8248
(1 rows)
To return just only year, I can call my year function on the todate(now()) column:
SELECT stackoverflow.year(todate(now())) as year FROm system.local;
year
------
2017
(1 rows)

Select all from column family not returning all

I'm having inconsistent results when I attempt to select * from a column family. For fun, I ran the same query in a loop and then counted the result set returned. No matter what I do, it seems to vary (sometimes as much as +/- 150 rows for every ~4,000).
rSet = session.execute(SimpleStatement('SELECT * FROM colFam', consistency_level=ConsistencyLevel.QUORUM))
Using this query results in different row counts returned. The table in question isn't going to hold millions of rows - being able to accurately select * from it is important. Running the query directly in CQLShell yields similar weirdness.
Is there something inherent in CQL/Cassandra that I haven't yet learned about that prevents Cassandra from being able to return an accurate representation of *? Or is my Google fu just failing me?

Cassandra time range query

Before you downvote I would like to state that I looked at all of the similar questions but I am still getting the dreaded "PRIMARY KEY column cannot be restricted" error.
Here's my table structure:
CREATE TABLE IF NOT EXISTS events (
id text,
name text,
start_time timestamp,
end_time timestamp,
parameters blob,
PRIMARY KEY (id, name, start_time, end_time)
);
And here's the query I am trying to execute:
SELECT * FROM events WHERE name = ? AND start_time >= ? AND end_time <= ?;
I am really stuck at this. Can anyone tell me what I am doing wrong?
Thanks,
Deniz
This is a query you need to remodel your data for, or use a distributed analytics platform (like spark). Id describes how your data is distributed through the database. Since it is not specified in this query a full table scan will be required to determine the necessary rows. The Cassandra design team has decided that they would rather you not do a query at all rather than do a query which will not scale.
Basically whenever you see "COLUMN cannot be restricted" It means that the query you have tried to perform cannot be done efficiently on the table you created.
To run the query, use the ALLOW FILTERING clause,
SELECT * FROM analytics.events WHERE name = ? AND start_time >= ? AND end_time <= ? ALLOW FILTERING;
The "general" rule to make query is you have to pass at least all partition key columns, then you can add each key in the order they're set." So in order for you to make this work you'd need to add where id = x in there.
However, it appears what this error message is implying is that once you select 'start_time > 34' that's as far "down the chain" you're allowed to go otherwise it would require the "potentially too costly" ALLOW FILTERING flag. So it has to be "only equality" down to one < > combo on a single column. All in the name of speed. This works (though doesn't give a range query):
SELECT * FROM events WHERE name = 'a' AND start_time = 33 and end_time <= 34 and id = '35';
If you're looking for events "happening at minute y" maybe a different data model would be possible, like adding an event for each minute the event is ongoing or what not, or bucketing based on "hour" or what not. See also https://stackoverflow.com/a/48755855/32453

Resources