Is it possible to express inequality in the WHERE clause of a CQL statement? - cassandra

I want to SELECT stuff WHERE value is not NAN. How to do it? I tried different options:
WHERE value != NAN
WHERE value is not NAN
WHERE value == value
None of these attempts succeeded.
I see that it is possible to write WHERE value = NAN, but is there a way to express inequality?

As you noted, none of the alternatives you tried work today:
although the != operator is recognized by the parser, it is unfortunately not supported in WHERE clause. This is true for both Cassandra and Scylla. I opened https://github.com/scylladb/scylladb/issues/12736 as an feature request in Scylla to add support for !=.
The IS NOT ... syntax is not relevant - it is only supported in the specific way IS NOT NULL, and even that is not supported in WHERE (see https://github.com/scylladb/scylladb/issues/8517).
WHERE value = value (note a single equals sign is the SQL and CQL syntax, not '==' as in C) is currently not supported, you can only check equality of a column to a constant, not check the equality of two columns. Again this is true for both Cassandra and Scylla. Scylla is now in the process of improving the power of the WHERE expressions, and at the end of this process this sort of expression will be supported.
I think your best solution today is just to read all the data, and filter out NaN yourself, in the client. The performance loss should be minimal - just the network overhead - because even if Scylla did this filtering for you it would still need to read the data from disk and do this filtering - it's not like it can get this inequality check "for free". This is unlike the equality check (WHERE value = 3) where Scylla can jump directly to the position of value = 3 (if "value" is the partition key or clustering key) and read only that. This efficiency concern is the reason why historically Scylla and Cassandra supported the equality operator, and not the inequality operator.

Cassandra is designed for OLTP workloads so reads are optimised for retrieving specific partitions such that the filter is of the form:
SELECT ... FROM ... WHERE partition_key = ?
A query that has an inequality filter is retrieving "everything except partition X" and is not really OLTP because Cassandra has to perform a full table scan to check all records which do NOT match the filter. This query does not scale so is not supported.
As far as I'm aware, the inequality operator (!=) only works in the conditional section of lightweight transactions that only applies to UPDATE or DELETE, not SELECT statements. For example:
UPDATE ... SET ... WHERE ... IF condition
If you have a complex search use case, you should look at using Elasticsearch or Apache Solr on top of Cassandra. If you have an analytics use case, consider using Apache Spark to query the data in Cassandra. Cheers!

Related

How to check length of text field of Cassandra table

There is one field 'name' in our Cassandra database whose data type is type 'text'
How do I retrieve the data which has length of the 'name' field greater than some number using Cassandra query.
As was pointed in the comment, it's easy to add the user-defined function, and use it to retrieve the length of the text field, but the catch is that you can't use the user-defined function in the WHERE condition (see CASSANDRA-8488).
Even if it was possible, if you only have this as condition - that's a bad query for Cassandra, as it will need to go through all data in the database, and filter them out. For such tasks, usually things like, Spark are used - you can read data via Spark Cassandra Connector, and apply necessary filtering conditions. But this will involve reading all of the data from database, and then performing the filtering - this would be quite slower than normal CQL queries, but at least automatically parallelized.

Querying on column which is not a part of a PK or a secondary index

Please help me to resolve a confusion. Cassandra book Claims that attempts to query based on column that is not a part of a PK should fail (No secondary index for this column as well). However when I try to do it I can see this warning:
Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
Once I append ALLOW FILTERING to my query, there is no more error. I understand the implication on performance - however there is a clear contradiction to what is written in the book. Was this feature added later or book authors simply missed this?
I think it is great you have a textbook to guide you through important noSQL concepts, but don't rely on it as CASSANDRA is open source and is constantly updated by the community. Online resources such as the official apache documentation is a much better option to retrieve updated information / tutorials on new and existing features.
Although ALLOW FILTERING does exist, it is still recommended to use a different table construction (e.g. changing the column to a key) or create an INDEX to keep querying fast.
AFAIK, Cassandra has ALLOW FILTERING from version 1.
Also to explain ALLOW FILTERING,
As per the datastax documentation,
Let’s take for example the following table:
CREATE TABLE blogs (blogId int,
time1 int,
time2 int,
author text,
content text,
PRIMARY KEY(blogId, time1, time2));
If you execute the following query:
SELECT * FROM blogs;
Cassandra will return you all the data that the table blogs contains.
If you now want only the data at a specified time1, you will naturally add an equal condition on the column time1:
SELECT * FROM blogs WHERE time1 = 1418306451235;
In response, you will receive the following error message:
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING.
Cassandra knows that it might not be able to execute the query in an efficient way. It is therefore warning you: “Be careful. Executing this query as such might not be a good idea as it can use a lot of your computing resources”.
The only way Cassandra can execute this query is by retrieving all the rows from the table blogs and then by filtering out the ones which do not have the requested value for the time1 column.
If your table contains for example a 1 million rows and 95% of them have the requested value for the time1 column, the query will still be relatively efficient and you should use ALLOW FILTERING.
On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value for the time1 column, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.
Unfortunately, Cassandra has no way to differentiate between the 2 cases above as they are depending on the data distribution of the table. Cassandra is therefore warning you and relying on you to make the good choice.
Thanks,
Harry

Selecting from multiple tables in Cassandra CQL

So I have two tables in the query I am using:
SELECT
R.dst_ap, B.name
FROM airports as A, airports as B, routes as R
WHERE R.src_ap = A.iata
AND R.dst_ap = B.iata;
However it is throwing the error:
mismatched input 'as' expecting EOF (..., B.name FROM airports [as] A...)
Is there anyway I can do what I am attempting to do (which is how it works relationally) in Cassandra CQL?
The short answer, is that there are no joins in Cassandra. Period. So using SQL-based JOIN syntax will yield an error similar to what you posted above.
The idea with Cassandra (or any distributed database) is to ensure that your queries can be served by a single node (cutting down on network time). There really isn't a way to guarantee that data from different tables could be queried from a single node. For this reason, distributed joins are typically seen as an anti-pattern. To that end, Cassandra simply doesn't allow them.
In Cassandra you need to take a query-based modeling approach. So you could solve this by building a table from your post-join result set, consisting of desired combinations of dst_ap and name. You would have to find an appropriate way to partition this table, but ultimately you would want to build it based on A) the result set you expect to see and B) the properties you expect to filter on in your WHERE clause.

Order of results in Cassandra

I have two questions about query results in Cassandra.
When I make a "full" select of a table in Cassandra (ie. select * from table) is it guaranteed that the results will be returned in increasing order of partition tokens?
For instance, having the following table:
create table users(id int, name text, primary key(id));
Is it guaranteed that the following query will return the results with increasing values in the token column?
select token(id), id from users;
If so, is it also guaranteed if the data is distributed to multiple nodes in the cluster?
If the anwer to the above question is 'yes', is it still valid if we use secondary index? For instance, if we would have the following index:
create index on users(name);
and we query the table by using the index:
select token(id), id from users where name = 'xyz';
is there any guarantee regarding the order of results?
The motivation for the above questions is if the token is the right thing to use in order in implement paging and/or resuming of broken longer "data exports".
EDIT: There are multiple resources on the net that state that the order matches the token order (eg. in description of partitioner results or this Datastax page):
Without a partition key specified in the WHERE clause, the actual order of the result set then becomes dependent on the hashed values of userid.
However the order of results is not specified in official Cassandra documentation, eg. of SELECT statement.
Is it guaranteed that the following query will return the results with increasing values in the token column?
Yes it is
If so, is it also guaranteed if the data is distributed to multiple nodes in the cluster?
The data distribution is orthogonal to the ordering of the retrieved data, no relationship
If the anwer to the above question is 'yes', is it still valid if we use secondary index?
Yes, even if you query data using a secondary index (be it SASI or the native implementation), the returned results will always be sorted by token order. Why ? The technical explanation is given in my blog post here: http://www.doanduyhai.com/blog/?p=13191#cluster_read_path
That's the main reason that explain why SASI is not a good fit if you want the search to return data ordered by some column values. Only a real search engine integration (like Datastax Enterprise Search) can yield you the correct ordering because it bypasses the cluster read path layer.

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

Resources