CQL syntax error: "mismatched input 'and' expecting ')' " - cassandra

I am executing the below query :
SELECT * FROM test
WHERE (
Column11 in ('Value1','Value2','Vlaue3')
AND Column12 in ('Value11','Value22','Vlaue32')
AND Column13 = 'Value99'
);
This is giving below error :
mismatched input 'and' expecting ')' (...,'Value3') [and]...)
But when I am executing above query with out external braces, its working fine.
SELECT * FROM test
WHERE Column11 in ('Value1','Value2','Vlaue3')
AND Column12 in ('Value11','Value22','Vlaue32')
AND Column13 = 'Value99' ;
Is there any way to execute first query ? Because I want to add multiple clauses, OR separated, and prepare a big query.

This one is just parentheses enclosing conditions forming the where clause.
Subquery or not, the CQL parser does not allow for extra parens.
I want to add multiple clauses, OR separated, and prepare a big query.
Due to Cassandra's underlying engineering choices with regard to data distribution and read path, OR is not a valid CQL keyword.
Cassandra requires you to model tables based on the anticipated query patterns. When you run a query, the goal should be to ensure that it can be served by a single node in the cluster. OR based logic tends to be more open-ended, and not focused on precise key values.
tl;dr;
CQL != SQL. This sounds to me like more of a use case for Postgres or MariaDB.

Related

Is it possible to express inequality in the WHERE clause of a CQL statement?

I want to SELECT stuff WHERE value is not NAN. How to do it? I tried different options:
WHERE value != NAN
WHERE value is not NAN
WHERE value == value
None of these attempts succeeded.
I see that it is possible to write WHERE value = NAN, but is there a way to express inequality?
As you noted, none of the alternatives you tried work today:
although the != operator is recognized by the parser, it is unfortunately not supported in WHERE clause. This is true for both Cassandra and Scylla. I opened https://github.com/scylladb/scylladb/issues/12736 as an feature request in Scylla to add support for !=.
The IS NOT ... syntax is not relevant - it is only supported in the specific way IS NOT NULL, and even that is not supported in WHERE (see https://github.com/scylladb/scylladb/issues/8517).
WHERE value = value (note a single equals sign is the SQL and CQL syntax, not '==' as in C) is currently not supported, you can only check equality of a column to a constant, not check the equality of two columns. Again this is true for both Cassandra and Scylla. Scylla is now in the process of improving the power of the WHERE expressions, and at the end of this process this sort of expression will be supported.
I think your best solution today is just to read all the data, and filter out NaN yourself, in the client. The performance loss should be minimal - just the network overhead - because even if Scylla did this filtering for you it would still need to read the data from disk and do this filtering - it's not like it can get this inequality check "for free". This is unlike the equality check (WHERE value = 3) where Scylla can jump directly to the position of value = 3 (if "value" is the partition key or clustering key) and read only that. This efficiency concern is the reason why historically Scylla and Cassandra supported the equality operator, and not the inequality operator.
Cassandra is designed for OLTP workloads so reads are optimised for retrieving specific partitions such that the filter is of the form:
SELECT ... FROM ... WHERE partition_key = ?
A query that has an inequality filter is retrieving "everything except partition X" and is not really OLTP because Cassandra has to perform a full table scan to check all records which do NOT match the filter. This query does not scale so is not supported.
As far as I'm aware, the inequality operator (!=) only works in the conditional section of lightweight transactions that only applies to UPDATE or DELETE, not SELECT statements. For example:
UPDATE ... SET ... WHERE ... IF condition
If you have a complex search use case, you should look at using Elasticsearch or Apache Solr on top of Cassandra. If you have an analytics use case, consider using Apache Spark to query the data in Cassandra. Cheers!

SparkSQL equivalent for SQL compiled statement with variables

I need to execute SparkSQL statements in an efficient manner. Eg. compile once, execute many times (with different parameter values).
For a simple SQL example:
select * from my_table where year=:1
where :1 is a bind variable, and thus the statement is only compiled once, and executed N times (with different values), I need the same SparkSQL equivalente.
Things like:
year = 2020
df_result = spark.sql("select * from my_table where year={0}".format(year))
are not what I expect, since are not really bound variables, but just one specific instantiated sentence.
Depending on where your data are stored, your cluster resources, size of table etc... you might consider caching the entire table, that will at least prevent spark from having to read off disk/blob storage on every execution of the query
catalog = sparkSession.catalog
if catalog.isCached("my_table")):
df_my_table.cache()
df_result = df_my_table.filter("year='" + str(year) + "'")
There may be many ways to do this better depending on your architecture, but i'm sticking to a 100% spark based solution here

How to change Spark GroupBy/OrderBy comparator to deal with encrypted data

I'm doing a university work which I am trying to make Spark SQL work over encrypted data (with my algorithms). I implemented some functions that allow comparing two encrypted values in terms of their equality and order, and I am using UDF/UDAF's functions to execute these functions.
For example, if I want to execute this query:
SELECT count(SALARY) FROM table1 WHERE age > 20
I convert this one into:
SELECT mycount_udf(SALARY) FROM table1 WHERE myfilter_udf(greater_udf(age,20))
where mycount_udf, my_filter_udf and greater_udf are UDAF and UDF's implemented to deal with my functions over encrypted data.
However, I am facing a problem when I want to execute query's like ORDER BY/GROUP BY. The internals of these operators use operations of equality and order to execute the query. However, to allow to execute queries correctly over my encrypted values, I have to change the comparators inside ORDER BY/GROUP BY, in order to use my UDF comparators (equality_udf, greater_udf, etc).
If I encrypt:
x = 5 => encrypted_x = KSKFA92
y = 6 => encrypted_y = A9283NA
As 5<6, greater_udf(5,6) will return False. So I have to use this comparator inside ORDER BY (SORT) to execute the query correctly because Spark doesn't know that the values are encrypted, and when it compares encrypted_x with encrypted_y using == or a comparator between Spark DataTypes, will cause a wrong result.
Is there any way to do this without changing Spark GROUP BY/ORDER BY source code? It seems me not possible to use UDF/UDAF. I am using JAVA to do this work.

COUNT(*) vs. COUNT(1) performance in Cassandra

According to the docs:
A SELECT expression using COUNT(*) returns the number of rows that matched the query. Alternatively, you can use COUNT(1) to get the same result.
Are there any performance benefits (as in RDBMSes) from using the latter approach?
There is no difference between COUNT(*) and COUNT(1). COUNT(1) is just for backwards compatibility I think with some older stuff. selectCountClause returns empty RawSelector list regardless of the contents, but if its a number and not 1 or not '*' it will throw an exception.
You might wanna avoid count in general if worried about performance. Instead use a counter or maintain count at higher level instead.

Selecting from multiple tables in Cassandra CQL

So I have two tables in the query I am using:
SELECT
R.dst_ap, B.name
FROM airports as A, airports as B, routes as R
WHERE R.src_ap = A.iata
AND R.dst_ap = B.iata;
However it is throwing the error:
mismatched input 'as' expecting EOF (..., B.name FROM airports [as] A...)
Is there anyway I can do what I am attempting to do (which is how it works relationally) in Cassandra CQL?
The short answer, is that there are no joins in Cassandra. Period. So using SQL-based JOIN syntax will yield an error similar to what you posted above.
The idea with Cassandra (or any distributed database) is to ensure that your queries can be served by a single node (cutting down on network time). There really isn't a way to guarantee that data from different tables could be queried from a single node. For this reason, distributed joins are typically seen as an anti-pattern. To that end, Cassandra simply doesn't allow them.
In Cassandra you need to take a query-based modeling approach. So you could solve this by building a table from your post-join result set, consisting of desired combinations of dst_ap and name. You would have to find an appropriate way to partition this table, but ultimately you would want to build it based on A) the result set you expect to see and B) the properties you expect to filter on in your WHERE clause.

Resources