IN operator for non prime attributes - cassandra

Can I use the IN operator for non-primary key attributes in Cassandra? Or any other methods to alternative instead of using IN in the query?
SELECT * FROM abc WHERE domain IN ('domain1','domain2') allow filtering;
Error from server: code=2200 [Invalid query] message="IN predicates on non-primary-key columns (domain) is not yet supported"

can you recommend any alternative for IN operator in cassandra for my purpose?
Remember, it is still possible to use the IN operator on a partition key. It wouldn't say it's recommended, but it should be fine with a small number of parameters. First though, you'd have to rebuild your table, or build a new table to support that query.
As I don't know your exact table definition, I'm going make some assumptions (like abc_id).
If you recreated your table similar to this:
CREATE TABLE abc_by_domain (
domain TEXT,
abc_id TEXT,
value TEXT,
PRIMARY KEY (domain,abc_id));
Now write some data to it, and then this works:
SELECT * FROM abc_by_domain WHERE domain IN ('domain1','domain2');
domain | abc_id | value
---------+--------+---------
domain1 | 1 | 1st row
domain2 | 2 | 2nd row
(2 rows)
Notes:
I assumed a current, single primary key of abc_id. Essentially, I made that a clustering key to ensure that the underlying rows now partitioned by domain would still be unique. In your case, take whatever key column enforces uniquness in the abc table, and use that as a clustering key to accomplish the same thing.
As per my warning above, this is known as a "multi-key query" which is an anti-pattern in Cassandra. The problem, is that Cassandra cannot guarantee that data on two partitions will be on the same node, so it essentially picks a coordinator and runs two queries behind the scenes. For two parameters, it's probably not too bad. But I would try to keep that in single digits.

Related

Optimization of a query which uses arithmetic operations in WHERE clause

I need to retrieve records where the expiration date is today. The expiration date is calculated dynamically using two other fields (startDate and durationDays):
SELECT * FROM subscription WHERE startDate + durationDays < currentDate()
Does it make sense to add two indexes for these two columns? Or should I consider adding a new column expirationDate and create an index for it only?
SELECT * FROM subscription WHERE startDate + durationDays < currentDate()
I'm wondering how does Cassandra handle such a filter as in my example? Does it make a full scan?
First of all, your question is predicated on CQL's ability to perform (date) arithmetic. It cannot.
> SELECT * FROM subscription WHERE startDate + durationDays < currentDate();
SyntaxException: line 1:43 no viable alternative at input '+' (SELECT * FROM subscription WHERE [startDate] +...)
Secondly the currentDate() function does not exist in Cassandra 3.11.4.
> SELECT currentDate() FROM system.local;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Unknown function 'currentdate'"
That does work in Cassandra 4.0, which as it has not been released yet, you really shouldn't be using.
So let's assume that you've created your secondary indexes on startDate and durationDays and you're just querying on those, without any arithmetic.
Does it execute a full table scan?
ABSOLUTELY.
The reason, is that querying solely on secondary index columns does not have a partition key. Therefore, it has to search for these values on all partitions on all nodes. In a large cluster, your query would likely time out.
Also, when it finds matching data, it has to keep querying. As those values are not unique; it's entirely possible that there are several results to be returned. Carlos in 100% correct is advising you to rebuild your table based on what you want to query.
Recommendations:
Try not to build a table with secondary indexes. Like ever.
If you have to build a table with secondary indexes, try to have a partition key in your WHERE clause to keep the query isolated to a single node.
Any filtering on dynamic (computed) values needs to be done on the application side.
In your case, it might make more sense to create a column called expirationDate, do your date arithmetic in your app, and then INSERT that value into your table.
You'll also want follow the "time bucket" pattern for handling time series data (which is what this appears to be). Say that month works as a "bucket" (it may or may not for your use case). PRIMARY KEY ((month),expirationDate,id) would be a good key. This way, all the subscriptions for a particular month are stored together, clustered by expirationDate, with id on the end to act as a tie-breaker for uniqueness.
One of the main differences between Cassandra and relational databases is that the definition of the tables depend on the query that will be used. The conditional of how the data will be retrieved (WHERE statement) should be included in the primary key as it will perform better than an index on the table.
There are multiple resources regarding the read path, and the quirks of primary keys vs indexes, this talk from the Cassandra Summit may be useful.

Why Secondary Index ( = ?) and Clustering Columns (order by) CANNOT be used together for CQL Query?

EDIT: a related jira ticket
A query in pattern select * from <table> where <partition_keys> = ? and <secondary_index_column> = ? order by <first_clustering_column> desc does not work, with error msg:
InvalidRequest: Error from server: code=2200 [Invalid query] message="ORDER BY with 2ndary indexes is not supported."
From the structure of index table, above query include the partition key, and first two cluster columns in the index table. Also, note that without order by clause, the result is sorted by clustering column as CLUSTERING ORDER.
Is there any way to make the query work? If not, why?
Data in Cassandra is naturally stored based on the sort order of Clustering Columns.
Secondary index in Cassandra is way different than a corresponding index in relation database. Its local per node, which means that its contents aren't known to other nodes of the cluster. So sorting by this index is highly impossible. Also within the node, the secondary indexes are holding just pointers to corresponding partition key.
If you need sorting to be performed by Cassandra, have them as clustering columns. Otherwise you can sort them in code, after you retrieve the results.
Also secondary indexes aren't ideal for Cassandra and definitely a better model is to not have them in first place, to save some headache for future.

Where and Order By Clauses in Cassandra CQL

I am new to NoSQL database and have just started using apache Cassandra. I created a simple table "emp" with primary key on "empno" column. This is a simple table as we always get in Oracle's default scott schema.
Now I loaded data using the COPY command and issued query Select * from emp order by empno but I was surprised that CQL did not allow Order by on empno column (which is PK). Also when I used Where condition, it did not allow any inequality operations on empno column (it said only EQ or IN conditions are allowed). It also did not allowed Where and Order by on any other column, as they were not used in PK, and did not have an index.
Can someone please help me what should I do if I want to keep empno unique in the table and want a query results in Sorted order of empno?
(My version is:
cqlsh:demodb> show version
[cqlsh 5.0.1 | Cassandra 2.2.0 | CQL spec 3.3.0 | Native protocol v4]
)
There are two parts to a PRIMARY KEY in Cassandra:
partition key(s)
clustering key(s)
PRIMARY KEY (partitionKey1,clusteringKey1,clusteringKey2)
or
PRIMARY KEY ((partitionKey1,partitionKey2),clusteringKey1,clusteringKey2)
The partition key determines which node(s) your data is stored on. The clustering key determines the order of the data within your partition key.
In CQL, the ORDER BY clause is really only used to reverse the defined sort direction of your clustering order. As for the columns themselves, you can only specify the columns defined (and in that exact order...no skipping) in your CLUSTERING ORDER BY clause at table creation time. So you cannot pick arbitrary columns to order your result set at query-time.
Cassandra achieves performance by using the clustering keys to sort your data on-disk, thereby only returning ordered rows in a single read (no random reads). This is why you must take a query-based modeling approach (often duplicating your data into multiple query tables) with Cassandra. Know your queries ahead of time, and build your tables to serve them.
Select * from emp order by empno;
First of all, you need a WHERE clause. It's ok to query without it, if you're working with a relational database. With Cassandra, you should do your best to avoid unbound SELECT queries. Besides, Cassandra can only enforce a sort order within a partition, so querying without a WHERE clause won't return data in the order you want, anyway.
Secondly, as I mentioned above, you need to define clustering keys. If you want to order your result set by empno, then you must find another column to define as your partition key. Try something like this:
CREATE TABLE emp_by_dept (
empno text,
dept text,
name text,
PRIMARY KEY (dept,empno)
) WITH CLUSTERING ORDER BY (empno ASC);
Now, I can query employees by department, and they will be returned to me ordered by empno:
SELECT * FROM emp_by_dept WHERE dept='IT';
But to be clear, you will not be able to query every row in your table, and have it ordered by a single column. The only way to get meaningful order into your result sets, is first partition your data in a way that makes sense to your business case. Running an unbound SELECT will return all of your rows (assuming that the query doesn't time-out while trying to query every node in your cluster), but result set ordering can only be enforced within a partition. So you have to restrict by partition key in order for that to make any sense.
My apologies for self-promoting, but last year I wrote an article for DataStax called We Shall Have Order!, in which I addressed how to solve these types of problems. Give it a read and see if it helps.
Edit for additional questions:
From your answer I concluded 2 things about Cassandra:
(1) There is no
way of getting a result set which is only order by a column that has
been defined as Unique.
(2) When we define a PK
(partition-key+clustering-key), then the results will always be order
by Clustering columns within any fixed partition key (we must restrict
to one partition-key value), that means there is no need of ORDER BY
clause, since it cannot ever change the order of rows (the order in
which rows are actually stored), i.e. Order By is useless.
1) All PRIMARY KEYs in Cassandra are unique. There's no way to order your result set by your partition key. In my example, I order by empno (after partitioning by dept). – Aaron 1 hour ago
2) Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC.
I created an index on "empno" column of "emp" table, it is still not
allowing ORDER BY empno. So, what Indexes are for? are they only for
searching records for specific value of index key?
You cannot order a result set by an indexed column. Secondary indexes are (not the same as their relational counterparts) really only useful for edge-case, analytics-based queries. They don't scale, so the general recommendation is not to use secondary indexes.
Ok, that simply means that one table cannot be used for getting
different result sets with different conditions and different sorting
order.
Correct.
Hence for each new requirement, we need to create a new table.
IT means if we have a billion rows in a table (say Sales table), and
we need sum of sales (1) Product-wise, (2) Region-wise, then we will
duplicate all those billion rows in 2 tables with one in clustering
order of Product, the other in clustering order of Region,. and even
if we need to sum sales per Salesman_id, then we build a 3rd table,
again putting all those billion rows? is it sensible?
It's really up to you to decide how sensible it is. But lack of query flexibility is a drawback of Cassandra. To get around it you can keep creating query tables (I.E., trading disk for performance). But if it gets to a point where it becomes ungainly or difficult to manage, then it's time to think about whether or not Cassandra is really the right solution.
EDIT 20160321
Hi Aaron, you said above "Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC."
But i found even that is not correct. Cassandra only allows ORDER by in the same direction as we define in the "CLUSTERING ORDER BY" caluse of CREATE TABLE. If in that clause we define ASC, it allows only order by ASC, and vice versa.
Without seeing an error message, it's hard to know what to tell you on that one. Although I have heard of queries with ORDER BY failing when you have too many rows stored in a partition.
ORDER BY also functions a little odd if you specify multiple columns to sort by. If I have two clustering columns defined, I can use ORDER BY on the first column indiscriminately. But as soon as I add the second column to the ORDER BY clause, my query only works if I specify both sort directions the same (as the CLUSTERING ORDER BY definition) or both different. If I mix and match, I get this:
InvalidRequest: code=2200 [Invalid query] message="Unsupported order by relation"
I think that has to do with how the data is stored on-disk. Otherwise Cassandra would have more work to do in preparing result sets. Whereas if it requires everything to either to match or mirror the direction(s) specified in the CLUSTERING ORDER BY, it can just relay a sequential read from disk. So it's probably best to only use a single column in your ORDER BY clause, for more predictable results.
Adding a redux answer as the accepted one is quite long.
Order by is currently only supported on the clustered columns of the PRIMARY KEY
and when the partition key is restricted by an Equality or an IN operator in where clause.
That is if you have your primary key defined like this :
PRIMARY KEY ((a,b),c,d)
Then you will be able to use the ORDER BY when & only when your query has :
a where clause with all the primary key restricted either by an equality operator (=) or an IN operator such as :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c,d;
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c;
These two query are the only valid ones.
Also this query would not work :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY d,c;
because order by currently only support the ordering of columns following their declared order in the PRIMARY KEY that is in primary key definition c has been declared before d and the query violates the ordering by placing d first.

Cassandra - querying on clustering keys

I am just getting start on Cassandra and I was trying to create tables with different partition and clustering keys to see how they can be queried differently.
I created a table with primary key of the form - (a),b,c where a is the partition key and b,c are clustering key.
When querying I noticed that the following query:
select * from tablename where b=val;
results in:
Cannot execute this query as it might involve data filtering and thus
may have unpredictable performance. If you want to execute this query
despite the performance unpredictability, use ALLOW FILTERING
And using "ALLOW FILTERING" gets me what I want (even though I've heard its bad for performance).
But when I run the following query:
select * from tablename where c=val;
It says:
PRIMARY KEY column "c" cannot be restricted (preceding column "b" is either not restricted or by a non-EQ relation)
And there is no "ALLOW FILTERING" option at all.
MY QUESTION IS - Why are all clustering keys not treated the same? column b which is adjacent to the partition key 'a' is given an option of 'allow filtering' which allows querying on it while querying on column 'c' does not seem possible at all (given the way this table is laid out).
ALLOW FILTERING gets cassandra to scan through all SSTables and get the data out of it when the partition key is missing, then why cant we do the same column c?
It's not that clustering keys are not treated the same, it's that you can't skip them. This is because Cassandra uses the clustering keys to determine on-disk sort order within a partition.
To add to your example, assume PRIMARY KEY ((a),b,c,d). You could run your query (with ALLOW FILTERING) by specifying just b, or b and c. But it wouldn't allow you to specify c and d (skipping b) or b and d (skipping c).
And as a side node, if you really want to be able to query by only b or only c, then you should support those queries with additional tables designed as such. ALLOW FILTERING is a band-aid, and is not something you should ever do in a production Cassandra deployment.

Cassandra query on Map - Contains Clause [duplicate]

This question already has answers here:
SELECT Specific Value from map
(3 answers)
Closed 7 years ago.
I am trying to query a table containing Map. Is it possible to apply contains clause on map data type table?
CREATE TABLE data.Table1 (
fetchDataMap map<text, frozen<Config>>,
userId text ,
PRIMARY KEY(userId)
);
Getting following Error:
cqlsh> SELECT * FROM data.Table1 WHERE fetchDataMap CONTAINS '233322554843924';
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on
the restricted columns support the provided operators: "
Please enlighten me with better query approach on this requirement.
For this to work, you have to create a secondary index on the map. But, you first have to ask yourself if you want to index your map keys or values (cannot do both). Given your CQL statement, I'll assume that you want to index your map key (and we'll go from there).
CREATE INDEX table1_fetchMapKey ON table1(KEYS(fetchDataMap));
After inserting some data (making a guess as to what your Config UDT looks like), I can SELECT with a slightly modified version of your CQL query above:
aploetz#cqlsh:stackoverflow> SELECT * FROm table1 WHERE
fetchDataMap CONTAINS KEY '233322554843924';
userid | fetchdatamap
--------+------------------------------------------------------------
B26354 | {'233322554843924': {key: 'location', value: '~/scripts'}}
(1 rows)
Note that I cannot in good conscience provide you with this solution, without passing along a link to the DataStax doc When To Use An Index. Secondary indexes are known to not perform well. So I can only imagine that a secondary index on a collection would perform worse, but I suppose that really depends on the relative cardinality. If it were me, I would re-model my table to avoid using a secondary index, if at all possible.

Resources