Query using composite keys, other than Row Key in Cassandra - cassandra

I want to query data filtering by composite keys other than Row Key in CQL3.
These are my queries:
CREATE TABLE grades (id int,
date timestamp,
subject text,
status text,
PRIMARY KEY (id, subject, status, date)
);
When I try and access the data,
SELECT * FROM grades where id = 1098; //works fine
SELECT * FROM grades where subject = 'English' ALLOW FILTERING; //works fine
SELECT * FROM grades where status = 'Active' ALLOW FILTERING; //gives an error
Bad Request: PRIMARY KEY part status cannot be restricted (preceding part subject is either not restricted or by a non-EQ
relation)
Just to experiment, I shuffled the keys around keeping 'id' as my Primary Row Key always. I am always ONLY able to query using either the Primary Row key or the second key, considering above example, if I swap subjects and status in Primary Key list, I can then query with status but I get similar error if I try to do by subject or by time.
Am I doing something wrong? Can I not query data using any other composite key in CQL3?
I'm using Cassandra 1.2.6 and CQL3.

That looks all normal behavior according to Cassandra Composite Key model (http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT). Cassandra data model aims (and this is a general NoSQL way of thinking) at granting that queries are performant, that comes to the expense of "restrictions" on the way you store and index your data, and then how you query it, namely you "always need to restrict the preceding part of subject" on the primary key.
You cannot swap elements on the primary key list on the queries (that is more a SQL way of thinking). You always need to "Constraint"/"Restrict" the previous element of the primary key if you are to use multiple elements of the composite key. This means that if you have composite key = (id, subject, status, date) and want to query "status", you will need to restrict "id" and/or "subject" ("or" is possible in case you use "allow filtering", i.e., you can restrict only "subject" and do not need to restrict "id"). So, if you want to query on "status" you will b able to query in two different ways:
select * from grades where id = '1093' and subject = 'English' and status = 'Active';
Or
select * from grades where subject = 'English' and status = 'Active' allow filtering;
The first is for a specific "student", the second for all the "students" on the subject in status = "Active".

Related

Why am I getting this error when I run the query?

When attempting to perform this query:
select race_name from sport_app.month_category_runner where race_type = 'URBAN RACE 10K' and club = 'CORNELLA ATLETIC';
I get the following error:
Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
It is an exercise, so I am not allowed to use ALLOW FILTERING.
So I have created two indexes in this way:
create index raceTypeIndex ON sport_app.month_category_runner(race_type);
create index clubIndex ON sport_app.month_category_runner(club);
But I keep getting the same error, am I missing something, or is there an alternative?
Table Structure:
CREATE TABLE month_category_runner (month text,
category text,
runner_id text,
club text,
race_name text,
race_type text,
race_date timestamp,
total_runners int,
net_time time,
PRIMARY KEY (month, category, runner_id, race_name, net_time));
Note if you add the "ALLOW FILTERING" the query will run on all the nodes of Cassandra cluster and can have a large impact on all nodes.
The recommendation is to add the partition as condition of your query, to allow the query to be executed on needed nodes only.
Example:
select race_name from month_category_runner where month = 'may' and club = 'CORNELLA ATLETIC';
select race_name from month_category_runner where month = 'may' and race_type = 'URBAN RACE 10K';
select race_name from month_category_runner where month = 'may' and race_type = 'URBAN RACE 10K' and club = 'CORNELLA ATLETIC' ALLOW FILTERING;
Your primary key is composed by (month, category, runner_id, race_name, net_time) and the column month is the partition, so this column must be on your query filter as i showed in example.
The query that you want to do using two columns that are not in primary key despite the index column exist, you need to use the ALLOW FILTERING that can have performance impact;
The other option is create a new table where the primary key contains theses columns.

Cassandra cql - sorting unique IDs by time

I am writting messaging chat system, similar to FB messaging. I did not find the way, how to effectively store conversation list (each row different user with last sent message most recent on top). If I list conversations from this table:
CREATE TABLE "conversation_list" (
"user_id" int,
"partner_user_id" int,
"last_message_time" time,
"last_message_text" text,
PRIMARY KEY ("user_id", "partner_user_id")
)
I can select from this table conversations for any user_id. When new message is sent, we can simply update the row:
UPDATE conversation_list SET last_message_time = '...', last_message_text='...' WHERE user_id = '...' AND partner_user_id = '...'
But it is sorted by clustering key of course. My question: How to create list of conversations, which is sorted by last_message_time, but partner_user_id will be unique for given user_id?
If last_message_time is clustering key and we delete the row and insert new (to keep partner_user_id unique), I will have many so many thumbstones in the table.
Thank you.
A slight change to your original model should do what you want:
CREATE TABLE conversation_list (
user_id int,
partner_user_id int,
last_message_time timestamp,
last_message_text text,
PRIMARY KEY ((user_id, partner_user_id), last_message_time)
) WITH CLUSTERING ORDER BY (last_message_time DESC);
I combined "user_id" and "partner_user_id" into one partition key. "last_message_time" can be the single clustering column and provide sorting. I reversed the default sort order with the CLUSTERING ORDER BY to make the timestamps descending. Now you should be able to just insert any time there is a message from a user to a partner id.
The select now will give you the ability to look for the last message sent. Like this:
SELECT last_message_time, last_message_text
FROM conversation_list
WHERE user_id= ? AND partner_user_id = ?
LIMIT 1

cassandra Select on indexed columns and with IN clause for the PRIMARY KEY are not supported

In Cassandra, I'm using the cql:
select msg from log where id in ('A', 'B') and filter1 = 'filter'
(where id is the partition key and filter1 is a secondary index and filter1 cannot be used as a cluster column)
This gives the response:
Select on indexed columns and with IN clause for the PRIMARY KEY are not supported
How can I change CQL to prevent this?
You would need to split that up into separate queries of:
select msg from log where id = 'A' and filter1 = 'filter';
and
select msg from log where id = 'B' and filter1 = 'filter';
Due to the way data is partitioned in Cassandra, CQL has a lot of seemingly arbitrary restrictions (to discourage inefficient queries and also because they are complex to implement).
Over time I think these restrictions will slowly be removed, but for now we have to work around them. For more details on the restrictions, see A deep look at the CQL where clause.
Another option, is that you could build a table specifically for this query (a query table) with filter1 as a partition key and id as a clustering key. That way, your query works and you avoid having a secondary index all-together.
aploetz#cqlsh:stackoverflow> CREATE TABLE log
(filter1 text,
id text,
msg text,
PRIMARY KEY (filter1, id));
aploetz#cqlsh:stackoverflow> INSERT INTO log (filter1, id, msg)
VALUES ('filter','A','message A');
aploetz#cqlsh:stackoverflow> INSERT INTO log (filter1, id, msg)
VALUES ('filter','B','message B');
aploetz#cqlsh:stackoverflow> INSERT INTO log (filter1, id, msg)
VALUES ('filter','C','message C');
aploetz#cqlsh:stackoverflow> SELECT msg FROM log
WHERE filter1='filter' AND id IN ('A','B');
msg
-----------
message A
message B
(2 rows)
You would still be using an "IN" which isn't known to perform well either. But you would also be specifying a partition key, so it might perform better than expected.

cassandra, select via a non primary key

I'm new with cassandra and I met a problem. I created a keyspace demodb and a table users. This table got 3 columns: id (int and primary key), firstname (varchar), name (varchar).
this request send me the good result:
SELECT * FROM demodb.users WHERE id = 3;
but this one:
SELECT * FROM demodb.users WHERE firstname = 'francois';
doesn't work and I get the following error message:
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: "
This request also doesn't work:
SELECT * FROM users WHERE firstname = 'francois' ORDER BY id DESC LIMIT 5;
InvalidRequest: code=2200 [Invalid query] message="ORDER BY with 2ndary indexes is not supported."
Thanks in advance.
This request also doesn't work:
That's because you are mis-understanding how sort order works in Cassandra. Instead of using a secondary index on firstname, create a table specifically for this query, like this:
CREATE TABLE usersByFirstName (
id int,
firstname text,
lastname text,
PRIMARY KEY (firstname,id));
This query should now work:
SELECT * FROM usersByFirstName WHERE firstname='francois'
ORDER BY id DESC LIMIT 5;
Note, that I have created a compound primary key on firstname and id. This will partition your data on firstname (allowing you to query by it), while also clustering your data by id. By default, your data will be clustered by id in ascending order. To alter this behavior, you can specify a CLUSTERING ORDER in your table creation statement:
WITH CLUSTERING ORDER BY (id DESC)
...and then you won't even need an ORDER BY clause.
I recently wrote an article on how clustering order works in Cassandra (We Shall Have Order). It explains this, and covers some ordering strategies as well.
There is one constraint in cassandra: any field you want to use in the where clause has to be the primary key of the table or there must be a secondary index on it. So you have to create an index to firstname and only after that you can use firstname in the where condition and you will get the result you were expecting.

Cassandra CQL - clustering order with multiple clustering columns

I have a column family with primary key definition like this:
...
PRIMARY KEY ((website_id, item_id), user_id, date)
which will be queried using queries such as:
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id = 0 AND date > 'some_date' ;
However, I'd like to keep my column family ordered by date only, such as SELECT date FROM myCF ; would return the most recent inserted date.
Due to the order of clustering columns, what I get is an order per user_id then per date.
If I change the primary key definition to:
PRIMARY KEY ((website_id, item_id), date, user_id)
I can no longer run the same query, as date must be restricted is user_id is.
I thought there might be some way to say:
...
PRIMARY KEY ((website_id, shop_id), store_id, date)
) WITH CLUSTERING ORDER BY (store_id RANDOMPLEASE, date DESC) ;
But it doesn't seem to exist. Worst, maybe this is completely stupid and I don't get why.
Is there any ways of achieving this? Am I missing something?
Many thanks!
Your query example restricts user_id so that should work with the second table format. But if you are actually trying to run queries like
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND date > 'some_date'
Then you need an additional table which is created to handle those queries, it would only order on Date and not on user id
Create Table LookupByDate ... PRIMARY KEY ((website_id, item_id), date)
In addition to your primary query, if all you try to get is "return the most recent inserted date", you may not need an additional table. You can use "static column" to store the last update time per partition. CASSANDRA-6561
It probably won't help your particular case (since I imagine your list of all users is unmanagably large), but if the condition on the first clustering column is matching one of a relatively small set of values then you can use IN.
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id IN ? AND date > 'some_date'
Don't use IN on the partition key because this will create an inefficient query that hits multiple nodes putting stress on the coordinator node. Instead, execute multiple asynchronous queries in parallel. But IN on a clustering column is absolutely fine.

Resources