Order by created date In Cassandra - cassandra

i have problem with ordering data in cassandra Database.
this is my table structure:
CREATE TABLE posts (
id uuid,
created_at timestamp,
comment_enabled boolean,
content text,
enabled boolean,
meta map<text, text>,
post_type tinyint,
summary text,
title text,
updated_at timestamp,
url text,
user_id uuid,
PRIMARY KEY (id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
and when i run this query, i got the following message:
Query:
select * from posts order by created_at desc;
message:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Or this query return data without sorting:
select * from posts

There are couple of things you need to understand,
In your case the partition key is "id" and the clustering key is "created_at".
what that essentially means is any row will be stored in a partition based on the hash of "id"(depending on your hashing scheme by default it is Murmur3), now inside that partition the data is sorted based on your clustering key, in your case "created_at".
So if you query some data from that table by default the results which come are sorted based on your clustering order and the default sort order is the one which you specify while creating the table. However there is a gotcha there.
If yo do not specify the partition key in the WHERE clause, the actual order of the result set then becomes dependent on the hashed values of partition key(in your case id).
So in order to get the posts by that specific order. you have to specify the partition key like this
select * from posts WHERE id=1 order by created_at desc;
Note:
It is not necessary to specify the ORDER BY clause on a query if your desired sort direction (“ASCending/DESCending”) already matches the CLUSTERING ORDER in the table definition.
So essentially the above query is same as
select * from posts WHERE id=1
You can read more about this here http://www.datastax.com/dev/blog/we-shall-have-order

The error message is pretty clear: you cannot ORDER BY without restricting the query with a WHERE clause. This is by design.
The data you get when running without a WHERE clause actually are ordered, not with your clustering key, but by applying the token function to your partition key. You can verify the order by issuing:
SELECT token(id), id, created_at, user_id FROM posts;
where the token function arguments exactly match your PARTITION KEY.
I suggest you to read this and this to understand what you can/can't do.

Related

Cassandra duplicate tables for different partition keys?

I have the following table, called inbox_items:
USE zwoop_chat
CREATE TABLE IF NOT EXISTS inbox_items (
postId text,
userId text,
partnerId text,
fromUserId text,
fromNickName text,
fromAvatar text,
toUserId text,
toNickName text,
toAvatar text,
unread int static,
lastMessage text,
lastMessageDate timestamp,
PRIMARY KEY ((postId, userId), lastMessageDate)
) WITH CLUSTERING ORDER BY (lastMessageDate DESC);
The problem with this table is that I want to query it, both by postId and userId, as well as by userId only.
In other words, I have an inbox per post, but I have an inbox per user as well.
Afaik there is no good way to achieve this because:
The partition key(s) uniquely determine the node where the data is stored. I.e. all partition keys corresponding the where clause should be present.
Secondary index is no good fit for keys with high cardinality (in this case, postId has high cardinality)
The solution I currently see is to duplicate the table with different keys.
This feels like such an overkill though.
Is there a better solution I'm missing?
Assuming partitioning by userid alone would not generate partitions that are too large, you partition by userid, and have postid in the clustering key. You specified that you would query by :
The problem with this table is that I want to query it, both by postId and userId, as well as by userId only.
So in this instance, you do not need postid within the partition key, but within the clustering key. The only issue is if you intend to query by postid alone as well - but that was not mentioned.
If the partition by userid will result in partitions that are too large, there is additional bucketing techniques available.

CQL query delete if not in list

I am trying to delete all rows in the table where the partition key is not in a list of guids.
Here's my table definition.
CREATE TABLE cloister.major_user (
user_id uuid,
user_handle text,
avatar text,
created_at timestamp,
email text,
email_verified boolean,
first_name text,
last_name text,
last_updated_at timestamp,
profile_type text,
PRIMARY KEY (user_id, user_handle)
) WITH CLUSTERING ORDER BY (user_handle ASC)
I want to retain certain user_ids and delete the rest. The following options have failed.
delete from juna_user where user_id ! in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2)
delete from juna_user where user_id not in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2)
delete from juna_user where user_id not in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2) ALLOW FILTERING
What am I doing wrong?
CQL supports only IN condition (see docs). You need to explicitly specify which primary key or partition keys to delete, you can't use condition not in, because it's potentially could be a huge amount of data. If you need to do that, you need to generate the list of entries to delete - you can do that using Spark Cassandra Connector, for example.

How to choose proper tables structure in cassandra?

Suppose I have table with the following structure
create table tasks (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
It allows me to get all tasks for user sorted by name ascending. Also I added task_id to primary key to avoid upserts. The following query holds
select * from tasks where user_id = ?
as well as
select * from tasks where user_id = ? and name > ?
However, I cannot get task with specific task_id. For example, following query crashes
select * from tasks where user_id = ? and task_id = ?
with this error
PRIMARY KEY column "task_id" cannot be restricted as preceding column "name" is not restricted
It requires name column to be specified, but at the moment I have only task_id (from url, for example) and user_id (from session).
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
Then you can query user_id=? and tsakId=?
PRIMARY KEY column "task_id" cannot be restricted as preceding
column "name" is not restricted
You are seeing this error because CQL does not permit queries to skip primary key components.
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
As you suspect, the typical way that problems like this are solved with Cassandra is that an additional table is created for each query. In this case, recreating the table with a PRIMARY KEY designed to match your additional query pattern would simply look like this:
create table tasks_by_user_and_task (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
While I am usually not a fan of using secondary indexes, in this case it may perform ok. Reason being, is that you would still be restricting your query by partition key, which would eliminate the need to examine additional nodes. The drawback (as Undefined_variable pointed out) is that you cannot create a secondary index on a primary key component, so you would need to duplicate that column (and apply the index to the non-primary key column) to get that solution to work.
It might be a good idea to model and test both solutions for performance.
If you have the extra disk space, the best method would be to replicate the data in a second table. You should avoid using secondary indexes in production. Your application would, of course, need to write to both these tables. But Cassandra is darn good at making that efficient.
create table tasks_by_name (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
create table tasks_by_id (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);

Ordering result with an IN restriction on Pk

I'm inserting messages from multiple users and want to get last messages belonging to a sub-selection of users sorted in a descending order by the the "created" field.
MY TABLE (fetching messages via "user_id IN (...)"):
CREATE TABLE users (
user_id timeuuid,
created timestamp,
msg text,
PRIMARY KEY ((user_id), created)
)
WITH CLUSTERING ORDER BY (created DESC)
MY QUERY:
cqlsh:fb> SELECT user_id,created,msg FROM posts WHERE user_id IN (657818d6-9c7e-11e5-b392-978fb134d9c9,e2028f98-9c57-11e5-b96c-8863dfc615b7);
The result is sorted by "created" only within a set of messages per each "user_id", see below (red line dividing 2 different users).
However I want the whole result be sorted as "created DESC", i.e. "LAST ONE" should be on the top.
Please advise. Thank you!
You can achieve your goal by adding order by clause in your query, For example,
SELECT user_id,created,msg FROM posts WHERE user_id IN (657818d6-9c7e-11e5-b392-978fb134d9c9,e2028f98-9c57-11e5-b96c-8863dfc615b7) order by created DESC;
Here are the some different cases of query and their result explanation.
Case 1:
If you didn't provide the order by clause and IN clause, then it will work as following by default.
SELECT user_id,created,msg FROM posts;
The partition key order is based on your partitioner type. In the above query user_id is the partition key, those partition key order is based on your partitioner type. So if your partitioner is Murmur3Partitioner then the result will be retrieved in hash order of user_id and the created column will be retrieved in descending order with respect to the user_id.
Case 2:
If you provide the partition key in IN clause but not ORDER BY clause.
SELECT user_id,created,msg FROM posts WHERE user_id IN (657818d6-9c7e-11e5-b392-978fb134d9c9,e2028f98-9c57-11e5-b96c-8863dfc615b7);
Then the result will be retrieved in the given order (order in which the query has in clause) of user_id partition key and created column will be in descending order with respect to the partition key.
You can dig more information if you are aware of how CQL3 Maps to Cassandra's Internal Data Structure.(Refer slides from 43). But how ever if the partition key is passed IN clause it will impact in the performance. Design your schema by considering these impacts too.

cassandra primary key column cannot be restricted

I am using Cassandra for the first time in a web app and I got a query problem.
Here is my tab :
CREATE TABLE vote (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), user_id, schedule_id)
);
On every request, I indicate my partition key, doodle_id.
For example I can make without any problems :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and user_id = 97a7378a-e1bb-4586-ada1-177016405142;
But on the last request I made :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
I got the following error :
Bad Request: PRIMARY KEY column "schedule_id" cannot be restricted (preceding column "user_id" is either not restricted or by a non-EQ relation)
I'm new with Cassandra, but correct me if I'm wrong, in a composite primary key, the first part is the PARTITION KEY which is mandatory to allow Cassandra to know where to look for data.
Then the others parts are CLUSTERING KEY to sort data.
But I still don't get why my first request is working and not the second one ?
If anyone could help it will be a great pleasure.
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your second query (queries by doodle_id and schedule_id, but not necessarilly with user_id), is to create a new table to handle that specific query. This table will be pretty much the same, except the PRIMARY KEY will be slightly different:
CREATE TABLE votebydoodleandschedule (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), schedule_id, user_id)
);
Now this query will work:
SELECT * FROM votebydoodleandschedule
WHERE doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7
AND schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
This gets you around having to specify ALLOW FILTERING. Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster.
The clustering key is also used to find the columns within a given partition. With your model, you'll be able to query by:
doodle_id
doodle_id/user_id
doodle_id/user_id/schedule_id
user_id using ALLOW FILTERING
user_id/schedule_id using ALLOW FILTERING
You can see your primary key as a file path doodle_id#123/user_id#456/schedule_id#789 where all data is stored in the deepest folder (ie schedule_id#789). When you're querying you have to indicate the subfolder/subtree from where you start searching.
Your 2nd query doesn't work because of how columns are organized within partition. Cassandra can not get a continuous slice of columns in the partition because they are interleaved.
You should invert the primary key order (doodle_id, schedule_id, user_id) to be able to run your query.

Resources