Ordering result with an IN restriction on Pk - cassandra

I'm inserting messages from multiple users and want to get last messages belonging to a sub-selection of users sorted in a descending order by the the "created" field.
MY TABLE (fetching messages via "user_id IN (...)"):
CREATE TABLE users (
user_id timeuuid,
created timestamp,
msg text,
PRIMARY KEY ((user_id), created)
)
WITH CLUSTERING ORDER BY (created DESC)
MY QUERY:
cqlsh:fb> SELECT user_id,created,msg FROM posts WHERE user_id IN (657818d6-9c7e-11e5-b392-978fb134d9c9,e2028f98-9c57-11e5-b96c-8863dfc615b7);
The result is sorted by "created" only within a set of messages per each "user_id", see below (red line dividing 2 different users).
However I want the whole result be sorted as "created DESC", i.e. "LAST ONE" should be on the top.
Please advise. Thank you!

You can achieve your goal by adding order by clause in your query, For example,
SELECT user_id,created,msg FROM posts WHERE user_id IN (657818d6-9c7e-11e5-b392-978fb134d9c9,e2028f98-9c57-11e5-b96c-8863dfc615b7) order by created DESC;
Here are the some different cases of query and their result explanation.
Case 1:
If you didn't provide the order by clause and IN clause, then it will work as following by default.
SELECT user_id,created,msg FROM posts;
The partition key order is based on your partitioner type. In the above query user_id is the partition key, those partition key order is based on your partitioner type. So if your partitioner is Murmur3Partitioner then the result will be retrieved in hash order of user_id and the created column will be retrieved in descending order with respect to the user_id.
Case 2:
If you provide the partition key in IN clause but not ORDER BY clause.
SELECT user_id,created,msg FROM posts WHERE user_id IN (657818d6-9c7e-11e5-b392-978fb134d9c9,e2028f98-9c57-11e5-b96c-8863dfc615b7);
Then the result will be retrieved in the given order (order in which the query has in clause) of user_id partition key and created column will be in descending order with respect to the partition key.
You can dig more information if you are aware of how CQL3 Maps to Cassandra's Internal Data Structure.(Refer slides from 43). But how ever if the partition key is passed IN clause it will impact in the performance. Design your schema by considering these impacts too.

Related

Cassandra CLUSTERING ORDER with updates [performance]

With Cassandra it is possible to specify the cluster ordering on a table with a particular column.
CREATE TABLE myTable (
user_id INT,
message TEXT,
modified DATE,
PRIMARY KEY ((user_id), modified)
)
WITH CLUSTERING ORDER BY (modified DESC);
Note: In this example, there is one message per user_id (intended)
Given this table my understanding is that the query's performance will be better in cases where recent data is queried.
However, if one where to make updates to the "modified" column does it add extra overhead on the server to "re-order" and is that overhead vs query performance significant?
In other words given this table would it perform better if the "CLUSTERING ORDER BY (modified DESC)" was dropped?
UPDATE: Updated the invalid CQL by adding modified to primary key, however, the original questions still stand.
In order to make modified a clustering column, it needs to be defined in the primary key.
CREATE TABLE myTable (
user_id INT,
message TEXT,
modified DATE,
PRIMARY KEY ((user_id), modified)
)
WITH CLUSTERING ORDER BY (modified DESC);
This way, your data will be sorted primarily by the hashed value of the user_id, and within each user_id by modified. You don't need to drop the "WITH CLUSTERING ORDER BY (modified DESC)"
Moving the comment as an answer, as reply of the updated question:
if one where to make updates to the "modified" column does it add
extra overhead on the server to "re-order" and is that overhead vs
query performance significant?
If modified is defined as part of the clustering key, you won't be able to update that record, but you will be able to add as many records as needed, each time with a different modified date.
Cassandra is an append-only database engine: this means that any update to the records will add a new record with a different timestamp, a select will consider the records with the latest timestamp. This means that there is no "re-order" operation.
Dropping or creating the clustering order should be defined in base of the query of how the information will be retrieved, if you are going to use only the latest records of that user_id, it makes sense to have the clustering order as you defined it.
in your data model user_id is a rowkey/shardkey/partition key (userid) that is important for data locality and the clustering column (modified) specifies the order that the data is arranged inside the partition. combination of these two keys makes the primary key.
Even in RDBS world, updating PK is avoidble for sake of data integrity.
however in cassandra there is no constraints/relation between column families/tables.
Assigning exact same values to Pk fields(userid,modified) will result in update the existing record else it will add set of fields.
refence:
https://www.datastax.com/dev/blog/we-shall-have-order

Order by created date In Cassandra

i have problem with ordering data in cassandra Database.
this is my table structure:
CREATE TABLE posts (
id uuid,
created_at timestamp,
comment_enabled boolean,
content text,
enabled boolean,
meta map<text, text>,
post_type tinyint,
summary text,
title text,
updated_at timestamp,
url text,
user_id uuid,
PRIMARY KEY (id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
and when i run this query, i got the following message:
Query:
select * from posts order by created_at desc;
message:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Or this query return data without sorting:
select * from posts
There are couple of things you need to understand,
In your case the partition key is "id" and the clustering key is "created_at".
what that essentially means is any row will be stored in a partition based on the hash of "id"(depending on your hashing scheme by default it is Murmur3), now inside that partition the data is sorted based on your clustering key, in your case "created_at".
So if you query some data from that table by default the results which come are sorted based on your clustering order and the default sort order is the one which you specify while creating the table. However there is a gotcha there.
If yo do not specify the partition key in the WHERE clause, the actual order of the result set then becomes dependent on the hashed values of partition key(in your case id).
So in order to get the posts by that specific order. you have to specify the partition key like this
select * from posts WHERE id=1 order by created_at desc;
Note:
It is not necessary to specify the ORDER BY clause on a query if your desired sort direction (“ASCending/DESCending”) already matches the CLUSTERING ORDER in the table definition.
So essentially the above query is same as
select * from posts WHERE id=1
You can read more about this here http://www.datastax.com/dev/blog/we-shall-have-order
The error message is pretty clear: you cannot ORDER BY without restricting the query with a WHERE clause. This is by design.
The data you get when running without a WHERE clause actually are ordered, not with your clustering key, but by applying the token function to your partition key. You can verify the order by issuing:
SELECT token(id), id, created_at, user_id FROM posts;
where the token function arguments exactly match your PARTITION KEY.
I suggest you to read this and this to understand what you can/can't do.

Get first row for each partition key in Cassandra

I am considering Cassandra as an intermediate storage during my ETL job to perform data deduplication.
Let's imagine I have a stream of events, each of them have some business entity id, timestamp and some value. I need to get only latest value in terms of in-event timestamp for each business key, but events may come unordered.
My idea was to create staging table with business id as a partition key and timestamp as a clustering key:
CREATE TABLE sample_keyspace.table1_copy1 (
id uuid,
time timestamp,
value text,
PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY ( time DESC )
Now if I insert some data in this table I can get latest value for some given partition key:
select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;
But that would require to issue such query for every business key I'm interested in.
Is there some effective way I could do it in CQL?
I know we have an ability to list all available partition keys (by select distinct id from table1). So if I look into storage model of Cassandra, getting first row for each partition key should not be too hard.
Is that supported?
If you're using a version after 3.6, there is an option on your query named PER PARTITION LIMIT (CASSANDRA-7017) which you can set to 1. This won't auto complete in cqlsh until 3.10 with CASSANDRA-12803.
SELECT * FROM table1 PER PARTITION LIMIT 1;
In a word: no.
The partitioning key is why Cassandra can work essentially any amount of data: It decides where to put/look for data using the hash of the partitioning key. That is why CQL SELECTs always need to do an equality filter on the entire partitioning key. In order to find the first time for each id, Cassandra would have to ask all nodes for any partition of the data, then perform a complex operation on each of them. Relational databases allow this, Cassandra does not. All it allows are full table scans (SELECT * from table1), or partition scans (SELECT DISTINCT id FROM table1), but those cannot* be linked to any complex operation.
*) I am omitting ALLOW FILTERING here, since it does not help in this context.

cassandra primary key column cannot be restricted

I am using Cassandra for the first time in a web app and I got a query problem.
Here is my tab :
CREATE TABLE vote (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), user_id, schedule_id)
);
On every request, I indicate my partition key, doodle_id.
For example I can make without any problems :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and user_id = 97a7378a-e1bb-4586-ada1-177016405142;
But on the last request I made :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
I got the following error :
Bad Request: PRIMARY KEY column "schedule_id" cannot be restricted (preceding column "user_id" is either not restricted or by a non-EQ relation)
I'm new with Cassandra, but correct me if I'm wrong, in a composite primary key, the first part is the PARTITION KEY which is mandatory to allow Cassandra to know where to look for data.
Then the others parts are CLUSTERING KEY to sort data.
But I still don't get why my first request is working and not the second one ?
If anyone could help it will be a great pleasure.
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your second query (queries by doodle_id and schedule_id, but not necessarilly with user_id), is to create a new table to handle that specific query. This table will be pretty much the same, except the PRIMARY KEY will be slightly different:
CREATE TABLE votebydoodleandschedule (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), schedule_id, user_id)
);
Now this query will work:
SELECT * FROM votebydoodleandschedule
WHERE doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7
AND schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
This gets you around having to specify ALLOW FILTERING. Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster.
The clustering key is also used to find the columns within a given partition. With your model, you'll be able to query by:
doodle_id
doodle_id/user_id
doodle_id/user_id/schedule_id
user_id using ALLOW FILTERING
user_id/schedule_id using ALLOW FILTERING
You can see your primary key as a file path doodle_id#123/user_id#456/schedule_id#789 where all data is stored in the deepest folder (ie schedule_id#789). When you're querying you have to indicate the subfolder/subtree from where you start searching.
Your 2nd query doesn't work because of how columns are organized within partition. Cassandra can not get a continuous slice of columns in the partition because they are interleaved.
You should invert the primary key order (doodle_id, schedule_id, user_id) to be able to run your query.

Primary Key related CQL3 Queries cases & errors when sorting

I have two issues while querying Cassandra:
Query 1
> select * from a where author='Amresh' order by tweet_id DESC;
Order by with 2ndary indexes is not supported
What I learned: secondary indexes are made to be used only with a WHERE clause and not ORDER BY? If so, then how can I sort?
Query 2
> select * from a where user_id='xamry' ORDER BY tweet_device DESC;
Order by currently only supports the ordering of columns following their
declared order in the PRIMARY KEY.
What I learned: The ORDER BY column should be in the 2nd place in the primary key, maybe? If so, then what if I need to sort by multiple columns?
Table:
CREATE TABLE a(
user_id varchar,
tweet_id varchar,
tweet_device varchar,
author varchar,
body varchar,
PRIMARY KEY(user_id,tweet_id,tweet_device)
);
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('xamry', 't1', 'web', 'Amresh', 'Here is my first tweet');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('xamry', 't2', 'sms', 'Saurabh', 'Howz life Xamry');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('mevivs', 't1', 'iPad', 'Kuldeep', 'You der?');
INSERT INTO a (user_id, tweet_id, tweet_device, author, body)
VALUES ('mevivs', 't2', 'mobile', 'Vivek', 'Yep, I suppose');
Create index user_index on a(author);
To answer your questions, let's focus on your choice of primary key for this table:
PRIMARY KEY(user_id,tweet_id,tweet_device)
As written, the user_id will be used as the partition key, which distributes your data around the cluster but also keeps all of the data for the same user ID on the same node. Within a single partition, unique rows are identified by the pair (tweet_id, tweet_device) and those rows will be automatically ordered by tweet_id because it is the second column listed in the primary key. (Or put another way, the first column in the PK that is not a part of the partition key determines the sort order of the partition.)
Query 1
The WHERE clause is author='Amresh'. Note that this clause does not involve any of the columns listed in the primary key; instead, it is filtering using a secondary index on author. Since the WHERE clause does not specify an exact value for the partition key column (user_id) using the index involves scanning all cluster nodes for possible matches. Results cannot be sorted when they come from more than one replica (node) because that would require holding the entire result set on the coordinator node before it could return any results to the client. The coordinator can't know what is the real "first" result row until it has confirmed that it has received and sorted every possible matching row.
If you need the information for a specific author name, separate from user ID, and sorted by tweet ID, then consider storing the data again in a different table. The data design philosophy with Cassandra is to store the data in the format you need when reading it and to actually denormalize (store redundant information) as necessary. This is because in Cassandra, writes are cheap (though it places the burden of managing multiple copies of the same logical data on the application developer).
Query 2
Here, the WHERE clause is user_id = 'xamry' which happens to be the partition key for this table. The good news is that this will go directly to the replica(s) holding this partition and not bother asking the other nodes. However, you cannot ORDER BY tweet_device because of what I explained at the top of this answer. Cassandra stores rows (within a single partition) sorted by a single column, generally the second column in the primary key. In your case, you can access data for user_id = 'xamry' ORDER BY tweet_id but not ordered by tweet_device. The answer, if you really need the data sorted by device, is the same as for Query 1: store it in a table where that is the second column in the primary key.
If, when looking up the tweets by user_id you only ever need them sorted by device, simply flip the order of the last two columns in your primary key. If you need to be able to sort either way, store the data twice in two different tables.
The Cassandra storage engine does not offer multi-column sorting other than the order of columns listed in your primary key.

Resources