CQL query delete if not in list - cassandra

I am trying to delete all rows in the table where the partition key is not in a list of guids.
Here's my table definition.
CREATE TABLE cloister.major_user (
user_id uuid,
user_handle text,
avatar text,
created_at timestamp,
email text,
email_verified boolean,
first_name text,
last_name text,
last_updated_at timestamp,
profile_type text,
PRIMARY KEY (user_id, user_handle)
) WITH CLUSTERING ORDER BY (user_handle ASC)
I want to retain certain user_ids and delete the rest. The following options have failed.
delete from juna_user where user_id ! in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2)
delete from juna_user where user_id not in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2)
delete from juna_user where user_id not in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2) ALLOW FILTERING
What am I doing wrong?

CQL supports only IN condition (see docs). You need to explicitly specify which primary key or partition keys to delete, you can't use condition not in, because it's potentially could be a huge amount of data. If you need to do that, you need to generate the list of entries to delete - you can do that using Spark Cassandra Connector, for example.

Related

Cassandra Order By Updated At

I'm trying to build a cassandra schema to represent chat.
The one thing i can't seem to figure out is how to query most recently updated rooms (similar to most chat app list view)
Fields desired in list view ordered by updated_at desc
*room id
room title
room image
*user
*updated_at
*message entry
*message type
*metadata
Current Tables
Create TYPE user(
id uuid,
name text,
avatar text
);
CREATE TABLE rooms(
id uuid,
"name" text,
image text,
users set<user>,
archived boolean,
created_at timestampz,
updated_at timestampz,
PRIMARY KEY(id)
);
CREATE TABLE messages(
room_id uuid,
message_id timeuuid,
user user,
message_type int,
entry text,
metadata map<text, text>,
PRIMARY KEY(room_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
CREATE TABLE rooms_by_user(
user_id uuid,
room_id uuid,
PRIMARY KEY(user_id, room_id)
);
Possible solutions that i can come up with.
Duplicate all room details to each message
allows easy query with SELECT * FROM messages PER PARTITION LIMIT 1
this would be a lot of duplicate data per message...
Query latest messages which user belongs to get room ids then query rooms
This doesn't seem to be the cassandra way?
Is there a better way to model my data?
By looking at the schema it looks like you need relational database.
In Cassandra usually you use one table per query, it means you you should design your table by how you will structure query.
Also you can query by partition key or clustering column (second one should be partition key + clustering column).
So in order to query by updater_at, you need to make that column as clustering column. And keep in mind that in Cassandra you cannot alter keys.

Updating Primary Key value Cassandra

I have table with node_id, node_name and data. My requirement is to getByID and getByName. So I have made id and name as the primary keys. But I also need to sometimes update the name as well.
I know Cassandra does not allow updating primary keys and having non primary key in the WHERE clause.
How can I achieve this?
I did consider deleting the record first, and then inserting again with the same id and new name. But I read that this would create tombstones and affect the performance.
Use only node_id as the primary key. To implement getByName create a materialized view. materialized views in cassandra
create table users_by_id_name(
id int,
createdOn bigint, -- timestamp in millisec
name text,
age int,
primary key (id,name,createdOn)
)WITH CLUSTERING ORDER BY ( name DESC, createdOn DESC);
Use above table definition to insert users.
Insert query --
insert into users_by_id_name (id,createdOn,name,age) values (1,100,'darthvedar',28);
to update the user, insert the row again with same user id and updated name and createdOn value.
insert into users_by_id_name (id,createdOn,name,age) values (1,200,'obi-wan-kenobi',28);
while selecting the user use below query --
select by user id -
select * from users_by_id_name where id=1 limit 1;
Select user by name -
select * from users_by_id_name where name='obi-wan-kenobi' ALLOW FILTERING;
Other way is to use secondary index on user name. Think, user name is not going to change too frequently, so secondary index is also one good option.
Edit after comments -
If you have very frequent updates on user name, it would be better to use two different tables.
create table users_by_id(
id int,
name text,
age int,
primary key (id)
);
create table users_by_name(
id int,
name text,
age int,
primary key (name)
);
While inserting , insert in both the tables using batch statement.
Hope this will help.

Order by created date In Cassandra

i have problem with ordering data in cassandra Database.
this is my table structure:
CREATE TABLE posts (
id uuid,
created_at timestamp,
comment_enabled boolean,
content text,
enabled boolean,
meta map<text, text>,
post_type tinyint,
summary text,
title text,
updated_at timestamp,
url text,
user_id uuid,
PRIMARY KEY (id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
and when i run this query, i got the following message:
Query:
select * from posts order by created_at desc;
message:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Or this query return data without sorting:
select * from posts
There are couple of things you need to understand,
In your case the partition key is "id" and the clustering key is "created_at".
what that essentially means is any row will be stored in a partition based on the hash of "id"(depending on your hashing scheme by default it is Murmur3), now inside that partition the data is sorted based on your clustering key, in your case "created_at".
So if you query some data from that table by default the results which come are sorted based on your clustering order and the default sort order is the one which you specify while creating the table. However there is a gotcha there.
If yo do not specify the partition key in the WHERE clause, the actual order of the result set then becomes dependent on the hashed values of partition key(in your case id).
So in order to get the posts by that specific order. you have to specify the partition key like this
select * from posts WHERE id=1 order by created_at desc;
Note:
It is not necessary to specify the ORDER BY clause on a query if your desired sort direction (“ASCending/DESCending”) already matches the CLUSTERING ORDER in the table definition.
So essentially the above query is same as
select * from posts WHERE id=1
You can read more about this here http://www.datastax.com/dev/blog/we-shall-have-order
The error message is pretty clear: you cannot ORDER BY without restricting the query with a WHERE clause. This is by design.
The data you get when running without a WHERE clause actually are ordered, not with your clustering key, but by applying the token function to your partition key. You can verify the order by issuing:
SELECT token(id), id, created_at, user_id FROM posts;
where the token function arguments exactly match your PARTITION KEY.
I suggest you to read this and this to understand what you can/can't do.

How to choose proper tables structure in cassandra?

Suppose I have table with the following structure
create table tasks (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
It allows me to get all tasks for user sorted by name ascending. Also I added task_id to primary key to avoid upserts. The following query holds
select * from tasks where user_id = ?
as well as
select * from tasks where user_id = ? and name > ?
However, I cannot get task with specific task_id. For example, following query crashes
select * from tasks where user_id = ? and task_id = ?
with this error
PRIMARY KEY column "task_id" cannot be restricted as preceding column "name" is not restricted
It requires name column to be specified, but at the moment I have only task_id (from url, for example) and user_id (from session).
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
Then you can query user_id=? and tsakId=?
PRIMARY KEY column "task_id" cannot be restricted as preceding
column "name" is not restricted
You are seeing this error because CQL does not permit queries to skip primary key components.
How should I create this table to perform both queries? Or I need create separate table for second case? What is the common pattern in this situation?
As you suspect, the typical way that problems like this are solved with Cassandra is that an additional table is created for each query. In this case, recreating the table with a PRIMARY KEY designed to match your additional query pattern would simply look like this:
create table tasks_by_user_and_task (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);
You could simply add one more redundant column taskId with same value as task_id and create a secondary index on taskId.
While I am usually not a fan of using secondary indexes, in this case it may perform ok. Reason being, is that you would still be restricting your query by partition key, which would eliminate the need to examine additional nodes. The drawback (as Undefined_variable pointed out) is that you cannot create a secondary index on a primary key component, so you would need to duplicate that column (and apply the index to the non-primary key column) to get that solution to work.
It might be a good idea to model and test both solutions for performance.
If you have the extra disk space, the best method would be to replicate the data in a second table. You should avoid using secondary indexes in production. Your application would, of course, need to write to both these tables. But Cassandra is darn good at making that efficient.
create table tasks_by_name (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), name, task_id)
);
create table tasks_by_id (
user_id uuid,
name text,
task_id uuid,
description text,
primary key ((user_id), task_id)
);

What do you do change in the such a data model Cassandra?

I have task to create a social feed(news feed). I think no need to explain the standard functionality - all are how as FB.
I chose solution apache cassandra and designed a data column Posts for storing information about posts users:
CREATE TABLE Posts (
post_id uuid,
post_at timestamp,
user_id text,
name varchar,
category set<text>,
link varchar,
image set<varchar>,
video set<varchar>,
content map<text, text>,
private boolean,
PRIMARY KEY ((post_id, user_id), post_at)
)
WITH CLUSTERING ORDER BY (post_at DESC) COMPACT STORAGE;
The next table contains id user posts:
CREATE TABLE posts_user (
post_id bigint,
post_at timestamp,
user_id bigint,
PRIMARY KEY ((post_id), post_at, user_id)
)
WITH CLUSTERING ORDER BY (post_at DESC) AND COMPACT STORAGE;
How do you think, is it good? What do you do change in the such a data model?
There are a couple of questions and a couple of improvements that jump out.
COMPACT STORAGE is deprecated now (if you want to take advantage of CQL 3 features). I do not think that you can create your table Posts as you have defined above since it uses CQL 3 features (collections) with COMPACT STORAGE as well as declaring more than one column that is not part of the primary key.
posts_user has completely different key types than Posts does. I am not clear on what the relationship between the two tables is, but I imagine that post_id is supposed to be consistent between them, whereas you have it as a uuid in one table and a bigint in the other. There are also discrepancies with the other fields.
Assuming post_id is unique and represents the id of an individual post, it is strange to have it as the first part of a compound primary key in the Posts table since if you know the post_id then you can already uniquely access the record. Furthermore, as it is part of the partition key it also prevents you from doing wider selects of multiple posts and taking advantage of your post_at ordering.
The common method to fix this is to create a dedicated index table to sort the data the way you want.
E.g.
CREATE TABLE posts (
id uuid,
created timestamp,
user_id uuid,
name text,
...
PRIMARY KEY (id)
);
CREATE TABLE posts_by_user_index (
user_id uuid,
post_id uuid,
post_at timestamp,
PRIMARY KEY (user_id,post_at,post_id)
WITH CLUSTERING ORDER BY (post_at DESC)
);
Or more comprehensively:
CREATE TABLE posts_by_user_sort_index (
user_id uuid,
post_id uuid,
sort_field text,
sort_value text,
PRIMARY KEY ((user_id,sort_field),sort_value,post_id)
);
However, in your case if you only wish to select the data one way, then you can get away with using your posts table to do the sorting:
CREATE TABLE posts (
id uuid,
post_at timestamp,
user_id uuid,
name text,
...
PRIMARY KEY (user_id,post_at,id)
WITH CLUSTERING ORDER BY (post_at DESC)
);
It will just make it more complicated if you wish to add additional indexes later since you will need to index each post not just by its post id, but by its user and post_at fields as well.

Resources