Cassandra Order By Updated At - cassandra

I'm trying to build a cassandra schema to represent chat.
The one thing i can't seem to figure out is how to query most recently updated rooms (similar to most chat app list view)
Fields desired in list view ordered by updated_at desc
*room id
room title
room image
*user
*updated_at
*message entry
*message type
*metadata
Current Tables
Create TYPE user(
id uuid,
name text,
avatar text
);
CREATE TABLE rooms(
id uuid,
"name" text,
image text,
users set<user>,
archived boolean,
created_at timestampz,
updated_at timestampz,
PRIMARY KEY(id)
);
CREATE TABLE messages(
room_id uuid,
message_id timeuuid,
user user,
message_type int,
entry text,
metadata map<text, text>,
PRIMARY KEY(room_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
CREATE TABLE rooms_by_user(
user_id uuid,
room_id uuid,
PRIMARY KEY(user_id, room_id)
);
Possible solutions that i can come up with.
Duplicate all room details to each message
allows easy query with SELECT * FROM messages PER PARTITION LIMIT 1
this would be a lot of duplicate data per message...
Query latest messages which user belongs to get room ids then query rooms
This doesn't seem to be the cassandra way?
Is there a better way to model my data?

By looking at the schema it looks like you need relational database.
In Cassandra usually you use one table per query, it means you you should design your table by how you will structure query.
Also you can query by partition key or clustering column (second one should be partition key + clustering column).
So in order to query by updater_at, you need to make that column as clustering column. And keep in mind that in Cassandra you cannot alter keys.

Related

How to update denormalized tables in Cassandra

I'm new to Cassandra world. Actually I'm trying to create a chat application for my users (traders). I have a table as below to find traders by a room_id.
CREATE TABLE IF NOT EXISTS traders_by_room (
room_id uuid,
trader_id uuid,
trader_username text,
trader_mongo_id text,
trader_profile_url text,
trader_profile_image text,
PRIMARY KEY (room_id)
)
CREATE TABLE IF NOT EXISTS traders (
trader_id uuid,
trader_username text,
trader_mongo_id text,
trader_profile_url text,
trader_profile_image text,
PRIMARY KEY (trader_id)
)
Now the issue is lets say that a particular trader updates his/her trader_profile_image in traders table, then the same needs to be updated in trader_by_room table. But for updating the trader_by_room table, we must provide room_id in WHERE clause, which might not be known, as we are just updating the profile pic for the trader. Couldn't really think of a solution.
Any help would mean a lot.

CQL query delete if not in list

I am trying to delete all rows in the table where the partition key is not in a list of guids.
Here's my table definition.
CREATE TABLE cloister.major_user (
user_id uuid,
user_handle text,
avatar text,
created_at timestamp,
email text,
email_verified boolean,
first_name text,
last_name text,
last_updated_at timestamp,
profile_type text,
PRIMARY KEY (user_id, user_handle)
) WITH CLUSTERING ORDER BY (user_handle ASC)
I want to retain certain user_ids and delete the rest. The following options have failed.
delete from juna_user where user_id ! in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2)
delete from juna_user where user_id not in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2)
delete from juna_user where user_id not in (0d70272c-8d24-43d0-9b2d-c62100b0e28e,0b7c0841-3a18-4c03-a211-f75690c93815,e96ba860-72cf-44d5-a6bd-5a9ec58827e3,729d7973-d4c4-42fb-94c4-d1ffd03b74cd,3bffa0c6-8b98-4f0c-bd7c-22d0662ab0a2) ALLOW FILTERING
What am I doing wrong?
CQL supports only IN condition (see docs). You need to explicitly specify which primary key or partition keys to delete, you can't use condition not in, because it's potentially could be a huge amount of data. If you need to do that, you need to generate the list of entries to delete - you can do that using Spark Cassandra Connector, for example.

Cassandra update previous rows after insert

I have this schema in cassandra:
create table if not exists
converstation_events(
timestamp timestamp,
sender_id bigint,
conversation_id bigint,
message_type varchar,
message text,
primary key ((conversation_id), sender_id, message_type, timestamp));
And there is a message_type with value conversation_ended, is there a way to denormalise the data so I can do queries on those conversations that have already ended?
I've thought about having an extra field that can be updated by a trigger when a conversation_ended message hits the system, does this make sense?
In Cassandra you need to model your data in a way the answers your questions. It's not like a RDBMS where you create you model first then create your queries. So think backwards...
When you do a query in cassandra (for the most part...) you need to query by the primary key and you can use your clustering key(s) to filter or a select ranges. a great post on it.
Your converstation_events table will give you answers about a conversation, filtering by sender, type and time. ** if you want to filter by time you must include sender_id and message_type in the query.
But you want all conversations of a given type so you'll need another table to answer this query. If you want all the conversation that are conversation_ended you could create a second table to map message type to conversation, like-
conversation_by_message_type (
message_type varchar,
conversation_id bigint,
timestamp timestamp,
primary key ((message_type), timestamp, conversation_id));
On the client side you'll have to add a record to conversation_by_message_type anytime you insert a converstation_events event with a given message_type that you might want to look up. I have timestamp in this table so you can sort or filter by time or time and conversation_id.
To find all the ended conversations you could do queries like
<ids> = select conversation_id from conversation_by_message_type where message_type = 'conversation_ended'
select * from conversation_events where conversation_id IN (<ids>)

Cassandra modeling with a read/unread status for a message inbox, CQL

I'm trying to find the best data model for a message box application. That messages appear in order in which first the ‘unread’ appear and then as the user scrolls the ‘read’ messages will appear. In both of the categories I want to sort the messages by arrival time. Something like priority inbox in gmail.
The first schema I thought to use is :
CREATE TABLE inbox
(userId uuid,
messageId timeuuid,
data blob,
isRead boolean,
PRIMARY KEY(userId, isRead, messageId))
WITH CLUSTERING ORDER BY (isRead ASC, messageId DESC);
So My data is first sorted by the boolean field and then by time. Now I can easily go over first my 'unread' messages and after they all end then I will start reading the 'read' messages.
The problem is that I can't update any message status, since it's a part of the primary key. I can do a delete and then insert in a batch operation, it's also the same row.
Another solution will be :
CREATE TABLE inbox
(userId uuid,
messageId timeuuid,
data blob,
isRead boolean,
PRIMARY KEY((userId, isRead), messageId))
WITH CLUSTERING ORDER BY (messageId DESC)
Having a row for every status. I gain a very easy access but does that mean that I have to deal with transaction? When reading a message I have to delete it from the ‘unread’ row and insert it to the ‘read’ row, they might be in different partitions.
another version for the partition key can be :
PRIMARY KEY(userId, messageId)
and then I would add a secondary index on isRead. My queries will always be on a certain user and not a group of user.
Any ideas on what is better? Or any other modeling ideas?
You can create a table referencing you messages by id for exemple :
CREATE TABLE inbox
(inbox_id uuid,
userId uuid,
messageId timeuuid,
data blob,
isRead boolean,
PRIMARY KEY(inbox_id));
This table store you datas and perform update operations.
Create other tables for search like
CREATE TABLE inbox
(inbox_id uuid,
userId uuid,
messageId timeuuid,
isRead boolean,
PRIMARY KEY((userId, isRead), messageId))
WITH CLUSTERING ORDER BY (isRead ASC, messageId DESC);
Search desired records in this table and update in both tables.

What do you do change in the such a data model Cassandra?

I have task to create a social feed(news feed). I think no need to explain the standard functionality - all are how as FB.
I chose solution apache cassandra and designed a data column Posts for storing information about posts users:
CREATE TABLE Posts (
post_id uuid,
post_at timestamp,
user_id text,
name varchar,
category set<text>,
link varchar,
image set<varchar>,
video set<varchar>,
content map<text, text>,
private boolean,
PRIMARY KEY ((post_id, user_id), post_at)
)
WITH CLUSTERING ORDER BY (post_at DESC) COMPACT STORAGE;
The next table contains id user posts:
CREATE TABLE posts_user (
post_id bigint,
post_at timestamp,
user_id bigint,
PRIMARY KEY ((post_id), post_at, user_id)
)
WITH CLUSTERING ORDER BY (post_at DESC) AND COMPACT STORAGE;
How do you think, is it good? What do you do change in the such a data model?
There are a couple of questions and a couple of improvements that jump out.
COMPACT STORAGE is deprecated now (if you want to take advantage of CQL 3 features). I do not think that you can create your table Posts as you have defined above since it uses CQL 3 features (collections) with COMPACT STORAGE as well as declaring more than one column that is not part of the primary key.
posts_user has completely different key types than Posts does. I am not clear on what the relationship between the two tables is, but I imagine that post_id is supposed to be consistent between them, whereas you have it as a uuid in one table and a bigint in the other. There are also discrepancies with the other fields.
Assuming post_id is unique and represents the id of an individual post, it is strange to have it as the first part of a compound primary key in the Posts table since if you know the post_id then you can already uniquely access the record. Furthermore, as it is part of the partition key it also prevents you from doing wider selects of multiple posts and taking advantage of your post_at ordering.
The common method to fix this is to create a dedicated index table to sort the data the way you want.
E.g.
CREATE TABLE posts (
id uuid,
created timestamp,
user_id uuid,
name text,
...
PRIMARY KEY (id)
);
CREATE TABLE posts_by_user_index (
user_id uuid,
post_id uuid,
post_at timestamp,
PRIMARY KEY (user_id,post_at,post_id)
WITH CLUSTERING ORDER BY (post_at DESC)
);
Or more comprehensively:
CREATE TABLE posts_by_user_sort_index (
user_id uuid,
post_id uuid,
sort_field text,
sort_value text,
PRIMARY KEY ((user_id,sort_field),sort_value,post_id)
);
However, in your case if you only wish to select the data one way, then you can get away with using your posts table to do the sorting:
CREATE TABLE posts (
id uuid,
post_at timestamp,
user_id uuid,
name text,
...
PRIMARY KEY (user_id,post_at,id)
WITH CLUSTERING ORDER BY (post_at DESC)
);
It will just make it more complicated if you wish to add additional indexes later since you will need to index each post not just by its post id, but by its user and post_at fields as well.

Resources