Cassandra update previous rows after insert - cassandra

I have this schema in cassandra:
create table if not exists
converstation_events(
timestamp timestamp,
sender_id bigint,
conversation_id bigint,
message_type varchar,
message text,
primary key ((conversation_id), sender_id, message_type, timestamp));
And there is a message_type with value conversation_ended, is there a way to denormalise the data so I can do queries on those conversations that have already ended?
I've thought about having an extra field that can be updated by a trigger when a conversation_ended message hits the system, does this make sense?

In Cassandra you need to model your data in a way the answers your questions. It's not like a RDBMS where you create you model first then create your queries. So think backwards...
When you do a query in cassandra (for the most part...) you need to query by the primary key and you can use your clustering key(s) to filter or a select ranges. a great post on it.
Your converstation_events table will give you answers about a conversation, filtering by sender, type and time. ** if you want to filter by time you must include sender_id and message_type in the query.
But you want all conversations of a given type so you'll need another table to answer this query. If you want all the conversation that are conversation_ended you could create a second table to map message type to conversation, like-
conversation_by_message_type (
message_type varchar,
conversation_id bigint,
timestamp timestamp,
primary key ((message_type), timestamp, conversation_id));
On the client side you'll have to add a record to conversation_by_message_type anytime you insert a converstation_events event with a given message_type that you might want to look up. I have timestamp in this table so you can sort or filter by time or time and conversation_id.
To find all the ended conversations you could do queries like
<ids> = select conversation_id from conversation_by_message_type where message_type = 'conversation_ended'
select * from conversation_events where conversation_id IN (<ids>)

Related

Cassandra Order By Updated At

I'm trying to build a cassandra schema to represent chat.
The one thing i can't seem to figure out is how to query most recently updated rooms (similar to most chat app list view)
Fields desired in list view ordered by updated_at desc
*room id
room title
room image
*user
*updated_at
*message entry
*message type
*metadata
Current Tables
Create TYPE user(
id uuid,
name text,
avatar text
);
CREATE TABLE rooms(
id uuid,
"name" text,
image text,
users set<user>,
archived boolean,
created_at timestampz,
updated_at timestampz,
PRIMARY KEY(id)
);
CREATE TABLE messages(
room_id uuid,
message_id timeuuid,
user user,
message_type int,
entry text,
metadata map<text, text>,
PRIMARY KEY(room_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
CREATE TABLE rooms_by_user(
user_id uuid,
room_id uuid,
PRIMARY KEY(user_id, room_id)
);
Possible solutions that i can come up with.
Duplicate all room details to each message
allows easy query with SELECT * FROM messages PER PARTITION LIMIT 1
this would be a lot of duplicate data per message...
Query latest messages which user belongs to get room ids then query rooms
This doesn't seem to be the cassandra way?
Is there a better way to model my data?
By looking at the schema it looks like you need relational database.
In Cassandra usually you use one table per query, it means you you should design your table by how you will structure query.
Also you can query by partition key or clustering column (second one should be partition key + clustering column).
So in order to query by updater_at, you need to make that column as clustering column. And keep in mind that in Cassandra you cannot alter keys.

Cassandra Update query | timestamp column as clustering key

I have a table in cassandra with following schema:
CREATE TABLE user_album_entity (
userId text,
albumId text,
updateDateTimestamp timestamp,
albumName text,
description text,
PRIMARY KEY ((userId), updateDateTimestamp)
);
The query required to get data would have a where userId = xxx order by updateTimestamp. Hence the schema had updateDateTimestamp.
Problem comes in updating the column of table.The query is: Update the album information for user where user id = xxx. But as per specs,for update query I would need the exact value of updateDateTimestamp which in normal world scenario, an application would never send.
What should be the answer to such problems since I believe this a very common use case where select query requires ordering on timestamp. Any help is much appreciated.
The problem is that your table structure allows the same album to have multiple records with the only difference being the timestamp (the clustering key).
Three possible solutions:
Remove the clustering key and sort your data at application level.
Remove the clustering key and add a Secondary Index to the timestamp field.
Remove the clustering key and create a Materialized View to perform the query.
If your usecase is such that each partition will contain exactly one row,
then you can model your table like:
CREATE TABLE user_album_entity (
userId text,
albumId text static,
updateDateTimestamp timestamp,
albumName text static,
description text static,
PRIMARY KEY ((userId), updateDateTimestamp)
);
modelling the table this way enables Update query to be done in following way:
UPDATE user_album_entity SET albumId = 'updatedAlbumId' WHERE userId = 'xyz'
Hope this helps.

Cassandra data modeling timestamps

I have a fairly simple data model. I am tracking events for users based on timestamps. I'm converting a JSON object which has this scema:
userID:{
event: [
{ timestamp: data },
{ timestamp: data }
]
}
I have come up with two Cassandra schemas.
The first:
CREATE TABLE users ( guid uuid, date timestamp, events varchar, PRIMARY KEY(guid, date) );
The second:
CREATE TABLE users ( guid uuid PRIMARY KEY, date timestamp, events map<text, text> );
Either one would work, requiring the data to be a stringified JSON object. My query will be returning all data from a user in a given time range. Which model makes more sense, or is there a better way to go about this?
The second approach won't allow you to do queries by time range since you don't have date as a clustering column. So you might want to do this:
CREATE TABLE users (
guid uuid,
date timestamp,
events map<text, text>,
PRIMARY KEY(guid, date) );
How you want to define the events field depends on what's in there and how you need to access it. If you access small parts of it often, you might want to break events in the map out into separate rows by making the event key another clustering column like this:
CREATE TABLE users (
guid uuid,
date timestamp,
event_type text,
event_value text,
PRIMARY KEY(guid, date, event_type) );
It's hard to give more specific advice since you didn't describe your use case in terms of what queries you want to run and the volume of data, number of users, etc.
As Jim was saying the second schema does not allow query on the timestamp since it is not contained in the primary key.
He suggested a valid solution but I would also suggest that you use not a uuid and timestamp but a TimeUUID (which provide both an id and a timestamp at the same time) if you can. However if you need to get the users by id only sometimes then the solution of Jim is probably the best :
PRIMARY KEY(guid, date, event_type)

cassandra primary key column cannot be restricted

I am using Cassandra for the first time in a web app and I got a query problem.
Here is my tab :
CREATE TABLE vote (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), user_id, schedule_id)
);
On every request, I indicate my partition key, doodle_id.
For example I can make without any problems :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and user_id = 97a7378a-e1bb-4586-ada1-177016405142;
But on the last request I made :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
I got the following error :
Bad Request: PRIMARY KEY column "schedule_id" cannot be restricted (preceding column "user_id" is either not restricted or by a non-EQ relation)
I'm new with Cassandra, but correct me if I'm wrong, in a composite primary key, the first part is the PARTITION KEY which is mandatory to allow Cassandra to know where to look for data.
Then the others parts are CLUSTERING KEY to sort data.
But I still don't get why my first request is working and not the second one ?
If anyone could help it will be a great pleasure.
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your second query (queries by doodle_id and schedule_id, but not necessarilly with user_id), is to create a new table to handle that specific query. This table will be pretty much the same, except the PRIMARY KEY will be slightly different:
CREATE TABLE votebydoodleandschedule (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), schedule_id, user_id)
);
Now this query will work:
SELECT * FROM votebydoodleandschedule
WHERE doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7
AND schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
This gets you around having to specify ALLOW FILTERING. Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster.
The clustering key is also used to find the columns within a given partition. With your model, you'll be able to query by:
doodle_id
doodle_id/user_id
doodle_id/user_id/schedule_id
user_id using ALLOW FILTERING
user_id/schedule_id using ALLOW FILTERING
You can see your primary key as a file path doodle_id#123/user_id#456/schedule_id#789 where all data is stored in the deepest folder (ie schedule_id#789). When you're querying you have to indicate the subfolder/subtree from where you start searching.
Your 2nd query doesn't work because of how columns are organized within partition. Cassandra can not get a continuous slice of columns in the partition because they are interleaved.
You should invert the primary key order (doodle_id, schedule_id, user_id) to be able to run your query.

Cassandra Latest value on each key

I am trying to model a history stream of user... Think like twitter, each record have three column:
userid, posted_time, message
If my primary use case is retrieving the lastest tweet from all user. Is there any simple way to model this in Cassandra?
In sql, it would be
select * from t where (userid, posted_time) in (select userid, max(posted_time) from t group by userid);
But I don't think it is possible in Cassandra.
For Cassandra you must denormalize your data so the queries you expect to perform are efficient. So, if you want to find the time of the latest update by a user, you should have a table that has the user ID as its key and the time of the latest update as the value.
Create table tblname(userid varchar,
posted_time timeuuid,
message text,
primary key(userid,posted_time)
)with clustering order by (posted_time desc);
After following schema
Select * from tblname where userid='id' limit 1
will give latest record for each user
Also you can order the result by posted_time

Resources