I'm trying to create a chat app with use the Cassandra database. so I want to get messages that send between to users (private chat) . so I want to get the best and optimal way to do CQL query? What are your idea?
my user message table structure.
Everything that has been said is true.
Based you data model on your queries
DS2020 on academy.datastax.com is a full fledge course to explain how to do.
Now regarding your sample chat here is some proposition
DROP TABLE IF EXISTS sample_chat;
CREATE TABLE IF NOT EXISTS sample_chat (
fromuser text,
touser text,
message_id timeuuid,
body text,
PRIMARY KEY ((fromuser, touser), message_id)
) WITH CLUSTERING ORDER BY (msg_time DESC);
INSERT INTO sample_chat(fromuser, touser, message_id, body) VALUES('Cedrick', 'Hasan', now(), 'Hi Hasan do you like Cassandra');
INSERT INTO sample_chat(fromuser, touser, message_id, body) VALUES('Hasan', 'Cedrick', now(), 'Yeah Cassandra rocks');
INSERT INTO sample_chat(fromuser, touser, message_id, body) VALUES('Cedrick', 'Hasan', now(), 'Take ds220 and give us some feedback');
select fromuser, touser, body from sample_chat;
Rational:
You want to retrieve a chat based on fromuser and touser and as such this is a good partition key. Chat between 2 users should not have more than 100.000 lines which is the recommended max number of rows for a partition.
You want your items to be ordered by time with latest in first (display only last messages in the chat). You want your messages to be unique. As such timeuuid is a good type for message_id you can extract time from it easily and it ensures unicity.
You want to avoid using time as a column name or any term that collide with existing keyword of cql.
In Cassandra, the table design should be done based on the query/queries that you are going to execute; in this case, how are you going to request the data? based on the fromuser or the touser column, both?
Datastax Academy has the course DS220 that can be a good starting point to learn data modeling for Cassandra
In Cassandra, you have to work on data model part and optimise your queries based on Datastax or Cassandra recommendations. As per you app you should also work on Cassandra configurations part what would be good for heavy write and read. To get good performance on your app you should consider not only database part but also need to consider OS, network etc.
Related
For my chat table design in cassandra I have the following scheme:
USE zwoop_chat
CREATE TABLE IF NOT EXISTS public_messages (
chatRoomId text,
date timestamp,
fromUserId text,
fromUserNickName text,
message text,
PRIMARY KEY ((chatRoomId, fromUserId), date)
) WITH CLUSTERING ORDER BY (date ASC);
The following query:
SELECT * FROM public_messages WHERE chatroomid=? LIMIT 20
Results in the typical message:
Cannot execute this query as it might involve data filtering and thus
may have unpredictable performance. If you want to execute this query
despite the performance unpredictability, use ALLOW FILTERING;
Obviously I'm doing something wrong with the partitioning here.
I'm not experienced with Cassandra and a bit confused about online suggestions that Cassandra will make an entire table scan, which is something that I don't really get realistically. Why would I want to fetch an entire table.
Another suggestion I read about is to create partitioning, e.g. to fetch the latest per day. But this doesn't work for me. You don't know when the latest chat message occurred.
Could be last day, last hour, or last week or month for that matter.
I'm pretty much used to sql or nosql like mongo, but this simple use case seems to be a problem for Cassandra. So what is the recommended approach here?
Edit:
It seems that it is common practise to add a bucket integer.
Let's say I create a bucket per 50 messages, is there a way to auto-increment it when the bucket is full?
I would prefer not having to do a fetch of MAX bucket and calculate when the bucket is full. Seems like bad performance for doing inserts.
Also it seems like a bad idea to manage the buckets in Java. Things like app restarts or load balancing would require extra logic.
(I currently use Java Spring JPA for Cassandra).
It works without bucketing using the following table design:
USE zwoop_chat
CREATE TABLE IF NOT EXISTS public_messages (
chatRoomId text,
date timestamp,
fromUserId text,
fromUserNickName text,
message text,
PRIMARY KEY ((chatRoomId), date)
) WITH CLUSTERING ORDER BY (date DESC);
I had to remove the fromUserId from the partition key, I assume it is required to include it in the where clause to avoid the error.
The jpa query:
publicMessageRepository.findFirst20ByPkChatRoomIdOrderByPkDateDesc(chatRoomId);
I have 5 Tables:
users_by_id
users_by_username
users_by_email
users_by_likes
users_by_followers
I have to write 5 Statements every time if a user registered. Is that not expensive or bad ?
INSERT INTO users_by_id (...) values (..)
INSERT INTO users_by_email (...) values (..)
INSERT INTO users_by_username (...) values (..)
INSERT INTO users_by_likes (...) values (..)
INSERT INTO users_by_followers (...) values (..)
The second question: Maybe I update users_by_id I have to write 5 Update statments. Is there another solution? Or is that not this bad ?
Cassandra advocates denormalization of your data and creating data model according to your queries. You will have to write your data model such that it satisfies all the queries with good performance. For performance (due to its architecture and design) Cassandra asks for writing and reading using partition key.
It is not expensive to write 5 insertions for same set of data in 5 different tables. Your reads will perform better and as data size increases to web scale, you will thank your decision of creating 5 tables and writing to them.
You can explore materialized views (Materialized View and Datastax Link for Materialized View but remember it is an experimental feature. So you have to understand it properly and also identify open issues with materialized views.
I would recommend you study Cassandra data model that will make things easier to grasp.
Cassandra is designed to be write intensive database so do not hesitate to duplicate your data. One should always design tables for the read queries. If one table satisfies one query, it is a fine design.
Answer to your second question, you should design your tables such a way that you do not have to update table. Always think about inserting new values.
For example, below table design
CREATE TABLE user_by_email (
email text,
timestamp timestamp,
name text,
fullname text,
userId text,
PRIMARY KEY (email,timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
INSERT INTO user_by_email (email, DateTime.Now ........)
In this design, you should get the latest inserted value. Additionally , this design keeps change history for that key.
Think about, how many times we have to update values like user id, email, username? rarely.
I'm facing a dilemma that my small knowledge of Cassandra doesn't allow me to solve.
I have a index table used to retrieve data from an item (a notification) using an external id. However, the data contained in that table (in that case the status of the notification) is modified so I need to update the index table as well. Here is the tables design:
TABLE notification_by_external_id (
external_id text,
partition_key_date text,
id uuid,
status text,
...
PRIMARY KEY (external_id, partition_key_date, id)
);
TABLE notification (
partition_key_date text,
status text,
id uuid,
...
PRIMARY KEY (partition_key_date, status, id)
);
The problem is that when I want to update the notification status (and hence the notification_by_external_id table), I don't have access to the external ID.
So far I came up to 2 solutions, none of which seems optimal, and I can't decide which one to go with.
Solution 1
Create an index on notification_by_external_id.id, but this will obviously be a high cardinality column. There can be several external IDs for each notifications, but we're talking about something around 5-10 to one top.
Solution 2
Create a table
TABLE external_id_notification (
notification_id uuid,
external_id text
PRIMARY KEY (notification_id, external_id)
);
but that would mean making one extra read operation (and of course maintain another table) which I understood is also a bad practice.
The thing to understand about secondary indexes is, that their scalability issue is not with the number of rows in the table, but with the amount of nodes in your cluster. A select on an index column means that every single node will have to process it and respond to it, just that it itself will be able to process the select efficiently.
Use secondary indexes for administrative purposes (i.e. you on cqlsh) only. Do not use it for productive purposes.
That being said. You could duplicate all the information into your external_id_notification table. That would alleviate the need for an extra read operation. I know that relational databases taught you, that duplicate data is bad (what if it differs?), and that you should always normalize. But you are not on a relational database. Denormalization is a thing, and on Cassandra, you should always go for that, unless you absolutely cannot.
Cassandra data modeling respects "Denormalization and duplication of data is a fact of life with Cassandra". But one of the cons for demormalized data is making the updates very hard. For example, if I have three tables catering for different queries, selecting is fine. However, if in my app, I want to update a username and I need to update these three tables? The update on first table looks ok. How about the latter two? The upates are going to be very expensive? How should I handle this case?
CREATE TABLE users_by_username (
username text PRIMARY KEY,
email text,
age int
)
CREATE TABLE users_by_email (
email text PRIMARY KEY,
username text,
age int
)
CREATE TABLE groups (
groupname text,
username text,
email text,
age int,
hash_prefix int,
PRIMARY KEY ((groupname, hash_prefix), username)
)
This is a typical problem I see when people try to put relational model in Cassandra which is being updated through time. Cassandra is a great database and for what it does, it works wonders. There are many features that enable all kinds of different data models and you can cover almost all use cases. When you look at your use case the question is why would you use Cassandra for relational model?
If you really want to make Cassandra cover your use case you will have to do a lot of different operations on application level just to execute updates and keep your data in consistent state.
After watching a few youtube clips, it looks like Canssandra's update is a simple write to append a record to the commit log in the file system. Then the data is put to memtable in cassandra server and send acknowledge to the client straight away. So the update call finishes. This makes the updating fast to the clients.
The whole compaction process happens afterwards, including flushing, sequential writing and merging based on the timestamp.
I am new to Cassandra and trying to build a data model for messaging system. I found few solutions but none of them exactly match my requirements. There are two main requirements:
Get a list of last messages for a particular user, from all other users, sorted by time.
Get a list of messages for one-to-one message history, sorted by time as well.
I thought of something like this,
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user,from_user),time)
) WITH CLUSTERING ORDER BY (time DESC);
But this design has few issues, like I wont be able to satisfy first requirement since this design requires to pass from_user as well. And also this would be inefficient when number of (to_user,from_user) pair increases.
You are right. That one table won't satisfy both queries, so you will need two tables. One for each query. This is a core concept with Cassandra data modeling. Query driven design.
So the query looking for messages to a user:
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user),time)
) WITH CLUSTERING ORDER BY (time DESC);
Messages from a user to another user.
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user),from_user,time)
) WITH CLUSTERING ORDER BY (time DESC);
Slight difference from yours: from_user is a clustering column and not a part of the partition key. This is minimize the amount of select queries needed in application code.
It's possible to use the second table to satisfy both queries, but you will have to supply the 'from_user' to use a range query on time.