How should I or should not use Cassandra and Redis together to build a scalable one on one chat application? - cassandra

Up until now I have used MySQL to do pretty much everything, but I don't like the thought of sharding my data manually and maintaining all of that for now.
I want to build a one on one chat application that is like Facebook and WhatsApp like the picture below:
So we have two parts here. The right part which is just all messages in a chat thread, and the left part that shows chat threads with information from the last message, and your chat partners information such as name and image and so on.
So far this is what I have:
Cassandra is really good at writing, and reading. But not so much at deleting data because of tombstones. And you don't want to set gc_grace_seconds to 0 because if a node goes down and deletes occur then that deleted row might come back to life when repair is done. So we might end up deleting all data from the node before it enters the cluster. Anyways, as I understood Cassandra would be perfect for the right part of this chat app. Since messages will be stored and ordered by their insertion time and that sorting will never change. So you just write and read. Which is what Cassandra is good at.
I have these table to store messages for the right part:
CREATE TYPE user_data_for_message (
from_id INT,
to_id INT,
from_username TEXT,
to_username TEXT,
from_image_name TEXT,
to_image_name TEXT
);
CREATE TABLE message_by_thread_id (
message_id TIMEUUID,
thread_id UUID,
user_data FROZEN <user_data_for_message>,
message TEXT,
created_time INT,
is_viewed BOOLEAN,
PRIMARY KEY (thread_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
Before I insert a new message, if the thread_id is not provided by the client, I can do a check whether a thread between two users exist. I can store that information like this:
CREATE TABLE message_thread_by_user_ids (
thread_id UUID,
user1_id INT,
user2_id INT,
PRIMARY KEY (user1_id, user2_id)
);
I could store two rows for every thread where user1 and user2 has reversed order, so that I just need to do 1 read to check for the existence. Since I don't want to check for existence of thread before every insert, I could first check if there exists a thread between users in Redis since it is in memory and much faster.
I could save the same information above in Redis too like this (not two way as I did in Cassandra, but one way to save memory. We can do two GET to check for it):
SET user:1:user:2:message_thread_id 123e4567-e89b-12d3-a456-426655440000
So before I send a message, I could first check in Redis whether there exists a thread between the two users. If not found in Redis, I could check in Cassandra, (in case Redis server was down at some point and did not save it), and if there exists a thread just use that thread_id to insert a new message, if not then create the thread, and then insert it in the table:
message_thread_by_user_ids
Insert it in Redis with the SET command above. And then finally insert the message in:
message_by_thread_id
Ok now comes the tricky part. The left part of the chat does not have static sorted order. The ordering changes all the time. If a conversation has a new message, then that conversation goes to the top. So I have not found a good way to model this in Cassandra without doing deletes and inserts. I have to delete a row and then insert it in order for the table to reorder the rows. And to delete a row and insert a row in a table, every time I send a message does not sound like a good idea to me, but I might be wrong, I am not experienced with Cassandra.
So my thought was that I could use Redis for that left part, but the only problem is that if Redis server goes down, then the most recent chat conversations on the left side will be lost, even tho the chat itself will be preserved in Cassandra. Users would need to resend message for the conversation to appear again.
I thought I could do this in Redis in the following way:
Every time a user sends a message, for example if user 1 sends message to user 2 I could do this:
ZADD user:1:message_thread_ids 1510624312 123e4567-e89b-12d3-a456-426655440000
ZADD user:2:message_thread_ids 1510624312 123e4567-e89b-12d3-a456-426655440000
The sorted set will keep track of the id of the threads with most recently active conversations sorted by unix timestamp.
But then another problem is that every time I load this window I have to do ZRANGE and for example get 20 of the most recent conversations on the left side, and then do 20 single SELECT statements with LIMIT 1 in Cassandra to get the information about the last message sent, that is perhaps not so efficient. I thought I could save information for the last message for 20 most recent active conversations in redis with HMSET with the most relevant information such as the message itself trimmed down to 60 characters only, last_message timestamp, from_username, to_username, from_id, to_id, from_image, to_image, and message_id.
HMSET thread:123e4567-e89b-12d3-a456-426655440000 <... message info ...>
But now I have to keep track and delete the hash maps from Redis that are not relevant, since I don't want to keep more than most recent 20, since it is going to eat up memory fast. I will get the most recent 20 from Redis and from memory, and if a user scrolls down then I will get 10 at a time from Cassandra. And the other problem is that if the Redis server goes down I might loose a conversation from the left side of the app, if the conversation is a completely new conversation.
I thought that with this approach I can get a lot of writes per second on the Cassandra side by just adding new nodes, and Redis can do like 200 000 - 800 000 operations per second so doing deletes and adding things to sorted set should not be a limitation. Since there will be some back and forth from the Redis server, I could try to either pipeline the Redis commands or write Lua scripts so that I can send the instructions to Redis in one go.
Is this a good idea? How can I solve this issue of the left side of the app that shows active conversations? Is it a good idea to do this in Redis like I suggested or should I do it differently?

Both are good solutions. But where could be bottlenecks?
1) Redis is limited to memory and can not exceed it. Also when Server shutdowns, u lose your data.
2) When it comes to scaling, redis uses Master-Slave topology with sharding where as Cassandra uses a ring-topology where every node is equal to write and reads.
In my opinion I would rather use Cassandra knowing it isn't as fast as Redis but fast enough and very easy to scale.
Is this a good idea? How can I solve this issue of the left side of the app that shows active conversations? Is it a good idea to do this in Redis like I suggested or should I do it differently?
How do your user write with each other, I think u do this with a websocket, dont you? If yes, just track socket-ID and remove it when socket disconnects.
Another Question is, where and how do you retrieve the friend Ids for a certain person (left side on your picture)?

Related

Tally unread (chat) messages in database

My goal is to create daily reports for users about chat messages they've missed/not read yet. Right now, all data is getting stored in a ScyllaDB, and that is working out well for the most part. But when it comes to these reports I've no idea whether there a good way to achieve that without changing the database system.
Thing is, I don't want to query query for each user the unread messages. (I could do that because messages have a timeuuid I can compare with a last_read timestamp, but it's slow because it meant multiple queries for every single user there is.) Therefore, I tried to create a dedicated table for the reporting:
CREATE TABLE
user uuid,
channel uuid,
count_start_time timestamp,
missed_count int,
PRIMARY KEY (channel, user)
)
Once a new message in the channel arrives, I can retrieve all users in that channel (from another table). My idea was to increment missed_count, or decrement it in case a message was deleted (and it's creation timestamp is > count_start_time, I figure I could achieve that with an IF condition to the update). Once a user reads his messages, I reset the count_start_time to current date and missed_count to 0.
But several issues arise here:
Since I can't use a Counter my updates aren't atomic. But I think I could live with that.
For the reasons below it would be ideal if I could just delete a row once messages get read instead of reseting timestamp and counter. But I've read that many deletions might cause performance issues (and I'm also not sure what would happen if the entry gets recreated after a short period b/c new messages arrive in the channel again)
The real bummer: since I did not want to iterate over all users on the system in the first place, I don't want to iterate over all entries here either. The naive idea would be to query with WHERE missed_count > 0. But missed_count isn't part of the cluster key so for my understanding that's not feasible.
Since I have to paginate, it could happen that I get the missed messages for a single user in different hunks. I mean, it could happen that I report to user1 that he has unread messaged from channel1 first, and later that he has unread messages from channel2, That means additional overhead in case I want to avoid multiple reports for the same user.
Is there a way I could structure my table to solve that problem, especially how to query only entries with missed_count > 0 or to utilize row deletion? Or is my goal beyond the design of Cassandra/ScyllaDB?
Thanks in advance!

Why collections shouldn't be used for unbounded data?

From Cassandra docs:
A collection is appropriate if the data for collection storage is limited. If the data has unbounded growth potential, like messages sent or sensor events registered every second, do not use collections.
Instead, use a table with a compound primary key where data is stored in the clustering columns.
I'm trying to understand why this is the case.
Let's say I have a messaging app and instead of using PrimaryKey(chatId, timestamp, messageId) I'd use something like PrimaryKey(chatId) with messages column where messages is a list of messages in a chat.
Understand what? You want to add the entire chat history i.e. all the messages in a single column of a single row? Would you do that in a regular sql db? No - there would be a table where each message is its own row
Apart from the fact that you will lose all ability to query the messages in the proposed schema - just the size of that one key would balloon up to the point that the ops required for the cluster will become a nightmare

Best way to run a script for large userbase?

I have users stored in postgresql database (~10 M) and i want to send all of them emails.
Currently i have written a nodejs script which basically fetches users 1000 at a time (Offset and limit in sql) and queues the request in rabbit MQ. Now this seems clumsy to me, as if the node process fails at any time i have to restart the process (i am currently keeping track of number of users skipped per query, and can restart back at the previous number skipped found from logs). This might lead to some users receiving duplicate email and some not receiving any. I can create a new table with new column indicating whether email has been to that person or not, but in my current situation i cant do so. Neither can i create a new table nor can i add a new row to existing table. (Seems to me like idempotent problem?).
How would you approach this problem? Do you think compound indexes might help. Please explain.
The best way to handle this is indeed to store who received an email, so there's no chance of doing it twice.
If you can't add tables or columns to your existing database, just create a new database for this purpose. If you want to be able to recover from crashes, you will need to store who got the email somewhere so if you are given hard restrictions on not storing this in your main database, get creative with another storage mechanism.

sequentiual numbering in the cloud

Ok so a simple task such as generating a sequential number has caused us an issue in the cloud.
Where you have more than one server it gets harder and harder to guarantee that the allocated number between servers are not clashing.
We are using Azure servers if it helps.
We thought about using the app cache but you cannot guarantee it will be updated between servers.
We are limited to using:
a SQL table with an identity column
or
some peer to peer method between servers
or
use a blob store and utalise the locks to store the nost upto date number. (this could have scaling issues)
I just wondered of anyone has an idea of a solution to resolve this?
Surely its a simple problem and must have been solved by now.
If you can live with a use-case where sometimes the numbers you get from this central location are not always sequential (but guaranteed to be unique) I would suggest considering the following pattern. I've helped an large e-commerce client implement this since they needed unique int PK's to synchronize back to premise:
Create a queue and create a small always-running process that populates this queue with sequential integers (this process should remember which number it generated last and keep replenishing the pool with more numbers once the queue gets close to be empty)
Now, you can have your code first poll the next number from the queue, delete it from the queue and then attempt to save it into the SQL Azure database. In case of failure, all you'll have is a "hole" in your sequential numbers. In scenarios with frequent inserts, you may be saving things out of order to the database (two processes poll from queue, one polls first but saves last, the PK's saved to the database are not sequential anymore)
The biggest downside is that you now have to maintain/monitor a process that replenishes the pool of PK's.
After read this, I would not trust on identity column.
I think the best way is before insert, get the last stored id and increment it by one. (programatically). Another option is create a trigger, but it could be a mass if you'll receive a lot of concurrent requests on DB or if your table have millions of records.
create trigger trigger_name
on table_name
after insert
as
declare #seq int
set #seq = (select max(id) + 1 from table_name)
update table_name
set table_name.id = #seq
from table_name
inner join inserted
on table_name.id = inserted.id
More info:
http://msdn.microsoft.com/en-us/library/windowsazure/ee336242.aspx
If you're worried about scaling the number generation when using blobs, then you can use the SnowMaker library which is available on GitHub and Nuget. It gets around the scale problem by retrieving blocks of ids into a local cache. This guarantees that the Ids are unique, but not necessarily sequential if you have more than one server. I'm not sure if that would achieve what you're after.

Hector Cassandra Data Retrieval

Is there any way to get all the data from a column family or from a key space?
I can't think of a way of doing this without knowing every single key for every single entry made to the database.
My problem is that I'm trying to create a Twitter clone where each message has its own id, and store those in the same keyspace in the same column family.
But then how do I get them back? I'll have to keep a track of every single id, and that can't possibly work.
Any help/ideas would be appreciated.
You can retrieve all data from a column family using get_range_slices, setting the range start and end to the same value to indicate that you want all data.
See the Cassandra FAQ
See http://aquiles.codeplex.com/discussions/278245 for a Thrift example.
Haven't yet found a handy Hector example but I think it uses RangeSlicesQuery...
However, it's not clear why you want to do this - for this sort of application you would normally look up the messages by ID, and use an index to determine which IDs you need. For example, storing a row for each user that lists all their messages. For example in the messages column family you might have something like:
MsgID0001 -> time text
1234567 Hello world
MsgID0300 -> time text
3456789 LOL ROTFL
And then in a "user2msg" column family, store the messages, perhaps using timestamp column names so the messages are stored in sorted in time order:
UserID001 -> 1234567 3456789
MsgID0001 MsgID0300
This can then be used to look up a particular user's messages, possibly filtered by time.
You'd then also need further column families to store user profiles etc.
Perhaps you need to add more detail to your question?
Update in response to comment: Yes, if you have one message per row, you have to retrieve each message individually. But what is your alternative? Retrieving all messages is only useful for doing batch processing of messages, not for (for example) showing a user their recent messages. Bear in mind that retrieving all messages could take a very long time - you have not explained why you want to retrieve all messages and what you are going to do with them all. How many messages are you expecting to have?
One possibility is to denormalise, i.e. in a row for each user, store the entire messages, so you don't have to do a separate lookup step for each message. This doubles the amount of storage required, however.
The answer i was looking for is CQL, cassandra's query language. It works similarly to sql which is what i need for the function im after.
this link has some excellent tutorials.

Resources