Tally unread (chat) messages in database - cassandra

My goal is to create daily reports for users about chat messages they've missed/not read yet. Right now, all data is getting stored in a ScyllaDB, and that is working out well for the most part. But when it comes to these reports I've no idea whether there a good way to achieve that without changing the database system.
Thing is, I don't want to query query for each user the unread messages. (I could do that because messages have a timeuuid I can compare with a last_read timestamp, but it's slow because it meant multiple queries for every single user there is.) Therefore, I tried to create a dedicated table for the reporting:
CREATE TABLE
user uuid,
channel uuid,
count_start_time timestamp,
missed_count int,
PRIMARY KEY (channel, user)
)
Once a new message in the channel arrives, I can retrieve all users in that channel (from another table). My idea was to increment missed_count, or decrement it in case a message was deleted (and it's creation timestamp is > count_start_time, I figure I could achieve that with an IF condition to the update). Once a user reads his messages, I reset the count_start_time to current date and missed_count to 0.
But several issues arise here:
Since I can't use a Counter my updates aren't atomic. But I think I could live with that.
For the reasons below it would be ideal if I could just delete a row once messages get read instead of reseting timestamp and counter. But I've read that many deletions might cause performance issues (and I'm also not sure what would happen if the entry gets recreated after a short period b/c new messages arrive in the channel again)
The real bummer: since I did not want to iterate over all users on the system in the first place, I don't want to iterate over all entries here either. The naive idea would be to query with WHERE missed_count > 0. But missed_count isn't part of the cluster key so for my understanding that's not feasible.
Since I have to paginate, it could happen that I get the missed messages for a single user in different hunks. I mean, it could happen that I report to user1 that he has unread messaged from channel1 first, and later that he has unread messages from channel2, That means additional overhead in case I want to avoid multiple reports for the same user.
Is there a way I could structure my table to solve that problem, especially how to query only entries with missed_count > 0 or to utilize row deletion? Or is my goal beyond the design of Cassandra/ScyllaDB?
Thanks in advance!

Related

DynamoDB sorting through data

Everywhere I look, the web is telling me to never use scan() in dynamoDB.
It uses all your capacity units, 1mb response size, etc.
I’ve looked at querying, but that doesn’t achieve what I want either.
How am I supposed to parse through my table?
Here is my setup-
I have a table “people” with rows of people.
I have attributes “email” (partition key), “fName”, “lName”, “displayName”, “passwordHash”, and “subscribed”.
subscribed is either true or false, and I need to sort through every person who is subscribed.
I can’t use a sort key because all emails are unique…
It is my understanding that DynamoDB data is sorted like follows:
primary key-
—sort key 1
——— Item 1
—sort key 2
——- Item 2
primary key 2
—Sort ket 1
..etc..
So setting subscribed as a sort ket would not work… I would still need to loop through every primary key.
Right now I am just getting every item with a filterExpression to check if someone is subscribed.
If they are, they pass. But what happens when I have hundreds of users, whose data eclipses 1mb?
I wouldn’t get every user that is subscribed in this case, and sending repeating requests with the start key to get every Mb of data is too tedious for the processor, and would slow the server down significantly
Are there any recommendations for how I should go about getting every subscribed user?
Note: Subscribed can not be a primary key and the email a sort key, because I have instances where I need just the user, which is easy to access if the email is the primary key.
Right now I am just getting every item with a filterExpression to check if someone is subscribed. If they are, they pass. But what happens when I have hundreds of users, whose data eclipses 1mb?
GetItem for single person lookups
You should ideally be using a GetItem here by providing the users email as a search parameter, and then checking if they are subscribed or not. Scanning to see if an individual is subscribed is not scalable in any way.
Pagination
When data exceeds 1MB you simply paginate:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html
Are there any recommendations for how I should go about getting every subscribed user?
Sparse Indexes
For this use case it's best to use a sparse index, in which you set subscribed="true" only if it's true, if it's false don't set it (you must use a string also, as boolean can't be used as a key).
Once you do so, you can create a GSI on the attribute subscribed, now only the items which are true are contained in your GSI making it sparse. So a Scan on that value now makes it as efficient as possible, albeit it will limit throughout capacity to 1000 WCU.
Making things scalable
An even better way to do so is to create an attribute called GSI_PK and assign it a random number. Then use subscribed as a sort key, again using a string and only when true. This will mean that your index will not become a bottleneck and limit your throughput to 1000 WCU due to a single value being Partition key.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-general-sparse-indexes.html

Best way to run a script for large userbase?

I have users stored in postgresql database (~10 M) and i want to send all of them emails.
Currently i have written a nodejs script which basically fetches users 1000 at a time (Offset and limit in sql) and queues the request in rabbit MQ. Now this seems clumsy to me, as if the node process fails at any time i have to restart the process (i am currently keeping track of number of users skipped per query, and can restart back at the previous number skipped found from logs). This might lead to some users receiving duplicate email and some not receiving any. I can create a new table with new column indicating whether email has been to that person or not, but in my current situation i cant do so. Neither can i create a new table nor can i add a new row to existing table. (Seems to me like idempotent problem?).
How would you approach this problem? Do you think compound indexes might help. Please explain.
The best way to handle this is indeed to store who received an email, so there's no chance of doing it twice.
If you can't add tables or columns to your existing database, just create a new database for this purpose. If you want to be able to recover from crashes, you will need to store who got the email somewhere so if you are given hard restrictions on not storing this in your main database, get creative with another storage mechanism.

How should I or should not use Cassandra and Redis together to build a scalable one on one chat application?

Up until now I have used MySQL to do pretty much everything, but I don't like the thought of sharding my data manually and maintaining all of that for now.
I want to build a one on one chat application that is like Facebook and WhatsApp like the picture below:
So we have two parts here. The right part which is just all messages in a chat thread, and the left part that shows chat threads with information from the last message, and your chat partners information such as name and image and so on.
So far this is what I have:
Cassandra is really good at writing, and reading. But not so much at deleting data because of tombstones. And you don't want to set gc_grace_seconds to 0 because if a node goes down and deletes occur then that deleted row might come back to life when repair is done. So we might end up deleting all data from the node before it enters the cluster. Anyways, as I understood Cassandra would be perfect for the right part of this chat app. Since messages will be stored and ordered by their insertion time and that sorting will never change. So you just write and read. Which is what Cassandra is good at.
I have these table to store messages for the right part:
CREATE TYPE user_data_for_message (
from_id INT,
to_id INT,
from_username TEXT,
to_username TEXT,
from_image_name TEXT,
to_image_name TEXT
);
CREATE TABLE message_by_thread_id (
message_id TIMEUUID,
thread_id UUID,
user_data FROZEN <user_data_for_message>,
message TEXT,
created_time INT,
is_viewed BOOLEAN,
PRIMARY KEY (thread_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
Before I insert a new message, if the thread_id is not provided by the client, I can do a check whether a thread between two users exist. I can store that information like this:
CREATE TABLE message_thread_by_user_ids (
thread_id UUID,
user1_id INT,
user2_id INT,
PRIMARY KEY (user1_id, user2_id)
);
I could store two rows for every thread where user1 and user2 has reversed order, so that I just need to do 1 read to check for the existence. Since I don't want to check for existence of thread before every insert, I could first check if there exists a thread between users in Redis since it is in memory and much faster.
I could save the same information above in Redis too like this (not two way as I did in Cassandra, but one way to save memory. We can do two GET to check for it):
SET user:1:user:2:message_thread_id 123e4567-e89b-12d3-a456-426655440000
So before I send a message, I could first check in Redis whether there exists a thread between the two users. If not found in Redis, I could check in Cassandra, (in case Redis server was down at some point and did not save it), and if there exists a thread just use that thread_id to insert a new message, if not then create the thread, and then insert it in the table:
message_thread_by_user_ids
Insert it in Redis with the SET command above. And then finally insert the message in:
message_by_thread_id
Ok now comes the tricky part. The left part of the chat does not have static sorted order. The ordering changes all the time. If a conversation has a new message, then that conversation goes to the top. So I have not found a good way to model this in Cassandra without doing deletes and inserts. I have to delete a row and then insert it in order for the table to reorder the rows. And to delete a row and insert a row in a table, every time I send a message does not sound like a good idea to me, but I might be wrong, I am not experienced with Cassandra.
So my thought was that I could use Redis for that left part, but the only problem is that if Redis server goes down, then the most recent chat conversations on the left side will be lost, even tho the chat itself will be preserved in Cassandra. Users would need to resend message for the conversation to appear again.
I thought I could do this in Redis in the following way:
Every time a user sends a message, for example if user 1 sends message to user 2 I could do this:
ZADD user:1:message_thread_ids 1510624312 123e4567-e89b-12d3-a456-426655440000
ZADD user:2:message_thread_ids 1510624312 123e4567-e89b-12d3-a456-426655440000
The sorted set will keep track of the id of the threads with most recently active conversations sorted by unix timestamp.
But then another problem is that every time I load this window I have to do ZRANGE and for example get 20 of the most recent conversations on the left side, and then do 20 single SELECT statements with LIMIT 1 in Cassandra to get the information about the last message sent, that is perhaps not so efficient. I thought I could save information for the last message for 20 most recent active conversations in redis with HMSET with the most relevant information such as the message itself trimmed down to 60 characters only, last_message timestamp, from_username, to_username, from_id, to_id, from_image, to_image, and message_id.
HMSET thread:123e4567-e89b-12d3-a456-426655440000 <... message info ...>
But now I have to keep track and delete the hash maps from Redis that are not relevant, since I don't want to keep more than most recent 20, since it is going to eat up memory fast. I will get the most recent 20 from Redis and from memory, and if a user scrolls down then I will get 10 at a time from Cassandra. And the other problem is that if the Redis server goes down I might loose a conversation from the left side of the app, if the conversation is a completely new conversation.
I thought that with this approach I can get a lot of writes per second on the Cassandra side by just adding new nodes, and Redis can do like 200 000 - 800 000 operations per second so doing deletes and adding things to sorted set should not be a limitation. Since there will be some back and forth from the Redis server, I could try to either pipeline the Redis commands or write Lua scripts so that I can send the instructions to Redis in one go.
Is this a good idea? How can I solve this issue of the left side of the app that shows active conversations? Is it a good idea to do this in Redis like I suggested or should I do it differently?
Both are good solutions. But where could be bottlenecks?
1) Redis is limited to memory and can not exceed it. Also when Server shutdowns, u lose your data.
2) When it comes to scaling, redis uses Master-Slave topology with sharding where as Cassandra uses a ring-topology where every node is equal to write and reads.
In my opinion I would rather use Cassandra knowing it isn't as fast as Redis but fast enough and very easy to scale.
Is this a good idea? How can I solve this issue of the left side of the app that shows active conversations? Is it a good idea to do this in Redis like I suggested or should I do it differently?
How do your user write with each other, I think u do this with a websocket, dont you? If yes, just track socket-ID and remove it when socket disconnects.
Another Question is, where and how do you retrieve the friend Ids for a certain person (left side on your picture)?

Periodic checks in node.js and mongodb (searching for missing record)

I'm receiving some periodic reports from a bunch of devices and storing them into a MongoDB database. They are incoming at about 20-30 seconds. However I would like to check when a device does not send the report for some time (for example the last report is more then 3 minutes old) and I would like to send an email or trigger some other mechanism.
So, the issue is how to check for the missing event in the most correct manner. I considered a cron job and a bunch of timers each related to each device record.
A cron job looks ok but I am fearing that starting a full scan query will overload the server/db and cause performance issues. Is there any kind of database structure that could aid this (some kind of index, maybe?).
Timers are probably the simpler solution but I fear how many timers can I create because I can get quite a number of devices.
Can anybody give me an advice what is the best approach to this? Thanks in advance.
Do you use Redis or something similar on this server? Set device ID as key with any value, e.g. 1. Expire key in 2-3 min and update expiration every time the device connects. Then fire cron jobs to check if ID is missing. This should be super fast.
Also you may use MongoDB's timed collections instead of Redis but in this case you will have to do a bunch of round trips to the DB server. http://docs.mongodb.org/manual/tutorial/expire-data/
Update:
As you do not know what IDs you will be looking for, this rather complicates the matter. Another option is to keep a log in a separate MongoDB collection with a timestamp of last ping you've got from the device.
Index timestamps and query .find({timestamp: {$lt: Date.now() - 60 * 1000}}) to get a list of stale devices.
It's very important that you update existing document rather than create a new one on each ping. So that if you have 10 devices connected you should have 10 documents in this collection. That's why you need a separate collection for this log.
There's a great article on time series data. I hope you find it useful http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb
An index on deviceid+timestamp handles this neatly.
Use distinct() to get your list of devices
For each device d,
db.events.find({ deviceid: d }).sort({ timestamp : -1 }).limit(1)
gives you the most recent event, whose timestamp you can now compare with the current time.

Strategies for checking inactivity on Azure

I have a table in Azure Table Storage, with rows that are regularly updated by various processes. I want to efficiently monitor when rows haven't been updated within a specific time period, and to cause alerts to be generated if that occurs.
Most task scheduler implementations I've seen for Azure function by making sure only one worker will perform a given job at a time. However, setting up a scheduled task that waits n minutes, and then queries the latest time-stamp to determine if action should be taken, seems inefficient since the work won't be spread across workers. It also seems generally inefficient to have to poll so many records.
An example use of this would be to send an email to a user that hasn't logged into a web site in the last 30 days. Assume that the number of users is a "large number" for the purposes of producing an efficient algorithm.
Does anyone have any recommendations for strategies that could be used to check for recent activity without forcing only one worker to do the job?
Keep a LastActive table with a timestamp as a rowkey (DateTime.UtcNow.Ticks.ToString("d19")). Update it by doing a batch transaction that deletes the old row and inserts the new row.
Now the query for inactive users is just something like from user in LastActive where user.PartitionKey == string.Empty && user.RowKey < (DateTime.UtcNow - TimeSpan.FromDays(30)).Ticks.ToString("d19") select user. That will be quite efficient for any size table.
Depending on what you're going to do with that information, you might want to then put a message on a queue and then delete the row (so it doesn't get noticed again the next time you check). Multiple workers can now pull those queue messages and take action.
I'm confused about your desire to do this on multiple worker instances... you presumably want to act on an inactive user only once, so you want only one instance to do the check. (The work of sending emails or whatever else you're doing can then be spread about by using a queue, but that initial check should be done by exactly one instance.)

Resources