Best way to run a script for large userbase?

Best way to run a script for large userbase? - node.js

I have users stored in postgresql database (~10 M) and i want to send all of them emails.
Currently i have written a nodejs script which basically fetches users 1000 at a time (Offset and limit in sql) and queues the request in rabbit MQ. Now this seems clumsy to me, as if the node process fails at any time i have to restart the process (i am currently keeping track of number of users skipped per query, and can restart back at the previous number skipped found from logs). This might lead to some users receiving duplicate email and some not receiving any. I can create a new table with new column indicating whether email has been to that person or not, but in my current situation i cant do so. Neither can i create a new table nor can i add a new row to existing table. (Seems to me like idempotent problem?).
How would you approach this problem? Do you think compound indexes might help. Please explain.

The best way to handle this is indeed to store who received an email, so there's no chance of doing it twice.
If you can't add tables or columns to your existing database, just create a new database for this purpose. If you want to be able to recover from crashes, you will need to store who got the email somewhere so if you are given hard restrictions on not storing this in your main database, get creative with another storage mechanism.

Related

Tally unread (chat) messages in database

My goal is to create daily reports for users about chat messages they've missed/not read yet. Right now, all data is getting stored in a ScyllaDB, and that is working out well for the most part. But when it comes to these reports I've no idea whether there a good way to achieve that without changing the database system.
Thing is, I don't want to query query for each user the unread messages. (I could do that because messages have a timeuuid I can compare with a last_read timestamp, but it's slow because it meant multiple queries for every single user there is.) Therefore, I tried to create a dedicated table for the reporting:
CREATE TABLE
user uuid,
channel uuid,
count_start_time timestamp,
missed_count int,
PRIMARY KEY (channel, user)
)
Once a new message in the channel arrives, I can retrieve all users in that channel (from another table). My idea was to increment missed_count, or decrement it in case a message was deleted (and it's creation timestamp is > count_start_time, I figure I could achieve that with an IF condition to the update). Once a user reads his messages, I reset the count_start_time to current date and missed_count to 0.
But several issues arise here:
Since I can't use a Counter my updates aren't atomic. But I think I could live with that.
For the reasons below it would be ideal if I could just delete a row once messages get read instead of reseting timestamp and counter. But I've read that many deletions might cause performance issues (and I'm also not sure what would happen if the entry gets recreated after a short period b/c new messages arrive in the channel again)
The real bummer: since I did not want to iterate over all users on the system in the first place, I don't want to iterate over all entries here either. The naive idea would be to query with WHERE missed_count > 0. But missed_count isn't part of the cluster key so for my understanding that's not feasible.
Since I have to paginate, it could happen that I get the missed messages for a single user in different hunks. I mean, it could happen that I report to user1 that he has unread messaged from channel1 first, and later that he has unread messages from channel2, That means additional overhead in case I want to avoid multiple reports for the same user.
Is there a way I could structure my table to solve that problem, especially how to query only entries with missed_count > 0 or to utilize row deletion? Or is my goal beyond the design of Cassandra/ScyllaDB?
Thanks in advance!

Redis and Postgresql synchronization (online users status)

In an NodeJS application I have to maintain a "who was online in the last N minutes" state. Since there is potentially thousands of online users - for performance reasons - I decided to not update my Postgresql user table for this task.
I choosed to use Redis to manage the online status. It's very easy and efficient.
But now I want to make complex queries to the user table, sorted by the online status.
I was thinking of creating a online table filled every minute from a Redis snapshot, but I'm not sure it's the best solution.
Following the table filling, will the next query referencing the online table take a big hit caused by the new indexes creation or loading?
Does anyone know a better solution?

I had to solve almost this exact same issue, but I took a different approach because I Didn't like the issues caused by trying to mix Redis and Postgres.
My solution was to collect the online data in a queue (Zero MQ in my case) but any queueing system should work, or a stream processing facility like Amazon Kinesis (The alternative I looked at.) I then inserted the data in batches into a second table (not the users table). I don't delete or update that table, only inserts and queries are allowed.
Doing things this way preserved the ability to do joins between the last online data and the users table without bogging down the database or creating many updates on the user tables. It has the side effect of giving us a lot of other useful data.
One thing to note that I have though about when thinking of other solutions to this problem is that your users table in transactional data(OLTP) while the latest online information is really analytics data (OLAP), so if you have a data warehouse, data lake, big data, or whatever term of the week you want to use for storing this type of data and querying against it that may be a better solution.

Robot's Tracker Threads and Display

Application: The purposed application has an tcp server able to handle several connections with the robots.
I choosed to work with database/ no files, so i'm using a sqlite db to save information about the robots and their full history, models of robots, tasks, etc...
The robots send us several data like odometry, tasks information, and so on...
I create a thread for every new robot's connection to handle the messages and update the informations of the robots on the database. Now lets start talk about my problems:
The application got to show information about the robots in realtime, and I was thinking about using QSqlQueryModel, set the right query and the show it on a QTableView but then I got to some problems/ solutions to think about:
Problem number 1: There are informations to show on the QTableView that are not on the database: I have the current consumption on the database and the actual charge on the database in capacity, but I want to show also on my table the remaining battery time, how can I add that column with the right behaviour (math implemented) in my TableView.
Problem number 2: I will be receiving messages each second for each robot, so, updating the db and the the gui(loading the query) may not be the best solution when I have a big number of robots connected? Is it better to update the table, and only update the db each minute or something like this? If I use this method I cant work with the table with the QSqlQueryModel to update the tables, so what is the approach that you recommend me to use?
Thanks
SancheZ

I have run into similar problem before; my conclusion was QSqlQueryModel is not the best option for display purposes. You may want some processing on query results, or you may want to create, remove, change display data based on the result for a fancier gui. I think best is to implement your own delegates and override the view related methods - setData, setEditor
This way you have the control over all your columns and direct union of raw data and its display equivalent (i.e. EditData, UserData).
Yes, it is better if you update your view real-time and run a batch execute at lower frequency to update the big data. In general app is the middle layer and db is a bottom layer for data monitoring, unless you use db in memory shared cache.
EDIT: One important point, you cannot run updates in multiple threads (you can, but sqlite blocks the thread until it gets the lock) so it is best to run update from a single thread

How to create a appropriate database model for the IM

recently we're developing the IM feature for our app. And we would save the chat record with core data. The strategy we make are:
every account has a separate sqlite file.
every chat has a separate table (dynamic created, refer to this article ), however, the table structure is the same. such as,
sender_id
msg_id
content
msg_send_time
...
If we put all the chat message in a table, and we fetch the records by "fromid and toid" to get a specific dialog records. However, if we have thousands of thousands message in this table, we doubt the fetch request would be very slow. so we create a specific table for each dialog.
So, is there any better solution for this problem?

Creating "tables" for conversations dynamically is a very bad idea. This will create so much overhead that it will make your code completely inefficient.
Instead, use single entity (not table, mind you, Core Data is not a database) to capture the messages. Filter by user IDs.
This will perform without a glitch with 100.000s of messages, far more than should be stored or displayed on a mobile device.

Processing a stream in Node where action depends on asynchronous calls

I am trying to write a node program that takes a stream of data (using xml-stream) and consolidates it and writes it to a database (using mongoose). I am having problems figuring out how to do the consolidation, since the data may not have hit the database by the time I am processing the next record. I am trying to do something like:
on order data being read from stream
look to see if customer exists on mongodb collection
if customer exists
add the order to the document
else
create the customer record with just this order
save the customer
My problem is that two 'nearby' orders for a customer cause duplicate customer records to be written, since the first one hasn't been written before the second one checks to see if it there.
In theory I think I could get around the problem by pausing the xml-stream, but there is a bug preventing me from doing this.

Not sure that this is the best option, but using async queue was what I ended up doing.
At the same time as I was doing that a pull request for xml-stream (which is what I was using to process the stream) that allowed pausing was added.

Is there a unique field on the customer object in the data coming from the stream? You could add a unique restriction to your mongoose schema to prevent duplicates at the database level.
When creating new customers, add some fallback logic to handle the case where you try to create a customer but that same customer is created by another save at the same. When this happens try the save again but first fetch the other customer first and add the order to the fetched customer document

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string