Periodic checks in node.js and mongodb (searching for missing record) - node.js

I'm receiving some periodic reports from a bunch of devices and storing them into a MongoDB database. They are incoming at about 20-30 seconds. However I would like to check when a device does not send the report for some time (for example the last report is more then 3 minutes old) and I would like to send an email or trigger some other mechanism.
So, the issue is how to check for the missing event in the most correct manner. I considered a cron job and a bunch of timers each related to each device record.
A cron job looks ok but I am fearing that starting a full scan query will overload the server/db and cause performance issues. Is there any kind of database structure that could aid this (some kind of index, maybe?).
Timers are probably the simpler solution but I fear how many timers can I create because I can get quite a number of devices.
Can anybody give me an advice what is the best approach to this? Thanks in advance.

Do you use Redis or something similar on this server? Set device ID as key with any value, e.g. 1. Expire key in 2-3 min and update expiration every time the device connects. Then fire cron jobs to check if ID is missing. This should be super fast.
Also you may use MongoDB's timed collections instead of Redis but in this case you will have to do a bunch of round trips to the DB server. http://docs.mongodb.org/manual/tutorial/expire-data/
Update:
As you do not know what IDs you will be looking for, this rather complicates the matter. Another option is to keep a log in a separate MongoDB collection with a timestamp of last ping you've got from the device.
Index timestamps and query .find({timestamp: {$lt: Date.now() - 60 * 1000}}) to get a list of stale devices.
It's very important that you update existing document rather than create a new one on each ping. So that if you have 10 devices connected you should have 10 documents in this collection. That's why you need a separate collection for this log.
There's a great article on time series data. I hope you find it useful http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb

An index on deviceid+timestamp handles this neatly.
Use distinct() to get your list of devices
For each device d,
db.events.find({ deviceid: d }).sort({ timestamp : -1 }).limit(1)
gives you the most recent event, whose timestamp you can now compare with the current time.

Related

Tally unread (chat) messages in database

My goal is to create daily reports for users about chat messages they've missed/not read yet. Right now, all data is getting stored in a ScyllaDB, and that is working out well for the most part. But when it comes to these reports I've no idea whether there a good way to achieve that without changing the database system.
Thing is, I don't want to query query for each user the unread messages. (I could do that because messages have a timeuuid I can compare with a last_read timestamp, but it's slow because it meant multiple queries for every single user there is.) Therefore, I tried to create a dedicated table for the reporting:
CREATE TABLE
user uuid,
channel uuid,
count_start_time timestamp,
missed_count int,
PRIMARY KEY (channel, user)
)
Once a new message in the channel arrives, I can retrieve all users in that channel (from another table). My idea was to increment missed_count, or decrement it in case a message was deleted (and it's creation timestamp is > count_start_time, I figure I could achieve that with an IF condition to the update). Once a user reads his messages, I reset the count_start_time to current date and missed_count to 0.
But several issues arise here:
Since I can't use a Counter my updates aren't atomic. But I think I could live with that.
For the reasons below it would be ideal if I could just delete a row once messages get read instead of reseting timestamp and counter. But I've read that many deletions might cause performance issues (and I'm also not sure what would happen if the entry gets recreated after a short period b/c new messages arrive in the channel again)
The real bummer: since I did not want to iterate over all users on the system in the first place, I don't want to iterate over all entries here either. The naive idea would be to query with WHERE missed_count > 0. But missed_count isn't part of the cluster key so for my understanding that's not feasible.
Since I have to paginate, it could happen that I get the missed messages for a single user in different hunks. I mean, it could happen that I report to user1 that he has unread messaged from channel1 first, and later that he has unread messages from channel2, That means additional overhead in case I want to avoid multiple reports for the same user.
Is there a way I could structure my table to solve that problem, especially how to query only entries with missed_count > 0 or to utilize row deletion? Or is my goal beyond the design of Cassandra/ScyllaDB?
Thanks in advance!

Best way to run a script for large userbase?

I have users stored in postgresql database (~10 M) and i want to send all of them emails.
Currently i have written a nodejs script which basically fetches users 1000 at a time (Offset and limit in sql) and queues the request in rabbit MQ. Now this seems clumsy to me, as if the node process fails at any time i have to restart the process (i am currently keeping track of number of users skipped per query, and can restart back at the previous number skipped found from logs). This might lead to some users receiving duplicate email and some not receiving any. I can create a new table with new column indicating whether email has been to that person or not, but in my current situation i cant do so. Neither can i create a new table nor can i add a new row to existing table. (Seems to me like idempotent problem?).
How would you approach this problem? Do you think compound indexes might help. Please explain.
The best way to handle this is indeed to store who received an email, so there's no chance of doing it twice.
If you can't add tables or columns to your existing database, just create a new database for this purpose. If you want to be able to recover from crashes, you will need to store who got the email somewhere so if you are given hard restrictions on not storing this in your main database, get creative with another storage mechanism.

Whats the proper way to keep track of changes to documents so that client can poll for deltas?

I'm storing key-value documents on a mongo collection, while multiple clients are pushing updates to this collection (posting to an API endpoint) at a very fast pace (updates will come in faster than once per second).
I need to expose another endpoint so that a watcher can poll all changes, in delta format, since last poll. Each diff must have a sequence number and/or timestamp.
What I'm thinking is:
For each update I calculate a diff and store it.
I store each diff on a mongo collection, with current timestamp (using Node Date object)
On each poll for changes: get all diffs from the collection, delete them and return.
The questions are:
Is it safe to use timestamps for sequencing changes?
Should I be using Mongo to store all diffs as changes are coming or some kind of message queue would be a better solution?
thanks!
On each poll for changes: get all diffs from the collection, delete them and return.
This sounds terribly fragile. What if client didn't receive the data (he crashed/network disappeared in the middle of receiving the response)? He retries the request, but oops, doesn't see anything. What I would do is that client remembers last version it saw and asks for updates like this:
GET /thing/:id/deltas?after_version=XYZ
When it receives a new batch of deltas, it gets the last version of that batch and updates its cached value.
Is it safe to use timestamps for sequencing changes?
I think so. ObjectId already contains a timestamp, so you might use just that, no need for separate time field.
Should I be using Mongo to store all diffs as changes are coming or some kind of message queue would be a better solution?
Depends on your requirements. Mongo should work well here. Especially if you'll be cleaning old data.
at a very fast pace (updates will come in faster than once per second)
By modern standards, 1 per second is nothing. 10 per second - same. 10k per second - now we're talking.

Running query on database after a document/row is of certain age

What is the best practice for running a database-query after any document in a collection become of certain age?
Let's say this is a node.js web-system with mongoDB, with a collection of posts. After a new post is inserted, it should be updated with some data after 60 minutes.
Would a cron-job that checks all posts with (age < one hour) every minute or two be the best solution? What would be the least stressing solution if this system has >10.000 active users?
Some ideas:
Create a second collection as a queue with a "time to update" field which would contain the time at which the source record needs to be updated. Index it, and scan through looking for values older than "now".
Include the field mentioned above in the original document and index it the same way
You could just clear the value when done or reset it to the next 60 minutes depending on behavior (rather than inserting/deleting/inserting documents into the collection).
By keeping the update-collection distinct, you have a better chance of always keeping the entire working set of queued updates in memory (compared to storing the update info in your posts).
I'd kick off the update not as a web request to the same instance of Node but instead as a separate process so as to not block user-requests.
As to how you schedule it -- that's up to you and your architecture and what's best for your system. There's no right "best" answer, especially if you have multiple web servers or a sharded data system.
You might use a capped collection, although you'd run the risk of potentially losing records needing to be updated (although you'd gain performance)

Users last-access time with CouchDB

I am new to CouchDB, but that is not related to the problem. The question is simple, yet not clear to me.
For example: Boris was on the site 5 seconds ago and viewing his profile Ivan sees it.
How to correctly implement this feature (users last-access time)?
The problem is that, if we update users profile document in CouchDB, for ex. property last_access_time, each time a page is refreshed, than we will have the most relevant information (with MySQL we did it this way), but on the other hand, we will have _rev of the document somewhere about 100000++ by the end of the day.
So, how do you do that or do you have any ideas?
This is not a full answer but a possible optimization. It will work in addition to any other answers here.
Instead of storing the latest timestamp, update the timestamp only if it has changed by e.g. 5 seconds, or 60 seconds.
Assume a user refreshes every second for a day. That is 86,400 updates. But if you only update the timestamp at 5-second intervals, that is 17,280; for 60-seconds it is 1,440.
You can do this on the client side. When you want to update the timestamp, fetch the current document and check the old timestamp. If it is less than 5 seconds old, don't do anything. Otherwise, update it normally.
You can also do it on the server side. Write an _update function in CouchDB, which you can query like e.g. POST /db/_design/my_app/_update/last-access/the_doc_id?time=2011-01-31T05:05:31.872Z. The update function will do the same thing: check the old timestamp, and either do nothing, or update it, depending on the elapsed time.
If there was (a large) part of a document that is relatively static, and (a small) part that is highly dynamic, I would consider splitting it into two different documents.
Another option might be to use something more suited to the high write throughput of small pieces of data of that nature such as Redis or possibly MongoDB, and (if necessary) have a background task to occasionally write the info to CouchDB.
CouchDB has no problem with rapid document updates. Just do it, like MySQL. High _rev is no problem.
The only thing is, you have to be responsible about your couch from day 1. All CouchDB users must do this anyway, however you may have to do it sooner. (Applications with few updates have lower risk of a full disk, so developers can postpone this work.)
Poll your database and run compaction if it needs it (based on size, document count, seq_id number)
Poll your views and run compaction too
Always have enough disk capacity and i/o bandwidth to support compaction. Mathematical worst-case: you need 2x the database size, and 2x the write speed; however, most applications require less. Since you are updating documents, not adding them, you will need way less.

Resources