Whats the proper way to keep track of changes to documents so that client can poll for deltas? - node.js

I'm storing key-value documents on a mongo collection, while multiple clients are pushing updates to this collection (posting to an API endpoint) at a very fast pace (updates will come in faster than once per second).
I need to expose another endpoint so that a watcher can poll all changes, in delta format, since last poll. Each diff must have a sequence number and/or timestamp.
What I'm thinking is:
For each update I calculate a diff and store it.
I store each diff on a mongo collection, with current timestamp (using Node Date object)
On each poll for changes: get all diffs from the collection, delete them and return.
The questions are:
Is it safe to use timestamps for sequencing changes?
Should I be using Mongo to store all diffs as changes are coming or some kind of message queue would be a better solution?
thanks!

On each poll for changes: get all diffs from the collection, delete them and return.
This sounds terribly fragile. What if client didn't receive the data (he crashed/network disappeared in the middle of receiving the response)? He retries the request, but oops, doesn't see anything. What I would do is that client remembers last version it saw and asks for updates like this:
GET /thing/:id/deltas?after_version=XYZ
When it receives a new batch of deltas, it gets the last version of that batch and updates its cached value.
Is it safe to use timestamps for sequencing changes?
I think so. ObjectId already contains a timestamp, so you might use just that, no need for separate time field.
Should I be using Mongo to store all diffs as changes are coming or some kind of message queue would be a better solution?
Depends on your requirements. Mongo should work well here. Especially if you'll be cleaning old data.
at a very fast pace (updates will come in faster than once per second)
By modern standards, 1 per second is nothing. 10 per second - same. 10k per second - now we're talking.

Related

Couchdb watch changes feed in clustered mode returning random changes for the same since value

According to the internet. You make a request to /_changes?since=0&limit=1 do what you want with the change, then use the last_seq value and pass to since and request again.
My problem is, this skips changes. You can keep requesting /_changes?since=0&limit=1 and get a different change over and over. Only occasionally actually getting the first change to the database. Sometimes you get the 7th change, or the 4th, etc. If you then repeat but using the last_seq value, it skips ahead further, far as I can tell, it never goes back and gets the changes it skipped.
Is there a proper way to periodically watch a couchdb changes feed without using the sockets method instead when using clusters?
What we have right now is a php script that runs on a cron task and requests the last 1000 changes, then it works through them and syncs up SQL databases to match what was in couchdb. With couchdb skipping changes, this is a big problem.
CouchDB 2.x doc states that (see):
"The results returned by _changes are partially ordered. In other words, the order is not guaranteed to be preserved for multiple calls."
So, when you call /_changes?since=0&limit=1 you obtain a different result as the order is not guaranteed.
The _changes response contains a pending attribute with the number of elements that are out of the response. If you take the last_seq value from the last request and use that value as the since attribute in the next request you'll get the next bunch of changes and the pending value is decreased consistently.
Also, you should be careful with the next documentation note:
If the specified replicas of the shards in any given since value are unavailable, alternative replicas are selected, and the last known checkpoint between them is used. If this happens, you might see changes again that you have previously seen. Therefore, an application making use of the _changes feed should be ‘idempotent’, that is, able to receive the same data multiple times, safely.
Read changes in batches is a recommendation of the CouchDB Replication Protocol (see) used by CouchDB compatible clients as Cloudant Sync, so the approach you described should be correct.
Please, don't use the numeric value of the change seq as a reference to infer that there are missed changes as this number is computed from cluster state which may vary between calls. You can check this answer for more detail.

Node.js - Scaling with Redis atomic updates

I have a Node.js app that preforms the following:
get data from Redis
preform calculation on data
write new result back to Redis
This process may take place several times per second. The issue I now face is that I wish to run multiple instances of this process, and I am obviously seeing out of date date being updated due to each node updating after another has got the last value.
How would I make the above process atomic?
I cannot add the operation to a transaction within Redis as I need to get the data (which would force a commit) before I can process and update.
Can anyone advise?
Apologies for the lack of clarity with the question.
After further reading, indeed I can use transactions however the area I was struggling to understand was that I need separate out the read from the update, and just wrap the update in the transaction along with using WATCH on the read. This causes the update transaction to fail if another update has taken place.
So the workflow is:
WATCH key
GET key
MULTI
SET key
EXEC
Hopefully this is useful for anyone else looking to an atomic get and update.
Redis supports atomic transactions http://redis.io/topics/transactions

Periodic checks in node.js and mongodb (searching for missing record)

I'm receiving some periodic reports from a bunch of devices and storing them into a MongoDB database. They are incoming at about 20-30 seconds. However I would like to check when a device does not send the report for some time (for example the last report is more then 3 minutes old) and I would like to send an email or trigger some other mechanism.
So, the issue is how to check for the missing event in the most correct manner. I considered a cron job and a bunch of timers each related to each device record.
A cron job looks ok but I am fearing that starting a full scan query will overload the server/db and cause performance issues. Is there any kind of database structure that could aid this (some kind of index, maybe?).
Timers are probably the simpler solution but I fear how many timers can I create because I can get quite a number of devices.
Can anybody give me an advice what is the best approach to this? Thanks in advance.
Do you use Redis or something similar on this server? Set device ID as key with any value, e.g. 1. Expire key in 2-3 min and update expiration every time the device connects. Then fire cron jobs to check if ID is missing. This should be super fast.
Also you may use MongoDB's timed collections instead of Redis but in this case you will have to do a bunch of round trips to the DB server. http://docs.mongodb.org/manual/tutorial/expire-data/
Update:
As you do not know what IDs you will be looking for, this rather complicates the matter. Another option is to keep a log in a separate MongoDB collection with a timestamp of last ping you've got from the device.
Index timestamps and query .find({timestamp: {$lt: Date.now() - 60 * 1000}}) to get a list of stale devices.
It's very important that you update existing document rather than create a new one on each ping. So that if you have 10 devices connected you should have 10 documents in this collection. That's why you need a separate collection for this log.
There's a great article on time series data. I hope you find it useful http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb
An index on deviceid+timestamp handles this neatly.
Use distinct() to get your list of devices
For each device d,
db.events.find({ deviceid: d }).sort({ timestamp : -1 }).limit(1)
gives you the most recent event, whose timestamp you can now compare with the current time.

Running query on database after a document/row is of certain age

What is the best practice for running a database-query after any document in a collection become of certain age?
Let's say this is a node.js web-system with mongoDB, with a collection of posts. After a new post is inserted, it should be updated with some data after 60 minutes.
Would a cron-job that checks all posts with (age < one hour) every minute or two be the best solution? What would be the least stressing solution if this system has >10.000 active users?
Some ideas:
Create a second collection as a queue with a "time to update" field which would contain the time at which the source record needs to be updated. Index it, and scan through looking for values older than "now".
Include the field mentioned above in the original document and index it the same way
You could just clear the value when done or reset it to the next 60 minutes depending on behavior (rather than inserting/deleting/inserting documents into the collection).
By keeping the update-collection distinct, you have a better chance of always keeping the entire working set of queued updates in memory (compared to storing the update info in your posts).
I'd kick off the update not as a web request to the same instance of Node but instead as a separate process so as to not block user-requests.
As to how you schedule it -- that's up to you and your architecture and what's best for your system. There's no right "best" answer, especially if you have multiple web servers or a sharded data system.
You might use a capped collection, although you'd run the risk of potentially losing records needing to be updated (although you'd gain performance)

Users last-access time with CouchDB

I am new to CouchDB, but that is not related to the problem. The question is simple, yet not clear to me.
For example: Boris was on the site 5 seconds ago and viewing his profile Ivan sees it.
How to correctly implement this feature (users last-access time)?
The problem is that, if we update users profile document in CouchDB, for ex. property last_access_time, each time a page is refreshed, than we will have the most relevant information (with MySQL we did it this way), but on the other hand, we will have _rev of the document somewhere about 100000++ by the end of the day.
So, how do you do that or do you have any ideas?
This is not a full answer but a possible optimization. It will work in addition to any other answers here.
Instead of storing the latest timestamp, update the timestamp only if it has changed by e.g. 5 seconds, or 60 seconds.
Assume a user refreshes every second for a day. That is 86,400 updates. But if you only update the timestamp at 5-second intervals, that is 17,280; for 60-seconds it is 1,440.
You can do this on the client side. When you want to update the timestamp, fetch the current document and check the old timestamp. If it is less than 5 seconds old, don't do anything. Otherwise, update it normally.
You can also do it on the server side. Write an _update function in CouchDB, which you can query like e.g. POST /db/_design/my_app/_update/last-access/the_doc_id?time=2011-01-31T05:05:31.872Z. The update function will do the same thing: check the old timestamp, and either do nothing, or update it, depending on the elapsed time.
If there was (a large) part of a document that is relatively static, and (a small) part that is highly dynamic, I would consider splitting it into two different documents.
Another option might be to use something more suited to the high write throughput of small pieces of data of that nature such as Redis or possibly MongoDB, and (if necessary) have a background task to occasionally write the info to CouchDB.
CouchDB has no problem with rapid document updates. Just do it, like MySQL. High _rev is no problem.
The only thing is, you have to be responsible about your couch from day 1. All CouchDB users must do this anyway, however you may have to do it sooner. (Applications with few updates have lower risk of a full disk, so developers can postpone this work.)
Poll your database and run compaction if it needs it (based on size, document count, seq_id number)
Poll your views and run compaction too
Always have enough disk capacity and i/o bandwidth to support compaction. Mathematical worst-case: you need 2x the database size, and 2x the write speed; however, most applications require less. Since you are updating documents, not adding them, you will need way less.

Resources