MongoDb + Mongoose QueryStream - Following document changes - node.js

I'm trying to make use of Mongoose and its querystream in a scheduling application, but maybe I'm misunderstanding how it works. I've read this question here on SO [Mongoose QueryStream new results and it seems I'm correct, but someone please explain:
If I'm filtering a query like so -
Model.find().stream()
when I add or change something that matches the .find(), it should throw a data event, correct? Or am I completely wrong in my understanding of this issue?
For example, I'm trying to look at some data like so:
Events.find({'title':/^word/}).stream();
I'm changing titles in the mongodb console, and not seeing any changes.
Can anyone explain why?

Your understanding is indeed incorrect as a stream is just an output stream of the current query response and not something that "listens for new data" by itself. The returned result here is basically just a node streaming interface, which is an optional choice as opposed to a "cursor", or indeed the direct translation to an array as mongoose methods do by default.
So a "stream" does not just "follow" anything. It is reall just another way of dealing with the normal results of a query, but in a way that does not "slurp" all of the results into memory at once. It rather uses event listeners to process each result as it is fetched from the server cursor.
What you are in fact talking about is a "tailable cursor", or some variant thereof. In basic MongoDB operations, a "tailable cursor" can be implemented on a capped collection. This is a special type of collection with specific rules, so it might not suit your purposes. They are intended for "insert only" operations which is typically suited to event queues.
On a model that is using a capped collection ( and only where a capped collection has been set ) then you implement like this:
var query = Events.find({'title':/^word/}).sort({ "$natural": -1}).limit(1);
var stream = query.tailable({ "awaitdata": true}).stream();
// fires on data received
stream.on("data",function(data) {
console.log(data);
});
The "awaitdata" there is just as an important option as the "tailable" option itself, as it is the main thing that tells the query cursor to remain "active" and "tail" the additions to the collection that meet the query conditions. But your collection must be "capped" for this to work.
An alternate and more adavanced approach to this is to do something like the meteor distribution does, where the "capped collection" that is being tailed is in fact the MongoDB oplog. This requires a replica set configuration, however just as meteor does out of the box, there is nothing wrong with having a single node as a replica set in itself. It's just not wise to do so in production.
This is more adavnced than a simple answer, but the basic concept is since the "oplog" is a capped collection you are able to "tail" it for all write operations on the database. This event data is then inspected to determine such details as the collection you want to watch for writes has been written to. Then that data can be used to query the new information and do something like return the updated or new results to a client via a websocket or similar.
But a stream in itself is just a stream. To "follow" the changes on a collection you either need to implement it as capped, or consider implementing a process based on watching the changes in the oplog as described.

Related

What is the difference between a changeStream and tailable cursor in MongoDB

I am trying to determine what the difference is between a changestream:
https://docs.mongodb.com/manual/changeStreams
https://docs.mongodb.com/manual/reference/method/db.collection.watch/
which looks like so:
const changeStream = collection.watch();
changeStream.next(function(err, next) {
expect(err).to.equal(null);
client.close();
done();
});
and a tailable cursor:
https://docs.mongodb.com/manual/core/tailable-cursors/
which looks like so:
const cursor = coll.find(self.query || query)
.addCursorFlag('tailable', true)
.addCursorFlag('awaitData', true) // true or false?
.addCursorFlag('noCursorTimeout', true)
.addCursorFlag('oplogReplay', true)
.setCursorOption('numberOfRetries', Number.MAX_VALUE)
.setCursorOption('tailableRetryInterval', 200);
const strm = cursor.stream(); // Node.js transform stream
do they have a different use case? when would it be good to use one over the other?
Change Streams (available in MongoDB v3.6+) is a feature that allows you to access real-time data changes without the complexity and risk of tailing the oplog. Key benefits of change streams over tailing the oplog are:
Utilise the built-in MongoDB Role-Based Access Control. Applications can only open change streams against collections they have read access to. Refined and specific authorisation.
Provide a well defined API that are reliable. The change events output that are returned by change streams are well documented. Also, all of the official MongoDB drivers follow the same specifications when implementing change streams interface.
Change events that are returned as part of change streams are at least committed to the majority of the replica set. This means the change events that are sent to the client are durable. Applications don't need to handle data rollback in the event of failover.
Provide a total ordering of changes across shards by utilising a global logical clock. MongoDB guarantees the order of changes are preserved and change events can be safely interpreted in the order received. For example, a change stream cursor opened against a 3-shard sharded cluster returns change events respecting the total order of those changes across all three shards.
Due to the ordering characteristic, change streams are also inherently resumable. The _id of change event output is a resume token. MongoDB official drivers automatically cache this resume token, and in the case of network transient error the driver will retry once. Additionally, applications can also resume manually by utilising parameter resume_after. See also Resume a Change Stream.
Utilise MongoDB aggregation pipeline. Applications can modify the change events output. Currently there are five pipeline stages available to modify the event output. For example, change event outputs can be filtered out (server side) before being sent out using $match stage. See Modify Change Stream Output for more information.
when would it be good to use one over the other?
If your MongoDB deployment is version 3.6+, I would recommend to utilise MongoDB Change Streams over tailing the oplog.
You may also find Change Streams Production Recommendations a useful resource.
With tailable cursor, you follow ALL changes to all collections. With changeStream, you see only changes to the selected collection. Much less traffic and more reliable.

Are MongoDB queries client-side operations?

Lets say I have a document
{ "_id" : ObjectId("544946347db27ca99e20a95f"), "nameArray": [{"id":1 , first_name: "foo"}]
Now i need to push a array into nameArray using $push . How does document update in that case. Does document get's retrieved on client and updates happens on client and changes are then reflected to Mongodb database server. Entire operation is carried out in Mongodb Database.
What you are asking here is if MongoDB operations are client-side operations. The short answer is NO.
In MongoDB a query targets a specific collection of documents as mentioned in the documentation and a collection is a group of MongoDB documents which exists within a single database. Collections are simply what tables are in RDBMS. So if query targets a specific collection then it means their are perform on database level, thus server-side. The same thing applies for data modification and aggregation operations.
Sometimes, your operations may involve a client-side processing because MongoDB doesn't provides a way to achieve what you want out of the box. Generally speaking, you only those type of processing when you want to modify your documents structure in the collection or change your fields' type. In such situation, you will need to retrieve your documents, perform your modification using bulk operations.
See the documentation:
Your array is inserted into the existing array as one element. If the array does not exists it is created. If the target is not an array the operation fails.
There is nothing stated like "retriving the element to the client and update it there". So the operation is completely done on the database server side. I don't know any operation that works in the way like you described it. Unless you are chaining a query, with a modify of the item in your client and an update. But these are two separated operations and not one single command.

Handling conflict in find, modify, save flow in MongoDB with Mongoose

I would like to update a document that involves reading other collection and complex modifications, so the update operators in findAndModify() cannot serve my purpose.
Here's what I have:
Collection.findById(id, function (err, doc) {
// read from other collection, validation
// modify fields in doc according to user input
// (with decent amount of logic)
doc.save(function (err, doc) {
if (err) {
return res.json(500, { message: err });
}
return res.json(200, doc);
});
}
My worry is that this flow might cause conflict if multiple clients happens to modify the same document.
It is said here that:
Operations on a single document are always atomic with MongoDB databases
I'm a bit confused about what Operations mean.
Does this means that the findById() will acquire the lock until doc is out of scope (after the response is sent), so there wouldn't be conflicts? (I don't think so)
If not, how to modify my code to support multiple clients knowing that they will modify Collection?
Will Mongoose report conflict if it occurs?
How to handle the possible conflict? Is it possible to manually lock the Collection?
I see suggestion to use Mongoose's versionKey (or timestamp) and retry for stale document
Don't use MongoDB altogether...
Thanks.
EDIT
Thanks #jibsales for the pointer, I now use Mongoose's versionKey (timestamp will also work) to avoid committing conflicts.
aaronheckmann — Mongoose v3 part 1 :: Versioning
See this sample code:
https://gist.github.com/anonymous/9dc837b1ef2831c97fe8
Operations refers to reads/writes. Bare in mind that MongoDB is not an ACID compliant data layer and if you need true ACID compliance, you're better off picking another tech. That said, you can achieve atomicity and isolation via the Two Phase Commit technique outlined in this article in the MongoDB docs. This is no small undertaking, so be prepared for some heavy lifting as you'll need to work with the native driver instead of Mongoose. Again, my ultimate suggestion is to not drink the NoSQL koolaid if you need transaction support which it sounds like you do.
When MongoDB receives a request to update a document, it will lock the database until it has completed the operation. Any other requests that MongoDB receives will wait until the locking operation has completed and the database is unlocked. This lock/wait behavior is automatic, so there aren't any conflicts to handle. You can find a lot more information about this behavior in the Concurrency section of the FAQ.
See jibsales answer for links to MongoDB's recommended technique for doing multi-document transactions.
There are a couple of NoSQL databases that do full ACID transactions, which would make your life a lot easier. FoundationDB is one such database. Data is stored as Key-Value but it supports multiple data models through layers.
Full disclosure: I'm an engineer at FoundationDB.
In my case I was wrong when "try to query the dynamic field with the upsert option". This guide helped me: How to solve error E11000 duplicate
In above guide, you're probably making one of two mistakes:
Upsert a document when findOneAndupdate() but the query finds a non-unique field.
Use insert many new documents in one go but don't use "ordered = false"

How to account for a failed write or add process in Mongodb

So I've been trying to wrap my head around this one for weeks, but I just can't seem to figure it out. So MongoDB isn't equipped to deal with rollbacks as we typically understand them (i.e. when a client adds information to the database, like a username for example, but quits in the middle of the registration process. Now the DB is left with some "hanging" information that isn't assocaited with anything. How can MongoDb handle that? Or if no one can answer that question, maybe they can point me to a source/example that can? Thanks.
MongoDB does not support transactions, you can't perform atomic multistatement transactions to ensure consistency. You can only perform an atomic operation on a single collection at a time. When dealing with NoSQL databases you need to validate your data as much as you can, they seldom complain about something. There are some workarounds or patterns to achieve SQL like transactions. For example, in your case, you can store user's information in a temporary collection, check data validity, and store it to user's collection afterwards.
This should be straight forwards, but things get more complicated when we deal with multiple documents. In this case, you need create a designated collection for transactions. For instance,
transaction collection
{
id: ..,
state : "new_transaction",
value1 : values From document_1 before updating document_1,
value2 : values From document_2 before updating document_2
}
// update document 1
// update document 2
Ooohh!! something went wrong while updating document 1 or 2? No worries, we can still restore the old values from the transaction collection.
This pattern is known as compensation to mimic the transactional behavior of SQL.

how best to 'tail -f' a large collection in mongo through meteor?

I have a collection in a mongo database that I append some logging-type of information. I'm trying to figure out the most efficient/simplest method to "tail -f" that in a meteor app - as a new document is added to the collection, it should be sent to the client, who should append it to the end of the current set of documents in the collection.
The client isn't going to be sent nor keep all of the documents in the collection, likely just the last ~100 or so.
Now, from a Mongo perspective, I don't see a way of saying "the last N documents in the collection" such that we wouldn't need to apply any sort at all. It seems like the best option available is doing natural sort descending, then a limit call, so something like what's listed in the mongo doc on $natural
db.collection.find().sort( { $natural: -1 } )
So, on the server side AFAICT the way of publishing this 'last 100 documents' Meteor collection would be something like:
Meteor.publish('logmessages', function () {
return LogMessages.find({}, { sort: { $natural: -1 }, limit: 100 });
});
Now, from a 'tail -f' perspective, this seems to have the right effect of sending the 'last 100 documents' to the server, but does so in the wrong order (the newest document would be at the start of the Meteor collection instead of at the end).
On the client side, this seems to mean needing to (unfortunately) reverse the collection. Now, I don't see a reverse() in the Meteor Collection docs and sorting by $natural: 1 doesn't work on the client (which seems reasonable, since there's no real Mongo context). In some cases, the messages will have timestamps within the documents and the client could sort by that to get the 'natural order' back, but that seems kind of hacky.
In any case, it feels like I'm likely missing a much simpler way have a live 'last 100 documents inserted into the collection' collection published from mongo through meteor. :)
Thanks!
EDIT - looks like if I change the collection in Mongo to a capped collection, then the server could create a tailable cursor to efficiently (and quickly) get notified of new documents added to the collection. However, it's not clear to me if/how to get the server to do so through a Meteor collection.
An alternative that seems a little less efficient but doesn't require switching to a capped collection (AFAICT) is using Smart Collections which does tailing of the oplog so at least it's event-driven instead of polling, and since all the operations in the source collection will be inserts, it seems like it'd still be pretty efficient. Unfortunately, AFAICT I'm still left with the sorting issues since I don't see how to define the server side collection as 'last 100 documents inserted'. :(
If there is a way of creating a collection in Mongo as a query of another ("materialized view" of sorts), then maybe I could create a log-last-100 "collection view" in Mongo, and then Meteor would be able to just publish/subscribe the entire pseudo-collection?
For insert-only data, $natural should get you the same results as indexing on timestamp and sorting so that's a good idea. The reverse thing is unfortunate; I think you have a couple choices:
use $natural and do the reverse yourself
add timestamp, still use $natural
add timestamp, index by time, sort
'#1' - For 100 items, doing the reverse client-side should be no problem even for mobile devices and that will off-load it from the server. You can use .fetch() to convert to an array and then reverse it to maintain order without needing to use timestamps. You'll be playing in normal array-land though; no more nice mini-mongo features so do any filtering first before reversing.
'#2' - This one is interesting because you don't have to use an index but you can still use the timestamp on the client to sort the records. This gives you the benefit of staying in mini-mongo-land.
'#3' - Costs space on the db but its the most straight-forward
If you don't need the capabilities of mini-mongo (or are comfortable doing array filtering yourself) then #1 is probably best.
Unfortunately MongoDB doesn't have views so can't do your log-last-100 view idea (although that would be a nice feature).
Beyond the above, keep an eye on your subscription life-cycle so users don't continually pull down log updates in the background when not viewing the log. I could see that quickly becoming a performance killer.

Resources