Mongoose Cursors with Many Documents and Load

Mongoose Cursors with Many Documents and Load - node.js

We've been using mongoose in Node.Js/Express for sometime and one of the things we are not clear about is, what happens when you have a query using find and you have a large result set of documents. For example, let's say you wanted to iterate through all your users to do some low-priority background processing.
let cursor = User.find({}).cursor();
cursor.on('data',function(user) {
// do some processing here
});
My understanding is that cursor.on('data') doesn't block. Therefore, if you have let's say 100,000 users, you would overwhelm the system trying to process 100,000 people nearly simultaneously. There does not seem to be a "next" or other method to regulate our ability to consume the documents.
How do you process large document result sets?

Mongoose actually does have a .next() method for cursors! Check out the Mongoose documentation. Here is a snapshot of the Example section as of this answer:
// There are 2 ways to use a cursor. First, as a stream:
Thing.
find({ name: /^hello/ }).
cursor().
on('data', function(doc) { console.log(doc); }).
on('end', function() { console.log('Done!'); });
// Or you can use `.next()` to manually get the next doc in the stream.
// `.next()` returns a promise, so you can use promises or callbacks.
var cursor = Thing.find({ name: /^hello/ }).cursor();
cursor.next(function(error, doc) {
console.log(doc);
});
// Because `.next()` returns a promise, you can use co
// to easily iterate through all documents without loading them
// all into memory.
co(function*() {
const cursor = Thing.find({ name: /^hello/ }).cursor();
for (let doc = yield cursor.next(); doc != null; doc = yield cursor.next()) {
console.log(doc);
}
});
With the above in mind, it's possible that your data set could grow to be quite large and difficult to work with. It may be a good idea for you to consider using MongoDB's aggregation pipeline for simplifying the processing of large data sets. If you use a replica set, you can even set a readPreference to direct your large aggregation queries to secondary nodes, ensuring that the primary node's performance remains largely unaffected. This would shift the burden from your server to less-critical secondary database nodes.
If your data set is particularly large and you perform the same calculations on the same documents repeatedly, you could even consider storing precomputed aggregation results in a "base" document and then apply all unprocessed documents on top of that "base" as a "delta"--that is, you can reduce your computations down to "every change since the last saved computation".
Finally, there's also the option of load balancing. You could have multiple application servers for processing and have a load balancer distributing requests roughly evenly between them to prevent any one server from becoming overwhelmed.
There are quite a few options available to you for avoiding a scenario where your systems becomes overwhelmed from all of the data processing. The strategies you should employ will depend largely on your particular use case. In this case, however, it seems as though this is a hypothetical question, so the additional strategies noted probably will not be things you will need to concern yourself with. For now, stick with the .next() calls and you should be fine.

I just found a "modern" way of this using for await.
for await (const doc of User.find().cursor()) {
console.log(doc.name);
}
I am using this for my 4M+ docs in one single collection, and it worked fine for me.
Here is the mongoose documentation if you want to refer.

With async await it has become easy. we can now have
const cursor = model.find({})
for await (const doc of cursor){
// carry out any operation
console.log(doc)
}

Related

Firestore fetching all documents on a node.js server. Scalability

Every night at 12pm I am fetching all of the users from my firestore database with this code.
const usersRef = db.collection('users');
const snapshot = await usersRef.get();
snapshot.forEach(doc => {
let docData = doc.data()
// some code and evaluations
})
I just want to know if this is a reliable way to read through all of the data each night without overloading the system. For instance if I have 50k users and I want to update their info each night on the server, will this require a lot of memory server-side to do? Also, is there a better way to handle what I am attempting to do, with the generic premise of updating the users data each night.

Your code is loading all documents in a collection in one go. Even on a server, that will at some point run out of memory.
You'll want to instead read a limited number of documents, process those documents, then read/process a next batch of documents, until you're done. This is known as paginating through data with queries and ensures you can handle any number of documents, instead of only the number that can fit into memory.

change stream in NodeJs for elasticsearch

The aim is to synchronize fields from certain collections on elasticsearch. With every change on mongodb, this should also be implemented on elasticsearch. I've seen the different packages. For example River. Unfortunately it didn't work out for me so I try without it. Is that the right approach with change streams?
How could you solve that more beautifully? The data must be synchronized with every change (insert, update, delete) on Elasticsearch. For several collections but different for each one (only certain fields per collection). Unfortunately, I don't have the experience to solve this in such a way that it doesn't take much effort if a collection or fields are added or removed
const res = await client.connect();
const changeStream = res.watch();
changeStream.on('change', (data) => {
// check the change (is the chance in the right database / collection)
// parse
// push it to elastic server
});
I hope you can help me, thanks in advance :)

Yes. it will work but you have to handle following scenarios
when your node js process goes down while mongodb updates are ongoing.
you can use resume token and keep track of that token so once your
process comes up it can resume from there.
inserting single document on each change.
it will be overwhelimg for elasticsearch and might result in slow inserts, which
will eventually result in sync lag between mongo and elastic. so better collect
multiple document in change stream and insert with bulk API operation.

GraphQL Dataloader vs Mongoose Populate

In order to perform a join-like operation, we can use both GraphQL and Mongoose to achieve that end.
Before asking any question, I would like to give the following example of Task/Activities (none of this code is tested, it is given just for the example's sake):
Task {
_id,
title,
description,
activities: [{ //Of Activity Type
_id,
title
}]
}
In mongoose, we can retrieve the activities related to a task with the populate method, with something like this:
const task = await TaskModel.findbyId(taskId).populate('activities');
Using GraphQL and Dataloader, we can have the same result with something like:
const DataLoader = require('dataloader');
const getActivitiesByTask = (taskId) => await ActivityModel.find({task: taskId});
const dataloaders = () => ({
activitiesByTask: new DataLoader(getActivitiesByTask),
});
// ...
// SET The dataloader in the context
// ...
//------------------------------------------
// In another file
const resolvers = {
Query: {
Task: (_, { id }) => await TaskModel.findbyId(id),
},
Task: {
activities: (task, _, context) => context.dataloaders.activitiesByTask.load(task._id),
},
};
I tried to see if there is any article that demonstrates which way is better regarding performance, resource exhaustion,...etc but I failed to find any comparison of the two methods.
Any insight would be helpful, thanks!

It's important to note that dataloaders are not just an interface for your data models. While dataloaders are touted as a "simplified and consistent API over various remote data sources" -- their main benefit when coupled with GraphQL comes from being able to implement caching and batching within the context of a single request. This sort of functionality is important in APIs that deal with potentially redundant data (think about querying users and each user's friends -- there's a huge chance of refetching the same user multiple times).
On the other hand, mongoose's populate method is really just a way of aggregating multiple MongoDB requests. In that sense, comparing the two is like comparing apples and oranges.
A more fair comparison might be using populate as illustrated in your question as opposed to adding a resolver for activities along the lines of:
activities: (task, _, context) => Activity.find().where('id').in(task.activities)
Either way, the question comes down to whether you load all the data in the parent resolver, or let the resolvers further down do some of the work. because resolvers are only called for fields that are included in the request, there is a potential major impact to performance between these two approaches.
If the activities field is requested, both approaches will make the same number of roundtrips between the server and the database -- the difference in performance will probably be marginal. However, your request might not include the activities field at all. In that case, the activities resolver will never be called and we can save one or more database requests by creating a separate activities resolver and doing the work there.
On a related note...
From what I understand, aggregating queries in MongoDB using something like $lookup is generally less performant than just using populate (some conversation on that point can be found here). In the context of relational databases, however, there's additional considerations to ponder when considering the above approaches. That's because your initial fetch in the parent resolver could be done using joins, which will generally be much faster than making separate db requests. That means at the expense of making the no-activities-field queries slower, you can make the other queries significantly faster.

Mongodb queries from multiple processes; how to implement atomicity?

I have a mongodb database where multiple node processes read and write documents. I would like to know how can I make that so only one process can work on a document at a time. (Some sort of locking) that is freed after the process finished updating that entry.
My application should do the following:
Walk through each entry one by one with a cursor.
(Lock the entry so no other processes can work with it)
Fetch information from a thirdparty site.
Calculate new information and update the entry.
(Unlock the document)
Also after unlocking the document there will be no need for other processes to update it for a few hours.
Later on I would like to set up multiple mongodb clusters so that I can reduce the load on the databases. So the solution should apply to both single and multiple database servers. Or at least using multiple mongo servers.

An elegant solution that doesn't involve locks is:
Add a version property to your document.
When updating the document, increment the version property.
When updating the document, include the last read version in the find query. If your document has been updated elsewhere, the find query will yield no results and your update will fail.
If your update fails, you can retry the operation.
I have used this pattern with great success in the past.
Example
Imagine you have a document {_id: 123, version: 1}.
Imagine now you have 3 Mongo clients doing db.collection.findAndModify({ query: {_id: 123, version: 1}, update: { $inc: 1 }});, concurrently.
The first update will apply, the remaining will fail. Why? because now version is 2, and the query included version: 1.

Per MongoDB documentation:
isolated: Prevents a write operation that affects multiple documents from yielding to other reads or writes once the first document is written... $isolated operator causes write operations to acquire an exclusive lock on the collection...will make WiredTiger single-threaded for the duration of the operation. So if you are updating multiple documents, you could first get the data from the third-party API, parse the info into an array for example, and then use something like this in Mongo shell:
db.foo.update(
{ status : "A" , $isolated : 1 },
{ $set: { < your key >: < your info >}}, //use the info in your array
{ multi: true }
)
Or if you have to update the document one by one, you could use findAndModify() or updateOne() of the Node Driver for MongoDB. Please note that per MongoDB documentation 'When modifying a single document, both findAndModify() and the update() method atomically update the document.'
An example of updating one by one: first you connect to the Mongod with the NodeJS driver, then connect to the third part API using NodeJS's Request module, for example, get and parse the data, before using the data to modify your documents, something like below:
var request = require('request');
var MongoClient = require('mongodb').MongoClient,
test = require('assert');
MongoClient.connect('mongodb://localhost:27017/test', function(err, db) {
var collection = db.collection('simple_query');
collection.find().forEach(
function(doc) {
request('http://www.google.com', function(error, response, body) {
console.log('body:', body); // parse body for your info
collection.findAndModify({
<query based on your doc>
}, {
$set: { < your key >: < your info >
}
})
});
}, function(err) {
});
});

Encountered this question today,
I feel like it's been left open,
First, findAndModify really seems like the way to go about it,
But, I found vulnerabilities in both answers suggested:
in Treefish Zhang's answer - if you run multiple processes in parallel they will query the same documents because in the beginning you use "find" and not "findAndModify", you use "findAndModify" only after the process was done - during processing it's still not updated and other processes can query it as well.
in arboreal84's answer - what happens if the process crashes in the middle of handling the entry? if you update the version while querying, then the process crashes, you have no clue whether the operation succeeded or not.
therefore, I think the most reliable approach would be to have multiple fields:
version
locked:[true/false],
lockedAt:[timestamp] (optional - in case the process crashed and was not able to unlock, you may want to retry after x amount of time)
attempts:0 (optional - increment this if you want to know how many process attempts were done, good to count retries)
then, for your code:
findAndModify: where version=oldVersion and locked=false, set locked=true, lockedAt=now
process the entry
if process succeeded, set locked=false, version=newVersion
if process failed, set locked=false
optional: for retry after ttl you can also query by "or locked=true and lockedAt<now-ttl"
and about:
i have a vps in new york and one in hong kong and i would like to
apply the lock on both database servers. So those two vps servers wont
perform the same task at any chance.
I think the answer to this depends on why you need 2 database servers and why they have the same entries,
if one of them is a secondary in cross-region replicas for high availability, findAndModify will query the primary since writing to secondary replica is not allowed and that's why you dont need to worry about 2 servers being in sync (it might have latency issue tho, but you'll have it anyways since you're communicating between 2 regions).
if you want it just for sharding and horizontal scaling, no need to worry about it because each shard will hold different entries, therefore entry lock is relevant just for one shard.
Hope it will help someone in the future
relevant questions:
MongoDB as a queue service?
Can I trust a MongoDB collection as a task queue?

node.js process a big collection of data

I'm working with mongoose in node.
I'm doing requests to retrieve a collection of items from a remote database. In order to get a full report, I need to parse a whole collection which is a large set.
I avoid to get close to things like:
model.find({}, function(err, data) {
// process the bunch of data
})
For now, I use a recursive approach in which I feed a local variable. Later I send back information about the process as a response.
app.get('/process/it/',(req,res)=>{
var processed_data=[];
function resolve(procdata) {
res.json({status:"ok", items:procdata.length});
}
function handler(data, procdata, start, n) {
if(data.length <= n)
resolve(procdata);
else {
// do something with data: push into processed_data
procdata.push(whatever);
mongoose.model('model').find({}, function(err, data){
handler(data, procdata, start+n, n);
}).skip(start).limit(n);
}
}
n=0
mysize=100
// first call
mongoose.model('model').find({}, function(err, data){
handler(data, processed_data, n, mysize);
}).skip(n).limit(mysize);
})
Is there any approach or solution providing any performance advantage, or just, to achieve this in a better way?
Any help would be appreciated.

Solution depends on the use case.
If data once processed doesn't change often, you can maybe have a secondary database which has the processed data.
You can load unprocessed data from the primary database using pagination the way your doing right now. And all processed data can be loaded from the secondary database in a single query.

It is fine as long as your data set is not big enough, performance could possibly be low though. When it gets to gigabyte level, your application will simply break because the machine won't have enough memory to store your data before sending it to client. Also sending gigabytes of report data will take a lot of time too. Here some suggestions:
Try aggregating your data by Mongo aggregate framework, instead of doing that by your application code
Try to break the report data into smaller reports
Pre-generating report data, store it somewhere (another collection perhaps), and simply send to client when they need to see it

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string