Firestore fetching all documents on a node.js server. Scalability - node.js

Every night at 12pm I am fetching all of the users from my firestore database with this code.
const usersRef = db.collection('users');
const snapshot = await usersRef.get();
snapshot.forEach(doc => {
let docData = doc.data()
// some code and evaluations
})
I just want to know if this is a reliable way to read through all of the data each night without overloading the system. For instance if I have 50k users and I want to update their info each night on the server, will this require a lot of memory server-side to do? Also, is there a better way to handle what I am attempting to do, with the generic premise of updating the users data each night.

Your code is loading all documents in a collection in one go. Even on a server, that will at some point run out of memory.
You'll want to instead read a limited number of documents, process those documents, then read/process a next batch of documents, until you're done. This is known as paginating through data with queries and ensures you can handle any number of documents, instead of only the number that can fit into memory.

Related

change stream in NodeJs for elasticsearch

The aim is to synchronize fields from certain collections on elasticsearch. With every change on mongodb, this should also be implemented on elasticsearch. I've seen the different packages. For example River. Unfortunately it didn't work out for me so I try without it. Is that the right approach with change streams?
How could you solve that more beautifully? The data must be synchronized with every change (insert, update, delete) on Elasticsearch. For several collections but different for each one (only certain fields per collection). Unfortunately, I don't have the experience to solve this in such a way that it doesn't take much effort if a collection or fields are added or removed
const res = await client.connect();
const changeStream = res.watch();
changeStream.on('change', (data) => {
// check the change (is the chance in the right database / collection)
// parse
// push it to elastic server
});
I hope you can help me, thanks in advance :)
Yes. it will work but you have to handle following scenarios
when your node js process goes down while mongodb updates are ongoing.
you can use resume token and keep track of that token so once your
process comes up it can resume from there.
inserting single document on each change.
it will be overwhelimg for elasticsearch and might result in slow inserts, which
will eventually result in sync lag between mongo and elastic. so better collect
multiple document in change stream and insert with bulk API operation.

Mongoose Cursors with Many Documents and Load

We've been using mongoose in Node.Js/Express for sometime and one of the things we are not clear about is, what happens when you have a query using find and you have a large result set of documents. For example, let's say you wanted to iterate through all your users to do some low-priority background processing.
let cursor = User.find({}).cursor();
cursor.on('data',function(user) {
// do some processing here
});
My understanding is that cursor.on('data') doesn't block. Therefore, if you have let's say 100,000 users, you would overwhelm the system trying to process 100,000 people nearly simultaneously. There does not seem to be a "next" or other method to regulate our ability to consume the documents.
How do you process large document result sets?
Mongoose actually does have a .next() method for cursors! Check out the Mongoose documentation. Here is a snapshot of the Example section as of this answer:
// There are 2 ways to use a cursor. First, as a stream:
Thing.
find({ name: /^hello/ }).
cursor().
on('data', function(doc) { console.log(doc); }).
on('end', function() { console.log('Done!'); });
// Or you can use `.next()` to manually get the next doc in the stream.
// `.next()` returns a promise, so you can use promises or callbacks.
var cursor = Thing.find({ name: /^hello/ }).cursor();
cursor.next(function(error, doc) {
console.log(doc);
});
// Because `.next()` returns a promise, you can use co
// to easily iterate through all documents without loading them
// all into memory.
co(function*() {
const cursor = Thing.find({ name: /^hello/ }).cursor();
for (let doc = yield cursor.next(); doc != null; doc = yield cursor.next()) {
console.log(doc);
}
});
With the above in mind, it's possible that your data set could grow to be quite large and difficult to work with. It may be a good idea for you to consider using MongoDB's aggregation pipeline for simplifying the processing of large data sets. If you use a replica set, you can even set a readPreference to direct your large aggregation queries to secondary nodes, ensuring that the primary node's performance remains largely unaffected. This would shift the burden from your server to less-critical secondary database nodes.
If your data set is particularly large and you perform the same calculations on the same documents repeatedly, you could even consider storing precomputed aggregation results in a "base" document and then apply all unprocessed documents on top of that "base" as a "delta"--that is, you can reduce your computations down to "every change since the last saved computation".
Finally, there's also the option of load balancing. You could have multiple application servers for processing and have a load balancer distributing requests roughly evenly between them to prevent any one server from becoming overwhelmed.
There are quite a few options available to you for avoiding a scenario where your systems becomes overwhelmed from all of the data processing. The strategies you should employ will depend largely on your particular use case. In this case, however, it seems as though this is a hypothetical question, so the additional strategies noted probably will not be things you will need to concern yourself with. For now, stick with the .next() calls and you should be fine.
I just found a "modern" way of this using for await.
for await (const doc of User.find().cursor()) {
console.log(doc.name);
}
I am using this for my 4M+ docs in one single collection, and it worked fine for me.
Here is the mongoose documentation if you want to refer.
With async await it has become easy. we can now have
const cursor = model.find({})
for await (const doc of cursor){
// carry out any operation
console.log(doc)
}

does waterline.js caches the collection and if it does is blowing up the server by taking up alot of ram?

We've recently crossed 130K documents in one of our collection. Since then we're facing higher memory consumption issue with nodejs. We're using sails waterline.js orm for querying mongodb. So any call made to db through waterline api for example Model.create triggers the increment and node process keeps consuming ram until ~1.8GB then it blows up and restarts. I am trying to debug issue for past week. And I could not find any solution. Please help.
When I deleted all collection data the server does not show any memory consumption. But bringing back the 130K docs creates the issue again.
For ex - I have a user registration endpoint /user
It calls following models in row
let user = await User.create(data);
let model2 = await Model2.create(userdata);
let model3 = await Model3.create(model2Data)
let model4 = await Model4.create(data2);
Take a note all these models does not have too many data. The model which has 130K data is different model.
I took heap dump of previous and later state of node vm. Examining in chrome dev tools I find there are lots of db data loaded into memory (underlined in image. Those data belongs to different model/collection called estimates) but our endpoint /user never calls or interact those models. So I suppose it's waterline or something else.
So there was a waterline mapping to User collection and other collections including the one which had so many data. Removing redundant associations between User and other collections fixed the memory leak.

node.js process a big collection of data

I'm working with mongoose in node.
I'm doing requests to retrieve a collection of items from a remote database. In order to get a full report, I need to parse a whole collection which is a large set.
I avoid to get close to things like:
model.find({}, function(err, data) {
// process the bunch of data
})
For now, I use a recursive approach in which I feed a local variable. Later I send back information about the process as a response.
app.get('/process/it/',(req,res)=>{
var processed_data=[];
function resolve(procdata) {
res.json({status:"ok", items:procdata.length});
}
function handler(data, procdata, start, n) {
if(data.length <= n)
resolve(procdata);
else {
// do something with data: push into processed_data
procdata.push(whatever);
mongoose.model('model').find({}, function(err, data){
handler(data, procdata, start+n, n);
}).skip(start).limit(n);
}
}
n=0
mysize=100
// first call
mongoose.model('model').find({}, function(err, data){
handler(data, processed_data, n, mysize);
}).skip(n).limit(mysize);
})
Is there any approach or solution providing any performance advantage, or just, to achieve this in a better way?
Any help would be appreciated.
Solution depends on the use case.
If data once processed doesn't change often, you can maybe have a secondary database which has the processed data.
You can load unprocessed data from the primary database using pagination the way your doing right now. And all processed data can be loaded from the secondary database in a single query.
It is fine as long as your data set is not big enough, performance could possibly be low though. When it gets to gigabyte level, your application will simply break because the machine won't have enough memory to store your data before sending it to client. Also sending gigabytes of report data will take a lot of time too. Here some suggestions:
Try aggregating your data by Mongo aggregate framework, instead of doing that by your application code
Try to break the report data into smaller reports
Pre-generating report data, store it somewhere (another collection perhaps), and simply send to client when they need to see it

Efficiency of MongoDB's db.collection.distinct() for every user vs saving as a db entry and retrieving results

In my nodeJS app I query mongoDB for distinct values for a particular database field. This returns an array of roughly 3000 values.
Every user must get this data for every session (as it's integral to running the app).
I'm wondering whether it's more efficient (and faster) to do this for every single user:
db.collection.distinct({"value"}, function(data){
// save the data in a variable
})
Or whether I should do a server-side loading of the distinct values (say, once a day), then save it as a db entry for every user to retrieve, like this:
// Server-side:
db.collection.distinct({"value"}, function(data){
// save the data to MongoDB as a document
})
// Client-side:
db.serverInfo.find({name: "uniqueEntries"}, function(data){
// Save to browser as a variable
})
I've tested this myself and can't notice much of a difference, but I'm the only one using the app at the moment. Once I get 10/100/1000/10,000 users I'm wondering which will be best to use here.
If you have an index on this field MongoDB should be able to return the result of the distinct() operation using only the index which should make it fast enough.
But, as with all performance questions, profiling is the best way to be sure, or in the case of MongoDB, use the explain option to see what's happening under the covers.

Resources