node.js process a big collection of data - node.js

I'm working with mongoose in node.
I'm doing requests to retrieve a collection of items from a remote database. In order to get a full report, I need to parse a whole collection which is a large set.
I avoid to get close to things like:
model.find({}, function(err, data) {
// process the bunch of data
})
For now, I use a recursive approach in which I feed a local variable. Later I send back information about the process as a response.
app.get('/process/it/',(req,res)=>{
var processed_data=[];
function resolve(procdata) {
res.json({status:"ok", items:procdata.length});
}
function handler(data, procdata, start, n) {
if(data.length <= n)
resolve(procdata);
else {
// do something with data: push into processed_data
procdata.push(whatever);
mongoose.model('model').find({}, function(err, data){
handler(data, procdata, start+n, n);
}).skip(start).limit(n);
}
}
n=0
mysize=100
// first call
mongoose.model('model').find({}, function(err, data){
handler(data, processed_data, n, mysize);
}).skip(n).limit(mysize);
})
Is there any approach or solution providing any performance advantage, or just, to achieve this in a better way?
Any help would be appreciated.

Solution depends on the use case.
If data once processed doesn't change often, you can maybe have a secondary database which has the processed data.
You can load unprocessed data from the primary database using pagination the way your doing right now. And all processed data can be loaded from the secondary database in a single query.

It is fine as long as your data set is not big enough, performance could possibly be low though. When it gets to gigabyte level, your application will simply break because the machine won't have enough memory to store your data before sending it to client. Also sending gigabytes of report data will take a lot of time too. Here some suggestions:
Try aggregating your data by Mongo aggregate framework, instead of doing that by your application code
Try to break the report data into smaller reports
Pre-generating report data, store it somewhere (another collection perhaps), and simply send to client when they need to see it

Related

change stream in NodeJs for elasticsearch

The aim is to synchronize fields from certain collections on elasticsearch. With every change on mongodb, this should also be implemented on elasticsearch. I've seen the different packages. For example River. Unfortunately it didn't work out for me so I try without it. Is that the right approach with change streams?
How could you solve that more beautifully? The data must be synchronized with every change (insert, update, delete) on Elasticsearch. For several collections but different for each one (only certain fields per collection). Unfortunately, I don't have the experience to solve this in such a way that it doesn't take much effort if a collection or fields are added or removed
const res = await client.connect();
const changeStream = res.watch();
changeStream.on('change', (data) => {
// check the change (is the chance in the right database / collection)
// parse
// push it to elastic server
});
I hope you can help me, thanks in advance :)
Yes. it will work but you have to handle following scenarios
when your node js process goes down while mongodb updates are ongoing.
you can use resume token and keep track of that token so once your
process comes up it can resume from there.
inserting single document on each change.
it will be overwhelimg for elasticsearch and might result in slow inserts, which
will eventually result in sync lag between mongo and elastic. so better collect
multiple document in change stream and insert with bulk API operation.

should i use redis like this : req.session.surname = 'toto' or like this client.set('surname', 'toto')

whats up! I am using redis with express and nodejs. when looking how to insert or retrieve data from redis, I saw two ways, one like this:
req.session.surname = 'toto'
console.log(req.session.surname)
and the other way is looking like this:
client.set('surname', 'toto')
client.get('surname', (err, data) => {
console.log(data)
})
Is there a difference between these two methods ?
Thanks for any help. Cheers !
There is no major difference between these two methods. In the first one you could use any other session store like mongo-db if you need more reliability (since redis is memcached there is a possibility to lose the data as data will be stored in RAM only). Second one is just set and get the desired value to the key for general usage where there is no 100% reliability is needed. Also you will face issue when processing request concurrently as there is no concurrency control for mem-cached DB like redis.
If you need 100% reliability (if you don't want to lose data easily) you can go with mongo-db. In mongo-db data will be stored persistently also we can control concurrency as well.

Caching relational data using redis

I'm building a small social network (users have posts and posts have comments - very basic), using clustered nodejs server and redis as a distributed cache.
My approach to cache users posts is to have a sorted set that contains all the user's posts ids ordered by rate(which should be updated every time someone add a like or comment), and actual objects sorted as hash objects.
So the get user's posts flow should look like this:
1. using zrange to get a range of ids from the sorted set.
2. using multi/exec and hgetall to fetch all the objects at once.
I have a couple of questions:
1. in regards of performance issues, will my approach scale when the cache size getting bigger, or maybe I should use lua or something?
1. in case if I want to continue with current approach, where I should save the sorted set in case of redis crash, if I use the redis persistence this will affect the overall performance, I thought about using a dedicated redis server for the sets (I searched If it is possible to backup only part of the redis data but didn't found anything about it.
My approach => getTopObjects({userID}, 0, 20) :
self.zrange = function(setID, start, stop, multi)
{
return execute(this, "zrange", [setID, start, stop], multi);
};
self.getObject = function(key, multi)
{
return execute(this, "hgetall", key, multi);
};
self.getObjects = function(keys)
{
let multi = thisArg.client.multi();
let promiseArray = [];
for (var i = 0, len = keys.length; i < len; i++)
{
promiseArray.push(this.getObject(keys[i], multi));
}
return execute(this, "exec", [], multi).then(function(results)
{
//TODO: do something with the result.
return Promise.all(promiseArray);
});
};
self.getTopObjects = function(setID, start, stop)
{
//TODO: validate the range
let thisArg = this;
return this.zrevrange(setID, start, stop).then(function(keys)
{
return thisArg.getObjects(keys);
});
};
It's an interesting intellectual exercise, but in my opinion this is classic premature optimization.
1) It's probably way too early to have even introduced redis, let alone be thinking about whether redis is fast enough. Your social network is almost certainly just fine up to about 1,000 users running off raw SQL queries against Mysql / Postgres / Random RDS. If it starts to slow down, get data on slow running queries and fix them with query optimizations and appropriate indexes. That'll get you past 10,000 users.
2) Now you can start introducing redis. In general, I'd encourage you to think about your redis as purely caching and not permanent storage; it shouldn't matter if it gets blown away, it just means your site is slower for the next few seconds because your users are getting their page loads from SQL queries instead of redis hits (each query re-populating that user's sorted list of posts in redis, of course).
Your strategy and example code for using redis seem fine to me, but until you have actual data on how users use your site (which may be drastically different than your current expectations), it's simply impossible to know what types of SQL indexes you will need, what keys and lists are ideal for caching in redis, etc.
I faced similar issues, I needed a way to query the data more efficiently. Can't say for sure but I heard Redis being single threaded blocks the main thread when running lua scripts, i'm sure that's not good for a social networking site. I heard about Tarantool and it looks promising, currently trying to wrap my head around it.
If you are concerned about your cache size growing bigger, I think most social networks keep two weeks worth of data in the users cache, anything older than two weeks gets deleted and you simply implement a scrolling feature that works with pagination, once the user scrolls down, fetch the next two weeks worth of data and add it back to memory only for that specific user (don't forget to specify the new ttl for the newly added data). This helps keep your cache size lean.
What happens when redis or any in memory data tool you are using crashes, you simply reload data back into the memory. They all have features where you save data to files as backup. I'm thinking of implementing another database layer don't know lets say Cassandra or Mongodb that holds the timelines of each user since inception. Sure this creates another overhead cause you have to keep three data layers (e.g mysql, redis and mongodb) in sync!
If this looks like a lot of work, feel free to use a 3rd party service to host your in memory data, at least you can sleep easy, but it's gonna cost you.
That said, this is highly opinionated. Got tired of people telling me to wait until my site explodes with users or the so called premature optimization reply you got :)

Cloud Functions Http Request return cached Firebase database

I'm new in Node.js and Cloud Functions for Firebase, I'll try to be specific for my question.
I have a firebase-database with objects including a "score" field. I want the data to be retrieved based on that, and that can be done easily in client side.
The issue is that, if the database gets to grow big, I'm worried that either it will take too long to return and/or will consume a lot of resources. That's why I was thinking of a http service using Cloud Functions to store a cache with the top N objects that will be updating itself when the score of any objects change with a listener.
Then, client side just has to call something like https://myexampleprojectroute/givemethetoplevels to receive a Json with the top N levels.
Is it reasonable? If so, how can I approach that? Which structures do I need to use this cache, and how to return them in json format via http?
At the moment I'll keep doing it client side but I'd really like to have that both for performance and learning purpose.
Thanks in advance.
EDIT:
In the end I did not implement the optimization. The reason why is, first, that the firebase database does not contain a "child count" so I didn't find a way with my newbie javascript knowledge to implement that. Second, and most important, is that I'm pretty sure it won't scale up to millions, having at most 10K entries, and firebase has rules for sorted reading optimization. For more information please check out this link.
Also, I'll post a simple code snippet to retrieve data from your database via http request using cloud-functions in case someone is looking for it. Hope this helps!
// Simple Test function to retrieve a json object from the DB
// Warning: No security methods are being used such authentication, request methods, etc
exports.request_all_levels = functions.https.onRequest((req, res) => {
const ref = admin.database().ref('CustomLevels');
ref.once('value').then(function(snapshot) {
res.status(200).send(JSON.stringify(snapshot.val()));
});
});
You're duplicating data upon writes, to gain better read performance. That's a completely reasonable approach. In fact, it is so common in NoSQL databases to keep such derived data structures that it even has a name: denormalization.
A few things to keep in mind:
While Cloud Functions run in a more predictable environment than the average client, the resources are still limited. So reading a huge list of items to determine the latest 10 items, is still a suboptimal approach. For simple operations, you'll want to keep the derived data structure up to date for every write operation.
So if you have a "latest 10" and a new item comes in, you remove the oldest item and add the new one. With this approach you have at most 11 items to consider, compared to having your Cloud Function query the list of items for the latest 10 upon every write, which is a O(something-with-n) operation.
Same for an averaging operation: you'll find a moving average to be most performant, because it doesn't require any of the previous data.

Efficiency of MongoDB's db.collection.distinct() for every user vs saving as a db entry and retrieving results

In my nodeJS app I query mongoDB for distinct values for a particular database field. This returns an array of roughly 3000 values.
Every user must get this data for every session (as it's integral to running the app).
I'm wondering whether it's more efficient (and faster) to do this for every single user:
db.collection.distinct({"value"}, function(data){
// save the data in a variable
})
Or whether I should do a server-side loading of the distinct values (say, once a day), then save it as a db entry for every user to retrieve, like this:
// Server-side:
db.collection.distinct({"value"}, function(data){
// save the data to MongoDB as a document
})
// Client-side:
db.serverInfo.find({name: "uniqueEntries"}, function(data){
// Save to browser as a variable
})
I've tested this myself and can't notice much of a difference, but I'm the only one using the app at the moment. Once I get 10/100/1000/10,000 users I'm wondering which will be best to use here.
If you have an index on this field MongoDB should be able to return the result of the distinct() operation using only the index which should make it fast enough.
But, as with all performance questions, profiling is the best way to be sure, or in the case of MongoDB, use the explain option to see what's happening under the covers.

Resources