Best way to preforms queries on large array in NodeJS - node.js

I currently have a table in my Postgres database with about 115k rows that I feel is too slow for my serverless functions. The only thing I need that table for is to lookup values using functions like ILIKE and the network barrier is slowing things down a lot I believe.
My thought was to take the table and make it into a javascript array of objects as it doesn't change often if ever. Now that I have it in a file such as array.ts and inside is:
export default [
{}, {}, {},...
]
What is the best way to query this huge array? Is it best to just use the .filter function? I currently am trying to import the array and filter it but it seems to just hang and never actually complete. MUCH slower the the current DB approach so I am unsure if this is the right approach.

Make the database faster
As people have commented, it's likely that the database will actually perform better than anything else given that databases are good at indexing large data sets. It may just be a case of adding the right index, or changing the way your serverless functions handle the connection pool.
Make local files faster
If you want to do it without the database, there are a couple of things that will make a big difference:
Read the file and then use JSON.parse, do not use require(...)
JavaScript is much slower to parse than JSON. You can therefore make things load much faster by parsing it as JavaScript.
Find a way to split up the data
Especially in a serverless environment, you're unlikely to need all the data for every request, and the serverless function will probably only serve a few requests before it is shutdown and a new one is started.
If you could split your files up such that you typically only need to load an array of 1,000 or so items, things will run much faster.
Depending on the size of the objects, you might consider having a file that contains only the id of the objects & the fields needed to filter them, then having a separate file for each object so you can load the full object after you have filtered.
Use a local database
If the issue is genuinely the network latency, and you can't find a good way to split up the files, you could try using a local database engine.
#databases/sqlite can be used to query an SQLite database file that you could pre-populate with your array of values and index appropriately.
const openDatabase = require('#databases/sqlite');
const {sql} = require('#databases/sqlite');
const db = openDatabase('mydata.db');
async function query(pattern) {
await db.query(sql`SELECT * FROM items WHERE item_name LIKE ${pattern}`);
}
query('%foo%').then(results => console.log(results));

Related

Is it faster to use aggregation or manually filter through data with nodejs and mongoose?

I'm at a crossroads trying to decide what methodology to use. Basically, I have a mongodb collection and i want to query it with specific params provided by the user, then i want to group the response according to the value of some of those parameters. For example, let's say my collection is animals and if i query all animals i get something like this
[
{type:"Dog",age:3,name:"Kahla"},
{type:"Cat",age:6,name:"mimi"},
...
]
Now i would like to return to the user a response that is grouped by the animal type, so that i end up with something like
{
Dogs: [...dog docs],
Cats: [...cat docs],
Cows: [...],
}
So basically I have 2 ways of doing this. One is to just use Model.find() and fetch all the animals that match my specific queries, such as age or any other field, and then manually filter and format my json response before sending it back to the user with res.json({}) (im using express btw)
Or I can use mongo's aggregate framework and $group to do this at the query level, hence returning from the DB an already grouped response to my request. The only inconvenience I've found with this so far with this is how the response is formatted, and ends up looking more something like this
[
{
"_id":"Dog",
"docs":[{dog docs...}]
},
{
"_id":"Cat",
"docs":[{...}]
}
]
The overall result is BASICALLY the same, but the formatting of the response is quite different, and my front end client needs to adjust to how Im sending the response. I don't really like the array of objects from the aggregation, and prefer a json-like object response with key names correponding to the arrays as I see fit.
So the real question here is whether there is one significant advantage of one way over the other? Is the aggregation framework so fast that it will scale well if my collection grows to huge numbers? Is filtering through the data with javascript and mapping the response so I can shape it to my liking a very inefficient process, and hence it's better to use aggregation and adapt the front end to this response shape?
I'm considering that by Faster you meant the least time to serve a request. That said, let's divide the time required to process your request:
Asynchronous Operations (Network Operations, File read/write etc)
Synchronous Operations
Synchronous operations are usually much more faster than the Asynchronous ones.(This also depends on the nature of the operation and the amount of data being processed). For example, if you loop over an iterable(e.g. Array, Map etc) which has a length of less than 1000 it won't take more than a few milliseconds.
On the other hand, Asynchronous operations takes more times. For example, if you run an HTTP request it would take couple of milliseconds to get the response.
When you are querying on the MongoDB with mongoose, it's an asynchronous call and it will take more time. So, if you run more queries to Database it will make your API slower. MongoDB Aggregation can help you reducing the total number of queries which may help you to make APIs faster. But the problem is, Aggregations are usually slower than normal find requests.
The summary is, if you can manually filter data without adding any DB query it's going to be faster.

Caching relational data using redis

I'm building a small social network (users have posts and posts have comments - very basic), using clustered nodejs server and redis as a distributed cache.
My approach to cache users posts is to have a sorted set that contains all the user's posts ids ordered by rate(which should be updated every time someone add a like or comment), and actual objects sorted as hash objects.
So the get user's posts flow should look like this:
1. using zrange to get a range of ids from the sorted set.
2. using multi/exec and hgetall to fetch all the objects at once.
I have a couple of questions:
1. in regards of performance issues, will my approach scale when the cache size getting bigger, or maybe I should use lua or something?
1. in case if I want to continue with current approach, where I should save the sorted set in case of redis crash, if I use the redis persistence this will affect the overall performance, I thought about using a dedicated redis server for the sets (I searched If it is possible to backup only part of the redis data but didn't found anything about it.
My approach => getTopObjects({userID}, 0, 20) :
self.zrange = function(setID, start, stop, multi)
{
return execute(this, "zrange", [setID, start, stop], multi);
};
self.getObject = function(key, multi)
{
return execute(this, "hgetall", key, multi);
};
self.getObjects = function(keys)
{
let multi = thisArg.client.multi();
let promiseArray = [];
for (var i = 0, len = keys.length; i < len; i++)
{
promiseArray.push(this.getObject(keys[i], multi));
}
return execute(this, "exec", [], multi).then(function(results)
{
//TODO: do something with the result.
return Promise.all(promiseArray);
});
};
self.getTopObjects = function(setID, start, stop)
{
//TODO: validate the range
let thisArg = this;
return this.zrevrange(setID, start, stop).then(function(keys)
{
return thisArg.getObjects(keys);
});
};
It's an interesting intellectual exercise, but in my opinion this is classic premature optimization.
1) It's probably way too early to have even introduced redis, let alone be thinking about whether redis is fast enough. Your social network is almost certainly just fine up to about 1,000 users running off raw SQL queries against Mysql / Postgres / Random RDS. If it starts to slow down, get data on slow running queries and fix them with query optimizations and appropriate indexes. That'll get you past 10,000 users.
2) Now you can start introducing redis. In general, I'd encourage you to think about your redis as purely caching and not permanent storage; it shouldn't matter if it gets blown away, it just means your site is slower for the next few seconds because your users are getting their page loads from SQL queries instead of redis hits (each query re-populating that user's sorted list of posts in redis, of course).
Your strategy and example code for using redis seem fine to me, but until you have actual data on how users use your site (which may be drastically different than your current expectations), it's simply impossible to know what types of SQL indexes you will need, what keys and lists are ideal for caching in redis, etc.
I faced similar issues, I needed a way to query the data more efficiently. Can't say for sure but I heard Redis being single threaded blocks the main thread when running lua scripts, i'm sure that's not good for a social networking site. I heard about Tarantool and it looks promising, currently trying to wrap my head around it.
If you are concerned about your cache size growing bigger, I think most social networks keep two weeks worth of data in the users cache, anything older than two weeks gets deleted and you simply implement a scrolling feature that works with pagination, once the user scrolls down, fetch the next two weeks worth of data and add it back to memory only for that specific user (don't forget to specify the new ttl for the newly added data). This helps keep your cache size lean.
What happens when redis or any in memory data tool you are using crashes, you simply reload data back into the memory. They all have features where you save data to files as backup. I'm thinking of implementing another database layer don't know lets say Cassandra or Mongodb that holds the timelines of each user since inception. Sure this creates another overhead cause you have to keep three data layers (e.g mysql, redis and mongodb) in sync!
If this looks like a lot of work, feel free to use a 3rd party service to host your in memory data, at least you can sleep easy, but it's gonna cost you.
That said, this is highly opinionated. Got tired of people telling me to wait until my site explodes with users or the so called premature optimization reply you got :)

Non-blocking insert into database with node js

Part of my Node Js app includes reading a file and after some (lightweight, row by row) processing, insert these records into the database.
Original code did just that. The problem is that the file may contain a crazy number of records which are inserted row by row. According to some tests I did, a file of 10000 rows blocks completely the app for some 10 seconds.
My considerations were:
Bulk create the whole object at once. This means reading the file, preparing the object by doing for each row some calculation, pushing it to the final object and in the end using Sequelize's bulkcreate. There were two downsides:
A huge insert can be as blocking as thousands of single-row inserts.
This may make it hard to generate reports for rows that were not inserted.
Bulk create in smaller, reasonable objects. This means reading the file, iterating each n (ex. 2000) rows by doing the calculations and adding it to an object, then using Sequelize's bulkcreate for the object. Object preparation and the bulkcreate would run asyncroniously. The downside:
Setting the object length seems arbitrary.
Also it seems like an artifice on my side, while there might be existing and proven solutions for this particular situation.
Moving this part of the code in another proccess. Ideally limiting cpu usage to reasonable levels for this process (idk. if it can be done or if it is smart).
Simply creating a new process for this (and other blocking parts of the code).
This is not the 'help me write some code' type of question. I have already looked around and it seems there is enough documentation. But I would like to invest on an efficient solution, using the proper tools. Other ideas are welcomed.

is mongo stored javascript good solution to multiple DB inserts in nodejs

I have a functionality is Mean Stack which has multiple collection inserts and creation. If it do that in plain mongoose , its going to be multiple Mongo calls and it might be slow.
Can i use mongo stored javascript for this?Pass some values to mongo javascript and it can do all the things from there..
Is it a suggested approach?
The recommended way to do lots of inserts is to use the Bulk Write Operations feature. You can define a set of inserts to be done as a single batch, then pass them all to MongoDB in one go.
However, that is really only appropriate for jobs such as a big data take-on, where you are importing a large number of similar records in one go. If you are running a normal application where there might be inserts, updates, deletes and reads in varying proportions and at varying rates, you would be better off letting Mongoose submit them as individual queries, and making sure your server hardware can cope.

Cloud Functions Http Request return cached Firebase database

I'm new in Node.js and Cloud Functions for Firebase, I'll try to be specific for my question.
I have a firebase-database with objects including a "score" field. I want the data to be retrieved based on that, and that can be done easily in client side.
The issue is that, if the database gets to grow big, I'm worried that either it will take too long to return and/or will consume a lot of resources. That's why I was thinking of a http service using Cloud Functions to store a cache with the top N objects that will be updating itself when the score of any objects change with a listener.
Then, client side just has to call something like https://myexampleprojectroute/givemethetoplevels to receive a Json with the top N levels.
Is it reasonable? If so, how can I approach that? Which structures do I need to use this cache, and how to return them in json format via http?
At the moment I'll keep doing it client side but I'd really like to have that both for performance and learning purpose.
Thanks in advance.
EDIT:
In the end I did not implement the optimization. The reason why is, first, that the firebase database does not contain a "child count" so I didn't find a way with my newbie javascript knowledge to implement that. Second, and most important, is that I'm pretty sure it won't scale up to millions, having at most 10K entries, and firebase has rules for sorted reading optimization. For more information please check out this link.
Also, I'll post a simple code snippet to retrieve data from your database via http request using cloud-functions in case someone is looking for it. Hope this helps!
// Simple Test function to retrieve a json object from the DB
// Warning: No security methods are being used such authentication, request methods, etc
exports.request_all_levels = functions.https.onRequest((req, res) => {
const ref = admin.database().ref('CustomLevels');
ref.once('value').then(function(snapshot) {
res.status(200).send(JSON.stringify(snapshot.val()));
});
});
You're duplicating data upon writes, to gain better read performance. That's a completely reasonable approach. In fact, it is so common in NoSQL databases to keep such derived data structures that it even has a name: denormalization.
A few things to keep in mind:
While Cloud Functions run in a more predictable environment than the average client, the resources are still limited. So reading a huge list of items to determine the latest 10 items, is still a suboptimal approach. For simple operations, you'll want to keep the derived data structure up to date for every write operation.
So if you have a "latest 10" and a new item comes in, you remove the oldest item and add the new one. With this approach you have at most 11 items to consider, compared to having your Cloud Function query the list of items for the latest 10 upon every write, which is a O(something-with-n) operation.
Same for an averaging operation: you'll find a moving average to be most performant, because it doesn't require any of the previous data.

Resources