MongoDB + NodeJS: MapReduce or manual calculation - node.js

I am creating a REST API in NodeJS that connects to MongoDB does a MapReduce and store the results on a different collection.
The code is pretty simple. It takes a User ID, gets all other users who are related to this user somehow using some algorithm, and then for each one, calculate a likeness percentage. Assuming there are 50k users in the test database, this MapReduce takes around 200-800ms. And that is ideal for me. If this were to get famous and have hundreds of concurrent requests like this, I'm pretty sure that will not be the case any more. I understand that MongoDB might need to be sharded as needed.
The other scenario is to just do a normal find(), loop over the cursor and do the same logic. It takes the same amount of time as MapReduce mind you. However, I just thought about this to try and put the heavy lifting of the calculations on the client side (NodeJS) and not on the server side like MapReduce. Does this idea even have merit? I thought that this way, I can scale APIs horizontally behind a load balancer or something.

It would be better to keep heavy lifting off of the server which processes each request and put it onto the database.
If you have 1000 requests and 200 of them require you to perform the calculation, 800 requests can be processed as normal by the server, so long as mongo does the the calculation with mapReduce or aggregation.
If you instead run the calculations manually on your node server, all requests will be affected by the server having to do the heavy lifting.
Mongo is also quite efficient at aggregation for sure and mapReduce also I would imagine.
I recently moved a ton of logic from my server onto mongoDB where I could and it made a world of difference.

Related

How to avoid database from being hit hard when API is getting bursted?

I have an API which allows other microservices to call on to check whether a particular product exists in the inventory. The API takes in only one parameter which is the ID of the product.
The API is served through API Gateway in Lambda and it simply queries against a Postgres RDS to check for the product ID. If it finds the product, it returns the information about the product in the response. If it doesn't, it just returns an empty response. The SQL is basically this:
SELECT * FROM inventory where expired = false and product_id = request.productId;
However, the problem is that many services are calling this particular API very heavily to check the existence of products. Not only that, the calls often come in bursts. I assume those services loop through a list of product IDs and check for their existence individually, hence the burst.
The number of concurrent calls on the API has resulted in it making many queries to the database. The rate can burst beyond 30 queries per sec and there can be a few hundred thousands of requests to fulfil. The queries are mostly the same, except for the product ID in the where clause. The column has been indexed and it takes an average of only 5-8ms to complete. Still, the connection to the database occasionally time out when the rate gets too high.
I'm using Sequelize as my ORM and the error I get when it time out is SequelizeConnectionAcquireTimeoutError. There is a good chance that the burst rate was too high and it max'ed out the pool too.
Some options I have considered:
Using a cache layer. But I have noticed that, most
of the time, 90% of the product IDs in the requests are not repeated.
This would mean that 90% of the time, it would be a cache miss and it
will still query against the database.
Auto scale up the database. But because the calls are bursty and I don't
know when they may come, the autoscaling won't complete in time to
avoid the time out. Moreover, the query is a very simple select statement and the CPU of the RDS instance hardly crosses 80% during the bursts. So I doubt scaling it would do much too.
What other techniques can I do to avoid the database from being hit hard when the API is getting burst calls which are mostly unique and difficult to cache?
Use cache in the boot time
You can load all necessary columns into an in-memory data storage (redis). Every update in database (cron job) will affect cached data.
Problems: memory overhead of updating cache
Limit db calls
Create a buffer for ids. Store n ids and then make one query for all of them. Or empty the buffer every m seconds!
Problems: client response time extra process for query result
Change your database
Use NoSql database for these data. According to this article and this one, I think choosing NoSql database is a better idea.
Problems: multiple data stores
Start with a covering index to handle your query. You might create an index like this for your table:
CREATE INDEX inv_lkup ON inventory (product_id, expired) INCLUDE (col, col, col);
Mention all the columns in your SELECT in the index, either in the main list of indexed columns or in the INCLUDE clause. Then the DBMS can satisfy your query completely from the index. It's faster.
You could start using AWS lambda throttling to handle this problem. But, for that to work the consumers of your API will need to retry when they get 429 responses. That might be super-inconvenient.
Sorry to say, you may need to stop using lambda. Ordinary web servers have good stuff in them to manage burst workload.
They have an incoming connection (TCP/IP listen) queue. Each new request coming in lands in that queue, where it waits until the server software accept the connection. When the server is busy requests wait in that queue. When there's a high load the requests wait for a bit longer in that queue. In nodejs's case, if you use clustering there's just one of these incoming connection queues, and all the processes in the cluster use it.
The server software you run (to handle your API) has a pool of connections to your DBMS. That pool has a maximum number of connections it it. As your server software handles each request, it awaits a connection from the pool. If no connection is immediately available the request-handling pauses until one is available, then handles it. This too smooths out the requests to the DBMS. (Be aware that each process in a nodejs cluster has its own pool.)
Paradoxically, a smaller DBMS connection pool can improve overall performance, by avoiding too many concurrent SELECTs (or other queries) on the DBMS.
This kind of server configuration can be scaled out: a load balancer will do. So will a server with more cores and more nodejs cluster processes. An elastic load balancer can also add new server VMs when necessary.

Nodejs application hangs on heavy requests

I am using, Nodejs express server with pg-promise. I have some queries in the database which takes alot of time to return result. For such queries I set a timeout for 3sec which fails the promise, if the query pg-promise query takes longer and the server returns an error. However, the issue is that if I send subsequent requests with same (heavy) queries, the application hangs and takes time to start processing the new request. It doesnot throw any error, that is why it is difficult to debug. I was wondering what can be the reason for the node application to hang?
Whenever somebody comes up with a question about queries execution taking too long at the very start, it always points at the misunderstanding of the fundamentals around development and implementation of database services.
Those issue typically root from the following problems:
Bad database design, or lack of essential performance considerations
Bad query execution planning, i.e. use of very inefficient query logic
Bad use of the connection pool, i.e. the database connectivity issues
Combinations of the above
So when you are trying address such a huge pool of possible problems with a brief problem description, and without any code examples, you will never get any usable answer. It is far too broad, and it would require to cover too many topics pertaining to writing database services.

is database access a blocking operation with NodeJs

I went through this article and the following rose a question:
QUEUED INPUTS If you’re receiving a high amount of concurrent data,
your database can become a bottleneck. As depicted above, Node.js can
easily handle the concurrent connections themselves. But because
database access is a blocking operation (in this case), we run into
trouble.
Isn't Db access an asynchronous operation in Nodejs? E.g. I usually perform all possible data transformations using MongoDb aggregation to minimize impact on NodeJs. Or I get things wrong?
that is why callbacks came into picture. that is the actual use of callbacks sine we don't know how much time db will take to process the aggregation. Db access is asynchronous just because of callbacks .

How can I "break up" a long running server side function in a Meteor app?

I have, as part of a meteor application, a server side that gets POST messages of information to feed to the web client via inserts/updates to a Collection. So far so good. However, sometimes these updates can be rather large (50K records a go, every 5 seconds). I was having a hard time keeping up to this until I started using batch-insert package and then low-level batch.find.update() and batch.execute() from Mongo.
However, there is still a good amount of processing going on even with 50K records (it does some calculations, analytics, etc). I would LOVE to be able to "thread" that logic so the main event loop can continue along. However, I am not sure there is a real easy way to create "real" threads for this within Meteor. So baring that, I would like to know the best / proper way of at least "batching" the work so that every N (say 1K or so) records I can release the event loop back to process other events (like some client side DDP messages and the like). Then do another 1K records, etc. until however many records as I need are done.
I am THINKING the solution lies within using Fibers/Futures -- which appear to be the Meteor way -- but I am not positive that is correct or the low level ideas like "setTimeout()" and/or "setImmediate()" are more appropriate.
TIA!
Meteor is not a one size fits all tool. I think you should decouple your meteor application from your batch processing. Set up a separate meteor instance, or better yet set up a pure node.js server to handle these requests and batch processes. It would look like this:
Create a node.js instance that connects to the same mongo database using the mongodb plugin (https://www.npmjs.com/package/mongodb).
Use express if you're using node.js to handle the post requests (https://www.npmjs.com/package/express).
Do the batch processing/inserts/updates in this instance.
The updates in mongo will be reflected in meteor very quickly. I had a similar situation and used a node server to do some batch data collection and then pass it into a cassandra database. I then used pig latin to run some batch operations on that data, and then inserted it into mongo. My meteor application would reactively display the new data pretty much instantaneously.
You can call this.unblock() inside a server method to allow the code to run in the background, and immediately return from the method. See example below.
Meteor.methods({
longMethod: function() {
this.unblock();
Meteor._sleepForMs(1000 * 60 * 60);
}
});

Increase speed of CouchDB's _changes feed while applying a filter

I'm having trouble with poor performance on CouchDB's _changes feed when there are multiple observers.
I have CouchDB running inside a virtual machine on a laptop, and multiple iOS clients are consuming _changes?feed=continuous on one of the databases over the network, using CouchDB's HTTP API. As the number of clients increases, the speed at which the changes come through is slowed to a crawl.
N.B. I'm actually communicating with CouchDB via an Apache reverse proxy, which is compressing the responses.
And I'm also noticing that, while applying a filter to the feed, it will often go long periods without delivering any changes to the HTTP stream. Almost as if I'm waiting for it to check a batch of documents that don't meet my filter.
Is there anything settings I can enable or optimisations I can make that will help speed this all up?
The increase of latency with the number of consumers of filtered _changes feed is no surprise when you realize, that for each change couchdb has ask the query server to evaluate the filter() function. Apparently it doesn't cache the results so it has to perform this operation for each consumer.
Something you could try is dropping the filter parameter and using the include_docs=true instead. This way the feed producer wouldn't have to ask the view server to evaluate the changes. This should make it more responsive. Of course, this comes with the price of significantly increasing the amount of data transferred in the feed and you have to duplicate the filter() function logic on the client side. Its not ideal, but I think its worth a shot.

Resources