Prevent concurrent processing in NodeJS

Prevent concurrent processing in NodeJS - node.js

I need NodeJS to prevent concurrent operations for the same requests. From what I understand, if NodeJS receives multiple requests, this is what happens:
REQUEST1 ---> DATABASE_READ
REQUEST2 ---> DATABASE_READ
DATABASE_READ complete ---> EXPENSIVE_OP() --> REQUEST1_END
DATABASE_READ complete ---> EXPENSIVE_OP() --> REQUEST2_END
This results in two expensive operations running. What I need is something like this:
REQUEST1 ---> DATABASE_READ
DATABASE_READ complete ---> DATABASE_UPDATE
DATABASE_UPDATE complete ---> REQUEST2 ---> DATABASE_READ ––> REQUEST2_END
---> EXPENSIVE_OP() --> REQUEST1_END
This is what it looks like in code. The problem is the window between when the app starts reading the cache value and when it finishes writing to it. During this window, the concurrent requests don't know that there is already one request with the same itemID running.
app.post("/api", async function(req, res) {
const itemID = req.body.itemID
// See if itemID is processing
const processing = await DATABASE_READ(itemID)
// Due to how NodeJS works,
// from this point in time all requests
// to /api?itemID="xxx" will have processing = false
// and will conduct expensive operations
if (processing == true) {
// "Cheap" part
// Tell client to wait until itemID is processed
} else {
// "Expensive" part
DATABASE_UPDATE({[itemID]: true})
// All requests to /api at this point
// are still going here and conducting
// duplicate operations.
// Only after DATABASE_UPDATE finishes,
// all requests go to the "Cheap" part
DO_EXPENSIVE_THINGS();
}
}
Edit
Of course I can do something like this:
const lockedIDs = {}
app.post("/api", function(req, res) {
const itemID = req.body.itemID
const locked = lockedIDs[itemID] ? true : false // sync equivalent to async DATABASE_READ(itemID)
if (locked) {
// Tell client to wait until itemID is processed
// No need to do expensive operations
} else {
lockedIDs[itemID] = true // sync equivalent to async DATABASE_UPDATE({[itemID]: true})
// Do expensive operations
// itemID is now "locked", so subsequent request will not go here
}
}
lockedIDs here behaves like an in-memory synchronous key-value database. That is ok, if it is just one server. But what if there are multiple server instances? I need to have a separate cache storage, like Redis. And I can access Redis only asynchronously. So this will not work, unfortunately.

Ok, let me take a crack at this.
So, the problem I'm having with this question is that you've abstracted the problem so much that it's really hard to help you optimize. It's not clear what your "long running process" is doing, and what it is doing will affect how to solve the challenge of handling multiple concurrent requests. What's your API doing that you're worried about consuming resources?
From your code, at first I guessed that you're kicking off some kind of long-running job (e.g. file conversion or something), but then some of the edits and comments make me think that it might be just a complex query against the database which requires a lot of calculations to get right and so you want to cache the query results. But I could also see it being something else, like a query against a bunch of third party APIs that you're aggregating or something. Each scenario has some nuance that changes what's optimal.
That said, I'll explain the 'cache' scenario and you can tell me if you're more interested in one of the other solutions.
Basically, you're in the right ballpark for the cache already. If you haven't already, I'd recommend looking at cache-manager, which simplifies your boilerplate a little for these scenarios (and let's you set cache invalidation and even have multi-tier caching). The piece that you're missing is that you essentially should always respond with whatever you have in the cache, and populate the cache outside the scope of any given request. Using your code as a starting point, something like this (leaving off all the try..catches and error checking and such for simplicity):
// A GET is OK here, because no matter what we're firing back a response quickly,
// and semantically this is a query
app.get("/api", async function(req, res) {
const itemID = req.query.itemID
// In this case, I'm assuming you have a cache object that basically gets whatever
// is cached in your cache storage and can set new things there too.
let item = await cache.get(itemID)
// Item isn't in the cache at all, so this is the very first attempt.
if (!item) {
// go ahead and let the client know we'll get to it later. 202 Accepted should
// be fine, but pick your own status code to let them know it's in process.
// Other good options include [503 Service Unavailable with a retry-after
// header][2] and [420 Enhance Your Calm][2] (non-standard, but funny)
res.status(202).send({ id: itemID });
// put an empty object in there so we know it's working on it.
await cache.set(itemID, {});
// start the long-running process, which should update the cache when it's done
await populateCache(itemID);
return;
}
// Here we have an item in the cache, but it's not done processing. Maybe you
// could just check to see if it's an empty object or not, but I'm assuming
// that we've setup a boolean flag on the cached object for when it's done.
if (!item.processed) {
// The client should try again later like above. Exit early. You could
// alternatively send the partial item, an empty object, or a message.
return res.status(202).send({ id: itemID });
}
// if we get here, the item is in the cache and done processing.
return res.send(item);
}
Now, I don't know precisely what all your stuff does, but if it's me, populateCache from above is a pretty simple function that just calls whatever service we're using to do the long-running work and then puts it into the cache.
async function populateCache(itemId) {
const item = await service.createThisWorkOfArt(itemId);
await cache.set(itemId, item);
return;
}
Let me know if that's not clear or if your scenario is really different from what I'm guessing.
As mentioned in the comments, this approach will cover most normal issues you might have with your described scenario, but it will still allow two requests to both fire off the long-running process, if they come in faster than the write to your cache store (e.g. Redis). I judge the odds of that happening are pretty low, but if you're really concerned about that then the next more paranoid version of this would be to simply remove the long-running process code from your web API altogether. Instead, your API just records that someone requested that stuff to happen, and if there's nothing in the cache then respond as I did above, but completely remove the block that actually calls populateCache altogether.
Instead, you would have a separate worker process running that would periodically (how often depends on your business case) check the cache for unprocessed jobs and kick off the work for processing them. By doing it this way, even if you have 1000's of concurrent requests for the same item, you can ensure that you're only processing it one time. The downside of course is that you add whatever the periodicity of the check is to the delay in getting the fully processed data.

You could create a local Map object (in memory for synchronous access) that contains any itemID as a key that is being processed. You could make the value for that key be a promise that resolves with whatever the result is from anyone who has previously processed that key. I think of this like a gate keeper. It keeps track of which itemIDs are being processed.
This scheme tells future requests for the same itemID to wait and does not block other requests - I thought that was important rather than just using a global lock on all requests related to itemID processing.
Then, as part of your processing, you first check the local Map object. If that key is in there, then it's currently being processed. You can then just await the promise from the Map object to see when it's done being processed and get any result from prior processing.
If it's not in the Map object, then it's not being processed now and you can immediately put it in Map to mark it as "in process". If you set a promise as the value, then you can resolve that promise with whatever result you get from this processing of the object.
Any other requests that come along will end up just waiting on that promise and you will thus only process this ID once. The first one to start with that ID will process it and all other requests that come along while it's processing will use the same shared result (thus saving the duplication of your heavy computation).
I tried to code up an example, but did not really understand what your psuedo-code was trying to do well enough to offer a code example.
Systems like this have to have perfect error handling so that all possible error paths handle the Map and promise embedded in the Map properly.
Based on your fairly light pseudo-code example, here's a similar pseudo code example that illustrates the above concept:
const itemInProcessCache = new Map();
app.get("/api", async function(req, res) {
const itemID = req.query.itemID
let gate = itemInProcessCache.get(itemID);
if (gate) {
gate.then(val => {
// use cached result here from previous processing
}).catch(err => {
// decide what to do when previous processing had an error
});
} else {
let p = DATABASE_UPDATE({itemID: true}).then(result => {
// expensive processing done
// return final value so any others waiting on the gate can just use that value
// decide if you want to clear this item from itemInProcessCache or not
}).catch(err => {
// error on expensive processing
// remove from the gate cache because we didn't get a result
// expensive processing will have to be done by someone else
itemInProcessCache.delete(itemID);
});
// mark this item as being processed
itemInProcessCache.set(itemID, p);
}
});
Note: This relies on the single-threadedness of node.js. No other request can get started until the request handler here returns so that itemInProcessCache.set(itemID, p); gets called before any other requests for this itemID could get started.
Also, I don't know databases very well, but this seems very much like a feature that a good multi-user database might have built in or have supporting features that makes this easier since it's not an uncommon idea to not want to have multiple requests all trying to do the same database work (or worse yet, trouncing each other's work).

Related

Does an backend endpoint with long awaits within it block other endpoints?

My backend has a few endpoints, most of them return some json to the customer and are pretty fast, however one of them takes a very long time to process.
It takes an image url from the request body, manipulates that image to get a new one, and once the image is processed it uploads it to a server in order to get back a url,
and only then it can use the url to make an order.
Getting the enhanced image and uploading it to the server (to get back the url) take a long time, like a good 3 seconds each if not more. I don't want the "order" endpoint to block the other endpoints, if that is something that would happen.
Each order is independent from the previous or the next one and I don't care how long it takes to process one,
if it means it doesn't distrupt and block the event loop.
For now this is my code:
app.post("/order", async (req,res) => {
AIEnhancedImage = await enhance(req.body.image)
url = await uploadImageToServer(AIEnhancedImage)
order(url)
}
app.get("/A"), async (req,res) => {
...
}
app.get("/B"), async (req,res) => {
...
}
app.get("/C"), async (req,res) => {
...
}
My question is, if another endpoint is hit, will that endpoint be blocked by the "order" one if there is one processing?
If it does, what is a better implementation to make sure the order endpoint is processed bit by bit instead all at once?
This doubt probably arises from my lack of knowledge about the event loop. what I hope is that
the code from the order endpoint will be added to the event loop but be processed indipendently and at the same time as other
requests from other endpoints. The blocking part would only be within that endpoint, so it wouldn;t affect significantly the performance of other endpoints.

The answer is it depends.
Is the code below CPU intensive or IO intensive?
AIEnhancedImage = await enhance(req.body.image)
url = await uploadImageToServer(AIEnhancedImage)
order(url)
Only one active user action can run inside an event loop callback. So if you are doing some cpu intensive task than nothing else can run on unless that task finishes.
Think it like this.. what ever custom code you write only one thing can run at a time.
But if you are doing IO based task then node JS will use special worker pool to process and wait for IO. So while Node JS waits for IO, node JS will pick something else in event loop and try to process it.
https://nodejs.org/en/docs/guides/event-loop-timers-and-nexttick/

How to handle multiple post requests at the same time while saving one of them on the db?

I am getting n post requests (on each webhook trigger) from a webhook. The data is identical on all requests that come from the same trigger - they all have the same 'orderId'. I'm interested in saving only one of these requests, so on each endpoint hit I'm checking if this specific orderId exists as a row in my database, otherwise - create it.
if (await orderIdExists === null) {
await Order.create(
{
userId,
status: PENDING,
price,
...
}
);
await sleep(3000)
function sleep(ms) {
return new Promise((resolve) => {
setTimeout(resolve, ms);
});
}
}
return res.status(HttpStatus.OK).send({success: true})
} catch (error) {
return res.status(HttpStatus.INTERNAL_SERVER_ERROR).send({success: false})
}
}
else {
return res.status(HttpStatus.UNAUTHORIZED).send(responseBuilder(false, responseErrorCodes.INVALID_API_KEY, {}, req.t));
}
}
Problem is before Sequelize manages to save the new created order in the db (all of the n post requests get to the enpoint in 1 sec - or less), I already get another endpoint hit from the other n post requests, while orderIdExists still equels null, So it ends up creating more identical orders. One (not so good solution) is to make orderId unique in the db, which prevents the creation of on order with the same orderId, but tries to anyway, which results in empty id incrementation in the db. Any idea would be greatly appreciated.
p.s. as you can see, i tried adding a 'sleep' function to no avail.

Your database is failing to complete its save operation before the next request arrives. The problem is similar to the Dogpile Effect or a "cache slam".
This requires some more thinking about how you are framing the problem: in other words the "solution" will be more philosophical and perhaps have less to do with code, so your results on StackOverflow may vary.
The "sleep" solution is no solution at all: there's no guarantee how long the database operation might take or how long you might wait before another duplicate request arrives. As a rule of thumb, any time "sleep" is deployed as a "solution" to problems of concurrency, it usually is the wrong choice.
Let me posit two possible ways of dealing with this:
Option 1: write-only: i.e. don't try to "solve" this by reading from the database before you write to it. Just keep the pipeline leading into the database as dumb as possible and keep writing. E.g. consider a "logging" table that just stores whatever the webhook throws at it -- don't try to read from it, just keep inserting (or upserting). If you get 100 ping-backs about a specific order, so be it: your table would log it all and if you end up with 100 rows for a single orderId, let some other downstream process worry about what to do with all that duplicated data. Presumably, Sequelize is smart enough (and your database supports whatever process locking) to queue up the operations and deal with write repetitions.
An upsert operation here would be helpful if you do want to have a unique constraint on the orderId (this seems sensible, but you may be aware of other considerations in your particular setup).
Option 2: use a queue. This is decidedly more complex, so weigh carefully wether or not your use-case justifies the extra work. Instead of writing data immediately to the database, throw the webhook data into a queue (e.g. a first-in-first-out FIFO queue). Ideally, you would want to choose a queue that supports de-duplication so that exiting messages are guaranteed to be unique, but that infers state, and that usually relies on a database of some sort, which is sort of the problem to begin with.
The most important thing a queue would do for you is it would serialize the messages so you can deal with them one at a time (instead of multiple database operations kicking off concurrently). You can upsert data into the database when you read a message out of the queue. If the webhook keeps firing and more messages enter the queue, that's fine because the queue forces them all to line up single-file and you can handle each insertion one at a time. You'll know that each database operation has completed before it moves on to the next message so you never "slam" the DB. In other words, putting a queue in front of the database will allow it to handle data when the database is ready instead of whenever a webhook comes calling.
The idea of a queue here is similar to what a semaphore accomplishes. Note that your database interface may already implement a kind of queue/pool under-the-hood, so weigh this option carefully: don't reinvent a wheel.
Hope those ideas are useful.

You saved my time #Everett and #april-henig. I found that saving directly into database read to records duplicates. If you store records into an object and deal with one record at time helped me a lot.
May be I would share my solution perhaps some may find it useful in future.
Create an empty object to save success request
export const queueAllSuccessCallBack = {};
Save POST request in object
if (status === 'success') { // I checked the request if is only successfully
const findKeyTransaction = queueAllSuccessCallBack[client_reference_id];
if (!findKeyTransaction) { // check if Id is not added to avoid any duplicates
queueAllSuccessCallBack[client_reference_id] = {
transFound,
body,
}; // save new request id as key and the value as data you want
}
}
Access the object to save into database
const keys = Object.keys(queueAllSuccessCallBack);
keys.forEach(async (key) => {
...
// Do extra checks if you want to do so
// Or save in database direct
});

When should I split some task into asynchronous tinier tasks?

I'm writing a personal project in Node and I'm trying to figure out when a task should be asynchronously splitted. Let's say I have this "4-Step-Task", they are not very expensive (the most expensive its the one who iterates over an array of objects and trying to match a URL with a RegExp, and the array probably won't have more than 20 or 30 objects).
part1().then(y => {
doTheSecondPart
}).then(z => {
doTheThirdPart
}).then(c => {
doTheFourthPart
});
The other way will be just executing one after another, but nothing else will progress until this task is done. With the above approach, others tasks can progress at least a little bit between each part.
Is there any criteria about when this approach should be prefered over a classic synchronous one?
Sorry my bad english, not my native language.

All you've described is synchronous code that isn't very long to run. First off, there's no reason to even use promises for that type of code. Secondly, there's no reason to break it up into chunks. All you would be doing with either of those choices is making the code more complicated to write, more complicated to test and more complicated to understand and it would also run slower. All of those are undesirable.
If you force even synchronous code into a promise, then a .then() handler will give some other code a chance to run between .then() handlers, but only certain types of events can be run there because processing a resolved promise is one of the highest priority things to do in the event queue system. It won't, for example, allow another incoming http request arriving on your server to start to run.
If you truly wanted to allow other requests to run and so on, you would be better off just putting the code (without promises) into a WorkerThread and letting it run there and then communicate back the result via messaging. If you wanted to keep it in the main thread, but let any other code run, you'd probably have to use a short setTimeout() delay to truly let all possible other types of tasks run in between.
So, if this code doesn't take much time to run, there's just really no reason to mess with complicating it. Just let it run in the fastest, quickest and simplest way.
If you want more concrete advice, then please show some actual code and provide some timing information about how long it takes to run. Iterating through an array of 20-30 objects is nothing in the general scheme of things and is not a reason to rewrite it into timesliced pieces.
As for code that iterates over an array/list of items doing matching against some string, this is exactly what the Express web server framework does on every incoming URL to find the matching routes. That is not a slow thing to do in Javascript.

Asynchronous programming is a better fit for code that must respond to events – for example, any kind of graphical UI. An example of a situation where programmers use async but shouldn't is any code that can focus entirely on data processing and can accept a “stop-the-world” block while waiting for data to download.
I use it extensivly with a rest API server as we have no idea of how long a request can take to for a server to respond . So in order for us not to "block the app" while waiting for the server response async requests are most useful
part1().then(y => {
doTheSecondPart
}).then(z => {
doTheThirdPart
}).then(c => {
doTheFourthPart
});
As you have described in your sample is much more of a synchronous procedural process that would not necessarily allow your interface to still work while your algorithm is busy with a process
In the case of a server call, if you still waiting for server to respond the algorithm using then is still using up resources and wont free your app up to run any other user interface events, while its waiting for the process to reach the next then statement .
You should use Async Await in this instance where you waiting for a user event or a server to respond but do not want your app to hang while waiting for server data...
async function wait() {
await new Promise(resolve => setTimeout(resolve,2000));
console.log("awaiting for server once !!")
return 10;
}
async function wait2() {
await new Promise(resolve => setTimeout(resolve,3000));
console.log("awaiting for server twice !!")
return 10;
}
async function f() {
let promise = new Promise((resolve, reject) => {
setTimeout(() => resolve("done!"), 1000)
});
let result = await promise;//.then(async function(){
console.log(result)
let promise6 = await wait();
let promise7 = await wait2();
//}); // wait until the promise resolves (*)
//console.log(result); // "done!"
}
f();
This sample should help you gain a basic understanding of how async/ Await works and here are a few resources to research it
Promises and Async
Mozilla Refrences

Correct way to use MongoDB node.js `insertMany` so that it does not block

Within a node.js application, I wanted to use insertMany to insert a lot of documents (well, actually, around 10'000). I encountered the following issue: While insertMany (called with await) is running, the node.js process is not processing anything from the processing loop until the insertMany call has finished.
Is this expected behaviour? How would I do this "the right way", so that my service would still process requests in the meantime? I would have expected the await insertMany to automatically enable this, as it's async, but it seems this is not the case.
Code snippet:
exports.writeOrg = async (req, res, next) => {
logger.debug('orgs.writeOrg()');
// ...
try {
// ...
logger.debug('Starting processing of data.');
const newOrgDocs = await processLdapUsers(tenantId, ldapUsers);
logger.debug('Processing of data finished.');
const orgModel = getOrgModel(tenantId);
// Now delete the entire collection
logger.debug(`Delete entire org collection in tenant ${tenantId}`);
await orgModel.deleteMany({});
// And add the new org information; this replaces what was there before
logger.debug(`Inserting org structure to tenant ${tenantId}`);
// This is the call which seems to block: --->
await orgModel.insertMany(newOrgDocs);
// <---
logger.debug(`Finished inserting org structure to tenant ${tenantId}`);
// ...
} catch (err) {
// ...
// error handling
}
}
The writeOrg function is a regular express request handler; the payload is a JSON array with typically 1000-20000 records; in the test case I have 6000 records with a total JSON size of around 6 MB. Writing locally takes just around 1.5s, writing to MongoDB Atlas (cheapest tier for testing) takes around 20 seconds, which is when this problem occurs.
Workaround: If I split up the data into smaller chunks, e.g. 50 records at a time, the event loop processes some data from time to time for other requests. But still, as the insertMany function is an async function call, I wasn't expecting this to be necessary.

There are multiple issues which make this rather slow, and the most important one is actually something which is not mentioned in the question: I am using Mongoose as an "ORM" wrapper for Mongo DB. I wasn't aware that this could have such a substantial impact on the runtime.
What happens here, after checking out the actual runtime with the Chrome node.js debugging tools, is that Mongoose wraps and validates each single document in the array, and this takes substantial time.
The BSON conversion also takes time, but the document wrapping is what takes up the most time.
This means: Mongoose is not super suitable for fast inserting (or reading, FWIW); if you need speed, going directly for the native Mongo DB driver is the way to go. If your needs for pure speed are not that big, and you want the convenience of Mongoose, Mongoose can add value by doing validations and adding defaults and things like that.

Avoiding race condition in node.js web app

I wondering what would be the way to design a web service like this:
Say I have a server listening for requests, it receives some key and checks if it's cached (for example using some DB) and if it's not it does some processing, generates the answer, stores it in cache DB and returns answer to client.
This seems to work OK but what happens if two clients request for the same non-existent key? In this case a race condition would happen, so it would look like
client 1 -> check cache DB -> generate answer -> store in cache -> reply to client
client 2 -> check cache DB -> generate answer -> store in cache -> reply to client
One way to avoid this issue would be using a UNIQUE feature in the DB, so whenever the second answer is generated and written to the DB, some error happens. This is fine but seems more like a patch rather than a real solution. Specially, imagine a case where generating the answer takes a lot of processing, then something else would be preferable.
One option I can think of is using job queues, so whenever a key is received, the key is either appended to an existing job, or a new job is added to the queue.
I've been playing with node.js for some weeks and I'm surprised that I haven't found examples showing this kind of use case. So I'm wondering if this is an acceptable solution for cases like this, or something better exists?

Here is how you can do that in a single-process setup:
var Emitter = require('events').EventEmitter;
var requests = Object.create(null);
function getSomething (key, callback) {
var request = requests[key];
if (!request) {
request = requests[key] = new Emitter;
getSomethingActually(key, function (err, result) {
delete requests[key];
if (err) return request.emit('error', err);
request.emit('result', result);
});
}
request.once('result', function (result) {
callback(null, result);
});
request.once('error', function (err) {
callback(err);
});
}
if you want to scale this, you need to use some external storage + event bus, like redis.

You should be using job queues (or some other sort of offloading jobs) either way. Processing-intensive tasks should always be taken out of your main Node application (either by a queue, spawning it as a separate process, etc) or else it will block the event loop, thus blocking all other requests.
This being said, if you choose to use a queue of some sort that can have a unique constraint, such as a postgres backed queue, and set a unique constraint on the key, duplicates will never be inserted into the work queue, so will never be processed twice. You can simply ignore a unique constraint error in this case.
Note that it is still likely possible, yet very unlikely, to have a sequence of events like:
request check the 'cache' for key x, gets a miss
worker completes answer for key x, inserts it into 'cache', removes x from queue
request received a miss for key x, adds it to the queue
worker pulls key x from the queue, starts computation
After this (probably unlikely) sequence of events, the second worker would get an error inserting the key. In my opinion, this is probably an unlikely enough event that adding a unique key constraint and just ignoring a unique constraint violation error on the second worker is probably a viable enough option.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string