When should I use worker-threads? - node.js

I am currently working on a backend which provides rest endpoints for my frontend with nestjs. In some endpoints I receive e.g. an array of elements which I need to process.
Concrete Example:
I receive an array of 50 elements. For each element I need to make a SQL request. Therefore I need to loop over the array and do stuff in SQL.
I always ask myself: At what amount of elements should I use for example worker threads to not block the event loop?
Maybe I misunderstood the blocking of the event loop and someone can enlight me.

I don't think that you'll need worker-threads in this scenario. As long as the sql-queries are executed asynchronsouly, i.e. the sql-query calls do not block, you will be fine. You can use Promise.all to speed up the processing of the loop, as the queries will be executed in parallel, e.g.
const dbQueryPromises = [];
for(const entry of data) {
dbQueryPromises.push(dbConnection.query(buildQuery(entry)));
}
await Promise.all(dbQueryPromises);
If, however, your code performs computation-heavy operations inside the loop, then you should consider worker-threads as the long running operations on your call stack will block the eventloop.

Only use them if you need to do CPU-intensive tasks with large amounts of data. They allow you to avoid the serialization step of the data. 50 Is not enough I believe

Related

NodeJS: Reading and writing to a shared object from multiple promises

I'm trying to find out if there could be an issue when accessing an object from multiple promises, eg:
let obj = {test: 0}
let promisesArr = []
for (let n = 0; n < 10; n++) {
promisesArr.push(promiseFunc(obj))
}
Promise.all(promisesArr)
// Then the promise would be something like this
function promiseFunc(obj) {
return new Promise(async (resolve, reject) => {
// read from the shared object
let read = obj.test
// write/modify the shared object
obj.test++
// Do some async op with the read data
await asyncFunc(read)
// resolves and gets called again later
})
}
From what I can see/tested there would not be an issue, it would seem like even though processing is asynchronous, there is no race condition. But maybe I could be missing something.
The only issue that I can see is writing to the object and then doing some I/O op and then read expecting what was written before to still be there
I'm not modifying the object after other async operations only at the start, but there are several promises doing the same. Once they resolve they get called again and the cycle starts over.
Race conditions in Javascript with multiple asynchronous operations depend entirely upon the application logic of what exactly your doing.
Since you don't show any real code in that regard, we can't really say whether you have a race condition liability here or not.
There is no generic answer to your question.
Yes, there can be race conditions among multiple asynchronous operations accessing the same data.
OR
The code can be written appropriately such that no race condition occurs.
It all depends upon the exact and real code and what it is doing. There is no generic answer. I can show you code with two promise-based asynchronous operations that absolutely causes a race condition and I can show you code with two promise-based asynchronous operations that does not cause a race condition. So, susceptibility to race conditions depends on the precise code and what it is doing and precisely how it is written.
Pure and simple access to a value in a shared object does not by itself cause a race condition because the main thread in Javascript is single threaded and non-interruptible so any single synchronous Javascript statement is thread-safe by itself. What matters is what you do with that data and how that code is written.
Here's an example of something that is susceptible to race conditions if there are other operations that can also change shareObj.someProperty:
let localValue = shareObj.someProperty;
let data = await doSomeAsyncOperation();
shareObj.someProperty = localValue + data.someProperty;
Whereas, this not cause a race condition:
let data = await doSomeAsyncOperation();
shareObj.someProperty = shareObj.someProperty += data.someProperty;
The second is not causing its own race condition because it is atomically updating the shared data. Whereas the first was getting it, storing it locally, then waiting for an asynchronous operation to complete which is an opportunity for other code to modify the shared variable without this local function knowing about it.
FYI, this is very similar to classic database issues. If you get a value from a database (which is always an asynchronous operation in nodejs), then increment it, then write the value back, that's subject to race conditions because others can be trying to read/modify the value at the same time and you can stomp on each other's changes. So, instead, you have to use an atomic operation built into the database to update the variable in one atomic operation.
For your own variables in Javascript, the problem is a little bit simpler than the generic database issue because reading and writing to a Javascript variable is atomic. You just make to make sure you don't hold onto a value that you will later modify and write back across an asynchronous operation.

in nodejs, how to limit the number of simultaneous access to part of the code?

One part of my code, I want it run once simultaneous, no parallel running.
I found semaphore:
var sem = require('semaphore')(1);
sem.take(function(){
//xxxxxxxx
sem.leave()
})
If I do this, it seems the code between sem.take and sem.leave will be only executed once simultaneously. But is there other way to do this? Cause as a node user, I am not very happy to touch this sort of "multithread" thing.
My problem is the following code, I want them to be run only once simultaneously. But this function can take up to 10 seconds cause I am waiting for the response of run_cmd (basically a wrapped child.exec).
But if I refreshed the page. Two processes will run silmutaneously. And the stdout from the previous process can affect the coming process.
How to solve this?
on.connnection(){
async function(){
let res = await run_cmd('xxxx')
res = parse_first_res(res)
res = await run_cmd('zzz' + res)
res = parse_second_res('xxyy' + res)
....
}
}
NodeJS is essentially single threaded. So as long as the code is synchronous no parallel processing can happen so you are safe.
However if the code is not synchronous (e.g. you do some database calls) then you need some sort of locking/semaphore. There's no escape.
Also note that locks/semaphores do not guarantee that a function will be called only once. It only guarantees that no parallel processing will happen. To ensure that the function is called only once you need some additional flag and if statement.
Cause as a node user, I am not very happy to touch this sort of "multithread" thing.
This is not about multi or single threading. It is about parallel processing. It doesn't matter whether you are node user or not. You will have to deal with race conditions at some point.

Concurrency between Meteor.setTimeout and Meteor.methods

In my Meteor application to implement a turnbased multiplayer game server, the clients receive the game state via publish/subscribe, and can call a Meteor method sendTurn to send turn data to the server (they cannot update the game state collection directly).
var endRound = function(gameRound) {
// check if gameRound has already ended /
// if round results have already been determined
// --> yes:
do nothing
// --> no:
// determine round results
// update collection
// create next gameRound
};
Meteor.methods({
sendTurn: function(turnParams) {
// find gameRound data
// validate turnParams against gameRound
// store turn (update "gameRound" collection object)
// have all clients sent in turns for this round?
// yes --> call "endRound"
// no --> wait for other clients to send turns
}
});
To implement a time limit, I want to wait for a certain time period (to give clients time to call sendTurn), and then determine the round result - but only if the round result has not already been determined in sendTurn.
How should I implement this time limit on the server?
My naive approach to implement this would be to call Meteor.setTimeout(endRound, <roundTimeLimit>).
Questions:
What about concurrency? I assume I should update collections synchronously (without callbacks) in sendTurn and endRound (?), but would this be enough to eliminate race conditions? (Reading the 4th comment on the accepted answer to this SO question about synchronous database operations also yielding, I doubt that)
In that regard, what does "per request" mean in the Meteor docs in my context (the function endRound called by a client method call and/or in server setTimeout)?
In Meteor, your server code runs in a single thread per request, not in the asynchronous callback style typical of Node.
In a multi-server / clustered environment, (how) would this work?
Great question, and it's trickier than it looks. First off I'd like to point out that I've implemented a solution to this exact problem in the following repos:
https://github.com/ldworkin/meteor-prisoners-dilemma
https://github.com/HarvardEconCS/turkserver-meteor
To summarize, the problem basically has the following properties:
Each client sends in some action on each round (you call this sendTurn)
When all clients have sent in their actions, run endRound
Each round has a timer that, if it expires, automatically runs endRound anyway
endRound must execute exactly once per round regardless of what clients do
Now, consider the properties of Meteor that we have to deal with:
Each client can have exactly one outstanding method to the server at a time (unless this.unblock() is called inside a method). Following methods wait for the first.
All timeout and database operations on the server can yield to other fibers
This means that whenever a method call goes through a yielding operation, values in Node or the database can change. This can lead to the following potential race conditions (these are just the ones I've fixed, but there may be others):
In a 2-player game, for example, two clients call sendTurn at exactly same time. Both call a yielding operation to store the turn data. Both methods then check whether 2 players have sent in their turns, finding the affirmative, and then endRound gets run twice.
A player calls sendTurn right as the round times out. In that case, endRound is called by both the timeout and the player's method, resulting running twice again.
Incorrect fixes to the above problems can result in starvation where endRound never gets called.
You can approach this problem in several ways, either synchronizing in Node or in the database.
Since only one Fiber can actually change values in Node at a time, if you don't call a yielding operation you are guaranteed to avoid possible race conditions. So you can cache things like the turn states in memory instead of in the database. However, this requires that the caching is done correctly and doesn't carry over to clustered environments.
Move the endRound code outside of the method call itself, using something else to trigger it. This is the approach I've taken which ensures that only the timer or the final player triggers the end of the round, not both (see here for an implementation using observeChanges).
In a clustered environment you will have to synchronize using only the database, probably with conditional update operations and atomic operators. Something like the following:
var currentVal;
while(true) {
currentVal = Foo.findOne(id).val; // yields
if( Foo.update({_id: id, val: currentVal}, {$inc: {val: 1}}) > 0 ) {
// Operation went as expected
// (your code here, e.g. endRound)
break;
}
else {
// Race condition detected, try again
}
}
The above approach is primitive and probably results in bad database performance under high loads; it also doesn't handle timers, but I'm sure with some thinking you can figure out how to extend it to work better.
You may also want to see this timers code for some other ideas. I'm going to extend it to the full setting that you described once I have some time.

Patterns for asynchronous but sequential requests

I have been writing a lot of NodeJS recently and that has forced me to attack some problems from a different perspective. I was wondering what patterns had developed for the problem of processing chunks of data sequentially (rather than in parallel) in an asynchronous request-environment, but I haven't been able to find anything directly relevant.
So to summarize the problem:
I have a list of data stored in an array format that I need to process.
I have to send this data to a service asynchronously, but the service will only accept a few at a time.
The data must be processed sequentially to meet the restrictions on the service, meaning making a number of parallel asynchronous requests is not allowed
Working in this domain, the simplest pattern I've come up with is a recursive one. Something like
function processData(data, start, step, callback){
if(start < data.length){
var chunk = data.split(start, step);
queryService(chunk, start, step, function(e, d){
//Assume no errors
//Could possibly do some matching between d and 'data' here to
//Update data with anything that the service may have returned
processData(data, start+step, step, callback);
});
}
else{
callback(data);
}
}
Conceptually, this should step through each item, but it's intuitively complex. I feel like there should be a simpler way of doing this. Does anyone have a pattern they tend to follow when approaching this kind of problem?
My first thought process would be to rely on object encapsulation. Create an object that contains all of the information about what needs to be processed and all of the relevant data about what has been processed and is being processed and the callback function will just call the 'next' function for the object, which will in turn start processing on the next piece of data and update the object. Essentially working like a n asynchronous for-loop.

GPars report status on large number of async functions and wait for completion

I have a parser, and after gathering the data for a row, I want to fire an aync function and let it process the row, while the main thread continues on and gets the next row.
I've seen this post: How do I execute two tasks simultaneously and wait for the results in Groovy? but I'm not sure it is the best solution for my situation.
What I want to do is, after all the rows are read, wait for all the async functions to finish before I go on. One concern with using a collection of Promises is that the list could be large (100,000+).
Also, I want to report status as we go. And finally, I'm not sure I want to automatically wait for a timeout (like on a get()), because the file could be huge, however, I do want to allow the user to kill the process for various reasons.
So what I've done for now is record the number of rows parsed (as they occur via rowsRead), then use a callback from the Promise to record another row being finished processing, like this:
def promise = processRow(row)
promise.whenBound {
rowsProcessed.incrementAndGet()
}
Where rowsProcessed is an AtomicInteger.
Then in the code invoked at the end of the sheet, after all parsing is done and I'm waiting for the processing to finish, I'm doing this:
boolean test = true
while (test) {
Thread.sleep(1000) // No need to pound the CPU with this check
println "read: ${sheet.rowsRead}, processed: ${sheet.rowsProcessed.get()}"
if (sheet.rowsProcessed.get() == sheet.rowsRead) {
test = false
}
}
The nice thing is, I don't have an explosion of Promise objects here - just a simple count to check. But I'm not sure sleeping every so often is as efficient as checking the get() on each Promise() object.
So, my questions are:
If I used the collection of Promises instead, would a get() react and return if the thread executing the while loop above was interrupted with Thread.interrupt()?
Would using the collection of Promises and calling get() on each be more efficient than trying to sleep and check every so often?
Is there another, better approach that I haven't considered?
Thanks!
Call to allPromises*.get() will throw InterruptedException if the waiting (main) thread gets interrupted
Yes, the promises have been created anyway, so grouping them in a list should not impose additional memory requirements, in my opinion.
The suggested solutions with a CountDownLanch or a Phaser are IMO much more suitable than using busy waiting.
An alternative to an AtomicInteger is to use a CountDownLatch. It avoids both the sleep and the large collection of Promise objects. You could use it like this:
latch = new CountDownLatch(sheet.rowsRead)
...
def promise = processRow(row)
promise.whenBound {
latch.countDown()
}
...
while (!latch.await(1, TimeUnit.SECONDS)) {
println "read: ${sheet.rowsRead}, processed: ${sheet.rowsRead - latch.count}"
}

Resources