Node.js worker_threads memory overflow(may be) - node.js

When i use worker_threads to handle a lot of complex logic that irrelevant with the main thread, i found that memory on the server very high.
Below is part of my simplified code.
main.js
const worker = new Worker(process.cwd() + "/worker.js")
// My business will repeat cycle infinitely, this code just an example
for (let i = 0; i < 1000000; i++) {
worker.postMessage(i)
}
woker.js
parentPort.on("message", async data => {
// a log of logic....
})
When I run this scripts on the server, the memory will keep increasing, Maybe the thread is not shut down?
I tried using "worker.removeAllListeners()", resourceLimits option and third party library "workerpool", but still didn't solve my problem.
what should i do or using other method to solve this problem? Thanks for answering me!

Related

Best way to trigger worker_thread OOM exception in Node.js

I would like to test code that reports when a worker_thread OOMs. Or at least that the parent is OK despite the thread crashing due to OOM reasons. I would like to specifically test Node.js killing a worker_thread.
I'm not even sure this is particularly testable since the environment in which Node.js is running seems to make a difference. Setting low old generation size (docs) using resource limits does not behave the way I thought it would, it seems Node.js and the OS are doing a lot of clever things to keep the process from blowing up in memory. The closest I have gotten is that both Node.js AND the worker are killed. But again, this is not the behaviour I was expecting - Node.js should be killing the worker.
Is there any way to reliably test for something like this? Currently running v16.13.2.
[EDIT]
Here is some sample JS for the worker:
const { isMainThread } = require('worker_threads');
if (!isMainThread) {
let as = [];
for (let i = 0; i < 100; i++) {
let a = '';
for (let i = 0; i < 9000000; i++) a += i.toString();
as.push(a);
}
}
OK, it looks like adding maxYoungGenerationSizeMb in my case is resulting in the behaviour I was looking for: e.g. { maxOldGenerationSizeMb: 10, maxYoungGenerationSizeMb: 10 }.

Does v8/Node actually garbage collect during function calls? - Or is this a sailsJS memory leak

I am creating a sailsJS webserver with a background task that needs to run continuously (if the server is idle). - This is a task to synchronize a database with some external data and pre-cache data to speed up requests.
I am using sails version 1.0. Tthe adapter is postgresql (adapter: 'sails-postgresql'), adapter version: 1.0.0-12
Now while running this application I noticed a major problem: it seems that after some time the application inexplicably crashes with an out of heap memory error. (I can't even catch this, the node process just quits).
While I tried to hunt for a memory leak I tried many different approaches, and ultimately I can reduce my code to the following function:
async DoRun(runCount=0, maxCount=undefined) {
while (maxCount === undefined || runCount < maxCount) {
this.count += 1;
runCount += 1;
console.log(`total run count: ${this.count}`);
let taskList;
try {
this.active = true;
taskList = await Task.find({}).populate('relatedTasks').populate('notBefore');
//taskList = await this.makeload();
} catch (err) {
console.error(err);
this.active = false;
return;
}
}
}
To make it "testable" I reduced the heap size allowed to be used by the application: --max-old-space-size=100; With this heapsize it always crashes about around 2000 runs. However even with an "unlimited" heap it crashes after a few (ten)thousand runs.
Now to further test this I commented out the Task.find() command and implimented a dummy that creates the "same" result".
async makeload() {
const promise = new Promise(resolve => {
setTimeout(resolve, 10, this);
});
await promise;
const ret = [];
for (let i = 0; i < 10000; i++) {
ret.push({
relatedTasks: [],
notBefore: [],
id: 1,
orderId: 1,
queueStatus: 'new',
jobType: 'test',
result: 'success',
argData: 'test',
detail: 'blah',
lastActive: new Date(),
updatedAt: Date.now(),
priority: 2 });
}
return ret;
}
This runs (so far) good even after 20000 calls, with 90 MB of heap allocated.
What am I doing wrong in the first case? This let me to believe that sails is having a memory leak? Or is node unable to free the database connections somehow?
I can't seem to see anything that is blatantly "leaking" here? As I can see in the log this.count is not a string so it's not even leaking there (same for runCount).
How can I progress from this point?
EDIT
Some further clarifications/summary:
I run on node 8.9.0
Sails version 1.0
using sails-postgresql adapter (1.0.0-12) (beta version as other version doesn't work with sails 1.0)
I run with the flag: --max-old-space-size=100
Environment variable: node_env=production
It crashes after approx 2000-2500 runs when in production environment (500 when in debug mode).
I've created a github repository containing a workable example of the code;
here. Once again to see the code at any point "soon" set the flag --max-old-space-size=80 (Or something alike)
I don't know anything about sailsJS, but I can answer the first half of the question in the title:
Does V8/Node actually garbage collect during function calls?
Yes, absolutely. The details are complicated (most garbage collection work is done in small incremental chunks, and as much as possible in the background) and keep changing as the garbage collector is improved. One of the fundamental principles is that allocations trigger chunks of GC work.
The garbage collector does not care about function calls or the event loop.

Node.js Spawning multiple threads within a class method

How can I run a single method multiple times multi-threaded when called as a method of a class?
At first I tried to use the cluster module, but I realize it just re-runs the whole process from the start, rightfully so.
How can I achieve something like what's outlined below?
I want a class's method to spawn n processes, and when the parallel tasks are completed, I can resolve a promise which the method returns.
The problem with the code below is that calling cluster.fork() will fork index.js process.
index.js
const Person = require('./Person.js');
var Mary = new Person('Mary');
Mary.run(5).then(() => {...});
console.log('I should only run once, but I am called 5 times too many');
Person.js
const cluster = require('cluster');
class Person{
run(distance){
var completed = 0;
return new Promise((resolve, reject) => {
for(var i = 0; i < distance; i++) {
// run a separate process for each
cluster.fork().send(i).on('message', message => {
if (message === 'completed') { ++completed; }
if (completed === distance) { resolve(); }
});
}
});
}
}
I think the short answer is impossible. It's even worse - this has nothing to do with js. To multi (process or thread) in your particular problem you will essentially need a copy of the object in every thread, since it needs (maybe) access to fields - in this case you would need to either initialize it in every thread or share memory. That last one I don't think is provided in cluster, and not trivial in other languages in every use case.
If the calculation is independent of the Person I suggest you extract it, and use the usual (in index.js):
if(cluster.isWorker) {
//Use the i for calculation
} else {
//Create Person, then fork children in for loop
}
You then collect the results and change the Person as needed. You will be copying index.js, but this is standard and you only run what you need.
The problem is if results are dependent on Person. If these are constant for all i you can still send them to your forks independently. Otherwise what you have is the only way to fork. In general forking in cluster is not meant for methods, but for the app itself, which is the standard forking behavior.
Another solution
Following your comment, I suggest you checkout child_process.execFile or child_process.exec on same file.
This way you can spawn a totally independent process on the fly. Now instead of calling cluster.fork you call execFile. You can use either the exit code or stdout as return values (stderr etc.). Promise is now replaced with:
var results = []
for(var i = 0; i < distance; i++) {
// run a separate process for each
results.push(child_process.execFile().child.execFile('node', 'mymethod.js`,i]));
}
//... catch the exit event from all results or return a callback using results.
Inside mymethod.js Have your code that takes i and returns what you want either in the exit code or through stdout, both properties of the returned child_process. This is a bit un-node.js-y since you're waiting on asynchronous calls, but you're requirements are non standard. Since I'm not sure how you use this perhaps returning a callback with the array is a better idea.

Node.js Cluster: Managing Workers

we're diving deeper in Node.js architecture, to achieve fully understanding, how to scale our application.
Clear solution is cluster usage https://nodejs.org/api/cluster.html. Everything seems to be fine, apart of workers management description:
Node.js does not automatically manage the number of workers for you, however. It is your responsibility to manage the worker pool for your application's needs.
I was searching, how to really manage the workers, but most solutions, says:
Start so many workers as you've got cores.
But I would like to dynamically scale up or down my workers count, depending on current load on server. So if there is load on server and queue is getting longer, I would like to start next worker. In another way, when there isn't so much load, I would like to shut down workers (and leave f.e. minimum 2 of them).
The ideal place, will be for me Master Process queue, and event when new Request is coming to Master Process. On this place we can decide if we need next worker.
Do you have any solution or experience with managing workers from Master Thread in Cluster? Starting and killing them dynamically?
Regards,
Radek
following code will help you to understand to create cluster on request basis.
this program will genrate new cluster in every 10 request.
Note: you need to open http://localhost:8000/ and refresh the page for increasing request.
var cluster = require('cluster');
var http = require('http');
var numCPUs = require('os').cpus().length;
var numReqs = 0;
var initialRequest = 10;
var maxcluster = 10;
var totalcluster = 2;
if (cluster.isMaster) {
// Fork workers.
for (var i = 0; i < 2; i++) {
var worker = cluster.fork();
console.log('cluster master');
worker.on('message', function(msg) {
if (msg.cmd && msg.cmd == 'notifyRequest') {
numReqs++;
}
});
}
setInterval(function() {
console.log("numReqs =", numReqs);
isNeedWorker(numReqs) && cluster.fork();
}, 1000);
} else {
console.log('cluster one initilize');
// Worker processes have a http server.
http.Server(function(req, res) {
res.writeHead(200);
res.end("hello world\n");
// Send message to master process
process.send({ cmd: 'notifyRequest' });
}).listen(8000);
}
function isNeedWorker(numReqs) {
if( numReqs >= initialRequest && totalcluster < numCPUs ) {
initialRequest = initialRequest + 10;
totalcluster = totalcluster + 1;
return true;
} else {
return false;
}
}
To manually manage your workers, you need a messaging layer to facilitate inter process communication. With IPC master and worker can communicate effectively, by default and architecture stand point this behavior is already implemented in the process module native. However i find the native implementation not flexible or robust enough to handle horizontal scaling due to network requests.
One obvious solution Redis as a message broker to facilitate this method of master and slave communication. However this solution also as its faults , which is context latency, directly linked to command and reply.
Further research led me to RabbitMQ,great fit for distributing time-consuming tasks among multiple workers.The main idea behind Work Queues (aka: Task Queues) is to avoid doing a resource-intensive task immediately and having to wait for it to complete. Instead we schedule the task to be done later. We encapsulate a task as a message and send it to the queue. A worker process running in the background will pop the tasks and eventually execute the job. When you run many workers the tasks will be shared between them.
To implement a robust server , read this link , it may give some insights. Link

Node.js Synchronous Library Code Blocking Async Execution

Suppose you've got a 3rd-party library that's got a synchronous API. Naturally, attempting to use it in an async fashion yields undesirable results in the sense that you get blocked when trying to do multiple things in "parallel".
Are there any common patterns that allow us to use such libraries in an async fashion?
Consider the following example (using the async library from NPM for brevity):
var async = require('async');
function ts() {
return new Date().getTime();
}
var startTs = ts();
process.on('exit', function() {
console.log('Total Time: ~' + (ts() - startTs) + ' ms');
});
// This is a dummy function that simulates some 3rd-party synchronous code.
function vendorSyncCode() {
var future = ts() + 50; // ~50 ms in the future.
while(ts() <= future) {} // Spin to simulate blocking work.
}
// My code that handles the workload and uses `vendorSyncCode`.
function myTaskRunner(task, callback) {
// Do async stuff with `task`...
vendorSyncCode(task);
// Do more async stuff...
callback();
}
// Dummy workload.
var work = (function() {
var result = [];
for(var i = 0; i < 100; ++i) result.push(i);
return result;
})();
// Problem:
// -------
// The following two calls will take roughly the same amount of time to complete.
// In this case, ~6 seconds each.
async.each(work, myTaskRunner, function(err) {});
async.eachLimit(work, 10, myTaskRunner, function(err) {});
// Desired:
// --------
// The latter call with 10 "workers" should complete roughly an order of magnitude
// faster than the former.
Are fork/join or spawning worker processes manually my only options?
Yes, it is your only option.
If you need to use 50ms of cpu time to do something, and need to do it 10 times, then you'll need 500ms of cpu time to do it. If you want it to be done in less than 500ms of wall clock time, you need to use more cpus. That means multiple node instances (or a C++ addon that pushes the work out onto the thread pool). How to get multiple instances depends on your app strucuture, a child that you feed the work to using child_process.send() is one way, running multiple servers with cluster is another. Breaking up your server is another way. Say its an image store application, and mostly is fast to process requests, unless someone asks to convert an image into another format and that's cpu intensive. You could push the image processing portion into a different app, and access it through a REST API, leaving the main app server responsive.
If you aren't concerned that it takes 50ms of cpu to do the request, but instead you are concerned that you can't interleave handling of other requests with the processing of the cpu intensive request, then you could break the work up into small chunks, and schedule the next chunk with setInterval(). That's usually a horrid hack, though. Better to restructure the app.

Resources