MacOS Catalina freezing+crashing after running Node.JS load test script - node.js

I wrote up a simple load testing script that runs N number of hits to and HTTP endpoint over M async parallel lanes. Each lane waits for the previous request to finish before starting a new request. The script, for my specific use-case, is randomly picking a numeric "width" parameter to add to the URL each time. The endpoint returns between 200k and 900k of image data on each request depending on the width parameter. But my script does not care about this data and simply relies on garbage collection to clean it up.
const fetch = require('node-fetch');
const MIN_WIDTH = 200;
const MAX_WIDTH = 1600;
const loadTestUrl = `
http://load-testing-server.com/endpoint?width={width}
`.trim();
async function fetchAll(url) {
const res = await fetch(url, {
method: 'GET'
});
if (!res.ok) {
throw new Error(res.statusText);
}
}
async function doSingleRun(runs, id) {
const runStart = Date.now();
console.log(`(id = ${id}) - Running ${runs} times...`);
for (let i = 0; i < runs; i++) {
const start = Date.now();
const width = Math.floor(Math.random() * (MAX_WIDTH - MIN_WIDTH)) + MIN_WIDTH;
try {
const result = await fetchAll(loadTestUrl.replace('{width}', `${width}`));
const duration = Date.now() - start;
console.log(`(id = ${id}) - Width ${width} Success. ${i+1}/${runs}. Duration: ${duration}`)
} catch (e) {
const duration = Date.now() - start;
console.log(`(id = ${id}) - Width ${width} Error fetching. ${i+1}/${runs}. Duration: ${duration}`, e)
}
}
console.log(`(id = ${id}) - Finished run. Duration: ` + (Date.now() - runStart));
}
(async function () {
const RUNS = 200;
const parallelRuns = 10;
const promises = [];
const parallelRunStart = Date.now();
console.log(`Running ${parallelRuns} parallel runs`)
for (let i = 0; i < parallelRuns; i++) {
promises.push(doSingleRun(RUNS, i))
}
await Promise.all(promises);
console.log(`Finished parallel runs. Duration ${Date.now() - parallelRunStart}`)
})();
When I run this in Node 14.17.3 on my MacBook Pro running MacOS 10.15.7 (Catalina) with even a modest parallel lane number of 3, after about 120 (x 3) hits of the endpoint the following happens in succession:
Console output ceases in the terminal for the script, indicating the script has halted
Other applications such as my browser are unable to make network connections.
Within 1 - 2 mins other applications on my machine begin to slow down and eventually freeze up.
My entire system crashes with a kernel panic and the machine reboots.
panic(cpu 2 caller 0xffffff7f91ba1ad5): userspace watchdog timeout: remoted connection watchdog expired, no updates from remoted monitoring thread in 60 seconds, 30 checkins from thread since monitoring enabled 640 seconds ago after loadservice: com.apple.logd, total successful checkins since load (642 seconds ago): 64, last successful checkin: 10 seconds ago
service: com.apple.WindowServer, total successful checkins since load (610 seconds ago): 60, last successful checkin: 10 seconds ago
I can very easily stop of the progression of these symptoms by doing a Ctrl+C in the terminal of my script and force quitting it. Everything quickly gets back to normal. And I can repeat the experiment multiple times before allowing it to crash my machine.
I've monitored Activity Monitor during the progression and there is very little (~1%) CPU usage, memory usage reaches up to maybe 60-70mb, though it is pretty evident that the Network activity is peaking during the script's run.
In my search for others with this problem there were only two Stack Overflow articles that came close:
node.js hangs other programs on my mac
Node script causes system freeze when uploading a lot of files
Anyone have any idea why this would happen? It seems very dangerous that a single app/script could so easily bring down a machine without being killed first by the OS.

Related

How to perform recurring long running background tasks in an node.js web server

I'm working on a node.js web server using express.js that should offer a dashboard to monitor database servers.
The architecture is quite simple:
a gatherer retrieves the information in a predefined interval and stores the data
express.js listens to user requests and shows a dashboard based on the stored data
I'm now wondering how to best implement the gatherer to make sure that it does not block the main loop and the simplest solution seems be to just use a setTimeout based approach but I was wondering what the "proper way" to architecture this would be?
Your concern is your information-gathering step. It probably is not as CPU-intensive as it seems. Because it's a monitoring app, it probably gathers information by contacting other machines, something like this.
async function gather () {
const results = []
let result
result = await getOracleMetrics ('server1')
results.push(result)
result = await getMySQLMetrics ('server2')
results.push(result)
result = await getMySQLMetrics ('server3')
results.push(result)
await storeMetrics(results)
}
This is not a cpu-intensive function. (If you were doing a fast Fourier transform on an image, that would be a cpu-intensive function.)
It spends most of its time awaiting results, and then a little time storing them. Using async / await gives you the illusion it runs synchronously. But, each await yields the main loop to other things.
You might invoke it every minute something like this. The .then().catch() stuff invokes it asynchronously.
setInterval (
function go () {
gather()
.then()
.catch(console.error)
}, 1000 * 60 * 60)
If you do actually have some cpu-intensive computation to do, you have a few choices.
offload it to a worker thread.
break it up into short chunks, with sleeps between them.
sleep = function sleep (howLong) {
return new Promise(function (resolve) {
setTimeout(() => {resolve()}, howLong)
})
}
async function gather () {
for (let chunkNo = 0; chunkNo < 100; chunkNo++) {
doComputationChunk(chunkNo)
await sleep(1)
}
}
That sleep() function yields to the main loop by waiting for a timeout to expire.
None of this is debugged, sorry to say.
For recurring tasks I prefer to use node-scheduler and shedule the jobs on app start-up.
In case you don't want to run CPU-expensive tasks in the main-thread, you can always run the code below in a worker-thread in parallel instead of the main thread - see info here
Here are two examples, one with a recurrence rule and one with interval in minutes using a cron expression:
app.js
let mySheduler = require('./mysheduler.js');
mySheduler.sheduleRecurrence();
// And/Or
mySheduler.sheduleInterval();
mysheduler.js
/* INFO: Require node-schedule for starting jobs of sheduled-tasks */
var schedule = require('node-schedule');
/* INFO: Helper for constructing a cron-expression */
function getCronExpression(minutes) {
if (minutes < 60) {
return `*/${minutes} * * * *`;
}
else {
let hours = (minutes - minutes % 60) / 60;
let minutesRemainder = minutes % 60;
return `*/${minutesRemainder} */${hours} * * *`;
}
}
module.exports = {
sheduleRecurrence: () => {
// Schedule a job # 01:00 AM every day (Mo-Su)
var rule = new schedule.RecurrenceRule();
rule.hour = 01;
rule.minute = 00;
rule.second = 00;
rule.dayOfWeek = new schedule.Range(0,6);
var dailyJob = schedule.scheduleJob(rule, function(){
/* INFO: Put your database-ops or other routines here */
// ...
// ..
// .
});
// INFO: Verbose output to check if job was scheduled:
console.log(`JOB:\n${dailyJob}\n HAS BEEN SCHEDULED..`);
},
sheduleInterval: () => {
let intervalInMinutes = 60;
let cronExpressions = getCronExpression(intervalInMinutes);
// INFO: Define unique job-name in case you want to cancel it
let uniqueJobName = "myIntervalJob"; // should be unique
// INFO: Schedule the job
var job = schedule.scheduleJob(uniqueJobName,cronExpressions, function() {
/* INFO: Put your database-ops or other routines here */
// ...
// ..
// .
})
// INFO: Verbose output to check if job was scheduled:
console.log(`JOB:\n${job}\n HAS BEEN SCHEDULED..`);
}
}
In case you want to cancel a job, you can use its unique job-name:
function cancelCronJob(uniqueJobName) {
/* INFO: Get job-instance for canceling scheduled task/job */
let current_job = schedule.scheduledJobs[uniqueJobName];
if (!current_job || current_job == 'undefinded') {
/* INFO: Cron-job not found (already cancelled or unknown) */
console.log(`CRON JOB WITH UNIQUE NAME: '${uniqueJobName}' UNDEFINED OR ALREADY CANCELLED..`);
}
else {
/* INFO: Cron-job found and cancelled */
console.log(`CANCELLING CRON JOB WITH UNIQUE NAME: '${uniqueJobName}`)
current_job.cancel();
}
};
In my example the recurrence and the interval are hardcoded, obviously you can also pass the recurrence-rules or the interval as argument to the respective function..
As per your comment:
'When looking at the implementation of node-schedule it feels like a this layer on top of setTimeout..'
Actually, node-schedule is using long-timeout -> https://www.npmjs.com/package/long-timeout so you are right, it's basically a convenient layer on top of timeOuts

How can the AWS Lambda concurrent execution limit be reached?

UPDATE
The original test code below is largely correct, but in NodeJS the various AWS services should be setup a bit differently as per the SDK link provided by #Michael-sqlbot
// manager
const AWS = require("aws-sdk")
const https = require('https');
const agent = new https.Agent({
maxSockets: 498 // workers hit this level; expect plus 1 for the manager instance
});
const lambda = new AWS.Lambda({
apiVersion: '2015-03-31',
region: 'us-east-2', // Initial concurrency burst limit = 500
httpOptions: { // <--- replace the default of 50 (https) by
agent: agent // <--- plugging the modified Agent into the service
}
})
// NOW begin the manager handler code
In planning for a new service, I am doing some preliminary stress testing. After reading about the 1,000 concurrent execution limit per account and the initial burst rate (which in us-east-2 is 500), I was expecting to achieve at least the 500 burst concurrent executions right away. The screenshot below of CloudWatch's Lambda metric shows otherwise. I cannot get past 51 concurrent executions no matter what mix of parameters I try. Here's the test code:
// worker
exports.handler = async (event) => {
// declare sleep promise
const sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
// return after one second
let nStart = new Date().getTime()
await sleep(1000)
return new Date().getTime() - nStart; // report the exact ms the sleep actually took
};
// manager
exports.handler = async(event) => {
const invokeWorker = async() => {
try {
let lambda = new AWS.Lambda() // NO! DO NOT DO THIS, SEE UPDATE ABOVE
var params = {
FunctionName: "worker-function",
InvocationType: "RequestResponse",
LogType: "None"
};
return await lambda.invoke(params).promise()
}
catch (error) {
console.log(error)
}
};
try {
let nStart = new Date().getTime()
let aPromises = []
// invoke workers
for (var i = 1; i <= 3000; i++) {
aPromises.push(invokeWorker())
}
// record time to complete spawning
let nSpawnMs = new Date().getTime() - nStart
// wait for the workers to ALL return
let aResponses = await Promise.all(aPromises)
// sum all the actual sleep times
const reducer = (accumulator, response) => { return accumulator + parseInt(response.Payload) };
let nTotalWorkMs = aResponses.reduce(reducer, 0)
// show me
let nTotalET = new Date().getTime() - nStart
return {
jobsCount: aResponses.length,
spawnCompletionMs: nSpawnMs,
spawnCompletionPct: `${Math.floor(nSpawnMs / nTotalET * 10000) / 100}%`,
totalElapsedMs: nTotalET,
totalWorkMs: nTotalWorkMs,
parallelRatio: Math.floor(nTotalET / nTotalWorkMs * 1000) / 1000
}
}
catch (error) {
console.log(error)
}
};
Response:
{
"jobsCount": 3000,
"spawnCompletionMs": 1879,
"spawnCompletionPct": "2.91%",
"totalElapsedMs": 64546,
"totalWorkMs": 3004205,
"parallelRatio": 0.021
}
Request ID:
"43f31584-238e-4af9-9c5d-95ccab22ae84"
Am I hitting a different limit that I have not mentioned? Is there a flaw in my test code? I was attempting to hit the limit here with 3,000 workers, but there was NO throttling encountered, which I guess is due to the Asynchronous invocation retry behaviour.
Edit: There is no VPC involved on either Lambda; the setting in the select input is "No VPC".
Edit: Showing Cloudwatch before and after the fix
There were a number of potential suspects, particularly due to the fact that you were invoking Lambda from Lambda, but your focus on consistently seeing a concurrency of 50 — a seemingly arbitrary limit (and a suspiciously round number) — reminded me that there's an anti-footgun lurking in the JavaScript SDK:
In Node.js, you can set the maximum number of connections per origin. If maxSockets is set, the low-level HTTP client queues requests and assigns them to sockets as they become available.
Here of course, "origin" means any unique combination of scheme + hostname, which in this case is the service endpoint for Lambda in us-east-2 that the SDK is connecting to in order to call the Invoke method, https://lambda.us-east-2.amazonaws.com.
This lets you set an upper bound on the number of concurrent requests to a given origin at a time. Lowering this value can reduce the number of throttling or timeout errors received. However, it can also increase memory usage because requests are queued until a socket becomes available.
...
When using the default of https, the SDK takes the maxSockets value from the globalAgent. If the maxSockets value is not defined or is Infinity, the SDK assumes a maxSockets value of 50.
https://docs.aws.amazon.com/sdk-for-javascript/v2/developer-guide/node-configuring-maxsockets.html
Lambda concurrency it not the only factor that decides how scalable your functions are. If your Lambda function is runnning within a VPC, it will require an ENI (Elastic Network Interface) which allows for ethernet traffic from and to the container (Lambda function).
It's possible your throttling occurred due to too many ENI's being requested (50 at a time). You can check this by viewing the logs of the Manager lambda function and looking for an error message when it's trying to invoke one of the child containers. If the error looks something like the following, you'll know ENI's is your issue.
Lambda was not able to create an ENI in the VPC of the Lambda function because the limit for Network Interfaces has been reached.

Node.js Calling functions as quickly as possible without going over some limit

I have multiple functions that call different api endpoints, and I need to call them as quickly as possible without going over some limit (20 calls per second for example). My current solution is to have a delay and call the function once every 50 milliseconds for the example I gave, but I would like to call them as quickly as possible and not just space out the calls equally with the rate limit.
function-rate-limit solved a similar problem for me. function-rate-limit spreads out calls to your function over time, without dropping calls to your function. It still allows instantaneous calls to you function until the rate limit is reached, so it can behave with no latency introduced under normal circumstances.
Example from function-rate-limit docs:
var rateLimit = require('function-rate-limit');
// limit to 2 executions per 1000ms
var start = Date.now()
var fn = rateLimit(2, 1000, function (x) {
console.log('%s ms - %s', Date.now() - start, x);
});
for (var y = 0; y < 10; y++) {
fn(y);
}
results in:
10 ms - 0
11 ms - 1
1004 ms - 2
1012 ms - 3
2008 ms - 4
2013 ms - 5
3010 ms - 6
3014 ms - 7
4017 ms - 8
4017 ms - 9
You can try using queue from async. Be careful when doing this, it essentially behaves like a while(true) in other languages:
const async = require('async');
const concurrent = 10; // At most 10 concurrent ops;
const tasks = Array(concurrent).fill().map((e, i) => i);
let pushBack; // let's create a ref to a lambda function
const myAsyncFunction = (task) => {
// TODO: Swap with the actual implementation
return Promise.resolve(task);
};
const q = async.queue((task, cb) => {
myAsyncFunction(task)
.then((result) => {
pushBack(task);
cb(null, result);
})
.catch((err) => cb(err, null));
}, tasks.length);
pushBack = (task) => q.push(task);
q.push(tasks);
What's happening here? We are saying "hey run X tasks in parallel" and after each task gets completed, we put it back in the queue which is the equivalent of saying "run X tasks in parallel forever"

Inconsistent request behavior in Node when requesting large number of links?

I am currently using this piece of code to connect to a massive list of links (a total of 2458 links, dumped at https://pastebin.com/2wC8hwad) to get feeds from numerous sources, and to deliver them to users of my program.
It's basically splitting up one massive array into multiple batches (arrays), then forking a process to handle a batch to request each stored link for a 200 status code. Only when a batch is complete is the next batch sent for processing, and when its all done the forked process is disconnected. However I'm facing issues concerning apparent inconsistency in how this is performing with this logic, particularly the part where it requests the code.
const req = require('./request.js')
const process = require('child_process')
const linkList = require('./links.json')
let processor
console.log(`Total length: ${linkList.length}`) // 2458 links
const batchLength = 400
const batchList = [] // Contains batches (arrays) of links
let currentBatch = []
for (var i in linkList) {
if (currentBatch.length < batchLength) currentBatch.push(linkList[i])
else {
batchList.push(currentBatch)
currentBatch = []
currentBatch.push(linkList[i])
}
}
if (currentBatch.length > 0) batchList.push(currentBatch)
console.log(`Batch list length by default is ${batchList.length}`)
// cutDownBatchList(1)
console.log(`New batch list length is ${batchList.length}`)
const startTime = new Date()
getBatchIsolated(0, batchList)
let failCount = 0
function getBatchIsolated (batchNumber) {
console.log('Starting batch #' + batchNumber)
let completedLinks = 0
const currentBatch = batchList[batchNumber]
if (!processor) processor = process.fork('./request.js')
for (var u in currentBatch) { processor.send(currentBatch[u]) }
processor.on('message', function (linkCompletion) {
if (linkCompletion === 'failed') failCount++
if (++completedLinks === currentBatch.length) {
if (batchNumber !== batchList.length - 1) setTimeout(getBatchIsolated, 500, batchNumber + 1)
else finish()
}
})
}
function finish() {
console.log(`Completed, time taken: ${((new Date() - startTime) / 1000).toFixed(2)}s. (${failCount}/${linkList.length} failed)`)
processor.disconnect()
}
function cutDownBatchList(maxBatches) {
for (var r = batchList.length - 1; batchList.length > maxBatches && r >= 0; r--) {
batchList.splice(r, 1)
}
return batchList
}
Below is request.js, using needle. (However, for some strange reason it may completely hang up on a particular site indefinitely - in that case, I just use this workaround)
const needle = require('needle')
function connect (link, callback) {
const options = {
timeout: 10000,
read_timeout: 8000,
follow_max: 5,
rejectUnauthorized: true
}
const request = needle.get(link, options)
.on('header', (statusCode, headers) => {
if (statusCode === 200) callback(null, link)
else request.emit('err', new Error(`Bad status code (${statusCode})`))
})
.on('err', err => callback(err, link))
}
process.on('message', function(linkRequest) {
connect(linkRequest, function(err, link) {
if (err) {
console.log(`Couldn't connect to ${link} (${err})`)
process.send('failed')
} else process.send('success')
})
})
In theory, I think this should perform perfectly fine - it spawns off a separate process to handle the dirty work in sequential batches so its not overloaded and is super scaleable. However, when using using the full list of links at length 2458 with a total of 7 batches, I often get massive "socket hang up" errors on random batches on almost every trial that I do, similar to what would happen if I requested all the links at once.
If I cut down the number of batches to 1 using the function cutDownBatchList it performs perfectly fine on almost every trial. This is all happening on a Linux Debian VPS with two 3.1GHz vCores and 4 GB RAM from OVH, on Node v6.11.2
One thing I also noticed is that if I increased the timeout to 30000 (30 sec) in request.js for 7 batches, it works as intended - however it works perfectly fine with a much lower timeout when I cut it down to 1 batch. If I also try to do all 2458 links at once, with a higher timeout, I also face no issues (which basically makes this mini algorithm useless if I can't cut down the timeout via batch handling links). This all goes back to the inconsistent behavior issue.
The best TLDR I can do: Trying to request a bunch of links in sequential batches in a forked child process - succeeds almost every time with a lower number of batches, fails consistently with full number of batches even though behavior should be the same since its handling it in isolated batches.
Any help would be greatly appreciated in solving this issue as I just cannot for the life of me figure it out!

How to find out the % CPU usage for Node.js process?

Is there a way to find out the cpu usage in % for a node.js process via code. So that when the node.js application is running on the server and detects the CPU exceeds certain %, then it will put an alert or console output.
On *nix systems can get process stats by reading the /proc/[pid]/stat virtual file.
For example this will check the CPU usage every ten seconds, and print to the console if it's over 20%. It works by checking the number of cpu ticks used by the process and comparing the value to a second measurement made one second later. The difference is the number of ticks used by the process during that second. On POSIX systems, there are 10000 ticks per second (per processor), so dividing by 10000 gives us a percentage.
var fs = require('fs');
var getUsage = function(cb){
fs.readFile("/proc/" + process.pid + "/stat", function(err, data){
var elems = data.toString().split(' ');
var utime = parseInt(elems[13]);
var stime = parseInt(elems[14]);
cb(utime + stime);
});
}
setInterval(function(){
getUsage(function(startTime){
setTimeout(function(){
getUsage(function(endTime){
var delta = endTime - startTime;
var percentage = 100 * (delta / 10000);
if (percentage > 20){
console.log("CPU Usage Over 20%!");
}
});
}, 1000);
});
}, 10000);
Try looking at this code: https://github.com/last/healthjs
Network service for getting CPU of remote system and receiving CPU usage alerts...
Health.js serves 2 primary modes: "streaming mode" and "event mode". Streaming mode allows a client to connect and receive streaming CPU usage data. Event mode enables Health.js to notify a remote server when CPU usage hits a certain threshold. Both modes can be run simultaneously...
You can use the os module now.
var os = require('os');
var loads = os.loadavg();
This gives you the load average for the last 60seconds, 5minutes and 15minutes.
This doesnt give you the cpu usage as a % though.
Use node process.cpuUsage function (introduced in node v6.1.0).
It shows time that cpu spent on your node process. Example taken from docs:
const previousUsage = process.cpuUsage();
// { user: 38579, system: 6986 }
// spin the CPU for 500 milliseconds
const startDate = Date.now();
while (Date.now() - startDate < 500);
// At this moment you can expect result 100%
// Time is *1000 because cpuUsage is in us (microseconds)
const usage = process.cpuUsage(previousUsage);
const result = 100 * (usage.user + usage.system) / ((Date.now() - startDate) * 1000)
console.log(result);
// set 2 sec "non-busy" timeout
setTimeout(function() {
console.log(process.cpuUsage(previousUsage);
// { user: 514883, system: 11226 } ~ 0,5 sec
// here you can expect result about 20% (0.5s busy of 2.5s total runtime, relative to previousUsage that is first value taken about 2.5s ago)
}, 2000);
see node-usage for tracking process CPU and Memory Usage (not the system)
Another option is to use node-red-contrib-os package

Resources