Memory leak in a node.js crawler application

Memory leak in a node.js crawler application - node.js

For over a month I'm struggling with a very annoying memory leak issue and I have no clue how to solve it.
I'm writing a general purpose web crawler based on: http, async, cheerio and nano. From the very beginning I've been struggling with memory leak which was very difficult to isolate.
I know it's possible to do a heapdump and analyse it with Google Chrome but I can't understand the output. It's usually a bunch of meaningless strings and objects leading to some anonymous functions telling me exactly nothing (it might be lack of experience on my side).
Eventually I came to a conclusion that the library I had been using at the time (jQuery) had issues and I replaced it with Cheerio. I had an impression that Cheerio solved the problem but now I'm sure it only made it less dramatic.
You can find my code at: https://github.com/lukaszkujawa/node-web-crawler. I understand it might be lots of code to analyse but perhaps I'm doing something stupid which can be obvious strait away. I'm suspecting the main agent class which does HTTP requests https://github.com/lukaszkujawa/node-web-crawler/blob/master/webcrawler/agent.js from multiple "threads" (with async.queue).
If you would like to run the code it requires CouchDB and after npm install do:
$ node crawler.js -c conf.example.json
I know that Node doesn't go crazy with garbage collection but after 10min of heavy crawling used memory can go easily over 1GB.
(tested with v0.10.21 and v0.10.22)

For what it's worth, Node's memory usage will grow and grow even if your actual used memory isn't very large. This is for optimization on behalf of the V8 engine. To see your real memory usage (to determine if there is actually a memory leak) consider dropping this code (or something like it) into your application:
setInterval(function () {
if (typeof gc === 'function') {
gc();
}
applog.debug('Memory Usage', process.memoryUsage());
}, 60000);
Run node --expose-gc yourApp.js. Every minute there will be a log line indicating real memory usage immediately after a forced garbage collection. I've found that watching the output of this over time is a good way to determine if there is a leak.
If you do find a leak, the best way I've found to debug it is to eliminate large sections of your code at a time. If the leak goes away, put it back and eliminate a smaller section of it. Use this method to narrow it down to where the problem is occurring. Closures are a common source, but also check for anywhere else references may not be cleaned up. Many network applications will attach handlers for sockets that aren't immediately destroyed.

Related

v8 memory spike (rss) when defining more than 1000 function (does not reproduce when using --jitless)

I have a simple node app with 1 function that defines 1000+ functions inside it (without running them).
When I call this function (the wrapper) around 200 times the RSS memory of the process spikes from 100MB to 1000MB and immediately goes down. (The memory spike only happens after around 200~ calls, before that all the calls do not cause a memory spike, and all the calls after do not cause a memory spike)
This issue is happening to us in our node server in production, and I was able to reproduce it in a simple node app here:
https://github.com/gileck/node-v8-memory-issue
When I use --jitless pr --no-opt the issue does not happen (no spikes). but obviously we do not want to remove all the v8 optimizations in production.
This issue must be some kind of a specific v8 optimization, I tried a few other v8 flags but non of them fix the issue (only --jitless and --no-opt fix it)
Anyone knows which v8 optimization could cause this?
Update:
We found that --no-concurrent-recompilation fix this issue (No memory spikes at all).
but still, we can't explain it.
We are not sure why it happens and which code changes might fix it (without the flag).
As one of the answers suggests, moving all the 1000+ function definitions out of the main function will solve it, but then those functions will not be able to access the context of the main function which is why they are defined inside it.
Imagine that you have a server and you want to handle a request.
Obviously, The request handler is going to run many times as the server gets a lot of requests from the client.
Would you define functions inside the request handler (so you can access the request context in those functions) or define them outside of the request handler and pass the request context as a parameter to all of them? We chose the first option... what do you think?

anyone knows which v8 optimization could cause this?
Load Elimination.
I guess it's fair to say that any optimization could cause lots of memory consumption in pathological cases (such as: a nearly 14 MB monster of a function as input, wow!), but Load Elimination is what causes it in this particular case.
You can see for yourself when your run with --turbo-stats (and optionally --turbo-filter=foo to zoom in on just that function).
You can disable Load Elimination if you feel that you must. A preferable approach would probably be to reorganize your code somewhat: defining 2,000 functions is totally fine, but the function defining all these other functions probably doesn't need to be run in a loop long enough until it gets optimized? You'll avoid not only this particular issue, but get better efficiency in general, if you define functions only once each.
There may or may not be room for improving Load Elimination in Turbofan to be more efficient for huge inputs; that's a longer investigation and I'm not sure it's worth it (compared to working on other things that likely show up more frequently in practice).
I do want to emphasize for any future readers of this that disabling optimization(s) is not generally a good rule of thumb for improving performance (or anything else), on the contrary; nor are any other "secret" flags needed to unlock "secret" performance: the default configuration is very carefully optimized to give you what's (usually) best. It's a very rare special case that a particular optimization pass interacts badly with a particular code pattern in an input function.

Why does v8 report duplicate module strings in heap in my jest tests?

In the process of upgrading node (16.1.x => 16.5.0), I observed that I'm getting OOM issues from jest. In troubleshooting, I'm periodically taking heap snapshots. I'm regularly seeing entries in "string" for module source (same shallow/retained size). In this example screenshot, you can see that the exact same module (React) is listed 2x. Sometimes, the module string is listed even 4x for any given source module.
Upon expansion, it says "system / Map", which suggests to me I think? that theres some v8 wide reference to this module string? That makes sense--maybe. node has a require cache, jest has a module cache, v8 and node i'd assume... share module references? The strings and compiled code buckets do increase regularly, but I expect them to get GC'd. In fact, I can see that many do--expansion of the items show the refs belonging to GC Roots. But I suspect something is holding on to these module references, and I fear it's not at the user level, but at the tooling level. This is somewhat evidenced by observation that only the node.js upgrade induces the OOM failure mode.
Why would my jest test have multiple instances of the same module (i am using --runInBand, so I don't expect multiple workers)
What tips would you offer to diagnose further?
I do show multiple VM Contexts, which I think makes sense--I suppose jest is running some test suites in some sort of isolation.
I do not have a reproduction--I am looking for discussion, best-know-methods, diagnostic ideas.

I can offer some thoughts:
"system / Map" does not mean "some v8 wide reference". "Map" is the internal name for "hidden class", which you may have heard of. The details don't even matter here; TL;DR: some internal thing, totally normal, not a sign of a problem.
Having several copies of the same string on the heap is also quite normal, because strings don't get deduplicated by default. So if you run some string-producing operation twice (such as: reading an external file), you'll get two copies of the string. I have no idea what jest does under the hood, but it's totally conceivable that running tests in parallel in mostly-isolated environments has a side effect of creating duplicate strings. That may be inefficient in a sense, but as long as they get GC'ed after a while, it's not really a problem.
If the specific hypothesis implied above (there are several tests in each file, and jest creates an in-memory copy of the entire file for each executing test) holds, then a possible mitigation might be to split your test files into smaller chunks (1.8MB is quite a lot for a single file). I don't have much confidence in this, but maybe it'd be easy for you to try it and see.
More generally: in the screenshot, there are 36MB of memory used by strings. That's far from being an OOM reason.
It might be insightful to measure the memory consumption of both Node versions. If, for example, it used to consume 4GB and now crashes when it reaches 2GB, that would indicate that the limit has changed. If it used to consume 2GB and now crashes when it reaches 4GB, that would imply that something major has changed. If it used to consume 1.98GB and now crashes when it reaches 2.0GB, then chances are something tiny has changed and you just happened to get lucky with the old version.
Until contradicting evidence turns up, I would operate under the assumption that the resource consumption is normal and simply must be accommodated. You could try giving Node more memory, or reducing the number of parallel test executions.

This seems like a known issue of Jest at Node JS v16.11.0+ and has already been reported to GitHub.

How can I automatically test for memory leaks in Node?

I have some code in a library that has in the past leaked badly, and I would like to add regression tests to avoid that in the future. I understand how to find memory leaks manually, by looking at memory usage profiles or Valgrind, but I have had trouble writing automatic tests for them.
I tried using global.gc() followed by process.memoryUsage() after running the operation I was checking for leaks, then doing this repeatedly to try to establish a linear relationship between number of operations and memory usage, but there seems to be noise in the memory usage numbers that makes this hard to measure accurately.
So, my question is this: is there an effective way to write a test in Node that consistently passes when an operation leaks memory, and fails when it does not leak memory?
One wrinkle that I should mention is that the memory leaks were occurring in a C++ addon, and some of the leaked memory was not managed by the Node VM, so I was measuring process.memoryUsage().rss.

Automating and logging information to test for memory leaks in node js.
There is a great module called memwatch-next.
npm install --save memwatch-next
Add to app.js:
const memwatch = require('memwatch-next');
// ...
memwatch.on('leak', (info) => {
// Some logging code...
console.error('Memory leak detected:\n', info);
});
This will allow you to automatically measure if there is a memory leak.
Now to put it to a test:
Good tool for this is Apache jMeter. More information here.
If you are using http you can use jMeter to soak test the application's end points.
SOAK testing is done to verify system's stability and performance characteristics over an extended period of time, its good when you are looking for memory leaks, connection leaks etc.
Continuous integration software:
Prior to deployment to production if you are using a software for continuous integration like Jenkins, you can make a Jenkins job to do this for you, it will test the application with parameters provided after the test will ether deploy the application or report that there is a memory leak. ( Depending on your Jenkins job configuration )
Hope it helps, update me on how it goes;
Good luck,

Given some arbitrary program, is it always possible to determine if it will ever terminate? The halting problem describes this. Consider the following program:
function collatz(n){
if(n==1)
return;
if(n%2==0)
return collatz(n/2);
else
return collatz(3*n+1);
}
The same idea can be applied to data in memory. It's not always possible to identify what memory isn't needed anymore and can thus be garbage collected. There is also the case of the program being designed to consume a lot of memory in some situation. The only known option is coming up with some heuristic like you have done, but it will most likely result in false positives and negatives. It may be easier to determine the root cause of the leak so it can be corrected.

Profiling heapdumps in Chrome Developer Tools (memory leak)

I'm having bit trouble with a NodeJS/Express/React application that is on production as we speak.
The problem is, that it keeps climbing up on memory usage and it just doesn't stop. It is slow and steady, and eventually Node crashes. I have several heapdumps that I have been creating with the help of node-heapdump, however, I don't know how to properly identify the leak.
I will share an image of my snapshot. Please note that I sorted by shallow size so supposedly one of those objects/types that appear on top must be the problem:
As I can see below, there is this "Promis in #585" that I see in many places and that could be the one, but I'm unable to identify that line, function or component.
Anybody could help? I can share more screenshots if you want.
Thanks.

I found the problem.
I'm using React Body Classname in my app so when we load different routes we can change the body class from client side. This npm module needs to be used with the Rewind() funcion when you do server side rendering in order to avoid memory leaks:
This is the module I'm talking about:
https://github.com/iest/react-body-classname
And, in order to avoid the memory leak, we are calling:
BodyClassName.rewind()
In the render function of our main App.js container component. This way, doesn't matter what url a user is landing on, Rewind() will always be called and so the data that can be garbage collected will be properly freed in the future.
Now our app stays at a nice and steady 120mb of memory usage.
Thanks anyway :D

NodeJS Memory Leak when using VM to execute untrusted code

I am using the NodeJS VM Module to run untrusted code safely. I have noticed a huge memory leak that takes about 10M of memory on each execution and does not release it. Eventually, my node process ends up using 500M+ of memory. After some digging, I traced the problem to the constant creation of VMs. To test my theory, I commented out the code that creates the VMs. Sure enough, the memory usage dropped dramatically. I then uncommented the code again and placed global.gc() calls strategically around the problem areas and ran node with the--expose-gc flag. This reduced my memory usage dramatically and retained the functionality.
Is there a better way of cleaning up VMs after I am done using it?
My next approach is the cache the vm containing the given unsafe code and reusing it if it I see the unsafe code again (Background:I am letting users write their own parsing function for blocks of text, thus, the unsafe code be executed frequently or executed once and never seen again).
Some reference code.
async.each(items,function(i,cb){
// Initialize context...
var context = vm.createContext(init);
// Execute untrusted code
var captured = vm.runInContext(parse, context);
// This dramatically improves the usage, but isn't
// part of the standard API
// global.gc();
// Return Result via a callback
cb(null,captured);
});

When I see this right this was fixed in v5.9.0, see this PR. It appears that in those cases both node core maintainer nor programmers can do much - that we pretty much have to wait for a upstream fix in v8.
So no, you can't do anything more about it. Catching this bug was good though!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string