Reducing Node memory usage when making HTTP requests in a loop

Reducing Node memory usage when making HTTP requests in a loop - node.js

I've set up a simple loop to poll an IronMQ messaging system, and everything works fine... except that memory usage increases more and more until it finally stabilizes at over 250MB. I've read that it's normal for Node to use more memory over time when run in a (sort of) recursive loop like this, even when running setTimeout and doing nothing, but I still don't understand the exact mechanics behind this behavior, or whether there is any way to control it. When making HTTP requests within the loop, memory usage more than doubles.
The code is running on a Heroku worker with a limit of 512MB RAM, leaving no breathing room to use cluster to use the rest of the available CPU cores. The memory usage can increase slowly or extremely quickly, depending on the jobs that run after receiving the messages.
This is the simplest code that reproduces this.
(function loop() {
request.get('http://www.example.com', function(err, request, body) {
if (err) console.log(err);
setTimeout(loop, 200);
});
})();
I've tried many, many ways of restructuring this code to prevent memory from increasing so high, but nothing has made any changes. Only the received HTTP response seems to have any effect on the upper limit of RAM used.
Is there a way to rewrite this entirely, or am I stuck with V8's behavior? All examples I've found use the same basic structure for infinite async loops, from kue to the async library.

Related

Node running many POST requests at once

I want to make a bunch of post requests at once from my node.js script I have the same payload for every one of the requests and would want to simply spin them all off at once in some kind of loop and then make them write to a DB upon completion.
Sample Code to demonstrate the logic:
for (var i = 0; i < 10; i++) {
makePostRequest(updateDB);
}
function makePostRequest (callback) {
http.post("test", function (result) {callback(result)});
}
function updateDB(result) {
db.put({result: result});
}
The question is, is it better to do this using multithreading or does the fact that node will run these post request asynchronously take care of it and result in the same performance? In other words, will these post requests all happen asynchronously in a same way you would expect a multhreaded program behave and result in the same performance?
My goal is to minimize execution time.

I think the improvement in execution time you'll see from multi-threading through child processes here will be minimal. As you said, the request are already executed asynchronously.
In fact, you may see a decrease in performance as n, the number of concurrent requests, increases.
To be sure, you could run a benchmark.

In your example, you send 10 requests in parallel. I think that this number is small and will not cause a problem.
There are some things that you have to pay attention to.
There is a maximum of concurrent outgoing connections for your server
How many concurrent requests is the remote server able to serve without slowing down?
If you launch the requests with a for loop, the for loop runs to completion before the requests are effectively sent out.
If you have to launch a really big number of requests, they will be faster completed, if you limit the maximal number of concurrent requests. Also think about about throttling the launch, by using setTimeout between calls to makePostRequest.
I don't know what are the thresholds where you need to take these precautions.

I decided to go with the logic that I described earlier in the question. Reasoning being is that node runs its async calls separately on a different thread so that its not blocking the main thread. his means that running POST async will behave similarly to what I wanted from multithreading

Memory leak in a node.js crawler application

For over a month I'm struggling with a very annoying memory leak issue and I have no clue how to solve it.
I'm writing a general purpose web crawler based on: http, async, cheerio and nano. From the very beginning I've been struggling with memory leak which was very difficult to isolate.
I know it's possible to do a heapdump and analyse it with Google Chrome but I can't understand the output. It's usually a bunch of meaningless strings and objects leading to some anonymous functions telling me exactly nothing (it might be lack of experience on my side).
Eventually I came to a conclusion that the library I had been using at the time (jQuery) had issues and I replaced it with Cheerio. I had an impression that Cheerio solved the problem but now I'm sure it only made it less dramatic.
You can find my code at: https://github.com/lukaszkujawa/node-web-crawler. I understand it might be lots of code to analyse but perhaps I'm doing something stupid which can be obvious strait away. I'm suspecting the main agent class which does HTTP requests https://github.com/lukaszkujawa/node-web-crawler/blob/master/webcrawler/agent.js from multiple "threads" (with async.queue).
If you would like to run the code it requires CouchDB and after npm install do:
$ node crawler.js -c conf.example.json
I know that Node doesn't go crazy with garbage collection but after 10min of heavy crawling used memory can go easily over 1GB.
(tested with v0.10.21 and v0.10.22)

For what it's worth, Node's memory usage will grow and grow even if your actual used memory isn't very large. This is for optimization on behalf of the V8 engine. To see your real memory usage (to determine if there is actually a memory leak) consider dropping this code (or something like it) into your application:
setInterval(function () {
if (typeof gc === 'function') {
gc();
}
applog.debug('Memory Usage', process.memoryUsage());
}, 60000);
Run node --expose-gc yourApp.js. Every minute there will be a log line indicating real memory usage immediately after a forced garbage collection. I've found that watching the output of this over time is a good way to determine if there is a leak.
If you do find a leak, the best way I've found to debug it is to eliminate large sections of your code at a time. If the leak goes away, put it back and eliminate a smaller section of it. Use this method to narrow it down to where the problem is occurring. Closures are a common source, but also check for anywhere else references may not be cleaned up. Many network applications will attach handlers for sockets that aren't immediately destroyed.

Are callbacks for requests a bad practice in node.js?

Imagine you want to download an image or a file, this would be the first way the internet will teach you to go ahead:
request(url, function(err, res, body) {
fs.writeFile(filename, body);
});
But doesn't this accumulate all data in body, filling the memory?
Would a pipe be totally more efficient?
request(url).pipe(fs.createWriteStream(filename));
Or is this handled internally in a similar matter, buffering the stream anyway, making this irrelevant?
Furthermore, if I want to use the callback but not the body (because you can still pipe), will this memory buffer still be filled?
I am asking because the first (callback) method allows me to chain downloads in stead of launching them in parallel(*), but I don't want to fill a buffer I'm not gonna use either. So I need the callback if I don't want to resort to something fancy like async just to use queue to prevent this.
(*) Which is bad because if you just request too many files before they are complete, the async nature of request will cause node to choke to death in an overdose of events and memory loss. First you'll get these:
"possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit."
And when stretching it, 500 piped requests will fill your memory up and crash node. That's why you need the callback in stead of the pipe, so you know when to start the next file.

But doesn't this accumulate all data in body, filling the memory?
Yes, many operations such as your first snippet buffer data into memory for processing. Yes this uses memory, but it is at least convenient and sometimes required depending on how you intend to process that data. If you want to load an HTTP response and parse the body as JSON, that is almost always done via buffering, although it's possible with a streaming parser, it is much more complicated and usually unnecessary. Most JSON data is not sufficiently large such that streaming is a big win.
Or is this handled internally in a similar matter, making this irrelevant?
No, APIs that provide you an entire piece of data as a string use buffering and not streaming.
However, multimedia data, yes, you cannot realistically buffer it to memory and thus streaming is more appropriate. Also that data tends to be opaque (you don't parse it or process it), which is also good for streaming.
Streaming is nice when circumstances permit it, but that doesn't mean there's anything necessarily wrong with buffering. The truth is buffering is how the vast majority of things work most of the time. In the big picture, streaming is just buffering 1 chunk at a time and capping them at some size limit that is well within the available resources. Some portion of the data needs to go through memory at some point if you are going to process it.
Because if you just request too many files one by one, the async nature of request will cause node to choke to death in an overdose of events and memory loss.
Not sure exactly what you are stating/asking here, but yes, writing effective programs requires thinking about resources and efficiency.
See also substack's rant on streaming/pooling in the hyperquest README.

I figured out a solution that renders the questions about memory irrelevant (although I'm still curious).
if I want to use the callback but not the body (because you can still pipe), will this memory buffer still be filled?
You don't need the callback from request() in order to know when the request is finished. The pipe() will close itself when the stream 'ends'. The close emits an event and can be listened for:
request(url).pipe(fs.createWriteStream(filename)).on('close', function(){
next();
});
Now you can queue all your requests and download files one by one.
Of course you can vacuum the internet using 8 parallel requests all the time with libraries such as async.queue, but if all you want to do is get some files with a simple script, async is probably overkill.
Besides, you're not gonna want to max out your system resources for a single trick on a multi-user system anyway.

Node.js async parallel - what consequences are?

There is code,
async.series(tasks, function (err) {
return callback ({message: 'tasks execution error', error: err});
});
where, tasks is array of functions, each of it peforms HTTP request (using request module) and calling MongoDB API to store the data (to MongoHQ instance).
With my current input, (~200 task to execute), it takes
[normal mode] collection cycle: 1356.843 sec. (22.61405 mins.)
But simply trying change from series to parallel, it gives magnificent benefit. The almost same amount of tasks run in ~30 secs instead of ~23 mins.
But, knowing that nothing is for free, I'm trying to understand what the consequences of that change? Can I tell that number of open sockets will be much higher, more memory consumption, more hit to DB servers?
Machine that I run the code is only 1GB of RAM Ubuntu, so I so that app hangs there one time, can it be caused by lacking of resources?

Your intuition is correct that the parallelism doesn't come for free, but you certainly may be able to pay for it.
Using a load testing module (or collection of modules) like nodeload, you can quantify how this parallel operation is affecting your server to determine if it is acceptable.
Async.parallelLimit can be a good way of limiting server load if you need to, but first it is important to discover if limiting is necessary. Testing explicitly is the best way to discover the limits of your system (eachLimit has a different signature, but could be used as well).
Beyond this, common pitfalls using async.parallel include wanting more complicated control flow than that function offers (which, from your description doesn't seem to apply) and using parallel on too large of a collection naively (which, say, may cause you to bump into your system's file descriptor limit if you are writing many files). With your ~200 request and save operations on 1GB RAM, I would imagine you would be fine as long as you aren't doing much massaging in the event handlers, but if you are experiencing server hangs, parallelLimit could be a good way out.
Again, testing is the best way to figure these things out.

I would point out that async.parallel executes multiple functions concurrently not (completely) parallely. It is more like virtual parallelism.
Executing concurrently is like running different programs on a single CPU core, via multitasking/scheduling. True parallel execution would be running different program on each core of multi-core CPU. This is important as node.js has single-threaded architecture.
The best thing about node is that you don't have to worry about I/O. It handles I/O very efficiently.
In your case you are storing data to MongoDB, is mostly I/O. So running them parallely will use up your network bandwidth and if reading/writing from disk then disk bandwidth too. Your server will not hang because of CPU overload.
The consequence of this would be that if you overburden your server, your requests may fail. You may get EMFILE error (Too many open files). Each socket counts as a file. Usually connections are pooled, meaning to establish connection a socket is picked from the pool and when finished return to the pool. You can increase the file descriptor with ulimit -n xxxx.
You may also get socket errors when overburdened like ECONNRESET(Error: socket hang up), ECONNREFUSED or ETIMEDOUT. So handle them with properly. Also check the maximum number of simultaneous connections for mongoDB server too.
Finally the server can hangup because of garbage collection. Garbage collection kicks in after your memory increases to a certain point, then runs periodically after some time. The max heap memory V8 can have is around 1.5 GB, so expect GC to run frequently if its memory is high. Node will crash with process out of memory if asking for more, than that limit. So fix the memory leaks in your program. You can look at these tools.

The main downside you'll see here is a spike in database server load. That may or may not be okay depending on your setup.
If your database server is a shared resource then you will probably want to limit the parallel requests by using async.eachLimit instead.

you'll realize the difference if multiple users connect:
in this case the processor can handle multiple operations
asynch tries to run several operations of multiple users relative equal
T = task
U = user
(T1.U1 = task 1 of user 1)
T1.U1 => T1.U2 => T2.U1 => T8.U3 => T2.U2 => etc
this is the oposite of atomicy (so maybe watch for atomicy on special db operations - but thats another topic)
so maybe it is faster to use:
T2.U1 before T1.U1
- this is no problem until
T2.U1 is based on T1.U1
- this is preventable by using callbacks/ or therefore are callbacks
...hope this is what you wanted to know... its a bit late here

Does use of recursive process.nexttick let other processes or threads work?

Technically when we execute the following code(recursive process.nexttick), the CPU usage would get to 100% or near. The question is the imagining that I'm running on a machine with one CPU and there's another process of node HTTP server working, how does it affect it?
Does the thread doing recursive process.nexttick let the HTTP server work at all?
If we have two threads of recursive process.nexttick, do they both get 50% share?
Since I don't know any machine with one core cannot try it. And since my understanding of time sharing of the CPU between threads is limited in this case, I don't how should try it with machines that have 4 cores of CPU.
function interval(){
process.nextTick(function(){
someSmallSyncCode();
interval();
})
}
Thanks

To understand whats going on here, you have to understand a few things about node's event loop as well as the OS and CPU.
First of all, lets understand this code better. You call this recursive, but is it?
In recursion we normally think of nested call stacks and then when the computation is done (reaching a base case), the stack "unwinds" back to the point of where our recursive function was called.
While this is a method that calls itself (indirectly through a callback), the event loop skews what is actually going on.
process.nextTick takes a function as a callback and puts it first at the list of stuff to be done on the next go-around of the event loop. This callback is then executed and when it is done, you once again register the same callback. Essentially, the key difference between this and true recursion is that our call stack never gets more than one call deep. We never "unwind" the stack, we just have lots of small short stacks in succession.
Okay, so why does this matter?
When we understand the event loop better and what is really going on, we can better understand how system resources are used. By using process.nextTick in this fashion, you are assuring there is ALWAYS something to do on the event loop, which is why you get high cpu usage (but you knew that already). Now, if we were to suppose that your HTTP server were to run in the SAME process as the script, such as below
function interval(){
process.nextTick(doIntervalStuff)
}
function doIntervalStuff() {
someSmallSyncCode();
interval();
}
http.createServer(function (req, res) {
doHTTPStuff()
}).listen(1337, "127.0.0.1");
then how does the CPU usage get split up between the two different parts of the program? Well thats hard to say, but if we understand the event loop, we can at least guess.
Since we use process.nextTick, the doIntervalStuff function will be run every time at the "start" of the event loop, however, if there is something to be done for the HTTP server (like handle a connection) then we know that will get done before the next time the event loop starts, and remember, due to the evented nature of node, this could be handling any number of connections on one iteration of the event loop. What this implies, is that at least in theory, each function in the process gets what it "needs" as far as CPU usage, and then the process.nextTick functions uses the rest. While this isn't exactly true (for example, your bit of blocking code would mess this up), it is a good enough model to think about.
Okay now (finally) on to your real question, what about separate processes?
Interestingly enough, the OS and CPU are often times also very "evented" in nature. Whenever a processes wants to do something (like in the case of node, start an iteration of the event loop), it makes a request to the OS to be handled, the OS then shoves this job in a ready queue (which is prioritized) and it executes when the CPU scheduler decides to get around to it. This is once again is an oversimplified model, but the core concept to take away is that much like in node's event loop, each process gets what it "needs" and then a process like your node app tries to execute whenever possible by filling in the gaps.
So when your node processes says its taking 100% of cpu, that isn't accurate, otherwise, nothing else would ever be getting done and the system would crash. Essentially, its taking up all the CPU it can but the OS still determines other stuff to slip in.
If you were to add a second node process that did the same process.nextTick, the OS would try to accommodate both processes and, depending on the amount of work to be done on each node process's event loop, the OS would split up the work accordingly (at least in theory, but in reality would probably just lead to everything slowing down and system instability).
Once again, this is very oversimplified, but hopefully it gives you an idea of what is going on. That being said, I wouldn't recommend using process.nextTick unless you know you need it, if doing something every 5 ms is acceptable, using a setTimeout instead of process.nextTick will save oodles on cpu usage.
Hope that answers your question :D

No. You don't have to artificially pause your processes to let others do their work, your operating system has mechanisms for that. In fact, using process.nextTick in this way will slow your computer down because it has a lot of overhead.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string