Am facing issue in my application servers. Assume that - there are two nodes in the Load-balancer.
Suddenly one of the node from them becomes unhealthy.
When I logged in that instance. There were no logs coming in pm2.
then I check its CPU it was very high.
So please guide me how can I fix this issue. Or any way to debug it.
Check out flame graphs to see where your Node app is CPU bound.
You can also use the new debugging system in Node 6.3 (--inspect) to debug with the full power of Chrome DevTools.
PM2 has some limited protection for runaway issues like this via the max-memory-restart option. Typically, high CPU will also correlate with high memory usage and this option can be used to restart your app when it begins consuming large amounts of memory (which in your case may or may not be the correct moment but it should help).
--max-memory-restart <memory> specify max memory amount used to autorestart (in octet or use syntax like 100M)
I'm running a node.js express application on production. After a few hours of running, in a heap snapshot I can see that there are more than 10 huge TLSWrap objects per worker (these are the largest objects in the application).
Some Technical Aspects
I'm running forever with the cluster module (2 workers).
The application runs inside an AWS EC2 large instance.
Most of the tasks per request are getting data from redis and sending some requests (events) to another server.
Normal memory usage: ~450MB, after a few hours suddenly: 3.5GB (then there is too much latency and my load balancer removes this machine). See Memory usage graph.
Normal CPU usage: 16%, during the memory leak: 99%.
What I've Tried Already
Code refactoring with memory leaks problems in mind (closures, big objects and minimal string concatenation.
Upgrading node all the way from v0.12.7, v4.1.1, v4.1.2 and v4.2.0.
Some Interesting Insights
The growth of memory usage is not linear, but exponential and happend suddenly and very fast.
I have both permanent instances and also auto-scaling instances (same type) and this memory leak occurs at the same time on all machines.
Traffic (# requests) is not higher than usual during the memory leak.
I've read that sometimes these problems can be the result of continuing the application running after uncaughtException, but my uncaughtException handler just logs the error and then immediately calls process.exit() - Isn't that the same as when node crashes and the forever automatically restarts it?
I have another application that's:
Running from the same AWS EC2 AMI.
Has larger number of requests per second.
Has the uncaughtException handler (with process.exit()), too.
But no memory leaks at all!
Any ideas?
Thanks,
I believe that your memory leak is caused by something other than the TLSWrap objects, probably in your application layer.
According to this recently closed node issue, https://github.com/nodejs/node/issues/4250, TLSWrap has been incorrectly reporting its size as a large number (a pointer cast to an int). The actual size of TSLWrap objects is much smaller.
I was also seeing very large TLSWrap objects in my heapdumps, but after upgrading to node 5.3.0 (which includes the fix, https://github.com/nodejs/node/pull/4268), I can confirm that they are now correctly shown as quite small in my heapdumps.
I have a gameserver.js file that is well over 100 KB in size. And I kept checking my task manager after each refresh on my browser and kept seeing my node.exe memory usage keep rising for every refresh. I'm using the ws module here: https://github.com/websockets/ws and figured, you know what, there is most likely some memory leak in my code somewhere...
So to double check and isolate the issue I created a test.js file and put in the default ws code block:
var WebSocketServer = require('ws').Server
, wss = new WebSocketServer({ port: 9300 });
wss.on('connection', function connection(ws) {
ws.on('message', function incoming(message) {
console.log('received: %s', message);
});
});
And started it up:
Now, I check node.exe's memory usage:
The incremental part that makes me confused is:
If I refresh my browser that makes the connection to this port 9300 websocket server and then look back at my task manager.. it shows:
Which is now at: 14,500 K.
And it keeps on rising upon each refresh, so theoretically if I keep just refreshing it will go through the roof. Is this intended? Is there a memory leak in the ws module somewhere maybe? The whole reason I ask is because I thought maybe in a few minutes or when the user closes the browser it will go back down, but it doesn't.
And the core reason why I wanted to do this test because I figured I had a memory leak issue in my personal code somewhere and just wanted to check if it wasn't me, or vice versa. Now I'm stumped.
Seeing an increased memory footprint by a Node.js application is completely normal behaviour. Node.js constantly analyses your running code, generates optimised code, reverts to unoptimised code (if needed), etc. All this requires quite a lot of memory even for the most simple of applications (Node.js itself is from a large part written in JavaScript that follows the same optimisations/deoptimisations as your own code).
Additionally, a process may be granted more memory when it needs it, but many operating systems remove that allocated memory from the process only when they decide it is needed elsewhere (i.e. by another process). So an application can, in peaks, consume 1 GB of RAM, then garbage collection kicks in, usage drops to 500 MB, but the process may still keep the 1 GB.
Detecting presence of memory leaks
To properly analyse memory usage and memory leaks, you must use Node.js's process.memoryUsage().
You should set up an interval that dumps this memory usage into a file i.e. every second, then apply some "stress" on your application over several seconds (i.e. for web servers, issue several thousand requests). Then take a look at the results and see if the memory just keeps increasing or if it follows a steady pattern of increasing/decreasing.
Detecting source of memory leaks
The best tool for this is likely node-heapdump. You use it with the Chrome debugger.
Start your application and apply initial stress (this is to generate optimised code and "warm-up" your application)
While the app is idle, generate a heapdump
Perform a single, additional operation (i.e. one more request) that you suspect will likely cause a memory leak - this is probably the trickiest part especially for large apps
Generate another heapdump
Load both heapdumps into Chrome debugger and compare them - if there is a memory leak, you will see that there are some objects that were allocated during that single request but were not released afterwards
Inspect the object to determine where the leak occurs
I had the opportunity to investigate a reported memory leak in the Sails.js framework - you can see detailed description of the analysis (including pretty graphs, etc.) on this issue.
There is also a detailed article about working with heapdumps by StrongLoop - I suggest to have a look at it.
The garbage collector is not called all the time because it blocks your process. So V8 launches GC when it thinks it's necessary.
To find if you have a memory leak I propose to fire up the GC manually after every request just to see if your memory is still going up. Normally if you don't have a memory leak your memory should not increase. Because the GC will clean all non-used objects. If your memory is still going up after a GC call you have a memory leak.
To launch GC manually you can do that, but attention! Don't use this in production; this is just a way to cleanup your memory and see if you have a memory leak.
Launch Node.js like this:
node --expose-gc --always-compact test.js
It will expose the garbage collector and force it to be aggressive. Call this method to run the GC:
global.gc();
Call this method after each hit on your server and see if the GC clean the memory or not.
You can also do two heapdumps of your process before and after request to see the difference.
Don't use this in production or in your project. It is just a way to see if you have a memory leak or not.
Background
I have a relatively simple node js application (essentially just expressjs + mongoose). It is currently running in production on an Ubuntu Server and serves about 20,000 page views per day.
Initially the application was running on a machine with 512 MB memory. Upon noticing that the server would essentially crash every so often I suspected that the application might be running out of memory, which was the case.
I have since moved the application to a server with 1 GB of memory. I have been monitoring the application and within a few minutes the application tends to reach about 200-250 MB of memory usage. Over longer periods of time (say 10+ hours) it seems that the amount keeps growing very slowly (I'm still investigating that).
I have been since been trying to figure out what is consuming the memory. I have been going through my code and have not found any obvious memory leaks (for example unclosed db connections and such).
Tests
I have implemented a handy heapdump function using node-heapdump and I have now enabled --expore-gc to be able to manually trigger garbage collection. From time to time I try triggering a manual GC to see what happens with the memory usage, but it seems to have no effect whatsoever.
I have also tried analysing heapdumps from time to time - but I'm not sure if what I'm seeing is normal or not. I do find it slightly suspicious that there is one entry with 93% of the retained size - but it just points to "builtins" (not really sure what the signifies).
Upon inspecting the 2nd highest retained size (Buffer) I can see that it links back to the same "builtins" via a setTimeout function in some Native Code. I suspect it is cache or https related (_cache, slabBuffer, tls).
Questions
Does this look normal for a Node JS application?
Is anyone able to draw any sort of conclusion from this?
What exactly is "builtins" (does it refer to builtin js types)?
This problem is killing the stability of my production servers.
To recap, the basic idea is that my node server(s) sometimes intermittently slow down, sometimes resulting in Gateway Timeouts. As best as I can tell from my logs, something is blocking the node thread (meaning that the incoming request is not accepted), but I cannot for the life of me figure out what.
The problem ranges in severity. Sometimes what should be <100ms requests take ~10 seconds to complete; sometimes they never even get accepted by the node server at all. In short, it is as though some random task is working and blocking the node thread for a period of time, thus slowing down (or even blocking) incoming requests; the one thing I can say for sure is that the need-to-fix-symptom is a "Gateway Timeout".
The issue comes and goes without warning. I have not been able to correlate it against CPU usage, RAM usage, uptime, or any other relevant statistic. I've seen the servers handle a large load fine, and then have this error with a small load, so it does not even appear to be load-related. It is not unusual to see the error around 1am PST, which is the smallest load time of the day! Restarting the node app does seem to maybe make the problem go away for a while, but that really doesn't tell me much. I do wonder if it might be a bug in node.js... not very comforting, considering it is killing my production servers.
The first thing I did was to make sure I had upgraded node.js to the latest (0.8.12), as well as all my modules (here they are). Of course, I also have plenty of error catchers in place. I'm not doing anything funky like printing out lots to the console or writing to lots of files.
At first, I thought it was outbound HTTP requests blocking the incoming socket, because the express middleware was not even picking up the inbound request, but I gave up the theory because it looks like the node thread itself became busy.
Next, I went through all my code with JSHint and fixed literally every single warning, including a few accidental globals (forgetting to write "var") but this didn't help
After that, I assumed that perhaps I was running out of memory. But, my heap snapshots via nodetime are looking pretty good now (described below).
Still thinking that memory might be an issue, I took a look at garbage collection. I enabled the --nouse-idle-notification flag and did some more code optimization to NULL objects when they were not needed.
Still convinced that memory was the issue, I added the --expose-gc flag and executed the gc(); command every minute. This did not change anything, except to occasionally make requests a bit slower perhaps.
In a desperate attempt, I setup the "cluster" module to use 2 workers and automatically restart them every 30 min. Still, no luck.
I increased the ulimit to over 10,000 and kept an eye on the open files. There seem to be < 300 open files (or sockets) per node.js app, and increasing the ulimit thus had no impact.
I've been logging my server with nodetime and here's the jist of it:
CentOS 5.2 running on the Amazon Cloud (m1.large instance)
Greater than 5000 MB free memory at all times
Less than 150 MB heap size at all times
CPU usage is less than 60% at all times
I've also checked my MongoDB servers, which have <5% CPU usage and no requests are taking > 100ms to complete, so I highly doubt there's a bottleneck.
I've wrapped (almost) all my code using Q-promises (see code sample), and of course have avoided Sync() calls like the plague. I've tried to replicate the issue on my testing server (OSX), but have had little luck. Of course, this may be just because the production servers are being used by so many people in so many unpredictable ways that I simply cannot replicate via stress tests...
Many months after I first asked this question, I found the answer.
In a nutshell, the problem was that I was not piping a big asset when transferring it from one server to another. In other words, I was downloading an image from one server, before uploading it to a S3 bucket. Instead of streaming the download into the upload, I downloaded the file into memory, and then uploaded it.
I'm not sure why this did not show up as a memory spike, or elsewhere in my statistics.
My guess is Mongoose. If you are storing large payloads in Mongo, Mongoose can be pretty slow due to how it builds the Mongoose objects. See https://github.com/LearnBoost/mongoose/issues/950 for more details on the problem. If this is the problem you wouldn't see it in Mongo itself since the query returns quickly, but object instantiation could take 75x the query time.
Try setting up timers around (process.hrtime()) before and after you the Mongoose objects are being created to see if that might be the problem. If this is the problem, I would switch to using the node Mongo driver directly instead of going through Mongoose.
You are heavily leaking memory, try setting every object to null as soon as you don't need it anymore! Read this.
More information about hunting down memory leaks can be found here.
Give special attention to having multiple references to the same object and check if you have circular references, those are a pain to debug but will help you very much.
Try invoking the garbage collector manually every minute or so (I don't know if you can do this in node.js cause I'm more of a c++ and php coder). From my years of experience working with c++ I can tell you the most likely cause of your application slowing down over time is memory leaks, find them and plug them, you'll be ok!
Also assuming you're not caching and/or processing images, audio or video in memory or anything like that 150M heap is a lot! Those could be hundreds of thousands or even millions of small objects.
You don't have to be running out of memory for your application to slow down... just searching for free memory with that many objects already allocated is a huge job for the memory allocator, it takes a lot of time to allocate each new object and as you leak more and more memory that time only increases.
Is "--nouse-idle-connection" a mistake? do you really mean "--nouse_idle_notification".
I think it's maybe some issues about gc with too many tiny objects.
node is single process, so watch the most busy cpu core is much important than the load.
when your program is slow, you can execute "gdb node pid" and "bt" to see what node is busy doing.
What I'd do is set up a parallel node instance on the same server with some kind of echo service and test that one. If it runs fine, you narrow down your problem to your program code (and not a scheduler/OS-level problem). Then, step by step, include the modules and test again. Certainly this is a lot of work, takes long and I dont know if it is doable on your system.
If you need to get this working now, you can go the NASA redundancy route:
Bring up a second copy of your production servers, and put a proxy in front of them which routes each request to both stacks and returns the first response. I don't recommend this as a perfect long-term solution but it should help significantly reduce issues in production now, and help you gather log data that you could replay to recreate the issues on non-production servers.
Obviously, this is straight-forward for read requests, but more complex for commands which write to the db.
We have a similar problem with our Node.js server. It didn't scale well for weeks and we had tried almost everything as you had. Our problem was in the implicit backlog value which is set very low for high-concurrent environments.
http://nodejs.org/api/http.html#http_server_listen_port_hostname_backlog_callback
Setting the backlog to a significantly higher value (e.g. 10000) as well as tune networking in our kernel (/etc/sysctl.conf on Linux) as described in manual section helped a lot. From this time forward we don't have any timeouts in our Node.js server.