Bit of back story:
My Node app, called Sleepychat (in production, hosted on Heroku) recently started reporting Memory Limit Reached errors (at 512MB). Luckily, Heroku allows this and won't crash the app until 5 times is reached, so that's good. However, I find it strange that this occurred only just after a recent push to Heroku which involved nothing more than adding a page. My theory was that Heroku might have used a newer version of Node on the update, but it seems that isn't the case. I downgraded and the error still occurs.
My source can be found on GitHub. Line 185 is the connection code, and line 1553 is the disconnect code.
Now, I've had some time to run through the code using New Relic and Node Inspector to profile it, and it seems that any connection, whether allowed or immediately refused and disconnected, allocates a small amount of memory which isn't released upon disconnection.
For example, I navigate to localhost, the page connects on document ready, and New Relic shows a slight increase in memory. Usually about 500kb (which still seems large to me, but I'm guessing that's mostly the socket object). When I disconnect, no change occurs in New Relic. What's more interesting, is that I can spam page refreshes to rapidly spike memory usage. Earlier I mentioned that it doesn't matter whether or not the connection is refused. By this, I meant that after 3 connections within 15 seconds of each other, the site will immediately refuse new connections and disconnect.
Despite this, page refreshes (which automatically connect) increases RAM. I've looked through many times to see if I'm not releasing something, but to my knowledge, everything should be managed by the scope, so I have no idea what's leaking.
I've noticed some odd behavior once I added New Relic, though. Firstly, once Sleepychat hits its limit of 512MB, Heroku will report a Memory Limit Reached error, but New Relic will display as if a large chunk (~100MB) has been released. Despite that display, Heroku would later report an even higher usage on its next error. Secondly, New Relic reports a drop in memory usage whenever I use Node Inspector to record heap allocations, but goes right back up once I stop the recording.
Here's a snapshot file from Node Inspector (Actually an allocation timeline). 10 seconds in is when I start spamming page refreshes. Additionally, here's an image of what New Relic reports.
Snapshot: Snapshot File
New Relic:
I have no idea what's going on. Anyone have any ideas?
Related
I am hosting a game website on Heroku that runs on a NodeJS Server. Clients are connected via sockets using the package socket.io.
Once in a while when the garbage collection cycle is triggered, connected clients would experience severe lag and often, disconnections. This is experienced by the clients through delayed incoming chat and delayed inputs to the game.
When I look into the logs, I find error messages relating to the Garbage Collection. Please see the attached logs below. When these GC events happen, sometimes it causes massive memory spikes to the point where the app would exceed it's allowed 0.5GB RAM and would be killed by Heroku. Lately however, the memory spikes don't occur as often, but the severe lag on the client side still happens around once or twice a day.
One aspect of the lag is through the chat. When a user types a message through "All Chat" (and any chat channel), the server currently console.log()'s it to the standard out. I happened to be watching the logs live one time during a spike event and noticed that chat being outputted to the terminal was in real time with no delay, however clients (I was also on the website myself as a client) received these messages in a very delayed fashion.
I have found online a NodeJS bug (that I think was fixed) that would cause severe lag when too much was being console.loged to the screen so I ran a stress test by sending 1000 messages from the client per second, for a minute. I could not reproduce the spike.
I have read many guides on finding memory leaks, inspecting the stack etc. but I'm very unsure how to run these tests on a live Heroku server. I have suspicions that my game objects on closing, are not being immediately cleared out and are all being cleared at once, causing the memory spikes, but I am not confident. I don't know how to best debug this. It is also difficult for me to catch this happening live as it only happens when more than 30+ people are logged in (Doesn't happen often as this is still a fairly small site).
The error messages include references to the circular-json module I use, and I also suspect that this may be causing infinite callbacks on itself somehow and not clearing out correctly, but I am not sure.
For reference, here is a copy of the source code: LINK
Here is a snippet of the memory when a spike happens:
Memory spike
Crash log 1: HERE
Crash log 2: HERE
Is there a way I can simulate sockets or simulate the live server's environment (i.e. connected clients) locally?
Any advice on how to approach or debug this problem would be greatly appreciated. Thank you.
Something to consider is that console.log will increase memory usage. If you are logging verbosely with large amounts of data this can accumulate. Looking quickly at the log, it seems you are running out of memory? This would mean the app starts writing to disk which is slower and will also run garbage collection spiking CPU.
This could mean a memory-leak due to resources not being killed/closed and simply accumulating. Debugging this can be a PITA.
Node uses 1.5GB to keep long-live objects around. Seems like you on a 500mb container so best to configure the web app to start like:
web: node --optimize_for_size --max_old_space_size=460 server.js
While you need to get to the bottom of the leak, you can also increase availability by running more than one worker and also more than one node instance and use socket.io-redis to keep the instance is in sync. I highly recommend this route.
Some helpful content on Nodejs memory on Heroku.
You can also spin up multiple connections via node script to interact with your local dev server using socket.io-client and monitor the memory locally and add logging to ensure connections are being closed correctly etc.
I ended up managing to track my "memory leak" down. It turns out I was saving the games (in JSONified strings) to the database too frequently and the server/database couldn't keep up. I've reduced the frequency of game saves and I haven't had any issues.
The tips provided by Samuel were very helpful too.
I have a gameserver.js file that is well over 100 KB in size. And I kept checking my task manager after each refresh on my browser and kept seeing my node.exe memory usage keep rising for every refresh. I'm using the ws module here: https://github.com/websockets/ws and figured, you know what, there is most likely some memory leak in my code somewhere...
So to double check and isolate the issue I created a test.js file and put in the default ws code block:
var WebSocketServer = require('ws').Server
, wss = new WebSocketServer({ port: 9300 });
wss.on('connection', function connection(ws) {
ws.on('message', function incoming(message) {
console.log('received: %s', message);
});
});
And started it up:
Now, I check node.exe's memory usage:
The incremental part that makes me confused is:
If I refresh my browser that makes the connection to this port 9300 websocket server and then look back at my task manager.. it shows:
Which is now at: 14,500 K.
And it keeps on rising upon each refresh, so theoretically if I keep just refreshing it will go through the roof. Is this intended? Is there a memory leak in the ws module somewhere maybe? The whole reason I ask is because I thought maybe in a few minutes or when the user closes the browser it will go back down, but it doesn't.
And the core reason why I wanted to do this test because I figured I had a memory leak issue in my personal code somewhere and just wanted to check if it wasn't me, or vice versa. Now I'm stumped.
Seeing an increased memory footprint by a Node.js application is completely normal behaviour. Node.js constantly analyses your running code, generates optimised code, reverts to unoptimised code (if needed), etc. All this requires quite a lot of memory even for the most simple of applications (Node.js itself is from a large part written in JavaScript that follows the same optimisations/deoptimisations as your own code).
Additionally, a process may be granted more memory when it needs it, but many operating systems remove that allocated memory from the process only when they decide it is needed elsewhere (i.e. by another process). So an application can, in peaks, consume 1 GB of RAM, then garbage collection kicks in, usage drops to 500 MB, but the process may still keep the 1 GB.
Detecting presence of memory leaks
To properly analyse memory usage and memory leaks, you must use Node.js's process.memoryUsage().
You should set up an interval that dumps this memory usage into a file i.e. every second, then apply some "stress" on your application over several seconds (i.e. for web servers, issue several thousand requests). Then take a look at the results and see if the memory just keeps increasing or if it follows a steady pattern of increasing/decreasing.
Detecting source of memory leaks
The best tool for this is likely node-heapdump. You use it with the Chrome debugger.
Start your application and apply initial stress (this is to generate optimised code and "warm-up" your application)
While the app is idle, generate a heapdump
Perform a single, additional operation (i.e. one more request) that you suspect will likely cause a memory leak - this is probably the trickiest part especially for large apps
Generate another heapdump
Load both heapdumps into Chrome debugger and compare them - if there is a memory leak, you will see that there are some objects that were allocated during that single request but were not released afterwards
Inspect the object to determine where the leak occurs
I had the opportunity to investigate a reported memory leak in the Sails.js framework - you can see detailed description of the analysis (including pretty graphs, etc.) on this issue.
There is also a detailed article about working with heapdumps by StrongLoop - I suggest to have a look at it.
The garbage collector is not called all the time because it blocks your process. So V8 launches GC when it thinks it's necessary.
To find if you have a memory leak I propose to fire up the GC manually after every request just to see if your memory is still going up. Normally if you don't have a memory leak your memory should not increase. Because the GC will clean all non-used objects. If your memory is still going up after a GC call you have a memory leak.
To launch GC manually you can do that, but attention! Don't use this in production; this is just a way to cleanup your memory and see if you have a memory leak.
Launch Node.js like this:
node --expose-gc --always-compact test.js
It will expose the garbage collector and force it to be aggressive. Call this method to run the GC:
global.gc();
Call this method after each hit on your server and see if the GC clean the memory or not.
You can also do two heapdumps of your process before and after request to see the difference.
Don't use this in production or in your project. It is just a way to see if you have a memory leak or not.
Background
I have a relatively simple node js application (essentially just expressjs + mongoose). It is currently running in production on an Ubuntu Server and serves about 20,000 page views per day.
Initially the application was running on a machine with 512 MB memory. Upon noticing that the server would essentially crash every so often I suspected that the application might be running out of memory, which was the case.
I have since moved the application to a server with 1 GB of memory. I have been monitoring the application and within a few minutes the application tends to reach about 200-250 MB of memory usage. Over longer periods of time (say 10+ hours) it seems that the amount keeps growing very slowly (I'm still investigating that).
I have been since been trying to figure out what is consuming the memory. I have been going through my code and have not found any obvious memory leaks (for example unclosed db connections and such).
Tests
I have implemented a handy heapdump function using node-heapdump and I have now enabled --expore-gc to be able to manually trigger garbage collection. From time to time I try triggering a manual GC to see what happens with the memory usage, but it seems to have no effect whatsoever.
I have also tried analysing heapdumps from time to time - but I'm not sure if what I'm seeing is normal or not. I do find it slightly suspicious that there is one entry with 93% of the retained size - but it just points to "builtins" (not really sure what the signifies).
Upon inspecting the 2nd highest retained size (Buffer) I can see that it links back to the same "builtins" via a setTimeout function in some Native Code. I suspect it is cache or https related (_cache, slabBuffer, tls).
Questions
Does this look normal for a Node JS application?
Is anyone able to draw any sort of conclusion from this?
What exactly is "builtins" (does it refer to builtin js types)?
My Node.Js application occasionally fails with a "Segmentation fault" and I am at a loss as to how to diagnose the cause. As mentioned below I have dramatically reduced their frequency by raising maxlisteners but not made the problem go away.
The application runs on a BeagleBone Black under Node v0.8.22 and uses Socket.io to communicate realtime data to browser pages displayed on the BBB's LCD display. It collects data from a sensor connected via I2C using the korevec/node-i2c library. I have, however, isolated that library and still have failures.
The failures generally occur when I have streamed data for a while to the client though they will occasionally/rarely happen at other times as well. This is not surprising as my app uses socket.io to communicate on almost all pages but the streaming page is at much higher volume.
I am getting the below message:
(node) warning: possible EventEmitter memory leak detected. 11
listeners added. Use emitter.setMaxListeners() to increase limit.
Trace
and have been doing so since day 1. I was also seeing the symptoms of a memory leak but since I raised the MaxListeners memory usage stays constant. Failure rate after making this change went down dramatically but has not gone away completely. I am using socket.io on top of http and raised MaxListeners for both socket.it and for http.
How does one go about diagnosing this problem? Is the memory leak error related? I can post code but there is quite a bit of it and I am not sure what parts are most relevant.
Thanks for any help,
Cheers, Will
Our Website is in .NET but with some old ASP and 32bits libraries too in it. It had been working fine for a while (2 years). But for the past month, we have seen the following error on our IIS7 server, which we have been unable to track down and fix:
"Faulting application w3wp.exe, version 7.0.6001.18000, time stamp 0x47919413, faulting module kernel32.dll, version 6.0.6001.18215, time stamp 0x4995344f, exception code 0xe053534f, fault offset 0x0002f328, process id 0x%9, application start time 0x%10."
We are able to reproduce the error:
One of our .ASPX pages starts loading, executing code and queries (we have response.flush() all over the page to track where the code breaks), then it suddenly stops and we get the above error in IIS.
The page stops loading and, without the response.flush(), it's not redirecting to our error.aspx page (as configured in web.config)
The error does NOT happen all the time. Sometimes, it happens 3 times in a row, then it's working fine for 15 minutes non-stop with a proper redirection to error.aspx.
The error we get then is a classic: "Either BOF or EOF is True, or the current record has been deleted."
When the error occurs, the page hangs and all other session on the same computer from any browsers have hanging web pages as well (BTW, we only allow 1 worker process while we are testing). From other computers, the site loads fine.
I can recycle the Application Pool, kill w3wp.exe, restart IIS. Nothing will do. The only way to successfully load the page again is to Restart MS SQL which handles our Session States. I don't know why this is, but we guessed that the Session Cookies on the users browsers points to a thread which was not terminated properly (due to the above crash) and IIS is waiting for it to terminate to process more code (?). If someone can explain this better, that would be really helpful. Is there a timeout which we can set to "terminate" threads? Is it a MS SQL related issue?
I have also looked at the Private and Virtual Memory usages, because I think our code is not the most effective and I am certain we have remaining memory leaks. However, I saw the page crash even though both Private and Virtual Memories were still quite low (under 100MB each).
I have used Debug Diag and WinDbg as indicated here: http://blogs.msdn.com/b/tess/archive/2009/03/20/debugging-a-net-crash-with-rules-in-debug-diag.aspx, but we are not able to make windbg work, this is what we are trying to do at the moment.
If someone could help us or point us toward the right direction that would be really great, thank you.
"Either BOF or EOF is True, or the current record has been deleted" means the table is empty and you are attempting to do a MoveNext. So check for eof before you do any moves.
IIS is notorious for throwing kernel errors in w3wp.exe like this one. All your errors in session state are just symptoms of the crashed process. Multiple APP pools won't help much - they just spread the error around.
I''d wager it is SQL deadlocks due to your user environment changing. This will cause a 10-second lag as SQL tries to determine which query to kill off. One wins, one loses. The loser gets back a pointer to an unexpectedly empty table and you try a move and subsequent crash. You maybe could point your DB to an ODBC connection and turn on tracing, or figure out a way to get SQL to log it.
I had all the same symptoms as above in Perl. I was able to make a wrapper fn() to do all SQL queries and log all sql, + params and any errors to disk to track down the problem. It was deadlocks, then we were able to code in auto-retry, and eventually we recoded the query order and scanned columns to eliminate the deadlocks.
It's entirely possible one of your referenced/linked assemblies somewhere has randomly gone corrupt (it can happen) on disk. Can you try a replicate the problem on a new, clean machine with the same stats, fresh installs of the latest xyz drivers you're using?
I solved a mysterious problem that took me months to isolate this way. It seemed clean, new machines with the same specs and prerequired drivers would work just fine - only some older machines with the same specs were failing consistently. I ended up uninstalling everything (IIS, ASP.NET, .NET, database and client) and starting from scratch. The end cause when I isolated it was that the db client driver was corrupt on the older machines (and all the older machines were clones of each other, so I assume they were cloned after the corruption occured), and it seemed to be messing with the .NET memory space even when I wasn't calling it directly. I have yet to even reply to my "help me debug this monster" post with this answer because I doubted it would ever help anyone.
We started receiving this error after installing windows updates on a Windows Server 2008R2 machine. Windows Process Activation Service (WAS) installs some additional site bindings that caused issues for our setup.
We removed net.tcp, net.pipe, net.msmq, and msmq.formatname bindings from our website and no longer got the faulting application exception.
This is probably an edge case, but just in case someone is coming here and they are using MVCMailer , I was getting this same error due to the .SendAsync() method on the mailers.
I switched them all to .Send() and the crashing stopped.
See this SO answer for ways to use the mailer async and avoid the crash (allegedly, I did not personally implement it)