How to debug a NodeJS blocked event loop? - node.js

We have a NodeJS/Express server running in production, and occasionally, all requests are getting blocked. The web requests are being received, but not processed (and they eventually all time out). After a few minutes, it'll begin accepting requests again, but then almost immediately begin blocking like before.
We've been trying to reproduce the issue locally but can't reproduce and determine what the cause is. My guess is the event loop is getting blocked from either a synchronous operation that's taking too long to complete or doesn't complete at all.
Are there any ways to debug a live production system and figure out what's causing the block? I've searched, but could only find solutions for local development. Is my best solution to look back at the logs, see where the last request that didn't block complete (before it started blocking), and debug that?
Using Node 6.2.2, Express 4.13.4, and running on Heroku.

Related

REST API In Node Deployed as Azure App Service 500 Internal Server Errors

I have looked at the request trace for several requests that resulted in the same outcome.
What will happen is I'll get a HttpModule="iisnode", Notification="EXECUTE_REQUEST_HANDLER", HttpStatus=500, HttpReason="Internal Server Error", HttpSubstatus=1013, ErrorCode="The pipe has been ended. (0x6d)"
This is a production API. Fewer than 1% of requests get this result but it's not the requests themselves - I can reissue the same request and it'll work.
I log telemetry for every API request - basics on the way in, things like http status and execution time as the response is on its way out.
None of the requests that get this error are in telemetry which makes me think something is happening somewhere between IIS and iisnode.
If anyone has resolved this or has solid thoughts on how to pin down what the root issue is I'd appreciate it.
Well for me, what's described here covered the bulk of the issue: github.com/Azure/iisnode/issues/57 Setting keepAliveTimeout to 0 on the express server reduced the 500s significantly.
Once the majority of the "noise" was eliminated it was much easier to correlate the remaining 500s that would occur to things I could see in my logs. For example, I'm using a 3rd party node package to resize images and a couple of the "images" that were loaded into the system weren't images. Instead of gracefully throwing an exception, the package seems to exit out of the running node process. True story. So on Azure, it would get restarted, but while that was happening requests would get a 500 internal server error.

Node app keeps crashing due to exhausted memory after migrated to Windows environment

I am working on a React site that was originally built by someone else. The app uses the Wordpress rest api for handling content.
Currently the live app sits on a nginx server running node v6 and it has been working just fine. However now I have to move the app over to an IIS environment(not by choice) and have been having nothing but problems with it. I have got the app to finally run as expected which is great, but now I am running into an issue regarding the memory in node becoming exhausted.
So when I was debugging this issue I noticed the server's firewall was polling the home route every 5 - 10 seconds, which was firing an api request to the Wordpress api each time. The api then would return a pretty large JSON object of data.
So my conclusion to this was the firewall is polling the home route too often which was killing the memory because then the app had to constantly fire api request and load in huge sets of data over and over.
So my solution was to set up a polling route on the node server(express) which would just return a 200 response and nothing else. This seemed to fix the issue as the app went from crashing every hours to lasting over two days. However after about two days the app crashed again with another memory error. The error looked like this:
So since the app lasted much longer with the polling route added in I assume that firewall polling was/is in fact my issue here, however now that I added in the polling route and the app still crashed after a couple days I have no idea what to do which is why I am asking for help.
I am very unfamiliar with working on Windows so I don't know if there are any memory restrictions or any obvious things I could do to help prevent this issue.
Some other notes are: I have tried increasing the --max-old-space-size to about 8000 but it didn't seem to do anything so I don't know if I am maybe implementing it wrong but when I start the script I have tried the following commands when starting the app:
Start-process npm -Argumentlist “run server-prod --max-old-space-size=8192” -WorkingDirectory C:\node\prod
And when I used forever to handle the process
forever start -o out.log -e error.log .\lib\server\server.js -c "node --max_old_space_size=8000”
Any help on what could be the issue or tips on what I should look for woulf be great, again I am very new to working on Windows so maybe there is just something I am missing.

NodeJS request pipe

I'm struggling with a technical issue, and because of I'm pretty new on NodeJS world I think I don't have the proper good practise and tools to help me solve this.
Using the well known request module, I'm making a stream proxy from a remote server to the client. Almost everything is fine and working properly until a certain point, if there is too much requests at the same time the server does no longer respond. Actualy it does get the client request but is unable to go through the stream process and serve the content.
What I'm currently doing:
Creating a server with http module with http.createServer
Getting remote url from a php script using exec
Instanciate the stream
How I did it:
http://pastebin.com/a2ZX5nRr
I tried to investigate on the pooling stuff and did not understand everything, same thing the pool maxSocket was recently added, but did not helped me. I was also seting before the http.globalAgent to infinity, but I read that this was no longer limited in nodeJS from a while, so it does not help.
See here: https://nodejs.org/api/http.html#http_http_globalagent
I also read this: Nodejs Max Socket Pooling Settings but I'm wondering what is the difference between a custom agent and the global one.
I believed that it could come from the server but I tested it on a very small one and a bigger one and it was not coming from there. I think it definitely coming from my app that has to be better designed. Indeed each time I'm restarting the app instance it works again. Also if I'm starting a fork of the server meanwhile the other is not serving anything on another port it will work. So it might not be about ressources.
Do you have any clue, tools or something that may help me to understand and debug what is going on?
NPM Module that can help handle stream properly:
https://www.npmjs.com/package/pump
I made few tests, and I think I've found what I was looking for. The unpipe things more info here:
https://nodejs.org/api/stream.html#stream_readable_unpipe_destination
Can see and read this too, it leads me to understand few things about pipe remaining open when target failed or something:
http://www.bennadel.com/blog/2679-how-error-events-affect-piped-streams-in-node-js.htm
So what I've done, i'm currently unpiping pipes when stream's end event is fired. However I guess you can make this in different ways, it depends on how you want to handle the thing but you may unpipe also on error from source/target.
Edit: I still have issues, it seams that the stream is now unpiping when it does not have too. I'll have to doubile check this.

Putting a Load on Node

We have a C# Web API server and a Node Express server. We make hundreds of requests from the C# server to a route on the Node server. The route on the Node server does intensive work and often doesn't return for 6-8 seconds.
Making hundreds of these requests simultaneously seems to cause the Node server to fail. Errors in the Node server output include either socket hang up or ECONNRESET. The error from the C# side says
No connection could be made because the target machine actively refused it.
This error occurs after processing an unpredictable number of the requests, which leads me to think it is simply overloading the server. Using a Thread.Sleep(500) on the C# side allows us to get through more requests, and fiddling with the wait there leads to more or less success, but thread sleeping is rarely if ever the right answer, and I think this case is no exception.
Are we simply putting too much stress on the Node server? Can this only be solved with Load Balancing or some form of clustering? If there is an another alternative, what might it look like?
One path I'm starting to explore is the node-toobusy module. If I return a 503 though, what should be the process in the following code? Should I Thread.Sleep and then re-submit the request?
It sounds like your node.js server is getting overloaded.
The route on the Node server does intensive work and often doesn't return for 6-8 seconds.
This is a bad smell - if your node process is doing intense computation, it will halt the event loop until that computation is completed, and won't be able to handle any other requests. You should probably have it doing that computation in a worker process, which will run on another cpu core if available. cluster is the node builtin module that lets you do that, so I'll point you there.
One path I'm starting to explore is the node-toobusy module. If I return a 503 though, what should be the process in the following code? Should I Thread.Sleep and then re-submit the request?
That depends on your application and your expected load. You may want to refresh once or twice if it's likely that things will cool down enough during that time, but for your API you probably just want to return a 503 in C# too - better to let the client know the server's too busy and let them make their own decision then to keep refreshing on its behalf.

Is it possible to pause cherrypy server in order to update static files / db without stopping it?

I have an internal cherrypy server that serves static files and answers XMLRPC requests. All works fine, but 1-2 times a day i need to update this static files and database. Of course i can just stop server, run update and start server. But this is not very clean since all other code that communicate with server via XMLRPC will have disconnects and users will see "can't connect" in broswers. And this adds additional complexity - i need some external start / stop / update code, wile all updaes can be perfectly done within cherrypy server itself.
Is it possible to somehow "pause" cherrypy programmatically so it will server static "busy" page and i can update data without fear that right now someone is downloading file A from server and i will update file B he wants next, so he will get different file versions.
I have tried to implement this programmatically, but where is a problem here. Cherrypy is multithread (and this is good), so even if i introduce a global "busy" flag i need some way to wait for all threads to complete aready existing tasks before i can update data. Can't find such way :(.
CherryPy's engine controls such things. When you call engine.stop(), the HTTP server shuts down, but first it waits for existing requests to complete. This mode is designed to allow for debugging to occur while not serving requests. See this state machine diagram. Note that stop is not the same as exit, which really stops everything and exits the process.
You could call stop, then manually start up an HTTP server again with a different app to serve a "busy" page, then make your edits, then stop the interim server, then call engine.start() and engine.block() again and be on your way. Note that this will mean a certain amount of downtime as the current requests finish and the new HTTP server takes over listening on the socket, but that will guarantee all current requests are done before you start making changes.
Alternately, you could write a bit of WSGI middleware which usually passes requests through unchanged, but when tripped returns a "busy" page. Current requests would still be allowed to complete, so there might be a period in which you're not sure if your edits will affect requests that are in progress. How to write WSGI middleware doesn't fit very well in an SO reply; search for resources like this one. When you're ready to hook it up in CherryPy, see http://docs.cherrypy.org/dev/concepts/config.html#wsgi

Resources