NodeJS Performance Issue

NodeJS Performance Issue - node.js

I'm running an API server using NodeJS 6.10.3 LTS on Ubuntu 14.04 (trusty). I've noticed that my API server tops out at ~600 reqs/min running on a c4.large EC2 instance. By tops out I mean, I see the CPU go uptil 100% Note, I know that I'm not fully utilizing the instance by using the cluster module, but that's ok for now.
I took a .cpuprofile dump of my API server for 10 seconds, and noticed that every second, for ~300ms, the profiler shows my NodeJS code is sitting (idle).
Does anyone know what that (idle) implies? Is it a GC issue? Or is it a internal (to V8) lock that I'm triggering? Any help or pointers to tools to help debug this would be nice. I'm working on anonymizing some of stack traces in the cpuprofile so I can share.
The packages I'm using are ExpressJS 4, Couchbase NodeJS SDK, Socket.IO mainly. The codepaths are mainly reading requests, and pushing to Couchbase. And finally querying couchbase via Views API, and pushing some aggregated data on a Socket.IO channel. So all pretty I/O async friendly stuff. I've made sure that I'm not calling any synchronous functions. There are no patterns of function calls before the (idle) in the cpu profile.

It could also just be I/O wait, meaning none of the sockets have data ready to read yet and so the time is spent idle. If you are using a load testing library you should check that the requests are evenly distributed within a second.
Take a look at https://www.npmjs.com/package/gc-stats to check GC data. There are flags to increase heap space, and to change when GC runs, if the problem turns out to be GC related.

Related

Debugging an Out of Memory Error in Node.js

I'm currently working on a Node.js project and my server keeps running out of memory. It has happened 4 times in the last 2 weeks, usually after about 10,000 requests. This project is live and has real users.
I am using
NodeJS 16
Google Cloud Platform's App Engine (instances have 2048mb of memory)
Express as my server framework
TypeORM as database ORM (database is postgres hosted on separate GCP SQL instance)
I have installed the GCP profiling tools and have captured the app running out of memory, but I'm not quite sure how to use the results. It almost looks like there is a memory leak in the _handleDataRow function within the pg client library. I am currently using version 8.8.0 of the library (8.9.0 was just released a few weeks ago and doesn't mention fixing any memory leaks in the release notes).
I'm a bit stuck with what I should do at this point.
Any suggestions or advice would be greatly appreciated! Thanks.
Update: I have also cross-posted to reddit and someone there helped me determine that issue is related to large queries with many joins. I was able to reproduce the issue, and will report back here once I am able to solve it.

When using App Engine, a great place to start looking for "why" a problem occurred in your app is through the Logs Explorer. Particularly, if you know the time-frame of when the issues started escalating or when the crash occurred.
Although based on your Memory Usage graph, it's a slow leak. So a top-to-bottom approach of your back-end is really necessary to try and pin-point the culprit. I would go through the whole stack and look for things like Globals that are set and not cleaned up, promises that are not being returned, large result-sets from the database that are bottle-necking the server, perhaps from a scheduled task.
Looking at the 2pm - 2:45pm range on the right-hand of the graph, I would narrow the Logs Explorer down to that exact time-frame. Then I would look for the processes or endpoints that are being utilized most frequently in that time-frame as well as the ones that are taking the most memory to get a good starting point.

node.js CPU usage spikes

I have an express.js app running in cluster mode (nodejs cluster module) on a linux production server. I'm using PM2 to manage this app. It usually uses less than 3% CPU. However, sometimes the CPU usage spikes up to 100% for a short period of time (less than a minute). I'm not able to reproduce this issue. This only happens once a day or two.
Is there any way to find out which function or route is causing the sudden CPU spikes, using PM2? Thanks.

i think have some slow synchronous execution on some request in your application.
add log every request income on middleware and store to elastic search and find what request have long response time or use newrelic (easy way but spend more money).
use blocked-at to find slow synchronous execution if detect try to use worker threads or use lib workerpool

My answer is based purely on my experience with the topic
Before going to production make local testing like:
stress testing.
longevity testing.
For both tests try to use tool like JMeter where you can put your one/multiple endpoints and run loads of them in period of time while monitoring CPU & MEMORY Usage.
If everything is fine, try also to stop the test and run the api manually try to monitor its behavior, this will help you if there is
memory leak from the APIs themselves
Is your app going through .map() , .reduce() for huge arrays?
Is your app is working significantly better after reboot?
if yes, then you need to suspect that the express app experiencing memory leak and Garbage collector trying to clean the mess.
If it's possible, try to rewrite the app using fastify, personally, this did not make the app much faster, but able to handle 1.5X more requests.

how to use all cpu with nodejs

We have a production chat app built in socketio/nodejs.
We use express.
Nodejs is a bit old : 10.21.0
SocketIO in 3.1.1
Our computer is a VM with 4vCPU and 16 GB RAM.
We use pm2 to manage starting node app with env variables.
We are facing an issue when there are about 500 users in chat and when they write. Bandwidth usage is around 250 Mbps in upload (but we have 10G so no issue). Issue begins here, we can see in our logs full of connection/disconnection and pm2 restart app.
In checking in more details, in launching "pm2 monit" we can see that only one processor is used and it is higher than 100% most of the time.
We read few documentation about clustering (cluster + fork). It seems to be interesting but in our case when we tested it, it's like we had few chat apps so for the same "chat room", users are in different workers so it's not OK.
Do you have an idea how we can fix that and use all processor/core ?
We are already thinking of starting with upgrading nodejs?
Thanks
Niko

Since Node.js is always single-threaded (aside from worker threads), upgrading Node won't get you much anywhere (aside from newer Nodes shipping newer V8 engines, which might be faster).
it's like we had few chat apps so for the same "chat room", users are in different workers so it's not OK.
This sounds like you've architected your app to use global variables or in-process state like that for these shared rooms. If you want to use cluster or PM2's multiple process mode, that state will need to live somewhere else, maybe a second Node application or, say, a Redis server.

Garbage collection causes lag on connected sockets (NodeJS Server)

I am hosting a game website on Heroku that runs on a NodeJS Server. Clients are connected via sockets using the package socket.io.
Once in a while when the garbage collection cycle is triggered, connected clients would experience severe lag and often, disconnections. This is experienced by the clients through delayed incoming chat and delayed inputs to the game.
When I look into the logs, I find error messages relating to the Garbage Collection. Please see the attached logs below. When these GC events happen, sometimes it causes massive memory spikes to the point where the app would exceed it's allowed 0.5GB RAM and would be killed by Heroku. Lately however, the memory spikes don't occur as often, but the severe lag on the client side still happens around once or twice a day.
One aspect of the lag is through the chat. When a user types a message through "All Chat" (and any chat channel), the server currently console.log()'s it to the standard out. I happened to be watching the logs live one time during a spike event and noticed that chat being outputted to the terminal was in real time with no delay, however clients (I was also on the website myself as a client) received these messages in a very delayed fashion.
I have found online a NodeJS bug (that I think was fixed) that would cause severe lag when too much was being console.loged to the screen so I ran a stress test by sending 1000 messages from the client per second, for a minute. I could not reproduce the spike.
I have read many guides on finding memory leaks, inspecting the stack etc. but I'm very unsure how to run these tests on a live Heroku server. I have suspicions that my game objects on closing, are not being immediately cleared out and are all being cleared at once, causing the memory spikes, but I am not confident. I don't know how to best debug this. It is also difficult for me to catch this happening live as it only happens when more than 30+ people are logged in (Doesn't happen often as this is still a fairly small site).
The error messages include references to the circular-json module I use, and I also suspect that this may be causing infinite callbacks on itself somehow and not clearing out correctly, but I am not sure.
For reference, here is a copy of the source code: LINK
Here is a snippet of the memory when a spike happens:
Memory spike
Crash log 1: HERE
Crash log 2: HERE
Is there a way I can simulate sockets or simulate the live server's environment (i.e. connected clients) locally?
Any advice on how to approach or debug this problem would be greatly appreciated. Thank you.

Something to consider is that console.log will increase memory usage. If you are logging verbosely with large amounts of data this can accumulate. Looking quickly at the log, it seems you are running out of memory? This would mean the app starts writing to disk which is slower and will also run garbage collection spiking CPU.
This could mean a memory-leak due to resources not being killed/closed and simply accumulating. Debugging this can be a PITA.
Node uses 1.5GB to keep long-live objects around. Seems like you on a 500mb container so best to configure the web app to start like:
web: node --optimize_for_size --max_old_space_size=460 server.js
While you need to get to the bottom of the leak, you can also increase availability by running more than one worker and also more than one node instance and use socket.io-redis to keep the instance is in sync. I highly recommend this route.
Some helpful content on Nodejs memory on Heroku.
You can also spin up multiple connections via node script to interact with your local dev server using socket.io-client and monitor the memory locally and add logging to ensure connections are being closed correctly etc.

I ended up managing to track my "memory leak" down. It turns out I was saving the games (in JSONified strings) to the database too frequently and the server/database couldn't keep up. I've reduced the frequency of game saves and I haven't had any issues.
The tips provided by Samuel were very helpful too.

Node.js app has periodic slowness and/or timeouts (does not accept incoming requests)

This problem is killing the stability of my production servers.
To recap, the basic idea is that my node server(s) sometimes intermittently slow down, sometimes resulting in Gateway Timeouts. As best as I can tell from my logs, something is blocking the node thread (meaning that the incoming request is not accepted), but I cannot for the life of me figure out what.
The problem ranges in severity. Sometimes what should be <100ms requests take ~10 seconds to complete; sometimes they never even get accepted by the node server at all. In short, it is as though some random task is working and blocking the node thread for a period of time, thus slowing down (or even blocking) incoming requests; the one thing I can say for sure is that the need-to-fix-symptom is a "Gateway Timeout".
The issue comes and goes without warning. I have not been able to correlate it against CPU usage, RAM usage, uptime, or any other relevant statistic. I've seen the servers handle a large load fine, and then have this error with a small load, so it does not even appear to be load-related. It is not unusual to see the error around 1am PST, which is the smallest load time of the day! Restarting the node app does seem to maybe make the problem go away for a while, but that really doesn't tell me much. I do wonder if it might be a bug in node.js... not very comforting, considering it is killing my production servers.
The first thing I did was to make sure I had upgraded node.js to the latest (0.8.12), as well as all my modules (here they are). Of course, I also have plenty of error catchers in place. I'm not doing anything funky like printing out lots to the console or writing to lots of files.
At first, I thought it was outbound HTTP requests blocking the incoming socket, because the express middleware was not even picking up the inbound request, but I gave up the theory because it looks like the node thread itself became busy.
Next, I went through all my code with JSHint and fixed literally every single warning, including a few accidental globals (forgetting to write "var") but this didn't help
After that, I assumed that perhaps I was running out of memory. But, my heap snapshots via nodetime are looking pretty good now (described below).
Still thinking that memory might be an issue, I took a look at garbage collection. I enabled the --nouse-idle-notification flag and did some more code optimization to NULL objects when they were not needed.
Still convinced that memory was the issue, I added the --expose-gc flag and executed the gc(); command every minute. This did not change anything, except to occasionally make requests a bit slower perhaps.
In a desperate attempt, I setup the "cluster" module to use 2 workers and automatically restart them every 30 min. Still, no luck.
I increased the ulimit to over 10,000 and kept an eye on the open files. There seem to be < 300 open files (or sockets) per node.js app, and increasing the ulimit thus had no impact.
I've been logging my server with nodetime and here's the jist of it:
CentOS 5.2 running on the Amazon Cloud (m1.large instance)
Greater than 5000 MB free memory at all times
Less than 150 MB heap size at all times
CPU usage is less than 60% at all times
I've also checked my MongoDB servers, which have <5% CPU usage and no requests are taking > 100ms to complete, so I highly doubt there's a bottleneck.
I've wrapped (almost) all my code using Q-promises (see code sample), and of course have avoided Sync() calls like the plague. I've tried to replicate the issue on my testing server (OSX), but have had little luck. Of course, this may be just because the production servers are being used by so many people in so many unpredictable ways that I simply cannot replicate via stress tests...

Many months after I first asked this question, I found the answer.
In a nutshell, the problem was that I was not piping a big asset when transferring it from one server to another. In other words, I was downloading an image from one server, before uploading it to a S3 bucket. Instead of streaming the download into the upload, I downloaded the file into memory, and then uploaded it.
I'm not sure why this did not show up as a memory spike, or elsewhere in my statistics.

My guess is Mongoose. If you are storing large payloads in Mongo, Mongoose can be pretty slow due to how it builds the Mongoose objects. See https://github.com/LearnBoost/mongoose/issues/950 for more details on the problem. If this is the problem you wouldn't see it in Mongo itself since the query returns quickly, but object instantiation could take 75x the query time.
Try setting up timers around (process.hrtime()) before and after you the Mongoose objects are being created to see if that might be the problem. If this is the problem, I would switch to using the node Mongo driver directly instead of going through Mongoose.

You are heavily leaking memory, try setting every object to null as soon as you don't need it anymore! Read this.
More information about hunting down memory leaks can be found here.
Give special attention to having multiple references to the same object and check if you have circular references, those are a pain to debug but will help you very much.
Try invoking the garbage collector manually every minute or so (I don't know if you can do this in node.js cause I'm more of a c++ and php coder). From my years of experience working with c++ I can tell you the most likely cause of your application slowing down over time is memory leaks, find them and plug them, you'll be ok!
Also assuming you're not caching and/or processing images, audio or video in memory or anything like that 150M heap is a lot! Those could be hundreds of thousands or even millions of small objects.
You don't have to be running out of memory for your application to slow down... just searching for free memory with that many objects already allocated is a huge job for the memory allocator, it takes a lot of time to allocate each new object and as you leak more and more memory that time only increases.

Is "--nouse-idle-connection" a mistake? do you really mean "--nouse_idle_notification".
I think it's maybe some issues about gc with too many tiny objects.
node is single process, so watch the most busy cpu core is much important than the load.
when your program is slow, you can execute "gdb node pid" and "bt" to see what node is busy doing.

What I'd do is set up a parallel node instance on the same server with some kind of echo service and test that one. If it runs fine, you narrow down your problem to your program code (and not a scheduler/OS-level problem). Then, step by step, include the modules and test again. Certainly this is a lot of work, takes long and I dont know if it is doable on your system.

If you need to get this working now, you can go the NASA redundancy route:
Bring up a second copy of your production servers, and put a proxy in front of them which routes each request to both stacks and returns the first response. I don't recommend this as a perfect long-term solution but it should help significantly reduce issues in production now, and help you gather log data that you could replay to recreate the issues on non-production servers.
Obviously, this is straight-forward for read requests, but more complex for commands which write to the db.

We have a similar problem with our Node.js server. It didn't scale well for weeks and we had tried almost everything as you had. Our problem was in the implicit backlog value which is set very low for high-concurrent environments.
http://nodejs.org/api/http.html#http_server_listen_port_hostname_backlog_callback
Setting the backlog to a significantly higher value (e.g. 10000) as well as tune networking in our kernel (/etc/sysctl.conf on Linux) as described in manual section helped a lot. From this time forward we don't have any timeouts in our Node.js server.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string