We're facing some challenging performance issues with some queries taking an extremely long time to respond, often in excess of 20s and frequently timing out at 60s. These extreme timings cannot be reproduced in local dev environments, even when using load testing tools like graphql-bench and autocannon. These same queries can also complete in ~500ms.
We have a federated graph setup with an instance of Apollo Gateway (running ~0.24) and 2 instances of Apollo Server (running 3.x) providing subgraphs. All 3 use NestJS 7 on top of Apollo Server + express. Our resolvers use Typeorm to interface with a reasonably large and complex MSSQL database. The database is hugely OP and resource monitors never get over 10% RAM & CPU.
The API servers are hosted on AWS ECS and are generously resources. Like the DB server, the CPU and RAM usage average out at 10-20%, although the CPU does spike to >100% sometimes.
Tracing in Apollo Studio appears to show some high level resolvers "hanging" or "waiting" for long periods of time. ie. 27s out of a 30s response.
We do have some N+1 issues and we've made some limited use of Dataloader in places to resolve those issues, although we can definitely implement it more. I'm not convined that the N+1 issues are responsible for the extreme timings though due to the wild range of those timing and the inconsistencies of them.
Does this scenario sound famililar to anyone? Does anyone have any ideas or pointers how we might go about tracking down the root issue and debugging it? We're a bit stumped tbh, especially as the issues seem to be impossible to repro locally.
Related
I'm currently working on a Node.js project and my server keeps running out of memory. It has happened 4 times in the last 2 weeks, usually after about 10,000 requests. This project is live and has real users.
I am using
NodeJS 16
Google Cloud Platform's App Engine (instances have 2048mb of memory)
Express as my server framework
TypeORM as database ORM (database is postgres hosted on separate GCP SQL instance)
I have installed the GCP profiling tools and have captured the app running out of memory, but I'm not quite sure how to use the results. It almost looks like there is a memory leak in the _handleDataRow function within the pg client library. I am currently using version 8.8.0 of the library (8.9.0 was just released a few weeks ago and doesn't mention fixing any memory leaks in the release notes).
I'm a bit stuck with what I should do at this point.
Any suggestions or advice would be greatly appreciated! Thanks.
Update: I have also cross-posted to reddit and someone there helped me determine that issue is related to large queries with many joins. I was able to reproduce the issue, and will report back here once I am able to solve it.
When using App Engine, a great place to start looking for "why" a problem occurred in your app is through the Logs Explorer. Particularly, if you know the time-frame of when the issues started escalating or when the crash occurred.
Although based on your Memory Usage graph, it's a slow leak. So a top-to-bottom approach of your back-end is really necessary to try and pin-point the culprit. I would go through the whole stack and look for things like Globals that are set and not cleaned up, promises that are not being returned, large result-sets from the database that are bottle-necking the server, perhaps from a scheduled task.
Looking at the 2pm - 2:45pm range on the right-hand of the graph, I would narrow the Logs Explorer down to that exact time-frame. Then I would look for the processes or endpoints that are being utilized most frequently in that time-frame as well as the ones that are taking the most memory to get a good starting point.
I am currently using express.js as the main backend. However, I have found that fastify is generally faster in performance.
But the downside of fastify is that it has a relatively slow startup time.
I am curious that would it slow down the autoscaling of Cloudrun?
I've seen that Cloudrun autoscales when the usage is over 60%.
In this case, I am thinking slow startup time can delay the response while autoscaling which can be a reason not to use fastify. How does this exactly work?
A slow cold start doesn't influence the autoscaller. The service scale according to the CPU usage and the number of queries.
If you track the number of created instance with benchmarks, you can see that you can have suddenly 50 concurrent requests, 5 to 10 instances are created by the Cloud Run autoscaller. Why? Because if the traffic suddenly increase with a lot of concurrent request, it can means that slope can continue and you can have soon 100 or 200 concurrent request and Cloud Run service is prepared to absorb that traffic.
If you scale slowly to 50 concurrent requests, you can have only 1 or 2 instances set up.
Anyway, it's just a thing that I noted in my previous tests. In addition, keep in mind that, if you have a sustained traffic, the cold start is a marginal case. You will lost a few milliseconds "rarely", and it doesn't imply a framework change.
My recommendation is to keep what is the best and the most efficient for you (you cost at least 100 times more than the Cloud Run cost!!)
So I have hosted asp.net core 2.2 web service on Azure(S2 plan). The problem is that my application sometimes getting high CPU usage(almost 99%). What I have done for now - checked process explorer on azure. I see there a lot of processes who are consuming CPU. Maybe someone knows if it's okay for these processes consume CPU?
Currently, I don't have an idea where do they come from. Maybe it's normal to have them here.
Shortly about my application:
Currently, there is not much traffic. 500-600 request in a day. Most of the request is used to communicate with MS SQL by querying records, adding, etc.
As well I am using MS Websocket, but high CPU happens when no WebSocket client is connected to web service, so I hardly believe that it's a cause. I tried to use apache ab for load testing, but there isn't any pattern, that after one request's load test, I would get high CPU. So sometimes happens, sometimes don't during load testing.
So I just update screenshot of processes, I see that lots of threads are being locked/used during the time when fluent migrator start running its logging.
Update*
I will remove fluent migrator logging middleware from Configure method. Will look forward with the situation.
UPDATE**
So I removed logging of FluentMigrator. Until now I didn't notice any CPU usage over 90%.
But still, I am confused. My CPU usage is spinning. Is it health CPU usage graph or not?
Also, I tried to make a load test on the websocket server.
I made a script that calls some functions of WebSocket every 100ms from 6-7 clients. So every 100ms there are 7 calls to WebSocket server from different clients, every function within itself queries some data/insert (approximately 3-4 queries of every WebSocket function).
What I did notice, on Azure S1 DTU 20 after 2min I am getting out of SQL pool connections, If I increase DTU to 100, it handles 7 clients properly without any errors of 'no connection pool'.
So the first question: is it a normal CPU spinning?
Second: should I get an error message of 'no SQL connection free' using this kind of load test on DTU 10 Azure SQL. I am afraid that when creating a scoped service on singleton WebSocket Service I am leaking connections.
This topic gets too long, maybe I should move it to a new topic?
-
At this stage I would say you need to profile your application and figure out what areas of your code are CPU intensive. In the past I have used dotTrace, this highlighted methods which are the most expensive with a call tree.
Once you know what areas of your code base are the least efficient, you can begin to refactor them so that they are more efficient. This could simply be changing some small operations, adding caching for queries or using distributed locking for example.
I believe the reason the other DLLs are showing CPU usage is because your code calling methods which are within those DLLs.
This is kind of a multi-tiered question in which my end goal is to establish the best way to setup my server which will be hosting a website as well as a service (using Socket.io) for an iOS (and eventually an Android) app. Both the app service and the website are going to be written in node.js as I need high concurrency and scaling for the app server and I figured whilst I'm at it may as well do the website in node because it wouldn't be that much different in terms of performance than something different like Apache (from my understanding).
Also the website has a lower priority than the app service, the app service should receive significantly higher traffic than the website (but in the long run this may change). Money isn't my greatest priority here, but it is a limiting factor, I feel that having a service that has 99.9% uptime (as 100% uptime appears to be virtually impossible in the long run) is more important than saving money at the compromise of having more down time.
Firstly I understand that having one node process per cpu core is the best way to fully utilise a multi-core cpu. I now understand after researching that running more than one per core is inefficient due to the fact that the cpu has to do context switching between the multiple processes. How come then whenever I see code posted on how to use the in-built cluster module in node.js, the master worker creates a number of workers equal to the number of cores because that would mean you would have 9 processes on an 8 core machine (1 master process and 8 worker processes)? Is this because the master process usually is there just to restart worker processes if they crash or end and therefore does so little it doesnt matter that it shares a cpu core with another node process?
If this is the case then, I am planning to have the workers handle providing the app service and have the master worker handle the workers but also host a webpage which would provide statistical information on the server's state and all other relevant information (like number of clients connected, worker restart count, error logs etc). Is this a bad idea? Would it be better to have this webpage running on a separate worker and just leave the master worker to handle the workers?
So overall I wanted to have the following elements; a service to handle the request from the app (my main point of traffic), a website (fairly simple, a couple of pages and a registration form), an SQL database to store user information, a webpage (probably locally hosted on the server machine) which only I can access that hosts information about the server (users connected, worker restarts, server logs, other useful information etc) and apparently nginx would be a good idea where I'm handling multiple node processes accepting connection from the app. After doing research I've also found that it would probably be best to host on a VPS initially. I was thinking at first when the amount of traffic the app service would be receiving will most likely be fairly low, I could run all of those elements on one VPS. Or would it be best to have them running on seperate VPS's except for the website and the server status webpage which I could run on the same one? I guess this way if there is a hardware failure and something goes down, not everything does and I could run 2 instances of the app service on 2 different VPS's so if one goes down the other one is still functioning. Would this just be overkill? I doubt for a while I would need multiple app service instances to support the traffic load but it would help reduce the apparent down time for users.
Maybe this all depends on what I value more and have the time to do? A more complex server setup that costs more and maybe a little unnecessary but guarantees a consistent and reliable service, or a cheaper and simpler setup that may succumb to downtime due to coding errors and server hardware issues.
Also it's worth noting I've never had any real experience with production level servers so in some ways I've jumped in the deep end a little with this. I feel like I've come a long way in the past half a year and feel like I'm getting a fairly good grasp on what I need to do, I could just do with some advice from someone with experience that has an idea with what roadblocks I may come across along the way and whether I'm causing myself unnecessary problems with this kind of setup.
Any advice is greatly appreciated, thanks for taking the time to read my question.
First - a little bit about my background: I have been programming for some time (10 years at this point) and am fairly competent when it comes to coding ideas up. I started working on web-application programming just over a year ago, and thankfully discovered nodeJS, which made web-app creation feel a lot more like traditional programming. Now, I have a node.js app that I've been developing for some time that is now running in production on the web. My main confusion stems from the fact that I am very new to the world of the web development, and don't really know what's important and what isn't when it comes to monitoring my application.
I am using a Joyent SmartMachine, and looking at the analytics options that they provide is a little overwhelming. There are so many different options and configurations, and I have no clue what purpose each analytic really serves. For the questions below, I'd appreciate any answer, whether it's specific to Joyent's Cloud Analytics or completely general.
QUESTION ONE
Right now, my main concern is to figure out how my application is utilizing the server that I have it running on. I want to know if my application has the right amount of resources allocated to it. Does the number of requests that it receives make the server it's on overkill, or does it warrant extra resources? What analytics are important to look at for a NodeJS app for that purpose? (using both MongoDB and Redis on separate servers if that makes a difference)
QUESTION TWO
What other statistics are generally really important to look at when managing a server that's in production? I'm used to programs that run once to do something specific (e.g. a raytracer that finishes running once it has computed an image), as opposed to web-apps which are continuously running and interacting with many clients. I'm sure there are many things that are obvious to long-time server administrators that aren't to newbies like me.
QUESTION THREE
What's important to look at when dealing with NodeJS specifically? What are statistics/analytics that become particularly critical when dealing with the single-threaded event loop of NodeJS versus more standard server systems?
I have other questions about how databases play into the equation, but I think this is enough for now...
We have been running node.js in production nearly an year starting from 0.4 and currenty 0.8 series. Web app is express 2 and 3 based with mongo, redis and memcached.
Few facts.
node can not handle large v8 heap, when it grows over 200mb you will start seeing increased cpu usage
node always seem to leak memory, or at least grow large heap size without actually using it. I suspect memory fragmentation, as v8 profiling or valgrind shows no leaks in js space nor resident heap. Early 0.8 was awful in this respect, rss could be 1GB with 50MB heap.
hanging requests are hard to track. We wrote our middleware to monitor these especially as our app is long poll based
My suggestions.
use multiple instances per machine, at least 1 per cpu. Balance with haproxy, nginx or such with session affinity
write midleware to report hanged connections, ie ones that code never responded or latency was over threshold
restart instances often, at least weekly
write poller that prints out memory stats with process module one per minute
Use supervisord and fabric for easy process management
Monitor cpu, reported memory stats and restart on threshold
Whichever the type of web app, NodeJS or otherwise, load testing will answer whether your application has the right amount of server resources. A good website I recently found for this is Load Impact.
The real question to answer is WHEN does the load time begin to increase as the number of concurrent users increase? A tipping point is reached when you get to a certain number of concurrent users, after which the server performance will start to degrade. So load test according to how many users you expect to reach your website in the near future.
How can you estimate the amount of users you expect?
Installing Google Analytics or another analytics package on your pages is a must! This way you will be able to see how many daily users are visiting your website, and what is the growth of your visits from month-to-month which can help in predicting future expected visits and therefore expected load on your server.
Even if I know the number of users, how can I estimate actual load?
The answer is in the F12 Development Tools available in all browsers. Open up your website in any browser and push F12 (or for Opera Ctrl+Shift+I), which should open up the browser's development tools. On Firefox make sure you have Firebug installed, on Chrome and Internet Explorer it should work out of the box. Go to the Net or Network tab and then refresh your page. This will show you the number of HTTP requests, bandwidth usage per page load!
So the formula to work out daily server load is simple:
Number of HTTP requests per page load X the average number of pages load per user per day X Expected number of concurrent users = Total HTTP Requests to Server per Day
And...
Number of MBs transferred per page load X the average number of pages load per user per day X Expected number of concurrent users = Total Bandwidth Required per Day
I've always found it easier to calculate these figures on a daily basis and then extrapolate it to weeks and months.
Node.js is single threaded so you should definitely start a process for every cpu your machine has. Cluster is by far the best way to do this and has the added benefit of being able to restart died workers and to detect unresponsive workers.
You also want to do load testing until your requests start timing out or exceed what you consider a reasonable response time. This will give you a good idea of the upper limit your server can handle. Blitz is one of the many options to have a look at.
I have never used Joyent's statistics, but NodeFly and their node-nodefly-gcinfo is a great tools to monitor node processes.