How to fix a memory leak when switching between databases with Mongoose & MongoDB? - node.js

I've identified a memory leak in an application I'm working on, which causes it to crash after a while due to being out of memory. Fortunately we're running it on Kubernetes, so the other replicas and an automatic reboot of the crashed pod keep the software running without downtime. I'm worried about potential data loss or data corruption though.
The memory leak is seemingly tied to HTTP requests. According to the memory usage graph, memory usage increases more rapidly during the day when most of our users are active.
In order to find the memory leak, I've attached the Chrome debugger to an instance of the application running on localhost. I made a heap snapshot and then I ran a script to trigger 1000 HTTP requests. Afterwards I triggered a manual garbage collection and made another heap snapshot. Then I opened a comparison view between the two snapshots.
According to the debugger, the increase of memory usage has been mainly caused by 1000 new NativeConnection objects. They remain in memory and thus accumulate over time.
I think this is caused by our architecture. We're using the following stack:
Node 10.22.0
Express 4.17.1
MongoDB 4.0.20 (hosted by MongoDB Atlas)
Mongoose 5.10.3
Depending on the request origin, we need to connect to a different database name. To achieve that we added some Express middleware in between that switches between databases, like so:
On boot we connect to the database cluster with mongoose.createConnection(uri, options). This sets up a connection pool.
On every HTTP request we obtain a connection to the right database with connection.useDb(dbName).
After obtaining the connection we register the Mongoose models with connection.model(modelName, modelSchema).
Do you have any ideas on how we can fix the memory leak, while still being able to switch between databases? Thanks in advance!

Related

Except from memory and CPU leaks, what will be reasons for Node.js server might go went down?

I have a Node.js (Express.js) server for my React.js website as BFF. I use Node.js for SSR, proxying some request and cache some pages in Redis. In last time I found that my server time to time went down. I suggest an uptime is about 2 days. After restart, all ok, then response time growth from hour to hour. I have resource monitoring at this server, and I see that server don't have problems with RAM or CPU. It used about 30% of RAM and 20% of CPU.
I regret to say it's a big production site and I can't make minimal reproducible example, cause i don't know where is reason of these error :(
Except are memory and CPU leaks, what will be reasons for Node.js server might go went down?
I need at least direction to search.
UPDATE1:
"went down" - its when kubernetes kills container due 3 failed life checks (GET request to a root / of website)
My site don't use any BD connection but call lots of 3rd party API's. About 6 API requests due one GET/ request from browser
UPDATE2:
Thx. To your answers, guys.
To understand what happend inside my GET/ request, i'm add open-telemetry into my server. In longtime and timeout GET/ requests i saw long API requests with very big tcp.connect and tls.connect.
I think it happens due lack of connections or something about that. I think Mostafa Nazari is right.
I create patch and apply them within the next couple of days, and then will say if problem gone
I solve problem.
It really was lack of connections. I add reusing node-fetch connection due keepAlive and a lot of cache for saving connections. And its works.
Thanks for all your answers. They all right, but most helpful thing was added open-telemetry to my server to understand what exactly happens inside request.
For other people with these problems, I'm strongly recommended as first step, add telemetry to your project.
https://opentelemetry.io/
PS: i can't mark two replies as answer. Joe have most detailed and Mostafa Nazari most relevant to my problem. They both may be "best answers".
Tnx for help, guys.
Gradual growth of response time suggest some kind of leak.
If CPU and memory consumption is excluded, another potentially limiting resources include:
File descriptors - when your server forgets to close files. Monitor for number of files in /proc//fd/* to confirm this. See what those files are, find which code misbehaves.
Directory listing - even temporary directory holding a lot of files will take some time to scan, and if your application is not removing some temporary files and lists them - you will be in trouble quickly.
Zombie processes - just monitor total number of processes on the server.
Firewall rules (some docker network magic may in theory cause this on host system) - monitor length of output of "iptables -L" or "iptables-save" or equivalent on modern kernels. Rare condition.
Memory fragmentation - this may happen in languages with garbage collection, but often leaves traces with something like "Can not allocate memory" in logs. Rare condition, hard to fix. Export some health metrics and make your k8s restart your pod preemptively.
Application bugs/implementation problems. This really depends on internal logic - what is going on inside the app. There may be some data structure that gets filled in with data as time goes by in some tricky way, becoming O(N) instead of O(1). Really hard to trace down, unless you have managed to reproduce the condition in lab/test environment.
API calls from frontend shift to shorter, but more CPU-hungry ones. Monitor distribution of API call types over time.
Here are some of the many possibilities of why your server may go down:
Memory leaks The server may eventually fail if a Node.js application is leaking memory, as you stated in your post above. This may occur if the application keeps adding new objects to the memory without appropriately cleaning up.
Unhandled exceptions The server may crash if an exception is thrown in the application code and is not caught. To avoid this from happening, ensure that all exceptions are handled properly.
Third-party libraries If the application uses any third-party libraries, the server may experience problems as a result. Before using them, consider examining their resource usage, versions, or updates.
Network Connection The server's network connection may have issues if the server is sending a lot of queries to third-party APIs or if the connection is unstable. Verify that the server is handling connections, timeouts, and retries appropriately.
Connection to the Database Even though your server doesn't use any BD connections, it's a good idea to look for any stale connections to databases that could be problematic.
High Volumes of Traffic The server may experience performance issues if it is receiving a lot of traffic. Make sure the server is set up appropriately to handle a lot of traffic, making use of load balancing, caching, and other speed enhancement methods. Cloudflare is always a good option ;)
Concurrent Requests Performance problems may arise if the server is managing a lot of concurrent requests. Check to see if the server is set up correctly to handle several requests at once, using tools like a connection pool, a thread pool, or other concurrency management strategies.
(Credit goes to my System Analysis and Design course slides)
With any incoming/outgoing web requests, 2 File Descriptors will be acquired. as there is a limit on number of FDs, OS does not let new Socket to be opened, this situation cause "Timeout Error" on clients. you can easily check number of open FDs by sudo ls -la /proc/_PID_/fd/ | tail -n +4 | wc -l where _PID_ is nodejs PID, if this value is rising, you have connection leak issue.
I guess you need to do the following to prevent Connection Leak:
make sure you are closing outgoing API call Http Connection (it depends on how you are opening them, some libraries manage this and you just need to config them)
cache your outgoing API call (if it is possible) to reduce API call
for your outgoing API call, use Connection pool, this would manage number of open HttpConnection, reuse already-opened connection and ...
review your code, so that you can serve a request faster than now (for example make your API call more parallel instead of await or nested call). anything you do to make your response faster, is good for preventing this situation
I solve problem. It really was lack of connections. I add reusing node-fetch connection due keepAlive and a lot of cache for saving connections. And its works.
Thanks for all your answers. They all right, but most helpful thing was added open-telemetry to my server to understand what exactly happens inside request.
For other people with these problems, I'm strongly recommended as first step, add telemetry to your project.
https://opentelemetry.io/

Apollo Client + NextJS memory leak, InMemoryCache

Official Apollo and NextJS recommendations are about to create a new ApolloClient instance each time when the GraphQL request should be executed in case if SSR is used.
This shows good results by memory usage, memory grows for some amount and then resets with the garbage collector to the initial level.
The problem is that the initial memory usage level constantly grows and the debugger shows that leaking is caused by the "InMemoryCache" object that is attached to the ApolloClient instance as cache storage.
We tried to use the same "InMemoryCache" instance for the all new Apollo instances and tried to disable caching customizing policies in "defaultOptions", but the leak is still present.
Is it possible to turn off cache completely? Something like setting a "false" value for the "cache" option in ApolloClient initialization? Or maybe it's a known problem with a known solution and could be solved with customization of the "InMemoryCache"?
We tried numerous options, such as force cache garbage collection, eviction of the objects in the cache, etc., but nothing helped, the leak is still here.
Thank you!

How can I instrument and log my KnexJS transactions?

I have a serious problem in production causing the application to become unresponsive and output the following error:
Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
A running hypothesis is some operations are holding onto long-running Knex transactions. Enough of them to reach the pool size, basically.
Is there a way to query the KnexJS API for how many pool connections are in use at any one time? Unfortunately since KnexJS occupies the max pool settings from the config, it can be hard to know how many are actually in use. From the postgres end, it seems like KnexJS is idling on all of its connections when they are not in use.
Is there a good way to instrument Knex transaction and transacting with some kind of middleware or hook? Another useful thing is to log the callstack of any transaction (or any longer than, say, 7 seconds). One challenge is I have calls to Knex transaction and transacting throughout my project. Maybe it's a long shot.
Any advice is greatly appreciated.
System Information
KnexJS version: 0.12.6 (we will update in the next month)
Database + version: Postgres 9.6
OS: Heroku Linux (Ubuntu?)
Easiest was to see whats happening on connection pool level is to run knex with DEBUG=knex:* environment variable set, which will print quite a lot debug info whats happening inside knex. Those logs shows for example when connections are fetched from pool and returned to there and every ran query too.
There are couple of global events that you can use to hookup to every query, but there is not any for hooking to transactions. Here is related question where I have written some example code how to actually measure transaction durations with query hooks though: Tracking DB querying time - Bookshelf/knex It probably leaks some memory, so its not very production ready solution, but for your debugging purposes it might be helpful.

aws kernel is killing my node app

Problem :
I am executing test of my mongoose query but kernel kills my node app for OutOfMemory Reasons.
flow scenario: for a single request
/GET REQUEST -> READ document of user(eg.schema) [This schema has ref : user schema with one of its fields] -> COMPILE/REARRANGE the output of query read from mongodb [This involves filtering and looping of data] according the response format as required by the client. -> UPDATE a field of this document and SAVE it back to mongoDB again -> UPDATE REDIS -> SEND response [the above compiled response ] back to requested client
** the above fails when 100 concurrent customers do the same...
MEM - goes very low (<10MB)
CPU - MAX (>98%)
What i could figure out is the rate at which read and writes are occurring which is choking mongodb by queuing all requests and thereby delaying nodejs which causes such drastic CPU and MEM values and finally app gets killed by the kernel.
PLEASE suggest how do i proceed to achieve concurrency in such flows...
You've now met the Linux OOM Killer. Basically, all linux kernels (not just Amazon's) need to take action when they've run out of RAM, so they need to find a process to kill. Generally, this is the process that has been asking for the most memory.
Your 3 main options are:
Add swap space. You can create a swapfile on the root disk if it has enough space, or create a small EBS volume, attach it to the instance, and configure it as swap.
Move to an instance type with more RAM.
Decrease your memory usage on the instance, either by stopping/killing unused processes or reconfiguring your app.
Option 1 is probably the easiest for short-term debugging. For production performance, you'd want to look at optimizing your app's memory usage or getting an instance with more RAM.

Nodejs application memory usage tracking and clean up on exit

"A Node application is an instance of a Node Process Object".link
Is there a way in which local memory on the server can be cleared every time the node application exits.
[By application exit i mean that when each individual user of the website shuts down the tab on the browser]
node.js is a single process that serves all your users. There is no specific memory associated with a given user other than any state that you yourself in your own node.js code might be storing locally in your node.js server on behalf of a given user. If you have some memory like that, then the typical ways to know when to clear out that state are as follows:
Offer a specific logout option in the web page and when the user logs out, you clear their state from memory. This doesn't catch all ways the user might disappear so this would typically be done in conjunction with other optins.
Have a recurring timer (say every 10 minutes) that automatically clears any state from an user who has not made a web request within the last hour (or however long you want the time set to). This also requires you to keep a timestamp for each user each time they access something on the site which is easy to do in a middleware function.
Have all your client pages keep a webSocket connection to the server and when that webSocket connection has been closed and not re-established for a few minutes, then you can assume that the user no longer has any page open to your site and you can clear their state from memory.
Don't store user state in memory. Instead, use a persistent database with good caching. Then, when the user is no longer using your site, their state info will just age out of the database cache gracefully.
Note: Tracking memory overall usage in node.js is not a trivial task so it's important you know exactly what you are measuring if you're tracking this. Overall process memory usage is a combination of memory that is actually being used and memory that was previously used, is currently available for reuse, but has not been given back to the OS. You obviously need to be able to track memory that is actually in use by node.js, not just memory that the process may be allocated. A heapsnapshot is one of the typical ways to track what is actually being used, not just what is allocated from the OS.

Resources