Apollo Client + NextJS memory leak, InMemoryCache - memory-leaks

Official Apollo and NextJS recommendations are about to create a new ApolloClient instance each time when the GraphQL request should be executed in case if SSR is used.
This shows good results by memory usage, memory grows for some amount and then resets with the garbage collector to the initial level.
The problem is that the initial memory usage level constantly grows and the debugger shows that leaking is caused by the "InMemoryCache" object that is attached to the ApolloClient instance as cache storage.
We tried to use the same "InMemoryCache" instance for the all new Apollo instances and tried to disable caching customizing policies in "defaultOptions", but the leak is still present.
Is it possible to turn off cache completely? Something like setting a "false" value for the "cache" option in ApolloClient initialization? Or maybe it's a known problem with a known solution and could be solved with customization of the "InMemoryCache"?
We tried numerous options, such as force cache garbage collection, eviction of the objects in the cache, etc., but nothing helped, the leak is still here.
Thank you!

Related

Except from memory and CPU leaks, what will be reasons for Node.js server might go went down?

I have a Node.js (Express.js) server for my React.js website as BFF. I use Node.js for SSR, proxying some request and cache some pages in Redis. In last time I found that my server time to time went down. I suggest an uptime is about 2 days. After restart, all ok, then response time growth from hour to hour. I have resource monitoring at this server, and I see that server don't have problems with RAM or CPU. It used about 30% of RAM and 20% of CPU.
I regret to say it's a big production site and I can't make minimal reproducible example, cause i don't know where is reason of these error :(
Except are memory and CPU leaks, what will be reasons for Node.js server might go went down?
I need at least direction to search.
UPDATE1:
"went down" - its when kubernetes kills container due 3 failed life checks (GET request to a root / of website)
My site don't use any BD connection but call lots of 3rd party API's. About 6 API requests due one GET/ request from browser
UPDATE2:
Thx. To your answers, guys.
To understand what happend inside my GET/ request, i'm add open-telemetry into my server. In longtime and timeout GET/ requests i saw long API requests with very big tcp.connect and tls.connect.
I think it happens due lack of connections or something about that. I think Mostafa Nazari is right.
I create patch and apply them within the next couple of days, and then will say if problem gone
I solve problem.
It really was lack of connections. I add reusing node-fetch connection due keepAlive and a lot of cache for saving connections. And its works.
Thanks for all your answers. They all right, but most helpful thing was added open-telemetry to my server to understand what exactly happens inside request.
For other people with these problems, I'm strongly recommended as first step, add telemetry to your project.
https://opentelemetry.io/
PS: i can't mark two replies as answer. Joe have most detailed and Mostafa Nazari most relevant to my problem. They both may be "best answers".
Tnx for help, guys.
Gradual growth of response time suggest some kind of leak.
If CPU and memory consumption is excluded, another potentially limiting resources include:
File descriptors - when your server forgets to close files. Monitor for number of files in /proc//fd/* to confirm this. See what those files are, find which code misbehaves.
Directory listing - even temporary directory holding a lot of files will take some time to scan, and if your application is not removing some temporary files and lists them - you will be in trouble quickly.
Zombie processes - just monitor total number of processes on the server.
Firewall rules (some docker network magic may in theory cause this on host system) - monitor length of output of "iptables -L" or "iptables-save" or equivalent on modern kernels. Rare condition.
Memory fragmentation - this may happen in languages with garbage collection, but often leaves traces with something like "Can not allocate memory" in logs. Rare condition, hard to fix. Export some health metrics and make your k8s restart your pod preemptively.
Application bugs/implementation problems. This really depends on internal logic - what is going on inside the app. There may be some data structure that gets filled in with data as time goes by in some tricky way, becoming O(N) instead of O(1). Really hard to trace down, unless you have managed to reproduce the condition in lab/test environment.
API calls from frontend shift to shorter, but more CPU-hungry ones. Monitor distribution of API call types over time.
Here are some of the many possibilities of why your server may go down:
Memory leaks The server may eventually fail if a Node.js application is leaking memory, as you stated in your post above. This may occur if the application keeps adding new objects to the memory without appropriately cleaning up.
Unhandled exceptions The server may crash if an exception is thrown in the application code and is not caught. To avoid this from happening, ensure that all exceptions are handled properly.
Third-party libraries If the application uses any third-party libraries, the server may experience problems as a result. Before using them, consider examining their resource usage, versions, or updates.
Network Connection The server's network connection may have issues if the server is sending a lot of queries to third-party APIs or if the connection is unstable. Verify that the server is handling connections, timeouts, and retries appropriately.
Connection to the Database Even though your server doesn't use any BD connections, it's a good idea to look for any stale connections to databases that could be problematic.
High Volumes of Traffic The server may experience performance issues if it is receiving a lot of traffic. Make sure the server is set up appropriately to handle a lot of traffic, making use of load balancing, caching, and other speed enhancement methods. Cloudflare is always a good option ;)
Concurrent Requests Performance problems may arise if the server is managing a lot of concurrent requests. Check to see if the server is set up correctly to handle several requests at once, using tools like a connection pool, a thread pool, or other concurrency management strategies.
(Credit goes to my System Analysis and Design course slides)
With any incoming/outgoing web requests, 2 File Descriptors will be acquired. as there is a limit on number of FDs, OS does not let new Socket to be opened, this situation cause "Timeout Error" on clients. you can easily check number of open FDs by sudo ls -la /proc/_PID_/fd/ | tail -n +4 | wc -l where _PID_ is nodejs PID, if this value is rising, you have connection leak issue.
I guess you need to do the following to prevent Connection Leak:
make sure you are closing outgoing API call Http Connection (it depends on how you are opening them, some libraries manage this and you just need to config them)
cache your outgoing API call (if it is possible) to reduce API call
for your outgoing API call, use Connection pool, this would manage number of open HttpConnection, reuse already-opened connection and ...
review your code, so that you can serve a request faster than now (for example make your API call more parallel instead of await or nested call). anything you do to make your response faster, is good for preventing this situation
I solve problem. It really was lack of connections. I add reusing node-fetch connection due keepAlive and a lot of cache for saving connections. And its works.
Thanks for all your answers. They all right, but most helpful thing was added open-telemetry to my server to understand what exactly happens inside request.
For other people with these problems, I'm strongly recommended as first step, add telemetry to your project.
https://opentelemetry.io/

How to fix a memory leak when switching between databases with Mongoose & MongoDB?

I've identified a memory leak in an application I'm working on, which causes it to crash after a while due to being out of memory. Fortunately we're running it on Kubernetes, so the other replicas and an automatic reboot of the crashed pod keep the software running without downtime. I'm worried about potential data loss or data corruption though.
The memory leak is seemingly tied to HTTP requests. According to the memory usage graph, memory usage increases more rapidly during the day when most of our users are active.
In order to find the memory leak, I've attached the Chrome debugger to an instance of the application running on localhost. I made a heap snapshot and then I ran a script to trigger 1000 HTTP requests. Afterwards I triggered a manual garbage collection and made another heap snapshot. Then I opened a comparison view between the two snapshots.
According to the debugger, the increase of memory usage has been mainly caused by 1000 new NativeConnection objects. They remain in memory and thus accumulate over time.
I think this is caused by our architecture. We're using the following stack:
Node 10.22.0
Express 4.17.1
MongoDB 4.0.20 (hosted by MongoDB Atlas)
Mongoose 5.10.3
Depending on the request origin, we need to connect to a different database name. To achieve that we added some Express middleware in between that switches between databases, like so:
On boot we connect to the database cluster with mongoose.createConnection(uri, options). This sets up a connection pool.
On every HTTP request we obtain a connection to the right database with connection.useDb(dbName).
After obtaining the connection we register the Mongoose models with connection.model(modelName, modelSchema).
Do you have any ideas on how we can fix the memory leak, while still being able to switch between databases? Thanks in advance!

Arangodb foxx-application poor performance

I have serious issue with custom foxx application.
About the app
The application is customized algorithm for finding path in graph. It's optimized for public transport. On init it loads all necessary data into javascript variable and then it traverse through them. Its faster then accessing the db each time.
The issue
When I access through api the application for first time then it is fast eg. 300ms. But when I do absolutely same request second time it is very slow. eg. 7000ms.
Can you please help me with this? I have no idea where to look for bugs.
Without knowing more about the app & the code, I can only speculate about reasons.
Potential reason #1: development mode.
If you are running ArangoDB in development mode, then the init procedure is run for each Foxx route request, making precalculation of values useless.
You can spot whether or not you're running in development mode by inspecting the arangod logs. If you are in development mode, there will be a log message about that.
Potential reason #2: JavaScript variables are per thread
You can run ArangoDB and thus Foxx with multiple threads, each having thread-local JavaScript variables. If you issue a request to a Foxx route, then the server will pick a random thread to answer the request.
If the JavaScript variable is still empty in this thread, it may need to be populated first (this will be your init call).
For the next request, again a random thread will be picked for execution. If the JavaScript variable is already populated in this thread, then the response will be fast. If the variable needs to be populated, then response will be slow.
After a few requests (at least as many as configured in --server.threads startup option), the JavaScript variables in each thread should have been initialized and the response times should be the same.

Connecting to the new Azure Caching (DataCache, DataCacheFactory, & Connection Pooling)

The Windows Azure Caching Document says
If possible, store and reuse the same DataCacheFactory object to conserve memory and optimize performance."
Has anyone seen any metrics or any quantification of how expensive this is?
One argument is that
"MaxConnectionsToServer setting... determines the number of chennels per DataCacheFactory that are opened to the cache cluster."
So if MaxConnectionsToServer = 1 and DataCacheFactory is a singleton in your app, then you've effectively syncronized all requests to your web server!
However, there is a lot of indication that DataCacheFactory should be a singleton (i.e. put in Application_OnStart).
This is critical and I can't believe it is not in the Microsoft documentation. Is the DataCacheFactory treated the same in AppFabric, Azure Shared Caching, and Azure Caching? I just have a difficult time believing that Microsoft designed caching in a way that requires a singleton factory object. This is like requiring anyone that uses SqlConnection to have a singleton SqlConnectionFactory object in their application.
So, considering a relatively average web app (For example, 1,000s of requests per hour, ~ 100 objects in cache, the average request accesses 5 cached objects):
By default (and recommendation) how many Factory objects should there be at one time?
How long does it take to create a DataCacheFactory reference?
How long does it take to create a DataCache reference?
Should their only be 1 DataCacheFactory object per app and only 1 DataCache reference per request?
EDIT (answers in progress):
(1/2). Let Azure connection pooling handle the Factory objects
(3). Still testing...
(4). Still trying to figure out if I should re-use DataCache references
How about that, Microsoft did document best practices and it does involve connection pooling! Although not easy to find (at least for me).
It appears that the answer is simply to not use the DataCacheFactory object when implementing the newer Azure Caching and just access the DataCache object directly
"There are also new overloads to the DataCache constructor that make
it simpler to create a cache client. In the past, it was always
necessary to create a DataCacheFactory object that returns the target
cache. Now it is possible to create the cache with the DataCache
constructor directly. The following example creates a client to the
default cache from the default section of the configuration file."
DataCache cache = new DataCache();
And to use connection pooling.
"With the latest Windows Azure SDK, connection pooling is enabled by
default when you define your cache settings in the application or web
configuration files. Because of this default behavior, it is important
to set the size of the connection pool correctly. The connection pool
size is configured with the maxConnectionsToServer attribute on the
dataCacheClient element."
I wish Microsoft gave some guidance on how to configure the maxConnectionsToServer correctly but that can be determined through testing. The automatic connection pooling with the new Azure Caching is pretty cool :)
I'm assuming you're referring to Shared Caching Service (previously known as Azure AppFabric Cache)
There is no cost per an individual connection. However, when you purchase a Cache account, you're paying not for only the size of a cache account but also for a particular number of connections.
Smallest cache account has 10 connections per hour, while most expensive one allows for 160 concurrent connections. Thus, if you're concerned that you may run out of connections given the size of your account, it maybe prudent to be careful as to how many connections you open from your app.
More details
http://msdn.microsoft.com/en-us/library/windowsazure/hh697522.aspx

Azure DataCache MaxConnectionToServer

I am using the AppFabricCacheSessionStoreProvider and occasionally get the error
ErrorCode:SubStatus:There is a temporary failure. Please retry later.
(The request failed, because you exceeded quota limits for this hour.
If you experience this often, upgrade your subscription to a higher
one). Additional Information : Throttling due to resource :
Connections.
I am using a basic 128mb cache with a web role which has two instances. What is the default MaxConnectionToServer value if it is not set? I think when I fire up a staging instance as well it can cause this error (4 simultaneous instances). Will setting MaxConnectionToServer to a higher value make it better or worse? I believe the 128mb cache has limit of 5 connections so should I set it to 1 which would mean only 4 connections could be used. The cache is not used elsewhere in the app.
The default for MaxConnectToServer is 1, so you shouldn't have to change this setting, but if you do set it to 1, it will avoid anyone else looking at your config from getting confused as well. If you set it to a higher value then you will see this problem more often.
The cache session provider seems to be a little slow at disposing of its connections to the cache when it doesn't need them any more. This means that if you're running a number of instances which is close to the limit for you cache size you do seem to see this error. You're correct a 128MB cache does only allow 5 concurrent connections. If you want to avoid this problem at the moment the only solution I'm aware of is to buy the next cache size up.

Resources