Titan 1.0.0 graph cache memory leaks

Titan 1.0.0 graph cache memory leaks - memory-leaks

I am running TomEE (kind of Tomcat) server on Java 8 and using Titan 1.0.0 with Cassandra backend and Elasticsearch index as my database. In my development environment with small user connection everything works perfect but on my production server with many user connected with many db access we have an issue with memory leaks probably because Titan graph cache or something Titan relative. I have configure 4GB heap-size and after more than 10 hours of run allocated memory on heap grows to its maximum (~300Mb per hour) and it causes that GC (Garbage Collector) can't clean up anymore and remains to run continuously that causes server instance to be unresponsive.
Using VisualVM i did some memory profiling and those are my screenshots:
Any suggest how we can fix this or find the way to investigate this issue in more details? May be some GC params can help us in this case?

I have seen those problems before on Titan 1.0 with Cassandra. Two things to check:
Opening And Closing The Graph
Are you opening different transaction to the graph per user or different graphs per user? i.e. are you doing:
(1)
//User 1 Logs in
graph = TitanFactory.open(config);
soStuffWithGraph(graph);
//User 2 Logs in
graph = TitanFactory.open(config);
soStuffWithGraph(graph);
or
(2)
graph = TitanFactory.open(config);
//User 1 Logs in
soStuffWithGraph(graph);
//User 2 Logs in
soStuffWithGraph(graph);
Approach (1) means that each user gets their own connection to the graph with their own graph object. This is very heavy and leads to more memory usage.
Approach (2) means that every user uses the same connection but different transactions. This is preferable in my opinion. Note: This is assuming users are on different threads.
Long lived Transactions
This is the problem I had which resulted in similar GC problems to yourself. I simply kept transactions alive for too long. To speed up querying Titan caches a lot and I don't think it clears the cache unless the transaction is closed. So ideally you should have something like:
graph = TitanFactory.open(config);
//User 1 Logs in
soStuffWithGraph(graph);
graph.tx().close();
//User 2 Logs in
soStuffWithGraph(graph);
graph.tx().close();
where each transaction is closed after a user is done with it.

Related

How to fix a memory leak when switching between databases with Mongoose & MongoDB?

I've identified a memory leak in an application I'm working on, which causes it to crash after a while due to being out of memory. Fortunately we're running it on Kubernetes, so the other replicas and an automatic reboot of the crashed pod keep the software running without downtime. I'm worried about potential data loss or data corruption though.
The memory leak is seemingly tied to HTTP requests. According to the memory usage graph, memory usage increases more rapidly during the day when most of our users are active.
In order to find the memory leak, I've attached the Chrome debugger to an instance of the application running on localhost. I made a heap snapshot and then I ran a script to trigger 1000 HTTP requests. Afterwards I triggered a manual garbage collection and made another heap snapshot. Then I opened a comparison view between the two snapshots.
According to the debugger, the increase of memory usage has been mainly caused by 1000 new NativeConnection objects. They remain in memory and thus accumulate over time.
I think this is caused by our architecture. We're using the following stack:
Node 10.22.0
Express 4.17.1
MongoDB 4.0.20 (hosted by MongoDB Atlas)
Mongoose 5.10.3
Depending on the request origin, we need to connect to a different database name. To achieve that we added some Express middleware in between that switches between databases, like so:
On boot we connect to the database cluster with mongoose.createConnection(uri, options). This sets up a connection pool.
On every HTTP request we obtain a connection to the right database with connection.useDb(dbName).
After obtaining the connection we register the Mongoose models with connection.model(modelName, modelSchema).
Do you have any ideas on how we can fix the memory leak, while still being able to switch between databases? Thanks in advance!

Node.JS WebSocket High Memory Usage

We currently have a production node.js application that has been underperforming for a while. Now the application is a live bidding platform, and also runs timed auctions. The actual system running live sales is perfect and works as required. We have noticed that while running our timed sales (where items in a sale have timers and they incrementally finish, and if someone bids within the last set time, it will increment the time up X amount of seconds).
Now the issue I have found is that during the period of a timed sale finishing (which can go on for hours) if items have 60 seconds between each lots and have extensions if users bid in the last 10 seconds. So we were able to connect via the devtools and I have done heap memory exports to see what is going on, but all I can see is that all indications point to stream writeable and the buffers. So my question is what am I doing wrong. See below a screenshot of a heap memory export:
As you can see from the above, there is a lot of memory being used specifically for this it was using 1473MB of physical RAM. We saw this rise very quickly (within 30 mins) and each increment seemed to be more than the last. So when it hit 3.5GB it was incrementing at around 120MB each second, and then as it got higher around 5GB it was incrementing at 500MB per second and got to around 6GB and then the worker crashed (has a max heap size of 8GB), and then we were a process down.
So let me tell you about the platform. It is of course a bidding platform as I said earlier, the platform uses Node (v11.3.0) and is clustered using the built in cluster library. It spawns 4 workers, and has the main process (so 5 altogether). The system accepts bids, checks other bids, calculates who is winning and essentially pushes updates to the connected clients via Redis PUB/SUB then that is broadcasted to that workers connected users.
All data is stored within redis and mysql is used to refresh data into redis as redis has performed 10x faster than mysql was able to.
Now the way this works is on connection a small session is created against the connection, this is then used to authenticate the user (which is a message sent from the client) all message events are sent to a handler which pushes it to the correct command these commands are then all set as async functions and run async.
Now this has no issue on small scale, but we had over 250 connections and was seeing the above behaviour and are unsure where to find a fix. We noticed when opening the top obejct, it was connected to buffer.js and stream_writable.js as well. I can also see all references are connected to system / JSArrayBufferData and all refer back to these, there are lots of objects, and we are unable to fix this issue.
We think one of the following:
We log to file using append mode, which logs lots of information to the console and to a file using fs.writeFile and append mode. We did some research and saw that writing to console can be a cause of this kind of behaviour.
It is the get lots function which outputs all the lots for that page (currently set to 50) every time an item finishes, so if the timer ends it will ask for a full page load for all the items on that page, instead of adding new lots in.
There is something else happening here that we are unaware of, maybe the external library we are using that may not be removing a reference.
I have listed the libraries of interest that we require here:
"bluebird": "^3.5.1", (For promisifying the redis library)
"colors": "^1.2.5", (Used on every console.log (we call logs for everything that happens this can be around 50 every few seconds.)
"nodejs-websocket": "^1.7.1", (Our websocket library)
"redis": "^2.8.0", (Our redis client)
Anyway, if there is anything painstakingly obvious I would love to hear, as everything I have followed online and other stack overflow questions does not relate close enough to the issue we are facing.

aws kernel is killing my node app

Problem :
I am executing test of my mongoose query but kernel kills my node app for OutOfMemory Reasons.
flow scenario: for a single request
/GET REQUEST -> READ document of user(eg.schema) [This schema has ref : user schema with one of its fields] -> COMPILE/REARRANGE the output of query read from mongodb [This involves filtering and looping of data] according the response format as required by the client. -> UPDATE a field of this document and SAVE it back to mongoDB again -> UPDATE REDIS -> SEND response [the above compiled response ] back to requested client
** the above fails when 100 concurrent customers do the same...
MEM - goes very low (<10MB)
CPU - MAX (>98%)
What i could figure out is the rate at which read and writes are occurring which is choking mongodb by queuing all requests and thereby delaying nodejs which causes such drastic CPU and MEM values and finally app gets killed by the kernel.
PLEASE suggest how do i proceed to achieve concurrency in such flows...

You've now met the Linux OOM Killer. Basically, all linux kernels (not just Amazon's) need to take action when they've run out of RAM, so they need to find a process to kill. Generally, this is the process that has been asking for the most memory.
Your 3 main options are:
Add swap space. You can create a swapfile on the root disk if it has enough space, or create a small EBS volume, attach it to the instance, and configure it as swap.
Move to an instance type with more RAM.
Decrease your memory usage on the instance, either by stopping/killing unused processes or reconfiguring your app.
Option 1 is probably the easiest for short-term debugging. For production performance, you'd want to look at optimizing your app's memory usage or getting an instance with more RAM.

Nodejs application memory usage tracking and clean up on exit

"A Node application is an instance of a Node Process Object".link
Is there a way in which local memory on the server can be cleared every time the node application exits.
[By application exit i mean that when each individual user of the website shuts down the tab on the browser]

node.js is a single process that serves all your users. There is no specific memory associated with a given user other than any state that you yourself in your own node.js code might be storing locally in your node.js server on behalf of a given user. If you have some memory like that, then the typical ways to know when to clear out that state are as follows:
Offer a specific logout option in the web page and when the user logs out, you clear their state from memory. This doesn't catch all ways the user might disappear so this would typically be done in conjunction with other optins.
Have a recurring timer (say every 10 minutes) that automatically clears any state from an user who has not made a web request within the last hour (or however long you want the time set to). This also requires you to keep a timestamp for each user each time they access something on the site which is easy to do in a middleware function.
Have all your client pages keep a webSocket connection to the server and when that webSocket connection has been closed and not re-established for a few minutes, then you can assume that the user no longer has any page open to your site and you can clear their state from memory.
Don't store user state in memory. Instead, use a persistent database with good caching. Then, when the user is no longer using your site, their state info will just age out of the database cache gracefully.
Note: Tracking memory overall usage in node.js is not a trivial task so it's important you know exactly what you are measuring if you're tracking this. Overall process memory usage is a combination of memory that is actually being used and memory that was previously used, is currently available for reuse, but has not been given back to the OS. You obviously need to be able to track memory that is actually in use by node.js, not just memory that the process may be allocated. A heapsnapshot is one of the typical ways to track what is actually being used, not just what is allocated from the OS.

Hazelcast - OperationTimeoutException

I am using Hazelcast version 3.3.1.
I have a 9 node cluster running on aws using c3.2xlarge servers.
I am using a distributed executor service and a distributed map.
Distributed executor service uses a single thread.
Distributed map is configured with no replication and no near-cache and stores about 1 million objects of size 1-2kb using Kryo serializer.
My use case goes as follow:
All 9 nodes constantly execute a synchronous remote operation on the distributed executor service and generate about 20k hits per second (about ~2k per node).
Invocations are executed using Hazelcast API: com.hazelcast.core.IExecutorService#executeOnKeyOwner.
Each operation accesses the distributed map on the node owning the partition, does some calculation using the stored object and stores the object in to the map. (for that I use the get and set API of the IMap object).
Every once in a while Hazelcast encounters a timeout exceptions such as:
com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:mapService', op=GetOperation{}, partitionId=212, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[172.31.44.2]:5701, backupsExpected=0, backupsCompleted=0}, response=null, done=false} No response has been received! backups-expected:0 backups-completed: 0
In some cases I see map partitions start to migrate which makes thing even worse, nodes constantly leave and re-join the cluster and the only way I can overcome the problem is by restarting the entire cluster.
I am wondering what may cause Hazelcast to block a map-get operation for 120 seconds?
I am pretty sure it's not network related since other services on the same servers operate just fine.
Also note that the servers are mostly idle (~70%).
Any feedbacks on my use case will be highly appreciated.

Why don't you make use of an entry processor? This is also send to the right machine owning the partition and the load, modify, store is done automatically and atomically. So no race problems. It will probably outperform the current approach significantly since there is less remoting involved.
The fact that the map.get is not returning for 120 seconds is indeed very confusing. If you switch to Hazelcast 3.5 we added some logging/debugging stuff for this using the slow operation detector (executing side) and slow invocation detector (caller side) and should give you some insights what is happening.
Do you see any Health monitor logs being printed?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string