Gremlin-Server takes too much memory and hangs - garbage-collection

I'm using gremlin-server (v3.02), with titan-hbase. I'm using the default configuration settings.The server is 8GB memory and 4-cores.
After few hours of work, the server stops responding to queries requests..
It must be said that the requests intensity on the server is NOT high, pretty much low-medium (few requests per hour, maybe less than that).
When cheking gremlin's last server log messages, I see it's about Hbase session timeout, and retries to reconnect the hbase again.
The server CPU and memory are 90-100% at this point.
JDK 1.8.0_45-b14 64bit on Redhat
Using jstat -gc I can all its time is spent in GC, also oldgen is 100%.
I have set "-Xmx 8g" but vitual memory in htop goes up to 12g, with a few tests with xmx I see that virtual memory always gets about "-Xmx + 4g ".
Jmap -histo gives me about 2g of [B (Byte[]) with a gig for CacheRelation and gig for CacheVertex.
After a restarting the gremlin-server, everything is back to normal, and works again.
Any ideas?

Related

MariaDB keeps exceeding innodb_buffer_pool_size

I have a backend server with 1G RAM for my HTTP server and my MariaDB.
I noticed the database keeps getting killed by OOM once or twice a day. Most of the time the OOM is triggered by the HTTP server, but not always.
I tried limiting innodb_buffer_pool_size many times, it is at 64M at this moment, but the process is still taking 40% to 60% of the server's memory.
How do I find the reason of this memory usage? It appears to be some kind of memory leak, because it keeps increasing throughout the day.
The database usually starts consuming about 7% to 9% of memory usage.
MariaDB version 10.5

Node web app running in Fargate crashes under load with memory and CPU relatively untaxed

We are running a Koa web app in 5 Fargate containers. They are pretty straightforward crud/REST API's with Koa over Mongo Atlas. We started doing capacity testing, and noticed that the node servers started to slow down significantly with plenty of headroom left on CPU (sitting at 30%), Memory (sitting at or below 20%), and Mongo (still returning in < 10ms).
To further test this, we removed the Mongo operations and just hammered our health-check endpoints. We did see a lot of throughput, but significant degradation occurred at 25% CPU and Node actually crashed at 40% CPU.
Our fargate tasks (containers) are CPU:2048 (2 "virtual CPUs") and Memory 4096 (4 gigs).
We raised our ulimit nofile to 64000 and also set the max-old-space-size to 3.5 GB. This didn't result in a significant difference.
We also don't see significant latency in our load balancer.
My expectation is that CPU or memory would climb much higher before the system began experiencing issues.
Any ideas where a bottleneck might exist?
The main issue here was that we were running containers with 2 CPUs. Since Node only effectively uses 1 CPU, there was always a certain amount of CPU allocation that was never used. The ancillary overhead never got the container to 100%. So node would be overwhelmed on its 1 cpu while the other was basically idle. This resulted in our autoscaling alarms never getting triggered.
So adjusted to 1 cpu containers with more horizontal scale out (ie more instances).

GC in Server Mode Not Collecting the Memory

IIS hosted WCF service is consuming Large memory like 18 GB and the server has slowed down.
I Analyzed Mini dump file and it shows only 1 GB or active objects. I understand the GC is not clearing the memory and GC must be running in server mode in 64 bit System. Any idea why the whole computer is stalling and app is taking huge memory?
The GC was running on Server Mode it was configured for better performance. I Understand GC running in Server mode will have a performance improvement because the GC's will not be triggered frequently due to high available memory and in server mode it will have high limit on memory usage. Here the problem was when the high limit is reached for the process CLR triggered the GC and it was trying to clear the Huge 18 GB of memory in one shot, so it was using 90% of system resource and rest applications were lagging.
We tried restarting but it was forever going so We had to kill the process. and now with Workstation mode GC smooth and clean. The only difference is response time has some delay due to GC after 1.5 GB allocation.
One more info: .NET 4.5 version has revision regarding this which has resolved this issue in GC.

CPU usage 350% while running DSE 4.x

VM configuration is - CentOS 6.2, 64-bit, 8 GB RAM Quad Core CPU.
There is aboug 1 GB of data and possibly 20 tables in the C* setup I have. When I try to start DSE after rebooting the VM, it takes a long time to start. So I ran top command and found that the CPU usage was shooting to 350%
Please see the screenshot attached.
Requesting pointers from experts here how can the CPU usage shoot up more than 100% or does the number indicate something else?

Java OutOfMemoryError in Windows Azure Virtual Machine

When I run my Java applications on a Window Azure's Ubuntu 12.04 VM,
with 4 by 1.6GHZ core and 7G RAM, I get the following out of memory error after a few minutes.
java.lang.OutOfMemoryError: GC overhead limit exceeded
I have a swap size of 15G byte, and the max heap size is set to 2G. I am using a Oracle Java 1.6. Increase the max heap size only delays the out of memory error.
It seems the JVM is not doing garbage collection.
However, when I run the above Java application on my local Windows 8 PC (core i7) , with the same JVM parameters, it runs fine. The heap size never exceed 1G.
Is there any extra setting on Windows Azure linux VM for running Java apps ?
On Azure VM, I used the following JVM parameters
-XX:+HeapDumpOnOutOfMemoryError
to get a heap dump. The heap dump shows an actor mailbox and Camel messages are taking up all the 2G.
In my Akka application, I have used Akka Camel Redis to publish processed messages to a Redis channel.
The out of memory error goes away when I stub out the above Camel Actor. It looks as though Akka Camel Redis Actor
is not performant on the VM, which has a slower cpu clock speed than my Xeon CPU.
Shing
The GC throws this exception when too much time is spent in garbage collection without collecting anything. I believe the default settings are 98% of CPU time being spent on GC with only 2% of heap being recovered.
This is to prevent applications from running for an extended period of time while making no progress because the heap is too small.
You can turn this off with the command line option -XX:-UseGCOverheadLimit

Resources