Java OutOfMemoryError in Windows Azure Virtual Machine - azure

When I run my Java applications on a Window Azure's Ubuntu 12.04 VM,
with 4 by 1.6GHZ core and 7G RAM, I get the following out of memory error after a few minutes.
java.lang.OutOfMemoryError: GC overhead limit exceeded
I have a swap size of 15G byte, and the max heap size is set to 2G. I am using a Oracle Java 1.6. Increase the max heap size only delays the out of memory error.
It seems the JVM is not doing garbage collection.
However, when I run the above Java application on my local Windows 8 PC (core i7) , with the same JVM parameters, it runs fine. The heap size never exceed 1G.
Is there any extra setting on Windows Azure linux VM for running Java apps ?

On Azure VM, I used the following JVM parameters
-XX:+HeapDumpOnOutOfMemoryError
to get a heap dump. The heap dump shows an actor mailbox and Camel messages are taking up all the 2G.
In my Akka application, I have used Akka Camel Redis to publish processed messages to a Redis channel.
The out of memory error goes away when I stub out the above Camel Actor. It looks as though Akka Camel Redis Actor
is not performant on the VM, which has a slower cpu clock speed than my Xeon CPU.
Shing

The GC throws this exception when too much time is spent in garbage collection without collecting anything. I believe the default settings are 98% of CPU time being spent on GC with only 2% of heap being recovered.
This is to prevent applications from running for an extended period of time while making no progress because the heap is too small.
You can turn this off with the command line option -XX:-UseGCOverheadLimit

Related

Netty webclient memory leak in tomcat server

I am observing swap memory issue in our tomcat servers which is installed in linux machines and when tried to collect heap dump, got this while analyzing heap dump.
16 instances of "io.netty.buffer.PoolArena$HeapArena", loaded by "org.apache.catalina.loader.ParallelWebappClassLoader # 0x7f07994aeb58" occupy 201,697,824 (15.40%) bytes.
Have seen in this blog Memory accumulated in netty PoolChunk that Adding -Dio.netty.allocator.type=unpooled showed significant reduction in the memory. Where do we need to add this property in our tomcat servers?

G1GC and Permgen

I'm having doubts regarding which metrics I should follow to allocate memory for the permgen.
I'm having crashing problems and that permgen is full, my server has 32gb of memory for the heap and 512m for permgen, would you have any metrics or recommendations to follow to configure Permgen? Another doubt would be related to the GC, the G1GC was configured because from what I had researched it was one of the best options, but I noticed that it demands more of the heap memory, would there be a better gc for a server with a lot of demand and that needs a precise collection or would it just be the same?
CentOS operating system
Java 7
tomcat 7

Why does the yarn node manager die when running spark application?

I am running a spark-java application on yarn with dynamic allocation enabled. The Yarn Node Manager halts, and I see java.lang.OutOfMemoryError: GC overhead limit exceeded in the Node Manager logs.
Naturally, I increased the memory for the Node Manager from 1G to 2G and then to 4G and I still see the same issues.
The strange thing is that this app used to work well in the Cloudera cluster now that we have switched to Horton works I see these issues.
When looking at Grafana charts for the node manager, I can see that the node that has died was using only 60% of its heap.
One side question is it normal for spark to use netty & nio simultaneously...because I see things like:
ERROR server.TransportRequestHandler (TransportRequestHandler.java:lambda$respond$2(226)) - Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=2003440799655, chunkIndex=197}, buffer=FileSegmentManagedBuffer{file=/folder/application_1643748238319_0002/blockmgr-70443374-6f01-4960-90f9-b045f87798af/0f/shuffle_0_516_0.data, offset=55455, length=1320}} to /xxx.xxx.xxx.xxx:xxxx; closing connection
java.nio.channels.ClosedChannelException
at org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.close(...)(Unknown Source)
Anyway, I see the outOfMemoryError exception in several scenarios.
YarnUncaughtExceptionHandler
yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(51)) - Thread Thread[Container Monitor,5,main] threw an Error. Shutting down now...
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.io.BufferedReader.<init>(BufferedReader.java:105)
at java.io.BufferedReader.<init>(BufferedReader.java:116)
at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:528)
at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsBasedProcessTree.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:457)
TransportRequestHandler Error
ava.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.network.util.ByteArrayWritableChannel.<init>(ByteArrayWritableChannel.java:32)
at org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.<init>(SaslEncryption.java:160)
at org.apache.spark.network.sasl.SaslEncryption$EncryptionHandler.write(SaslEncryption.java:87)
and
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.sun.crypto.provider.CipherCore.update(CipherCore.java:666)
at com.sun.crypto.provider.DESedeCipher.engineUpdate(DESedeCipher.java:228)
at javax.crypto.Cipher.update(Cipher.java:1795)
Long Pause
util.JvmPauseMonitor (JvmPauseMonitor.java:run(205)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1326ms
The main reason for that issue is that your containers are over customized with more memory than the actual physical memory another thing is the number of vcores should be aligned with the number of vcore=(CPU * core), if you set 16GB and your physical machine is only 8GB your container will try to allocate 16GB then yarn you kill the container due OOM
check this setting at YARN:
yarn.nodemanager.resource.memory-mb=(value for a single machine memory, not for the sum of all machines)
yarn.nodemanager.resource.cpu-vcores=(cpu * cores) and for all vcores related params

JVM GC behaviour on heap dump and unnecessary heap usage

We have problem tuning the memory management of JVM's. The very same application running on the k8s cluster, but one of the pods' jvm heap usage rises to ~95% and, when we try to get a heapdump on this vs, somehow gc runs and heap usage drops suddenly, leaving us with a tiny heap dump.
I think the old space has grown unnecessarily, and gc did not work to reclaim memory (for nearly 15 hours). Unfortunately we can't see what is occupying the space, because the heap dump is very small as gc is forced.
All 3 pods are having memory of 1500m and
here is the jvm heap usage percentage graph (3 pods, green being the problematic one):
Details:
openjdk 15.0.1 2020-10-20
OpenJDK Runtime Environment AdoptOpenJDK (build 15.0.1+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 15.0.1+9, mixed mode, sharing)
JVM Parameters:
-XX:MaxRAMPercentage=75
-XX:InitialRAMPercentage=75
-server
-Xshare:off
-XX:MaxMetaspaceSize=256m
-Dsun.net.inetaddr.ttl=60
-XX:-OmitStackTraceInFastThrow
-XX:+ShowCodeDetailsInExceptionMessages
The questions are:
Why a full gc is called when we try to get heap dump?
What is the motivation behind the gc not reclaiming memory and causes the application run with the heap size between ~70% and ~95%, while jvm can use and perfectly work with only 10%?
What can be done to force jvm to do gc more aggresively to avoid this situation? Or should it be done for production environment?
JVM heap dump procedure has 2 modes
live objects - this mode executes Full GC along side with heap dump. This is default options.
all objects - heap dump would include all object on heap both reachable and unreachable.
Heap dump mode is usually possible to choose via tool specific option.
Answering your questions
Why a full gc is called when we try to get heap dump?
Answered above
What is the motivation behind the gc not reclaiming memory and causes the application run with the heap size between ~70% and ~95%, while jvm can use and perfectly work with only 10%?
Reclaiming memory required CPU resources and impacts application latency. While JVM is operating withing memory limits it will mostly avoid expensive GC.
Recent development of containers is driving some changes in JVM GC department, but statement above is still relevant for default GC configuration.
What can be done to force jvm to do gc more aggressively to avoid this situation? Or should it be done for production environment?
Original answers lack problem statement. But general advises are
manage memory limits per container (JVM derive heap size from container limits unless they are overridden explicitly)
forcing GC periodically is possible, though unlikely to be a solution to any problem
G1GC has wide range of tuning options relevant for containers

GC in Server Mode Not Collecting the Memory

IIS hosted WCF service is consuming Large memory like 18 GB and the server has slowed down.
I Analyzed Mini dump file and it shows only 1 GB or active objects. I understand the GC is not clearing the memory and GC must be running in server mode in 64 bit System. Any idea why the whole computer is stalling and app is taking huge memory?
The GC was running on Server Mode it was configured for better performance. I Understand GC running in Server mode will have a performance improvement because the GC's will not be triggered frequently due to high available memory and in server mode it will have high limit on memory usage. Here the problem was when the high limit is reached for the process CLR triggered the GC and it was trying to clear the Huge 18 GB of memory in one shot, so it was using 90% of system resource and rest applications were lagging.
We tried restarting but it was forever going so We had to kill the process. and now with Workstation mode GC smooth and clean. The only difference is response time has some delay due to GC after 1.5 GB allocation.
One more info: .NET 4.5 version has revision regarding this which has resolved this issue in GC.

Resources