ResourceManager Memory Leak?

ResourceManager Memory Leak? - memory-leaks

We got two CDH cluster with the same version(CDH-5.5.2-1.cdh5.5.2.p0.4), and both the ResourceManager of each cluster with the same configuration.
One of the ResourceManager is running well, and its heap memory is stay in a constant value(e.g 800mb) as the time is going on.
But the other one will throw OOM exception and exit after 15 days. When we use 'jmap -F -histo' to dump its jvm heap info, we are seeing that the size of object 'char[]' is growing up as the time is moving, and it finally throw OOM.
Following is key info of jvm dump result of both the good RM and OOM RM:
dump cmd：jmap -F -histo pid
A）jvm dump of good RM in cluster A
[we are seeing that 40w+ char[] instances with 60m+ heap mem][1]
B）jvm dump of bak RM（OOM） in cluster B
[we are seeing that 30w+ char[] instances but with 400m+ heap mem][2]
Any help wil be appreciated.
We dump(jmap -F -dump:file=file.dump_result pid) heap info today, and use MAT(memory analyzer tools) to analyse the dump file, we found that the instance variable applications(java.util.concurrent.ConcurrentHashMap) in org.apache.hadoop.yarn.server.resourcemanager.RMActiveServiceContext eats up a lot of memory:
call hierachry information
instance variable: applications

Related

Kubernetes Pods Terminated - Exit Code 137

I need some advise on an issue I am facing with k8s 1.14 and running gitlab pipelines on it. Many jobs are throwing up exit code 137 errors and I found that it means that the container is being terminated abruptly.
Cluster information:
Kubernetes version: 1.14
Cloud being used: AWS EKS
Node: C5.4xLarge
After digging in, I found the below logs:
**kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%).
**kubelet: E0114 03:37:08.653132** 4721 kubelet.go:1282] Image garbage collection failed once. Stats initialization may not have completed yet: failed to garbage collect required amount of images. Wanted to free 3022784921 bytes, but freed 0 bytes
**kubelet: W0114 03:37:23.240990** 4721 eviction_manager.go:397] eviction manager: timed out waiting for pods runner-u4zrz1by-project-12123209-concurrent-4zz892_gitlab-managed-apps(d9331870-367e-11ea-b638-0673fa95f662) to be cleaned up
**kubelet: W0114 00:15:51.106881** 4781 eviction_manager.go:333] eviction manager: attempting to reclaim ephemeral-storage
**kubelet: I0114 00:15:51.106907** 4781 container_gc.go:85] attempting to delete unused containers
**kubelet: I0114 00:15:51.116286** 4781 image_gc_manager.go:317] attempting to delete unused images
**kubelet: I0114 00:15:51.130499** 4781 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage
**kubelet: I0114 00:15:51.130648** 4781 eviction_manager.go:362] eviction manager: pods ranked for eviction:
1. runner-u4zrz1by-project-10310692-concurrent-1mqrmt_gitlab-managed-apps(d16238f0-3661-11ea-b638-0673fa95f662)
2. runner-u4zrz1by-project-10310692-concurrent-0hnnlm_gitlab-managed-apps(d1017c51-3661-11ea-b638-0673fa95f662)
3. runner-u4zrz1by-project-13074486-concurrent-0dlcxb_gitlab-managed-apps(63d78af9-3662-11ea-b638-0673fa95f662)
4. prometheus-deployment-66885d86f-6j9vt_prometheus(da2788bb-3651-11ea-b638-0673fa95f662)
5. nginx-ingress-controller-7dcc95dfbf-ld67q_ingress-nginx(6bf8d8e0-35ca-11ea-b638-0673fa95f662)
And then the pods get terminated resulting in the exit code 137s.
Can anyone help me understand the reason and a possible solution to overcome this?
Thank you :)

Exit Code 137 does not necessarily mean OOMKilled. It indicates failure as container received SIGKILL (some interrupt or ‘oom-killer’ [OUT-OF-MEMORY])
If pod got OOMKilled, you will see below line when you describe the pod
State: Terminated
Reason: OOMKilled
Edit on 2/2/2022
I see that you added **kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%). and must evict pod(s) to reclaim ephemeral-storage from the log. It usually happens when application pods are writing something to disk like log files. Admins can configure when (at what disk usage %) to do eviction.

137 mean that k8s kill container for some reason (may be it didn't pass liveness probe)
Cod 137 is 128 + 9(SIGKILL) process was killed by external signal

The typical causes for this error code can be system out of RAM, or a health check has failed

Was able to solve the problem.
The nodes initially had 20G of ebs volume and on a c5.4xlarge instance type. I increased the ebs to 50 and 100G but that did not help as I kept seeing the below error:
"Disk usage on image filesystem is at 95% which is over the high
threshold (85%). Trying to free 3022784921 bytes down to the low
threshold (80%). "
I then changed the instance type to c5d.4xlarge which had 400GB of cache storage and gave 300GB of EBS. This solved the error.
Some of the gitlab jobs were for some java applications that were eating away lot of cache space and writing lot of logs.

Detailed Exit code 137
It denotes that the process was terminated by an external signal.
The number 137 is a sum of two numbers: 128+x, # where x is the signal number sent to the process that caused it to terminate.
In the example, x equals 9, which is the number of the SIGKILL signal, meaning the process was killed forcibly.
Hope this helps better.

Check Jenkins's master node memory and CPU profile. in my case, it was a master under high memory and CPU utilization, and slaves were getting restarted with 137.

GC graph shows there is a memory leak but unable to track in the dump

We have a Java Micorservice in our application which is connected to Postgres as well as Phoenix. We are using Spring Boot 2.x.
The problem is we are executing endurance testing for our application for about 8 hours and we could observe that the used heap is keep on increasing though we used the recommended suggestions for VM arguments, looks like a memory leak. we analysed the heap dump however the root cause is not exactly clear for us, can some experts help based on the results?
The VM arguments that we are actually using are:
-XX:ConcGCThreads=8 -XX:+DisableExplicitGC -XX:InitialHeapSize=536870912 -XX:InitiatingHeapOccupancyPercent=45 -XX:MaxGCPauseMillis=1000 -XX:MaxHeapFreeRatio=70 -XX:MaxHeapSize=536870912 -XX:MinHeapFreeRatio=40 -XX:ParallelGCThreads=16 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:StringDeduplicationAgeThreshold=1 -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UseStringDeduplication
We are expecting the used heap should be flat in the GC log, however memory consumption is not released and it keeps on increasing.
Heap Dump:
GC graph:

I'm not sure which tool you are using above, but I would be looking for the dominator hierarchy in the heap. Eclipse MAT is a good tool to analyse heap dumps and it can point you in the direction of what's actually holding the memory and you can decide if you want to categorise it as a leak or not. Regardless of the label you attach, if the application is going to crash after a while because it runs out of memory, then it is a problem.
This blog also discusses diagnosing this type of problems.

Multiple Cassandra node goes down

WE have a 12 node cassandra cluster across 2 different datacenter. We are migrating the data from sql DB to cassandra through a net application and there is another .net app thats reads data from the cassandra. Off recently we are seeing one or the other node going down (nodetool status shows DN and the service is stopped on it). Below is the output of the nodetool status. WE have to start the service to again get it working but it stops again.
https://ibb.co/4P1T453
Path to the log: https://pastebin.com/FeN6uDGv

So in looking through your pastebin, I see a few things that can be adjusted.
First I'm reasonably sure that this is your primary issue:
Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out,
especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.
From GNU Error Codes:
Macro: int ENOMEM
“Cannot allocate memory.” The system cannot allocate more virtual
memory because its capacity is full.
-Xms12G, -Xmx12G, -Xmn3000M,
How much RAM is on your instance? From what I'm seeing your node is dying from an OOM (Out of Memory error). My guess is that you're designating too much RAM to the heap, and there isn't enough for the OS/page-cache. In fact, I wouldn't designate much more than 50%-60% of RAM to the heap.
For example, I mostly build instances on 16GB of RAM, and I've found that a 10GB max heap is about as high as you'd want to go on that.
-XX:+UseParNewGC, -XX:+UseConcMarkSweepGC
In fact, as you're using CMS GC, I wouldn't go higher than 8GB for max heap size.
Maximum number of memory map areas per process (vm.max_map_count) 65530 is too low,
recommended value: 1048575, you can change it with sysctl.
This means you haven't adjusted your limits.conf or sysctl.conf. Check through the guide (DSE 6.0 - Recommended Production Settings), but generally it's a good idea to add the following to these files:
/etc/limits.conf
* - memlock unlimited
* - nofile 100000
* - nproc 32768
* - as unlimited
/etc/sysctl.conf
vm.max_map_count = 1048575
Note: After adjusting sysctl.conf, you'll want to run a sudo sysctl -p or reboot.
Is swap disabled? : false,
You will want to disable swap. If Cassandra starts swapping contents of RAM to disk, things will get really slow. Run a swapoff -a and then edit /etc/fstab and remove any swap entries.
tl;dr; Summary
Set your initial and max heap sizes to 8GB (heap new size is fine).
Modify your limits.conf an sysctl.conf files appropriately.
Disable swap.
It's also a good idea to get on the latest version of 3.11 (3.11.4).
Hope this helps!

JVM Tuning - CMS behavior

Currently I'm running an application in Tomcat 7 with the following jvm arguments:
-Dcatalina.home=E:\Tomcat
-Dcatalina.base=E:\Tomcat
-Djava.endorsed.dirs=E:\Tomcat\endorsed
-Djava.io.tmpdir=E:\TomcatE\temp
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Djava.util.logging.config.file=E:\Tomcat\conf\logging.properties
-XX:MaxPermSize=512m
-XX:PermSize=512m
-XX:+UseConcMarkSweepGC
-XX:NewSize=7g
-XX:MaxTenuringThreshold=31
-XX:CMSInitiatingOccupancyFraction=90
-XX:+UseCMSInitiatingOccupancyOnly
-XX:SurvivorRatio=6
-XX:TargetSurvivorRatio=90
-verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-Xloggc:E:\Tomcat7\gc.log
I'm using CMS as garbage collector and the behavior seems to be very strange. Even having 13GB of Old generation, when a major collection is performed (I guess at 90% of occupied space -> -XX:CMSInitiatingOccupancyFraction=90) , CMS is not able to clean a large amount of objects (still having occupied space of at least 7GB). I don't believe that application has so many long-lived objects (not sure!). Is it not supposed that CMS release much more space? Or could be something related to fragmentation?
Because of this behavior I'm having frequent CMS cycles...that I would like to decrease.
Even using a low pause GC, sometimes application stops 15-30 secs... How can I decrease pause time in CMS?
Could be a good idea to have more JVMS instead of having one with 20GB of heap?
Thanks a lot

First, you can dump the heap dump file with:
jmap -heap:live heap.bin ${pid}
command and find the long-lived objects by mat
Second, because the heap size is bigger than 8G and you can try Garbage First(G1) Collector

how to check heap size allocated for jvm by linux

I have apache-tomcat as my web server.
I want to check what heap size is allocated for jvm by linux.
Also from where, I can modify it.

A simple way on Linux is to run the following:
ps -ef |grep tomcat
Look for the starting and maximum JVM memory:
-Xms1024m -Xmx4096m
In this case it is allocating 1G on startup and the Maximum is 4G.

You can easily check the heap size memory allocation using JConsole, if you have a path to your jre/jdk set up correctly on the system you should be able to start it with command jconsole from anywhere.
For managing your heap memory allocation you can have a look here: http://javahowto.blogspot.com/2006/06/6-common-errors-in-setting-java-heap.html

The heap size used by Tomcat is defined in its configuration.
This is the place where you can both check and change it.
If you're unsure about where this configuration is saved, I'd suggest looking at the Tomcat documentation where this is explained together with all configuration values.

If you need more information from the server but cannot log into it interactively (or don't have a GUI or JMX set up etc) you can include javamelody in your POM file/libs and it will create a page at host:8080//monitoring with all kinds of good information, including heap size, GC statistics and permgen size.
This is NOT a safe thing to leave running in a production environment - if you need it all the time at least lock it down!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string