Application is taking more time to process the JNI weak reference during remark phase of G1GC - garbage-collection

The application is running to unexpected behavior due to this long GC and am trying to bring the GC time below 500ms.
Snippet from GC logs:
2020-03-17T16:50:04.505+0900: 1233.742: [GC remark
2020-03-17T16:50:04.539+0900: 1233.776: [GC ref-proc
2020-03-17T16:50:04.539+0900: 1233.776: [SoftReference, 0 refs, 0.0096740 secs]
2020-03-17T16:50:04.549+0900: 1233.786: [WeakReference, 3643 refs, 0.0743530 secs]
2020-03-17T16:50:04.623+0900: 1233.860: [FinalReference, 89 refs, 0.0100470 secs]
2020-03-17T16:50:04.633+0900: 1233.870: [PhantomReference, 194 refs, 9 refs, 0.0168580 secs]
2020-03-17T16:50:04.650+0900: 1233.887: [JNI Weak Reference, 0.9726330 secs], 1.0839410 secs], 1.1263670 secs]
Application is running on Java 7 with the below JVM options:
CommandLine flags: -XX:+AggressiveOpts -XX:GCLogFileSize=52428800 -XX:+HeapDumpOnOutOfMemoryError -XX:InitialHeapSize=4294967296
-XX:+ManagementServer -XX:MaxHeapSize=8589934592 -XX:MaxPermSize=805306368 -XX:MaxTenuringThreshold=15 -XX:NewRatio=5
-XX:NumberOfGCLogFiles=30 -XX:+OptimizeStringConcat -XX:PermSize=268435456 -XX:+PrintGC -XX:+PrintGCDateStamps
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintReferenceGC -XX:+UseCompressedOops
-XX:+UseFastAccessorMethods -XX:+UseG1GC -XX:+UseGCLogFileRotation -XX:+UseStringCache
Changing the parameters like NewRatio, MaxTenuringThreshold, InitialHeapSize etc is changing the frequency of such long GCs, but still there are one or two.
Is there any way to figure out what is contributing to long processing time of JNI weak reference?

Related

High Number of CMS mark remark pauses even though Old gen is not half full

I am trying to understand the cause for high number of CMS marks and remarks(other phases as well) averaging around 700ms even though the old gen is not even half full.Following are the GC configurations and stats from GCViewer.
-Xms3g
-Xmx3g
-XX:NewSize=1800m
-XX:MaxNewSize=1800m
-XX:MaxPermSize=256m
-XX:SurvivorRatio=8
-XX:+UseConcMarkSweepGC
-XX:+CMSClassUnloadingEnabled
Summary using GC Viewer: http://i.imgur.com/0IIbNUr.png
GC Log
152433.761: [GC [1 CMS-initial-mark: 284761K(1302528K)] 692884K(2961408K), 0.3367298 secs] [Times: user=0.33 sys=0.00, real=0.34 secs]
152434.098: [CMS-concurrent-mark-start]
152434.417: [CMS-concurrent-mark: 0.318/0.318 secs] [Times: user=1.38 sys=0.02, real=0.32 secs]
152434.417: [CMS-concurrent-preclean-start]
152434.426: [CMS-concurrent-preclean: 0.008/0.009 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
152434.426: [CMS-concurrent-abortable-preclean-start]
CMS: abort preclean due to time 152439.545: [CMS-concurrent-abortable-preclean: 4.157/5.119 secs] [Times: user=5.82 sys=0.20, real=5.12 secs]
152439.549: [GC[YG occupancy: 996751 K (1658880 K)]152439.550: [Rescan (parallel) , 0.5383841 secs]152440.088: [weak refs processing, 0.0070783 secs]152440.095: [class unloading, 0.0777632 secs]152440.173: [scrub symbol & string tables, 0.0416825 secs] [1 CMS-remark: 284761K(1302528K)] 1281512K(2961408K), 0.6771800 secs] [Times: user=3.35 sys=0.02, real=0.68 secs]
152440.227: [CMS-concurrent-sweep-start]
152440.613: [CMS-concurrent-sweep: 0.382/0.386 secs] [Times: user=0.39 sys=0.01, real=0.39 secs]
152440.613: [CMS-concurrent-reset-start]
152440.617: [CMS-concurrent-reset: 0.004/0.004 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
152441.719: [GC [1 CMS-initial-mark: 284757K(1302528K)] 1320877K(2961408K), 0.7720557 secs] [Times: user=0.78 sys=0.01, real=0.77 secs]
152442.492: [CMS-concurrent-mark-start]
CMS remark has to scan the young generation, since your young generation is so large this takes some time. Depending on the java version (which you did not specify!) you may have to enable parallel remarking (CMSParallelRemarkEnabled).
Enabling CMSScavengeBeforeRemark may also reduce the amount of memory that needs to be scanned during remarks.
And simply shrinking the new generation and taking the hit of a few more promotions that then get cleaned out by the concurrent old gen GC may work too.
I don't think incremental mode fixes anything here, it just drastically alters the behavior of CMS that it masks your original issue.

Extremely long pause times for concurrent mode failure and promotion failure

I'm trying to troubleshoot extremely long pause times when using the CMS collector. I'm using Java 1.6.0u20 and planning an upgrade to 1.7.0u71 but we are stuck right now on this older version.
I'm wondering if anyone has any insight into these long "real" pauses.
The machine is a VM but there are only 2 VMs on the ESX host and they are using less than the total number of cores and ram available, so swapping shouldn't be an issue, but I'm not 100% sure. Any tips related to JVM on a VM would be appreciated as well.
Increasing the heap doesn't help - we started with 1gb on the throughput collector and went to 1.5, 2, 4, 5, 6, ... just last night I increased the heap size to 10gb. The problem always remains with larger or smaller new sizes, etc.
Here is a concurrent mode failure:
2014-11-13T09:36:12.805-0700: 34537.287: [GC 34537.288: [ParNew: 2836628K->2836628K(3058944K), 0.0000296 secs]34537.288: [CMS: 3532075K->1009314K(6989824K), 298.2601836 secs] 6368704K->1009314K(10048768K), [CMS Perm : 454750K->105512K(524288K)], 298.2603873 secs] [Times: user=5.89 sys=31.00, real=297.67 secs]
Total time for which application threads were stopped: 298.2647309 seconds
Here is a promotion failure:
2014-11-13T11:23:30.395-0700: 40974.985: [GC 40974.985: [ParNew (promotion failed)
Desired survivor size 223739904 bytes, new threshold 7 (max 7)
- age 1: 126097168 bytes, 126097168 total
: 3058944K->2972027K(3058944K), 1.6271403 secs]40976.612: [CMS: 6369748K->1735350K(6989824K), 26.6789774 secs] 9103364K->1735350K(10048768K), [CMS Perm : 129283K->105970K(524288K)], 28.3063205 secs] [Times: user=8.05 sys=2.08, real=28.38 secs]
Total time for which application threads were stopped: 28.3069287 seconds
Why are the "real" times so much longer than the cpu/kernel times??
[Times: user=5.89 sys=31.00, real=297.67 secs]
[Times: user=8.05 sys=2.08, real=28.38 secs]

Full GC duration difference in same JVM

Why is the amount of time to complete full GC varies significantly in same JVM
We got 8GB heap Sun JVM.
Some times it is 13 seconds and rarely (once a week) it is 530 seconds. This long FULL GCs is causing some communication issue in our clustered environment. Is the difference in resource availability (like cpu cycles non availability etc ) when Full GC occurs is causing this issue? Whether changing our gc parameters will help? . Please find our gc parameters below.
example:
157858.158: [Full GC 157858.158: [Tenured: 5567918K->2718558K(5593088K), 13.4078854 secs] 7042362K->2718558K(7689728K), [Perm : 202405K->202405K(524288K)], 13.4079752 secs]
683185.700: [Full GC 683185.700: [Tenured: 5584345K->2461609K(5593088K), 536.8253698 secs] 7028566K->2461609K(7689728K), [Perm : 242259K->242259K(524288K)], 536.8254562 secs]
Environment:
We are running a application on SAP Netweaver Server - Sun JVM.
java -version
java version "1.4.2_19-rev"
Java(TM) Platform, Standard Edition for Business (build 1.4.2_19-rev-b0
Java HotSpot(TM) 64-Bit Server VM (build 1.4.2_19-rev-b07, mixed mode)
JVM parameters:
Xmx8192M
-Xms8192M
-XX:PermSize=512M
-XX:MaxPermSize=512M
-XX:NewSize=2730M
-XX:MaxNewSize=2730M
-Djco.jarm=1
-XX:SurvivorRatio=2
-XX:TargetSurvivorRatio=90
-XX:MaxTenuringThreshold=10
-XX:SoftRefLRUPolicyMSPerMB=1
-XX:+DisableExplicitGC
-XX:+UseParNewGC
-XX:+UseTLAB
-XX:+HandlePromotionFailure
-XX:ParallelGCThreads=32
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution
-Xss2M
-XX:CompilerThreadStackSize=4096
-Djava.awt.headless=true
-Dsun.io.useCanonCaches=false
-Djava.security.policy=./java.policy
-Djava.security.egd=file:/dev/urandom
-Dorg.omg.CORBA.ORBClass=com.sap.engine.system.ORBProxy
-Dorg.omg.CORBA.ORBSingletonClass=com.sap.engine.system.ORBSingletonProxy
-Djavax.rmi.CORBA.PortableRemoteObjectClass=com.sap.engine.system.PortableRemoteObjectProxy
-Dvr2m.meta.directory.class=com.vendavo.core.util.VenMetaDir
-Dvr2m.home=E:\Vendavo
-Djasper.reports.compile.class.path=E:\<>\jasperreports\v1.2.5\jasperreports -1.2.5.jar;E:\<dsaf>iReport\v1.2.4\iReport- 1.2.4.jar;E:\<dsaf>\iReport\v1.2.4\itext-1.3.1.jar;E:\<dsaf>\classes\jars\abc.jar;
-Dvr2m.cluster.mynodename=n1_server101
-XX:+HeapDumpOnCtrlBreak
-XX:+HeapDumpOnOutOfMemoryError
Below is the same tenuring distribution. Not sure of the way to send the complete GC logs.
Desired survivor size 644087808 bytes, new threshold 10 (max 10)
- age 1: 17299744 bytes, 17299744 total
- age 2: 4327344 bytes, 21627088 total
- age 3: 2152536 bytes, 23779624 total
- age 4: 1291104 bytes, 25070728 total
- age 5: 2277184 bytes, 27347912 total
- age 6: 8323128 bytes, 35671040 total
- age 9: 1859888 bytes, 37530928 total
- age 10: 2849376 bytes, 40380304 total
: 1465272K->39817K(2096640K), 0.0317708 secs] 7042426K->5619506K(7689728K), 0.0318546 secs]
682873.961: [GC 682873.961: [ParNew
Desired survivor size 644087808 bytes, new threshold 10 (max 10)
- age 1: 17629648 bytes, 17629648 total
- age 2: 1937560 bytes, 19567208 total
- age 3: 4322600 bytes, 23889808 total
- age 4: 2051048 bytes, 25940856 total
- age 5: 910360 bytes, 26851216 total
- age 6: 2237400 bytes, 29088616 total
- age 7: 8322776 bytes, 37411392 total
- age 10: 1859936 bytes, 39271328 total
: 1437577K->38693K(2096640K), 0.0363818 secs] 7017266K->5621199K(7689728K), 0.0364742 secs]
683032.408: [GC 683032.408: [ParNew
Desired survivor size 644087808 bytes, new threshold 10 (max 10)
- age 1: 27372472 bytes, 27372472 total
- age 2: 414904 bytes, 27787376 total
- age 3: 1828208 bytes, 29615584 total
- age 4: 4318504 bytes, 33934088 total
- age 5: 2051520 bytes, 35985608 total
- age 6: 760512 bytes, 36746120 total
- age 7: 2153392 bytes, 38899512 total
- age 8: 8322232 bytes, 47221744 total
: 1436453K->46460K(2096640K), 0.0555022 secs] 7018959K->5630806K(7689728K), 0.0555993 secs]
683185.700: [Full GC 683185.700: [Tenured: 5584345K->2461609K(5593088K), 536.8253698 secs] 7028566K->2461609K(7689728K), [Perm : 242259K->242259K(524288K)], 536.8254562 secs]
684682.569: [GC 684682.569: [ParNew
This looks like "promotion failure" syndrom.
GC has two cycles
young GC or minor GC, collects garbage in young space
full GC collects both young and old (if Mark Sweep Compact algorithm is chosen as in your configuration)
Young GC promotes some live objects to old space
if they have been tenured long enough
or in case if too much objects have survived and survivor subspace of young space cannot accommodate them
Normally JVM estimates amount of free space in old space it needs for young collection and starts full GC if free memory is low.
But if that estimate is wrong free space in old space can get exhausted in the middle of young collection.
In this case JVM have to rollback young collection and start full collection. During work of copy collection (used for young collection) object graph is not consistent and mark sweep compact cannot start until graph is fixed to consistent state.
Unfortunately this roll back could take order of magnitude longer than normal full GC.
This problem is typical for Concurrent Mark Sweep collector, but may affect Mark Sweep Compact too.
UPDATE
Looking at your GC logs and tenuring distribution, I would suggest you to reduce young space (and thus scale of potential "rollback").
Judging from your logs cutting young space down 3 times may be reasonable. -XX:NewSize=900M -XX:MaxNewSize=900M
Upgrading JVM would be another good options (failure prediction logic has likely been improved since 1.4 times).
Below are few links related to GC in HotSpot JVM:
Understanding GC pauses in JVM, HotSpot's minor GC
Garbage collection in HotSpot JVM
How to tame java GC pauses? Surviving 16GiB heap and greater

how to find application suspension time from GC log files

I am new to Garbage collection ,Plz some one help me to get answers for my following questions with clear explanation
I want to find application suspension time and suspension count from the GC logs files for different JVM's :
SUN
jRockit
IBM
of different versions.
A. For SUN i am using JVM options
-Xloggc:gc.log -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+UseParNewGC -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode
B. For JRockit i am using JVM options
-Xms100m -Xmx100m -Xns50m -Xss200k -Xgc:genconpar -Xverbose:gc -Xverboselog:gc_jrockit.log
My Questions are
Q1. What is suspension time of an application and why it occurs.
Q2. How to say by looking on logs that suspension was occurred.
Q3. Does suspension time of an application = sum of GC Times.
Eg:
2013-09-06T23:35:23.382-0700: [GC 150.505: [ParNew
Desired survivor size 50331648 bytes, new threshold 2 (max 15)
- age 1: 28731664 bytes, 28731664 total
- age 2: 28248376 bytes, 56980040 total
: 688128K->98304K(688128K), 0.2166700 secs] 697655K->163736K(10387456K), 0.2167900 secs] [Times: user=0.44 sys=0.04, real=0.22 secs]
2013-09-06T23:35:28.044-0700: 155.167: [GC 155.167: [ParNew
Desired survivor size 50331648 bytes, new threshold 15 (max 15)
- age 1: 22333512 bytes, 22333512 total
- age 2: 27468336 bytes, 49801848 total
: 688128K->71707K(688128K), 0.0737140 secs] 753560K->164731K(10387456K), 0.0738410 secs] [Times: user=0.30 sys=0.02, real=0.07 secs]
suspensionTime = 0.2167900 secs + 0.0738410 secs
i. If yes do i need to add all times for every gc occurs
ii. If no Plz explain me in detail for those logs we consider that suspension occured and those not consider for different Collectors with logs
Q4. Can we say GC times "0.2167900 , 0.0738410" are equal to GC Pauses ie;TotalGCPause = 0.2167900 + 0.0738410
Q5. Can we calculate suspension time by using only above flags or we need to include extra flags like -XX:+PrintGCApplicationStoppedTime for SUN
Q6. I seen an tool dyna trace it calculating suspension time and count for SUN with out using the flag -XX:+PrintGCApplicationStoppedTime
If you want the most precise information about the amount of time your application was stopped due to GC activity, you should go with -XX:+PrintGCApplicationStoppedTime.
-XX:+PrintGCApplicationStoppedTime enables the printing of the amount of time application threads have been stopped as the result of an internal HotSpot VM operation (GC and safe-point operations).
But, for practical daily usage the information provided by the GC logs is sufficient. You can use the approach described in your question 3. to determine the time spent in the GC.

Why CMS(concurrent mode failure) happened here?

Operation System: Red Hat Linux 4.8
CPU Info: Intel(R) Xeon(R) CPU 5160 # 3.00GHz X 16
JDK version: "1.5.0_16"
JVM Parameter:
-server
-Xmx1024m
-Xms1024m
-XX:NewSize=256m
-XX:MaxNewSize=256m
-XX:PermSize=128m
-XX:MaxPermSize=128m
-XX:SurvivorRatio=8
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+UseConcMarkSweepGC
-XX:+UseCMSCompactAtFullCollection
-XX:CMSFullGCsBeforeCompaction=5
-XX:CMSInitiatingOccupancyFraction=60
-XX:CMSMaxAbortablePrecleanTime=5
-XX:+CMSPermGenSweepingEnabled
-XX:+CMSClassUnloadingEnabled
-XX:MaxGCPauseMillis=1500
JVM GC Log:
945188.489: [GC 945188.489: [ParNew: 224543K->14968K(235968K), 0.0506680 secs] 552200K->344514K(1022400K), 0.0507700 secs]
945242.102: [GC 945242.102: [ParNew: 224760K->15374K(235968K), 0.0632410 secs] 554306K->346710K(1022400K), 0.0633450 secs]
945270.397: [GC 945270.402: [ParNew: 225163K->225163K(235968K), 0.0000230 secs]945270.402: [CMS (concurrent mode failure)[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor70]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor58]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor38]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor62]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor54]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor74]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor53]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor73]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor64]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor39]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor59]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor51]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor42]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor48]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor76]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor52]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor57]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor61]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor56]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor55]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor63]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor60]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor40]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor65]
: 331336K->71676K(786432K), 13.8120660 secs] 556499K->71676K(1022400K), 13.8122360 secs]
945289.234: [GC 945289.234: [ParNew: 209792K->2581K(235968K), 0.0065240 secs] 281468K->74257K(1022400K), 0.0066160 secs]
945324.703: [GC 945324.703: [ParNew: 212373K->3829K(235968K), 0.0081040 secs] 284049K->75506K(1022400K), 0.0082040 secs]
Why CMS(concurrent mode failure) happened here?
The old generation seems : 331336K->71676K(786432K)
Concurrent Mode Failure as defined
The message "concurrent mode failure"
signifies that the concurrent
collection of the tenured generation
did not finish before the tenured
generation became full.
In other words, the new generation is filling up too fast, it is overflowing to tenured generation but the CMS could not clear out the tenured generation in the background.
In your case, at 945270.397
ParNew: 225163K->225163K(235968K) shows the Young was full and could not clear objects at all.
Update
A similar log to yours is explained here says
This shows that a ParNew collection
was requested, but was not attempted.
(The reason is that it was estimated
that there was not enough space in the
CMS generation to promote the
worst-case surviving young generation
objects.) We name this failure a "full
promotion guarantee failure". As a result, the concurrent mode of CMS is interrupted and a full GC is invoked.
So as I see it, a full GC on the young objects of 225M as well as the Tenured of 331K takes 13 seconds and gets the heap down to 71 M, but this has been a result of the concurrent mode failure
Suggestion
If you are really creating so many old objects, then you probably need a bigger heap.
Or reduce try reducing the -XX:CMSInitiatingOccupancyFraction from 60 but dont think that will make much of a diff

Resources