Full GC, PSPermGen not cleaned - garbage-collection

My Java EE server has been working nicely, and then inside 10 mins full gc started to occur more frequently, then finally it was stopped all the time due to GC. PSPermGen was not released.
My JVM settings are:
set JAVA_OPTS=%JAVA_OPTS% -Xms4g -Xmx4g -XX:MaxPermSize=512m -XX:NewRatio=3
2012-09-05T14:03:10.394+0100: 94287.753: [Full GC [PSYoungGen: 843584K->0K(947200K)] [ParOldGen: 3077347K->3117145K(3145728K)] 3920931K->3117145K(4092928K) [PSPermGen: 181533K->181521K(186944K)], 10.9564398 secs] [Times: user=286.14 sys=0.19, real=10.97 secs]
Total time for which application threads were stopped: 10.9678339 seconds
Application time: 0.0023102 seconds
Total time for which application threads were stopped: 0.0088344 seconds
Application time: 0.3052301 seconds
Total time for which application threads were stopped: 0.0085634 seconds
Application time: 0.1125068 seconds
2012-09-05T14:03:21.798+0100: 94299.158: [Full GC [PSYoungGen: 842024K->22409K(947200K)] [ParOldGen: 3117145K->3145232K(3145728K)] 3959170K->3167641K(4092928K) [PSPermGen: 181521K->181521K(186752K)], 11.4649901 secs] [Times: user=372.58 sys=0.11, real=11.47 secs]
Total time for which application threads were stopped: 11.4757898 seconds
Application time: 0.0706553 seconds
Total time for which application threads were stopped: 0.0102510 seconds
Application time: 0.3951514 seconds
2012-09-05T14:03:33.748+0100: 94311.110: [Full GC [PSYoungGen: 843584K->34503K(947200K)] [ParOldGen: 3145232K->3141687K(3145728K)] 3988816K->3176190K(4092928K) [PSPermGen: 181521K->181521K(186112K)], 10.9699419 secs] [Times: user=369.43 sys=0.14, real=10.97 secs]
Total time for which application threads were stopped: 10.9806713 seconds
Application time: 0.0027075 seconds
Any clue what could be reason? Memory leak or JVM can be tweaked better?

Well from the log, few things are clear. Either that the system genuinely needs too much memory that it is unable to clear the tenured generation resulting in 3.1GB consistent consumption. This part only you can answer.
Or there is a memory leak. Memory Leak might/not be possible because the ole gen used space is constant at around 3.145GB. With memory leak usually even this increases.
Probably more log can help. If this factor increases with time, then rest assured - a leak.
If constant, then the application is genuinely short of memory needed.

Related

GHC per thread GC strategy

I have a Scotty api server which constructs an Elasticsearch query, fetches results from ES and renders the json.
In comparison to other servers like Phoenix and Gin, I'm getting higher CPU utilization and throughput for serving ES responses by using BloodHound but Gin and Phoenix were magnitudes better than Scotty in memory efficiency.
Stats for Scotty
wrk -t30 -c100 -d30s "http://localhost:3000/filters?apid=1&hfa=true"
Running 30s test # http://localhost:3000/filters?apid=1&hfa=true
30 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 192.04ms 305.45ms 1.95s 83.06%
Req/Sec 133.42 118.21 1.37k 75.54%
68669 requests in 30.10s, 19.97MB read
Requests/sec: 2281.51
Transfer/sec: 679.28KB
These stats are on my Mac having GHC 7.10.1 installed
Processor info 2.5GHx i5
Memory info 8GB 1600 Mhz DDR3
I am quite impressed by lightweight thread based concurrency of GHC but memory efficiency remains a big concern.
Profiling memory usage yielded me the following stats
39,222,354,072 bytes allocated in the heap
277,239,312 bytes copied during GC
522,218,848 bytes maximum residency (14 sample(s))
761,408 bytes maximum slop
1124 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 373 colls, 373 par 2.802s 0.978s 0.0026s 0.0150s
Gen 1 14 colls, 13 par 0.534s 0.166s 0.0119s 0.0253s
Parallel GC work balance: 42.38% (serial 0%, perfect 100%)
TASKS: 18 (1 bound, 17 peak workers (17 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.008s elapsed)
MUT time 31.425s ( 36.161s elapsed)
GC time 3.337s ( 1.144s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 34.765s ( 37.314s elapsed)
Alloc rate 1,248,117,604 bytes per MUT second
Productivity 90.4% of total user, 84.2% of total elapsed
gc_alloc_block_sync: 27215
whitehole_spin: 0
gen[0].sync: 8919
gen[1].sync: 30902
Phoenix never took more than 150 MB, while Gin took much lower memory.
I believe that GHC uses mark and sweep strategy for GC. I also believe it would have been better to use per thread incremental GC strategy akin to Erlang VM for better memory efficiency.
And by interpreting Don Stewart's answer to a related question there must be some way to change the GC strategy in GHC.
I also noted that the memory usage remained stable and pretty low when the concurrency level was low, so I think memory usage booms up only when concurrency is pretty high.
Any ideas/pointers to solve this issue.
http://community.haskell.org/~simonmar/papers/local-gc.pdf
This paper by Simon Marlow describes per-thread local heaps, and claims that this was implemented in GHC. It's dated 2011. I can't be sure if this is what the current version of GHC actually does (i.e., did this go into the release version of GHC, is it still the current status quo, etc.), but it seems my recollection wasn't completely made up.
I will also point out the section of the GHC manual that explains the settings you can twiddle to adjust the garbage collector:
https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/runtime-control.html#rts-options-gc
In particular, by default GHC uses a 2-space collector, but adding the -c RTS option makes it use a slightly slower 1-space collector, which should eat less RAM. (I'm entirely unclear which generation(s) this information applies to.)
I get the impression Simon Marlow is the guy who does most of the RTS stuff (including the garbage collector), so if you can find him on IRC, he's the guy to ask if you want the direct truth...

Optimize Haskell GC usage

I am running a long-lived Haskell program that holds on to a lot of memory. Running with +RTS -N5 -s -A25M (size of my L3 cache) I see:
715,584,711,208 bytes allocated in the heap
390,936,909,408 bytes copied during GC
4,731,021,848 bytes maximum residency (745 sample(s))
76,081,048 bytes maximum slop
7146 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 24103 colls, 24103 par 240.99s 104.44s 0.0043s 0.0603s
Gen 1 745 colls, 744 par 2820.18s 619.27s 0.8312s 1.3200s
Parallel GC work balance: 50.36% (serial 0%, perfect 100%)
TASKS: 18 (1 bound, 17 peak workers (17 total), using -N5)
SPARKS: 1295 (1274 converted, 0 overflowed, 0 dud, 0 GC'd, 21 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 475.11s (454.19s elapsed)
GC time 3061.18s (723.71s elapsed)
EXIT time 0.27s ( 0.50s elapsed)
Total time 3536.57s (1178.41s elapsed)
Alloc rate 1,506,148,218 bytes per MUT second
Productivity 13.4% of total user, 40.3% of total elapsed
The GC time is 87% of the total run time! I am running this on a system with a massive amount of RAM, but when I set a high -H value the performance was worse.
It seems that both -H and -A controls the size of gen 0, but what I would really like to do is increase the size of gen 1. What is the best way to do this?
As Carl suggested, you should check code for space leaks. I'll assume that your program really requires a lot of memory for good reason.
The program spent 2820.18s doing major GC. You can lower it by reducing either memory usage (not a case by the assumption) or number of major collections. You have a lot of free RAM, so you can try -Ffactor option:
-Ffactor
[Default: 2] This option controls the amount of memory reserved for
the older generations (and in the case of a two space collector the size
of the allocation area) as a factor of the amount of live data. For
example, if there was 2M of live data in the oldest generation when we
last collected it, then by default we'll wait until it grows to 4M before
collecting it again.
In your case there is ~3G of live data. By default major GC will be triggered when heap grows to 6G. With -F3 it will be triggered when heap grows to 9G saving you ~1000s CPU time.
If most of the live data is static (e.g. never changes or changes slowly,) then you will be interested in stable heap. The idea is to exclude long living data from major GC. It can be achieved e.g. using compact normal forms, though it is not merged into GHC yet.

Why is the Java G1 gc spending so much time scanning RS?

I'm currently evaluating the G1 garbage collector and how it performs for our application. Looking at the gc-log, I noticed a lot of collections have very long "Scan RS" phases:
7968.869: [GC pause (mixed), 10.27831700 secs]
[Parallel Time: 10080.8 ms]
(...)
[Scan RS (ms): 4030.4 4034.1 4032.0 4032.0
Avg: 4032.1, Min: 4030.4, Max: 4034.1, Diff: 3.7]
[Object Copy (ms): 6038.5 6033.3 6036.7 6037.1
Avg: 6036.4, Min: 6033.3, Max: 6038.5, Diff: 5.2]
(...)
[Eden: 19680M(19680M)->0B(20512M) Survivors: 2688M->2624M Heap:
75331M(111904M)->51633M(115744M)]
[Times: user=40.49 sys=0.02, real=10.28 secs]
All the removed log-rows entries show runtimes in single-digit ms.
I think most of the time should be spent in copying, right? What could be the reason Scan RS takes so long? Any ideas on how to tweak the G1-settings?
The JVM was started with
-Xms40960M -Xmx128G -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:gc.log
Edit: Oh, I forgot... I'm using Java 7u25
Update:
I noticed two other weird things:
16187.740: [GC concurrent-mark-start]
16203.934: [GC pause (young), 2.89871800 secs]
(...)
16218.455: [GC pause (young), 4.61375100 secs]
(...)
16237.441: [GC pause (young), 4.46131800 secs]
(...)
16257.785: [GC pause (young), 4.73922600 secs]
(...)
16275.417: [GC pause (young), 3.87863400 secs]
(...)
16291.505: [GC pause (young), 3.72626400 secs]
(...)
16307.824: [GC pause (young), 3.72921700 secs]
(...)
16325.851: [GC pause (young), 3.91060700 secs]
(...)
16354.600: [GC pause (young), 5.61306000 secs]
(...)
16393.069: [GC pause (young), 17.50453200 secs]
(...)
16414.590: [GC concurrent-mark-end, 226.8497670 sec]
The concurrent GC run is continuing while parallel runs are being performed. I'm not sure if that's intended, but it kinda seems wrong to me. Admittedly, this is an extreme example, but I do see this behaviour all over my log.
Another thing is that my JVM process grew to 160g. Considering a heap-size of 128g, that's a rather large overhead. Is this to be expected, or is G1 leaking memory? Any ideas on how to find that out?
PS: I'm not really sure if I should've made new questions for the updates... if any of you think that this would be beneficial, tell me ;)
Update 2:
I guess the G1 really may be leaking memory: http://printfdebugger.tumblr.com/post/19142660766/how-i-learned-to-love-cms-and-had-my-heart-broken-by-g1
As this is a deal-breaker for now, I'm not going to spend more time on playing with this.
Things I didn't yet try is configuring region size (-XX:G1HeapRegionSize) and lowering the heap occupancy (-XX:InitiatingHeapOccupancyPercent).
Let's see.
1 - First clues
It looks like your GC was configured to use 4 threads (or you have 4 vCPU, but it is unlikely given the size of the heap). It is quite low for a 128GB heap, I was expecting more.
The GC events seems to happen at 25+ seconds interval. However, the log extract you gave do not mention the number of regions that were processed.
=> By any chance, did you specify pause time goals to G1GC (-XX:MaxGCPauseMillis=N) ?
2 - Long Scan RSet time
"Scan RSet" means the time the GC spent in scanning the Remembered Sets. Remembered Set of a region contains cards that correspond to the references pointing into that region. This phase scans those cards looking for the references pointing into all the regions of the collection set.
So here, we have one more question :
=> How many regions were processed during that particular collection (i.e. how big is the CSet)
3 - Long Object Copy time
The copy time, as the name suggest, is the time spend by each worker thread copying live objects from the regions in the Collection Set to the other regions.
Such long copy time can suggest that a lot of regions were processed, and that you may want to reduce that number. It could also suggest swapping, but this is very unlikely given your user/real values at the end of the log.
4 - Now what to do
You should check in the GC log the number of regions that were processed. Correlate this number with your region size and deduce the amount of memory that was scanned.
You can then set a smaller pause time goal (for instance, to 500ms using -XX:MaxGCPauseMillis=500). This will
increase the number of GC events,
reduce the amount of freed memory per GC cycle
reduce the STW pauses during YGC
Hope that helps !
Sources :
https://blogs.oracle.com/poonam/entry/understanding_g1_gc_logs
http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/G1GettingStarted/index.html
http://jvm-options.tech.xebia.fr/

Young generation is tiny despite NewRatio=2

At some point my application starts to create a lot of temporary arrays, this is expected behaviour, and I want to give a lot of space to Young Generation, so temporary arrays don't get promoted to the Tenured Generation.
JVM options:
java -Xmx240g -XX:+UseConcMarkSweepGC -XX:NewRatio=2 -XX:+PrintGCTimeStamps -verbose:gc -XX:+PrintGCDetails
At some point my GC log starts looking like this:
800.020: [GC 800.020: [ParNew: 559514K->257K(629120K), 0.1486790 secs] 95407039K->94847783K(158690816K), 0.1487540 secs] [Times: user=3.34 sys=0.05, real=0.15 secs]
800.202: [GC 800.202: [ParNew: 559489K->246K(629120K), 0.1665870 secs] 95407015K->94847777K(158690816K), 0.1666610 secs] [Times: user=3.79 sys=0.00, real=0.17 secs]
800.402: [GC 800.402: [ParNew: 559478K->257K(629120K), 0.1536610 secs] 95407009K->94847788K(158690816K), 0.1537290 secs] [Times: user=3.48 sys=0.02, real=0.15 secs]
I'm very confused by the fact that Young Generation size is 629120K (=629M), while I expect it to be approx. 1/2 (because NewRatio=2) of Tenured Generation size which is 158690816K (=158G). Tenured size generation corresponds with NewRatio and Xms as expected, i.e. it is 2/3 of total heap size.
JVM version:
java version "1.7.0_21"
Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
Update:
I believe that at this point (800 sec of running time) program has peak temporary array usage.
If program does not go beyond 629M Young generation size, does it mean I should increase NewRatio? Let's assume I'm planning to give more workload to the program, and I expect that temporary arrays volume to permanent arrays volume ratio will be the same.
I ran the program with NewRatio=8 before, and gc log consists mostly of lines like these:
800.004: [GC 800.004: [ParNew: 186594K->242K(209664K), 0.1059450 secs] 95345881K->95159529K(126655428K), 0.1060110 secs] [Times: user=2.41 sys=0.00, real=0.10 secs]
800.122: [GC 800.122: [ParNew: 186610K->221K(209664K), 0.1073210 secs] 95345897K->95159522K(126655428K), 0.1073900 secs] [Times: user=2.37 sys=0.07, real=0.11 secs]
800.240: [GC 800.240: [ParNew: 186589K->221K(209664K), 0.1026210 secs] 95345890K->95159524K(126655428K), 0.1026870 secs] [Times: user=2.34 sys=0.00, real=0.10 secs]
800.357: [GC 800.357: [ParNew: 186589K->218K(209664K), 0.1043130 secs] 95345892K->95159527K(126655428K), 0.1043810 secs] [Times: user=2.30 sys=0.07, real=0.10 secs]
It makes me think that NewRatio has the impact on Young generation size currently, but it shouldn't because currently Young generation is far below 1/9 of heap size.
Update 2: It's a huge scientific calculation and my solution needs up to 240GB of memory. It is not a memory leak and algorithm is the best I was able to come up with.
Please, could you try running with the AdaptiveSizePolicy disabled, use : -XX-UseAdaptiveSizePolicy?
This should allow you to preset the sizes of each Heap Area and will disable dynamic changes in their sizes at runtime.
First of all, check What is the meaning of the -XX:NewRatio and -XX:OldSize JVM flags?
The NewRatio is the ratio of young generation to old generation (e.g. value 2 means max size of old will be twice the max size of young, i.e. young can get up to 1/3 of the heap).
and you will see your young generation size is correct.
I recommend you to check also Are ratios between spaces/generations in the Java Heap constant? and set static sizes from the very beginning using
-Xms240g -Xmx240g -XX:NewSize=120g -XX:MaxNewSize=120g -XX:-UseAdaptiveSizePolicy

Why does it take three Full GC to garbage collect permgen?

What are the reasons why it would take three successive "Full GC" before perm gen is garbage collected?
The first GC got the heap down from 2.4gb to 761mb, but fails to substantially GC perm gen, though it does appear to recover 6K.
We'll ignore the young generation collection.
The second Full GC does very little for the heap, as expected since the server was lightly loaded at the time. The odd thing is that it did NOTHING for perm gen.
The third Full GC finally takes perm gen from its max of 524mb down to 141mb.
Here's the unedited snippet from the GC logs:
2012-12-07T19:46:40.731-0600: [Full GC [CMS: 2474402K->761372K(2804992K), 4.6386780 secs] 2606228K->761372K(3111680K), [CMS Perm : 524286K->524280K(524288K)], 4.6387670 secs] [Times: user=4.68 sys=0.00, real=4.63 secs]
2012-12-07T19:46:45.374-0600: [GC [ParNew
Desired survivor size 17432576 bytes, new threshold 6 (max 6)
- age 1: 65976 bytes, 65976 total
: 1552K->8827K(306688K), 0.0199700 secs] 762925K->770200K(3111680K), 0.0200340 secs] [Times: user=0.08 sys=0.00, real=0.02 secs]
2012-12-07T19:46:45.395-0600: [Full GC [CMS: 761372K->752917K(2804992K), 3.7379280 secs] 770212K->752917K(3111680K), [CMS Perm : 524287K->524287K(524288K)], 3.7380180 secs] [Times: user=3.77 sys=0.00, real=3.74 secs]
2012-12-07T19:46:49.135-0600: [Full GC [CMS: 752917K->693347K(2804992K), 3.2845870 secs] 752917K->693347K(3111680K), [CMS Perm : 524287K->141759K(524288K)], 3.2846780 secs] [Times: user=3.32 sys=0.00, real=3.29 secs]
System info and GC flags:
Java 1.7.0_07, 64-Bit Server, Ubuntu 12.04
-Xms3g -Xmx3g -XX:PermSize=512m -XX:MaxPermSize=512m
-XX:+UseConcMarkSweepGC
EDIT: we have two app servers; the second one exhibited slightly different behavior: there were only two Full GC entries.
2012-12-07T20:36:31.097-0600: [Full GC [CMS: 2307424K->753901K(2804992K), 5.0783720 secs] 2394279K->753901K(3111680K), [CMS Perm : 524280K->524121K(524288K)], 5.0784780 secs] [Times: user=5.12 sys=0.00, real=5.08 secs]
2012-12-07T20:36:36.178-0600: [Full GC [CMS: 753901K->695698K(2804992K), 3.4488560 secs] 755266K->695698K(3111680K), [CMS Perm : 524121K->140568K(524288K)], 3.4489690 secs] [Times: user=3.48 sys=0.00, real=3.45 secs]
So it looks like the young generation was significant. Perhaps it's requiring two successive Full GC, with no other GC (young generation GC) in between to garbage collect perm gen in our particular set up. I've dug a lot, but I haven't found any discussion of this behavior.
It would not astonish me that the concurrent collections of the heap and perm gen do not influence each other, as especially the heap collection is already a complex operation by itself; that'd explain why the perm gen is only collected the second time. I'm mainly guessing, though.
It might be interesting to get more details on what's actually collected in the perm gen (unloaded classes, strings?). -XX:+PrintGCDetails would help, and maybe -verbose:class.

Resources