Optimize Haskell GC usage - haskell

I am running a long-lived Haskell program that holds on to a lot of memory. Running with +RTS -N5 -s -A25M (size of my L3 cache) I see:
715,584,711,208 bytes allocated in the heap
390,936,909,408 bytes copied during GC
4,731,021,848 bytes maximum residency (745 sample(s))
76,081,048 bytes maximum slop
7146 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 24103 colls, 24103 par 240.99s 104.44s 0.0043s 0.0603s
Gen 1 745 colls, 744 par 2820.18s 619.27s 0.8312s 1.3200s
Parallel GC work balance: 50.36% (serial 0%, perfect 100%)
TASKS: 18 (1 bound, 17 peak workers (17 total), using -N5)
SPARKS: 1295 (1274 converted, 0 overflowed, 0 dud, 0 GC'd, 21 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 475.11s (454.19s elapsed)
GC time 3061.18s (723.71s elapsed)
EXIT time 0.27s ( 0.50s elapsed)
Total time 3536.57s (1178.41s elapsed)
Alloc rate 1,506,148,218 bytes per MUT second
Productivity 13.4% of total user, 40.3% of total elapsed
The GC time is 87% of the total run time! I am running this on a system with a massive amount of RAM, but when I set a high -H value the performance was worse.
It seems that both -H and -A controls the size of gen 0, but what I would really like to do is increase the size of gen 1. What is the best way to do this?

As Carl suggested, you should check code for space leaks. I'll assume that your program really requires a lot of memory for good reason.
The program spent 2820.18s doing major GC. You can lower it by reducing either memory usage (not a case by the assumption) or number of major collections. You have a lot of free RAM, so you can try -Ffactor option:
-Ffactor
[Default: 2] This option controls the amount of memory reserved for
the older generations (and in the case of a two space collector the size
of the allocation area) as a factor of the amount of live data. For
example, if there was 2M of live data in the oldest generation when we
last collected it, then by default we'll wait until it grows to 4M before
collecting it again.
In your case there is ~3G of live data. By default major GC will be triggered when heap grows to 6G. With -F3 it will be triggered when heap grows to 9G saving you ~1000s CPU time.
If most of the live data is static (e.g. never changes or changes slowly,) then you will be interested in stable heap. The idea is to exclude long living data from major GC. It can be achieved e.g. using compact normal forms, though it is not merged into GHC yet.

Related

Understanding `-Rghc-timing` output

Basically the title - if I run stack ghc -- SomeFile.hs -Rghc-timing, and then receive the following output:
<<ghc: 32204977120 bytes, 418 GCs, 589465960/3693483304 avg/max bytes residency (15 samples), 8025M in use, 0.001 INIT (0.000 elapsed), 10.246 MUT (10.327 elapsed), 21.465 GC (23.670 elapsed) :ghc>>
Does that mean:
When compiling, GHC used a total of 8,025 MB of memory
When compiling, GHC took a total of around 33 seconds in wall-clock time to complete
Basically, I want to make sure that it's as I think it is - that GHC's compilation time and memory usage is being measured, rather than anything to do with the program at runtime.
Thank you!
Yes, this line shows the statistics for the GHC compiler itself, while it was compiling your code. It is unrelated to the "runtime" performance of the resulting compiled program. The meaning of the various statistics is documented in the manual under the -t option, here.
And yes, while compiling your program, GHC allocated a maximum of 8025MB of memory from the operating system and took about 34 seconds of wall clock time, (24 in the garbage collector and 10 in the mutator).

How do I measure fragmentation in Hotspot's Metaspace?

I'm looking into debugging an "OutOfMemoryError: Metaspace" error in my application. Right before the OOME I see the following in the gc logs:
{Heap before GC invocations=6104 (full 39):
par new generation total 943744K, used 0K [...)
eden space 838912K, 0% used [...)
from space 104832K, 0% used [...)
to space 104832K, 0% used [...)
concurrent mark-sweep generation total 2097152K, used 624109K [...)
Metaspace used 352638K, capacity 487488K, committed 786432K, reserved 1775616K
class space used 36291K, capacity 40194K, committed 59988K, reserved 1048576K
2015-08-11T20:34:13.303+0000: 105892.129: [Full GC (Last ditch collection) 105892.129: [CMS: 624109K->623387K(2097152K), 3.4208207 secs] 624109K->623387K(3040896K), [Metaspace: 352638K->352638K(1775616K)], 3.4215100 secs] [Times: user=3.42 sys=0.00, real=3.42 secs]
Heap after GC invocations=6105 (full 40):
par new generation total 943744K, used 0K [...)
eden space 838912K, 0% used [...)
from space 104832K, 0% used [...)
to space 104832K, 0% used [...)
concurrent mark-sweep generation total 2097152K, used 623387K [...)
Metaspace used 352638K, capacity 487488K, committed 786432K, reserved 1775616K
class space used 36291K, capacity 40194K, committed 59988K, reserved 1048576K
}
From what I can see, Metaspace capacity isn't even nearing the committed size (in this case, -XX:MaxMetaspaceSize=768m). So I suspect fragmentation of Metaspace causing the allocator to fail to find a new chunk for the new classloader.
I'm aware of -XX:PrintFLSStatistics but that only covers CMS, not native memory.
So my question is: is there a debugging help similar to PrintFLSStatistics available for Hotspot's native memory?
This is using Java HotSpot(TM) 64-Bit Server VM (25.45-b02) for linux-amd64 JRE (1.8.0_45-b14).
I've just looked into the implementation of the Metaspace in HotSpot. The Metaspace is divided into chunks and managed using a freelist. So fragmentation is indeed a possible reason for your problem.
I've also looked through the flags of the HotSpot VM (-XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal), there is no flag in the release version.
However, there is a dump() method in the Metaspace class which seems to be triggered by setting the -XX:+TraceMetadataChunkAllocation flag. There is also the -XX:+TraceMetavirtualspaceAllocation which is sounding to be of interest for you. However, those are "develop" flags, meaning you need a debug version of the VM.
#loonytune's answer works just fine, but I want to provide a little bit more detail:
For context, "The Metaspace" is a collection of metaspaces, one per class loader. Each metaspace holds a list of VirtualSpace objects out of which Metachunks of different sizes are allocated. These chunks hold MetaBlocks, which are the real containers for metadata.
I need a debug JRE to run those flags, so following this tuorial I checked out the openjdk repository (I renamed the checkout to vm because the build scripts seem to take issue with the jdk8 folder name), ran
~/vm$ bash configure --enable-debug
~/vm$ DISABLE_HOTSPOT_OS_VERSION_CHECK=ok make all
and used the resulting vm/build/linux-x86_64-normal-server-fastdebug/images/j2re-image as my java runtime.
The log lines generated look like this:
VirtualSpaceNode::take_from_committed() not available 8192 words space # 0x00007fee4cdb9350 128K, 94% used [0x00007fedf5e22000, 0x00007fedf5f13000, 0x00007fedf5f22000, 0x00007fedf6022000)
Which indicates that the current VirtualSpace is full and can't hold another chunk of the requested 8192 word size. This will cause this metaspace to switch to another VirtualSpace.
ChunkManager::chunk_freelist_allocate: 0x00007fee4c0c39f8 chunk 0x00007fee15397400 size 128 count 0 Free chunk total 7680 count 15
ChunkManager::chunk_freelist_allocate: 0x00007fee4c0c39f8 chunk 0x00007fedf6021000 size 512 count 14 Free chunk total 7168 count 14
This happens when a new Metachunk is allocated, in the first case it's 128 words big and uses up the list of small chunks. As you can see, the next request goes to the medium sized chunks (of size 512) and leaves 14 chunks free in total. Once the free total reaches 0, a Full GC is needed to increase the total Metaspace size.
Note that specifying -verbose gets you even more output from the above two flags.

GHC per thread GC strategy

I have a Scotty api server which constructs an Elasticsearch query, fetches results from ES and renders the json.
In comparison to other servers like Phoenix and Gin, I'm getting higher CPU utilization and throughput for serving ES responses by using BloodHound but Gin and Phoenix were magnitudes better than Scotty in memory efficiency.
Stats for Scotty
wrk -t30 -c100 -d30s "http://localhost:3000/filters?apid=1&hfa=true"
Running 30s test # http://localhost:3000/filters?apid=1&hfa=true
30 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 192.04ms 305.45ms 1.95s 83.06%
Req/Sec 133.42 118.21 1.37k 75.54%
68669 requests in 30.10s, 19.97MB read
Requests/sec: 2281.51
Transfer/sec: 679.28KB
These stats are on my Mac having GHC 7.10.1 installed
Processor info 2.5GHx i5
Memory info 8GB 1600 Mhz DDR3
I am quite impressed by lightweight thread based concurrency of GHC but memory efficiency remains a big concern.
Profiling memory usage yielded me the following stats
39,222,354,072 bytes allocated in the heap
277,239,312 bytes copied during GC
522,218,848 bytes maximum residency (14 sample(s))
761,408 bytes maximum slop
1124 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 373 colls, 373 par 2.802s 0.978s 0.0026s 0.0150s
Gen 1 14 colls, 13 par 0.534s 0.166s 0.0119s 0.0253s
Parallel GC work balance: 42.38% (serial 0%, perfect 100%)
TASKS: 18 (1 bound, 17 peak workers (17 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.008s elapsed)
MUT time 31.425s ( 36.161s elapsed)
GC time 3.337s ( 1.144s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 34.765s ( 37.314s elapsed)
Alloc rate 1,248,117,604 bytes per MUT second
Productivity 90.4% of total user, 84.2% of total elapsed
gc_alloc_block_sync: 27215
whitehole_spin: 0
gen[0].sync: 8919
gen[1].sync: 30902
Phoenix never took more than 150 MB, while Gin took much lower memory.
I believe that GHC uses mark and sweep strategy for GC. I also believe it would have been better to use per thread incremental GC strategy akin to Erlang VM for better memory efficiency.
And by interpreting Don Stewart's answer to a related question there must be some way to change the GC strategy in GHC.
I also noted that the memory usage remained stable and pretty low when the concurrency level was low, so I think memory usage booms up only when concurrency is pretty high.
Any ideas/pointers to solve this issue.
http://community.haskell.org/~simonmar/papers/local-gc.pdf
This paper by Simon Marlow describes per-thread local heaps, and claims that this was implemented in GHC. It's dated 2011. I can't be sure if this is what the current version of GHC actually does (i.e., did this go into the release version of GHC, is it still the current status quo, etc.), but it seems my recollection wasn't completely made up.
I will also point out the section of the GHC manual that explains the settings you can twiddle to adjust the garbage collector:
https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/runtime-control.html#rts-options-gc
In particular, by default GHC uses a 2-space collector, but adding the -c RTS option makes it use a slightly slower 1-space collector, which should eat less RAM. (I'm entirely unclear which generation(s) this information applies to.)
I get the impression Simon Marlow is the guy who does most of the RTS stuff (including the garbage collector), so if you can find him on IRC, he's the guy to ask if you want the direct truth...

Understanding output of GHC's +RTS -t -RTS option

I'm benchmarking the memory consumption of a haskell programm compiled with GHC. In order to do so, I run the programm with the following command line arguments: +RTS -t -RTS. Here's an example output:
<<ghc: 86319295256 bytes, 160722 GCs, 53963869/75978648 avg/max bytes residency (386 samples), 191M in use, 0.00 INIT (0.00 elapsed), 152.69 MUT (152.62 elapsed), 58.85 GC (58.82 elapsed) :ghc>>.
According to the ghc manual, the output shows:
The total number of bytes allocated by the program over the whole run.
The total number of garbage collections performed.
The average and maximum "residency", which is the amount of live data in bytes. The runtime can only determine the amount of live data during a major GC, which is why the number of samples corresponds to the number of major GCs (and is usually relatively small).
The peak memory the RTS has allocated from the OS.
The amount of CPU time and elapsed wall clock time while initialising the runtime system (INIT), running the program itself (MUT, the mutator), and garbage collecting (GC).
Applied to my example, it means that my program shuffles 82321 MiB (bytes divided by 1024^2) around, performs 160722 garbage collections, has a 51MiB/72MiB average/maximum memory residency, allocates at most 191M memory in RAM and so on ...
Now I want to know, what »The average and maximum "residency", which is the amount of live data in bytes« is compared to »The peak memory the RTS has allocated from the OS«? And also: What uses the remaining space of roughly 120M?
I was pointed here for more information, but that does not state clearly, what I want to know. Another source (5.4.4 second item) hints that the 120M memory is used for garbage collection. But that is too vague – I need a quotable information source.
So please, is there anyone who could answer my questions with good sources as proofs?
Kind regards!
The "resident" size is how much live Haskell data you have. The amount of memory actually allocated from the OS may be higher.
The RTS allocates memory in "blocks". If your program needs 7.3 blocks of of RAM, the RTS has to allocate 8 blocks, 0.7 of which is empty space.
The default garbage collection algorithm is a 2-space collector. That is, when space A fills up, it allocates space B (which is totally empty) and copies all the live data out of space A and into space B, then deallocates space A. That means that, for a while, you're using 2x as much RAM as is actually necessary. (I believe there's a switch somewhere to use a 1-space algorithm which is slower but uses less RAM.)
There is also some overhead for managing threads (especially if you have lots), and there might be a few other things.
I don't know how much you already know about GC technology, but you can try reading these:
http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel-gc/par-gc-ismm08.pdf
http://www.mm-net.org.uk/workshop190404/GHC%27s_Garbage_Collector.ppt

How to measure sequential and parallel runtimes of Haskell program

I am taking measurement of the haskell program from this question to produce the following table with runtimes and speedups summary so I can plot in a graph.
#Cores Runtimes Speedups
Absolute Relative
Seq ? .. ..
1 3.712 .. ..
2 1.646 .. ..
First question
While the runtimes on 1 and 2 cores are taken by compiling the program with the -threaded flag on ([3] and [4] below), I am not sure which time to take for the sequential one ([1] or [2] below):
should it be the time obtained by compiling without the -threaded flag, or
that obtained with the flag on but then NOT specifying any number of cores i.e. with no -Nx
Compiling without -threaded flag
$ ghc --make -O2 test.hs
[1] $ time ./test ## number of core = 1
102334155
real 0m4.194s
user 0m0.015s
sys 0m0.046s
Compiling with -threaded flag
$ ghc --make -O2 test.hs -threaded -rtsopts
[2] $ time ./test ## number of core = not sure?
102334155
real 0m3.547s
user 0m0.000s
sys 0m0.078s
[3] $ time ./test +RTS -N1 ## number of core = 1
102334155
real 0m3.712s
user 0m0.016s
sys 0m0.046s
[4] $ time ./test +RTS -N2 ## number of core = 2
102334155
real 0m1.646s
user 0m0.016s
sys 0m0.046s
Second question
As can be seen from above, I am using the time command to measure the runtimes. I am taking the 'real' time. But if I run the program with the -sstderr flag on, I get more detailed information:
$ ghc --make -O2 test.hs -rtsopts
$ ./test +RTS -sstderr
102334155
862,804 bytes allocated in the heap
2,432 bytes copied during GC
26,204 bytes maximum residency (1 sample(s))
19,716 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.57s ( 3.62s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.57s ( 3.62s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 241,517 bytes per MUT second
Productivity 100.0% of total user, 98.6% of total elapsed
I believe that the -sstderr provides a more accurate time which I should use instead of the time command. Am I correct? Also, which of the 'Total time' (3.57s or 3.62s) should I use?
And finally, any general advice/good practice while taking measurements like this? I am aware that there are some packages that allow us to benchmark our program, but I am mainly interested in taking the measurements manually (or using a script to do that for me).
Also: the runtimes are the median of running the program 3 times.
I would use -N1 for the single-core time. I believe that also constrains the GC to use one core (which seems fitting for the benchmark, I think?), but others may know more.
As for your second question, the answer to benchmarking in Haskell is nearly always to use criterion. Criterion will allow you to time one run of the program, and you can then wrap it in a script which runs the program with -N1, -N2, etc. Taking the median of 3 runs is okay as a very quick and rough indicator, but if you want to rely on the results then you'll need a lot more runs than that. Criterion runs your code enough and performs the appropriate statistics to give you a sensible average time, as well as confidence intervals and standard deviation (and it tries to correct for how busy your machine is). I know you asked about best practice for doing it yourself, but Criterion already embodies a lot of it: use clock time, benchmark a lot, and as you realised, don't just take a simple mean of the results.
Criterion requires very little change to your program if you want to benchmark the whole thing. Add this:
import Criterion.Main
main :: IO ()
main = defaultMain [bench "My program" oldMain]
where oldMain is whatever your main function used to be.

Resources