GHC per thread GC strategy

GHC per thread GC strategy - multithreading

I have a Scotty api server which constructs an Elasticsearch query, fetches results from ES and renders the json.
In comparison to other servers like Phoenix and Gin, I'm getting higher CPU utilization and throughput for serving ES responses by using BloodHound but Gin and Phoenix were magnitudes better than Scotty in memory efficiency.
Stats for Scotty
wrk -t30 -c100 -d30s "http://localhost:3000/filters?apid=1&hfa=true"
Running 30s test # http://localhost:3000/filters?apid=1&hfa=true
30 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 192.04ms 305.45ms 1.95s 83.06%
Req/Sec 133.42 118.21 1.37k 75.54%
68669 requests in 30.10s, 19.97MB read
Requests/sec: 2281.51
Transfer/sec: 679.28KB
These stats are on my Mac having GHC 7.10.1 installed
Processor info 2.5GHx i5
Memory info 8GB 1600 Mhz DDR3
I am quite impressed by lightweight thread based concurrency of GHC but memory efficiency remains a big concern.
Profiling memory usage yielded me the following stats
39,222,354,072 bytes allocated in the heap
277,239,312 bytes copied during GC
522,218,848 bytes maximum residency (14 sample(s))
761,408 bytes maximum slop
1124 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 373 colls, 373 par 2.802s 0.978s 0.0026s 0.0150s
Gen 1 14 colls, 13 par 0.534s 0.166s 0.0119s 0.0253s
Parallel GC work balance: 42.38% (serial 0%, perfect 100%)
TASKS: 18 (1 bound, 17 peak workers (17 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.008s elapsed)
MUT time 31.425s ( 36.161s elapsed)
GC time 3.337s ( 1.144s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 34.765s ( 37.314s elapsed)
Alloc rate 1,248,117,604 bytes per MUT second
Productivity 90.4% of total user, 84.2% of total elapsed
gc_alloc_block_sync: 27215
whitehole_spin: 0
gen[0].sync: 8919
gen[1].sync: 30902
Phoenix never took more than 150 MB, while Gin took much lower memory.
I believe that GHC uses mark and sweep strategy for GC. I also believe it would have been better to use per thread incremental GC strategy akin to Erlang VM for better memory efficiency.
And by interpreting Don Stewart's answer to a related question there must be some way to change the GC strategy in GHC.
I also noted that the memory usage remained stable and pretty low when the concurrency level was low, so I think memory usage booms up only when concurrency is pretty high.
Any ideas/pointers to solve this issue.

http://community.haskell.org/~simonmar/papers/local-gc.pdf
This paper by Simon Marlow describes per-thread local heaps, and claims that this was implemented in GHC. It's dated 2011. I can't be sure if this is what the current version of GHC actually does (i.e., did this go into the release version of GHC, is it still the current status quo, etc.), but it seems my recollection wasn't completely made up.
I will also point out the section of the GHC manual that explains the settings you can twiddle to adjust the garbage collector:
https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/runtime-control.html#rts-options-gc
In particular, by default GHC uses a 2-space collector, but adding the -c RTS option makes it use a slightly slower 1-space collector, which should eat less RAM. (I'm entirely unclear which generation(s) this information applies to.)
I get the impression Simon Marlow is the guy who does most of the RTS stuff (including the garbage collector), so if you can find him on IRC, he's the guy to ask if you want the direct truth...

Related

Understanding `-Rghc-timing` output

Basically the title - if I run stack ghc -- SomeFile.hs -Rghc-timing, and then receive the following output:
<<ghc: 32204977120 bytes, 418 GCs, 589465960/3693483304 avg/max bytes residency (15 samples), 8025M in use, 0.001 INIT (0.000 elapsed), 10.246 MUT (10.327 elapsed), 21.465 GC (23.670 elapsed) :ghc>>
Does that mean:
When compiling, GHC used a total of 8,025 MB of memory
When compiling, GHC took a total of around 33 seconds in wall-clock time to complete
Basically, I want to make sure that it's as I think it is - that GHC's compilation time and memory usage is being measured, rather than anything to do with the program at runtime.
Thank you!

Yes, this line shows the statistics for the GHC compiler itself, while it was compiling your code. It is unrelated to the "runtime" performance of the resulting compiled program. The meaning of the various statistics is documented in the manual under the -t option, here.
And yes, while compiling your program, GHC allocated a maximum of 8025MB of memory from the operating system and took about 34 seconds of wall clock time, (24 in the garbage collector and 10 in the mutator).

How to infer node.js profiling results?

I have a simple nodejs application (app A) in Windows that listens to a port and as soon as it receives request it posts to another server (app B) and records the response in MongoDB.
App A (single thread, no cluster implemeneted yet) processes around 35 requests per second (measured using locust.io). The below is the profiling information of app A. 97.8% of time is taken by the shared libraries out of which 93.5% is due to ntdll.dll. Is it normal or a potential bottleneck that can be fixed?
[Summary]:
ticks total nonlib name
6023 2.0% 87.8% JavaScript
0 0.0% 0.0% C++
502 0.2% 7.3% GC
300434 97.8% Shared libraries
834 0.3% Unaccounted
[Shared libraries]:
ticks total nonlib name
287209 93.5% C:\windows\SYSTEM32\ntdll.dll
12907 4.2% C:\Program Files\nodejs\node.exe
144 0.0% C:\windows\system32\KERNELBASE.dll
133 0.0% C:\windows\system32\KERNEL32.DLL
25 0.0% C:\windows\system32\WS2_32.dll
15 0.0% C:\windows\system32\mswsock.dll
1 0.0% C:\windows\SYSTEM32\WINMM.dll
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
287209 93.5% C:\windows\SYSTEM32\ntdll.dll
6705 2.3% C:\Program Files\nodejs\node.exe
831 12.4% LazyCompile: <anonymous> C:\opt\acuity\proxy\nodejs\node_modules\mongoose\node_modules\mongodb-core\lib\topologies\server.js:786:54
826 99.4% LazyCompile: *Callbacks.emit

In a typical application (a mixture of CPU bound and I/O bound workloads) I would say the more the process was able to run among the CPU slots allotted to it, the better - that way we would see more CPU consumption in the user space, as opposed to the kernel space.
In Node.js, as the I/O is delayed until it is available to be actioned, and when they are ready to be actioned, we could see an increased activity in the OS space. If the massive usage of ntdll.dll is to carry out I/O, I would say this is not a concern, rather indication of a good performant system.
Do you have heavy profile showing the split-up in ntdll.dll? If they are pointing to win32 APIs / helper functions which facilitate I/O, then I would say this is a good sign of your system.

Optimize Haskell GC usage

I am running a long-lived Haskell program that holds on to a lot of memory. Running with +RTS -N5 -s -A25M (size of my L3 cache) I see:
715,584,711,208 bytes allocated in the heap
390,936,909,408 bytes copied during GC
4,731,021,848 bytes maximum residency (745 sample(s))
76,081,048 bytes maximum slop
7146 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 24103 colls, 24103 par 240.99s 104.44s 0.0043s 0.0603s
Gen 1 745 colls, 744 par 2820.18s 619.27s 0.8312s 1.3200s
Parallel GC work balance: 50.36% (serial 0%, perfect 100%)
TASKS: 18 (1 bound, 17 peak workers (17 total), using -N5)
SPARKS: 1295 (1274 converted, 0 overflowed, 0 dud, 0 GC'd, 21 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 475.11s (454.19s elapsed)
GC time 3061.18s (723.71s elapsed)
EXIT time 0.27s ( 0.50s elapsed)
Total time 3536.57s (1178.41s elapsed)
Alloc rate 1,506,148,218 bytes per MUT second
Productivity 13.4% of total user, 40.3% of total elapsed
The GC time is 87% of the total run time! I am running this on a system with a massive amount of RAM, but when I set a high -H value the performance was worse.
It seems that both -H and -A controls the size of gen 0, but what I would really like to do is increase the size of gen 1. What is the best way to do this?

As Carl suggested, you should check code for space leaks. I'll assume that your program really requires a lot of memory for good reason.
The program spent 2820.18s doing major GC. You can lower it by reducing either memory usage (not a case by the assumption) or number of major collections. You have a lot of free RAM, so you can try -Ffactor option:
-Ffactor
[Default: 2] This option controls the amount of memory reserved for
the older generations (and in the case of a two space collector the size
of the allocation area) as a factor of the amount of live data. For
example, if there was 2M of live data in the oldest generation when we
last collected it, then by default we'll wait until it grows to 4M before
collecting it again.
In your case there is ~3G of live data. By default major GC will be triggered when heap grows to 6G. With -F3 it will be triggered when heap grows to 9G saving you ~1000s CPU time.
If most of the live data is static (e.g. never changes or changes slowly,) then you will be interested in stable heap. The idea is to exclude long living data from major GC. It can be achieved e.g. using compact normal forms, though it is not merged into GHC yet.

Understanding output of GHC's +RTS -t -RTS option

I'm benchmarking the memory consumption of a haskell programm compiled with GHC. In order to do so, I run the programm with the following command line arguments: +RTS -t -RTS. Here's an example output:
<<ghc: 86319295256 bytes, 160722 GCs, 53963869/75978648 avg/max bytes residency (386 samples), 191M in use, 0.00 INIT (0.00 elapsed), 152.69 MUT (152.62 elapsed), 58.85 GC (58.82 elapsed) :ghc>>.
According to the ghc manual, the output shows:
The total number of bytes allocated by the program over the whole run.
The total number of garbage collections performed.
The average and maximum "residency", which is the amount of live data in bytes. The runtime can only determine the amount of live data during a major GC, which is why the number of samples corresponds to the number of major GCs (and is usually relatively small).
The peak memory the RTS has allocated from the OS.
The amount of CPU time and elapsed wall clock time while initialising the runtime system (INIT), running the program itself (MUT, the mutator), and garbage collecting (GC).
Applied to my example, it means that my program shuffles 82321 MiB (bytes divided by 1024^2) around, performs 160722 garbage collections, has a 51MiB/72MiB average/maximum memory residency, allocates at most 191M memory in RAM and so on ...
Now I want to know, what »The average and maximum "residency", which is the amount of live data in bytes« is compared to »The peak memory the RTS has allocated from the OS«? And also: What uses the remaining space of roughly 120M?
I was pointed here for more information, but that does not state clearly, what I want to know. Another source (5.4.4 second item) hints that the 120M memory is used for garbage collection. But that is too vague – I need a quotable information source.
So please, is there anyone who could answer my questions with good sources as proofs?
Kind regards!

The "resident" size is how much live Haskell data you have. The amount of memory actually allocated from the OS may be higher.
The RTS allocates memory in "blocks". If your program needs 7.3 blocks of of RAM, the RTS has to allocate 8 blocks, 0.7 of which is empty space.
The default garbage collection algorithm is a 2-space collector. That is, when space A fills up, it allocates space B (which is totally empty) and copies all the live data out of space A and into space B, then deallocates space A. That means that, for a while, you're using 2x as much RAM as is actually necessary. (I believe there's a switch somewhere to use a 1-space algorithm which is slower but uses less RAM.)
There is also some overhead for managing threads (especially if you have lots), and there might be a few other things.
I don't know how much you already know about GC technology, but you can try reading these:
http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel-gc/par-gc-ismm08.pdf
http://www.mm-net.org.uk/workshop190404/GHC%27s_Garbage_Collector.ppt

Profiler results are confusing

Here is a picture of CPU monitoring provided by VisualVm profiler. I'm confused because I can't understand what does percentage means?As you can see for CPU the number is 24,1 of what?overall cpu time?and gc - 21,8 the same question.What is 100% in both cases?Please clarify this data.

CPU usage is the total CPU usage of your system. The GC activity shows how much of the CPU is spent by the GC threads.
In your example, it looks like the GC was executing and thus contributing to the majority of the total CPU usage in the system. After the GC finished, there was no CPU used.
You should check if this observation is consistent with your GC logs, the logs will shows you the activity around the 19:50 time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string