How to measure sequential and parallel runtimes of Haskell program - haskell

I am taking measurement of the haskell program from this question to produce the following table with runtimes and speedups summary so I can plot in a graph.
#Cores Runtimes Speedups
Absolute Relative
Seq ? .. ..
1 3.712 .. ..
2 1.646 .. ..
First question
While the runtimes on 1 and 2 cores are taken by compiling the program with the -threaded flag on ([3] and [4] below), I am not sure which time to take for the sequential one ([1] or [2] below):
should it be the time obtained by compiling without the -threaded flag, or
that obtained with the flag on but then NOT specifying any number of cores i.e. with no -Nx
Compiling without -threaded flag
$ ghc --make -O2 test.hs
[1] $ time ./test ## number of core = 1
102334155
real 0m4.194s
user 0m0.015s
sys 0m0.046s
Compiling with -threaded flag
$ ghc --make -O2 test.hs -threaded -rtsopts
[2] $ time ./test ## number of core = not sure?
102334155
real 0m3.547s
user 0m0.000s
sys 0m0.078s
[3] $ time ./test +RTS -N1 ## number of core = 1
102334155
real 0m3.712s
user 0m0.016s
sys 0m0.046s
[4] $ time ./test +RTS -N2 ## number of core = 2
102334155
real 0m1.646s
user 0m0.016s
sys 0m0.046s
Second question
As can be seen from above, I am using the time command to measure the runtimes. I am taking the 'real' time. But if I run the program with the -sstderr flag on, I get more detailed information:
$ ghc --make -O2 test.hs -rtsopts
$ ./test +RTS -sstderr
102334155
862,804 bytes allocated in the heap
2,432 bytes copied during GC
26,204 bytes maximum residency (1 sample(s))
19,716 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.57s ( 3.62s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.57s ( 3.62s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 241,517 bytes per MUT second
Productivity 100.0% of total user, 98.6% of total elapsed
I believe that the -sstderr provides a more accurate time which I should use instead of the time command. Am I correct? Also, which of the 'Total time' (3.57s or 3.62s) should I use?
And finally, any general advice/good practice while taking measurements like this? I am aware that there are some packages that allow us to benchmark our program, but I am mainly interested in taking the measurements manually (or using a script to do that for me).
Also: the runtimes are the median of running the program 3 times.

I would use -N1 for the single-core time. I believe that also constrains the GC to use one core (which seems fitting for the benchmark, I think?), but others may know more.
As for your second question, the answer to benchmarking in Haskell is nearly always to use criterion. Criterion will allow you to time one run of the program, and you can then wrap it in a script which runs the program with -N1, -N2, etc. Taking the median of 3 runs is okay as a very quick and rough indicator, but if you want to rely on the results then you'll need a lot more runs than that. Criterion runs your code enough and performs the appropriate statistics to give you a sensible average time, as well as confidence intervals and standard deviation (and it tries to correct for how busy your machine is). I know you asked about best practice for doing it yourself, but Criterion already embodies a lot of it: use clock time, benchmark a lot, and as you realised, don't just take a simple mean of the results.
Criterion requires very little change to your program if you want to benchmark the whole thing. Add this:
import Criterion.Main
main :: IO ()
main = defaultMain [bench "My program" oldMain]
where oldMain is whatever your main function used to be.

Related

Understanding `-Rghc-timing` output

Basically the title - if I run stack ghc -- SomeFile.hs -Rghc-timing, and then receive the following output:
<<ghc: 32204977120 bytes, 418 GCs, 589465960/3693483304 avg/max bytes residency (15 samples), 8025M in use, 0.001 INIT (0.000 elapsed), 10.246 MUT (10.327 elapsed), 21.465 GC (23.670 elapsed) :ghc>>
Does that mean:
When compiling, GHC used a total of 8,025 MB of memory
When compiling, GHC took a total of around 33 seconds in wall-clock time to complete
Basically, I want to make sure that it's as I think it is - that GHC's compilation time and memory usage is being measured, rather than anything to do with the program at runtime.
Thank you!
Yes, this line shows the statistics for the GHC compiler itself, while it was compiling your code. It is unrelated to the "runtime" performance of the resulting compiled program. The meaning of the various statistics is documented in the manual under the -t option, here.
And yes, while compiling your program, GHC allocated a maximum of 8025MB of memory from the operating system and took about 34 seconds of wall clock time, (24 in the garbage collector and 10 in the mutator).

GHC per thread GC strategy

I have a Scotty api server which constructs an Elasticsearch query, fetches results from ES and renders the json.
In comparison to other servers like Phoenix and Gin, I'm getting higher CPU utilization and throughput for serving ES responses by using BloodHound but Gin and Phoenix were magnitudes better than Scotty in memory efficiency.
Stats for Scotty
wrk -t30 -c100 -d30s "http://localhost:3000/filters?apid=1&hfa=true"
Running 30s test # http://localhost:3000/filters?apid=1&hfa=true
30 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 192.04ms 305.45ms 1.95s 83.06%
Req/Sec 133.42 118.21 1.37k 75.54%
68669 requests in 30.10s, 19.97MB read
Requests/sec: 2281.51
Transfer/sec: 679.28KB
These stats are on my Mac having GHC 7.10.1 installed
Processor info 2.5GHx i5
Memory info 8GB 1600 Mhz DDR3
I am quite impressed by lightweight thread based concurrency of GHC but memory efficiency remains a big concern.
Profiling memory usage yielded me the following stats
39,222,354,072 bytes allocated in the heap
277,239,312 bytes copied during GC
522,218,848 bytes maximum residency (14 sample(s))
761,408 bytes maximum slop
1124 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 373 colls, 373 par 2.802s 0.978s 0.0026s 0.0150s
Gen 1 14 colls, 13 par 0.534s 0.166s 0.0119s 0.0253s
Parallel GC work balance: 42.38% (serial 0%, perfect 100%)
TASKS: 18 (1 bound, 17 peak workers (17 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.008s elapsed)
MUT time 31.425s ( 36.161s elapsed)
GC time 3.337s ( 1.144s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 34.765s ( 37.314s elapsed)
Alloc rate 1,248,117,604 bytes per MUT second
Productivity 90.4% of total user, 84.2% of total elapsed
gc_alloc_block_sync: 27215
whitehole_spin: 0
gen[0].sync: 8919
gen[1].sync: 30902
Phoenix never took more than 150 MB, while Gin took much lower memory.
I believe that GHC uses mark and sweep strategy for GC. I also believe it would have been better to use per thread incremental GC strategy akin to Erlang VM for better memory efficiency.
And by interpreting Don Stewart's answer to a related question there must be some way to change the GC strategy in GHC.
I also noted that the memory usage remained stable and pretty low when the concurrency level was low, so I think memory usage booms up only when concurrency is pretty high.
Any ideas/pointers to solve this issue.
http://community.haskell.org/~simonmar/papers/local-gc.pdf
This paper by Simon Marlow describes per-thread local heaps, and claims that this was implemented in GHC. It's dated 2011. I can't be sure if this is what the current version of GHC actually does (i.e., did this go into the release version of GHC, is it still the current status quo, etc.), but it seems my recollection wasn't completely made up.
I will also point out the section of the GHC manual that explains the settings you can twiddle to adjust the garbage collector:
https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/runtime-control.html#rts-options-gc
In particular, by default GHC uses a 2-space collector, but adding the -c RTS option makes it use a slightly slower 1-space collector, which should eat less RAM. (I'm entirely unclear which generation(s) this information applies to.)
I get the impression Simon Marlow is the guy who does most of the RTS stuff (including the garbage collector), so if you can find him on IRC, he's the guy to ask if you want the direct truth...

Optimize Haskell GC usage

I am running a long-lived Haskell program that holds on to a lot of memory. Running with +RTS -N5 -s -A25M (size of my L3 cache) I see:
715,584,711,208 bytes allocated in the heap
390,936,909,408 bytes copied during GC
4,731,021,848 bytes maximum residency (745 sample(s))
76,081,048 bytes maximum slop
7146 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 24103 colls, 24103 par 240.99s 104.44s 0.0043s 0.0603s
Gen 1 745 colls, 744 par 2820.18s 619.27s 0.8312s 1.3200s
Parallel GC work balance: 50.36% (serial 0%, perfect 100%)
TASKS: 18 (1 bound, 17 peak workers (17 total), using -N5)
SPARKS: 1295 (1274 converted, 0 overflowed, 0 dud, 0 GC'd, 21 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 475.11s (454.19s elapsed)
GC time 3061.18s (723.71s elapsed)
EXIT time 0.27s ( 0.50s elapsed)
Total time 3536.57s (1178.41s elapsed)
Alloc rate 1,506,148,218 bytes per MUT second
Productivity 13.4% of total user, 40.3% of total elapsed
The GC time is 87% of the total run time! I am running this on a system with a massive amount of RAM, but when I set a high -H value the performance was worse.
It seems that both -H and -A controls the size of gen 0, but what I would really like to do is increase the size of gen 1. What is the best way to do this?
As Carl suggested, you should check code for space leaks. I'll assume that your program really requires a lot of memory for good reason.
The program spent 2820.18s doing major GC. You can lower it by reducing either memory usage (not a case by the assumption) or number of major collections. You have a lot of free RAM, so you can try -Ffactor option:
-Ffactor
[Default: 2] This option controls the amount of memory reserved for
the older generations (and in the case of a two space collector the size
of the allocation area) as a factor of the amount of live data. For
example, if there was 2M of live data in the oldest generation when we
last collected it, then by default we'll wait until it grows to 4M before
collecting it again.
In your case there is ~3G of live data. By default major GC will be triggered when heap grows to 6G. With -F3 it will be triggered when heap grows to 9G saving you ~1000s CPU time.
If most of the live data is static (e.g. never changes or changes slowly,) then you will be interested in stable heap. The idea is to exclude long living data from major GC. It can be achieved e.g. using compact normal forms, though it is not merged into GHC yet.

Understanding output of GHC's +RTS -t -RTS option

I'm benchmarking the memory consumption of a haskell programm compiled with GHC. In order to do so, I run the programm with the following command line arguments: +RTS -t -RTS. Here's an example output:
<<ghc: 86319295256 bytes, 160722 GCs, 53963869/75978648 avg/max bytes residency (386 samples), 191M in use, 0.00 INIT (0.00 elapsed), 152.69 MUT (152.62 elapsed), 58.85 GC (58.82 elapsed) :ghc>>.
According to the ghc manual, the output shows:
The total number of bytes allocated by the program over the whole run.
The total number of garbage collections performed.
The average and maximum "residency", which is the amount of live data in bytes. The runtime can only determine the amount of live data during a major GC, which is why the number of samples corresponds to the number of major GCs (and is usually relatively small).
The peak memory the RTS has allocated from the OS.
The amount of CPU time and elapsed wall clock time while initialising the runtime system (INIT), running the program itself (MUT, the mutator), and garbage collecting (GC).
Applied to my example, it means that my program shuffles 82321 MiB (bytes divided by 1024^2) around, performs 160722 garbage collections, has a 51MiB/72MiB average/maximum memory residency, allocates at most 191M memory in RAM and so on ...
Now I want to know, what »The average and maximum "residency", which is the amount of live data in bytes« is compared to »The peak memory the RTS has allocated from the OS«? And also: What uses the remaining space of roughly 120M?
I was pointed here for more information, but that does not state clearly, what I want to know. Another source (5.4.4 second item) hints that the 120M memory is used for garbage collection. But that is too vague – I need a quotable information source.
So please, is there anyone who could answer my questions with good sources as proofs?
Kind regards!
The "resident" size is how much live Haskell data you have. The amount of memory actually allocated from the OS may be higher.
The RTS allocates memory in "blocks". If your program needs 7.3 blocks of of RAM, the RTS has to allocate 8 blocks, 0.7 of which is empty space.
The default garbage collection algorithm is a 2-space collector. That is, when space A fills up, it allocates space B (which is totally empty) and copies all the live data out of space A and into space B, then deallocates space A. That means that, for a while, you're using 2x as much RAM as is actually necessary. (I believe there's a switch somewhere to use a 1-space algorithm which is slower but uses less RAM.)
There is also some overhead for managing threads (especially if you have lots), and there might be a few other things.
I don't know how much you already know about GC technology, but you can try reading these:
http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel-gc/par-gc-ismm08.pdf
http://www.mm-net.org.uk/workshop190404/GHC%27s_Garbage_Collector.ppt

Is the UNIX `time` command accurate enough for benchmarks? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Let's say I wanted to benchmark two programs: foo.py and bar.py.
Are a couple thousand runs and the respective averages of time python foo.py and time python bar.py adequate enough for profiling and comparing their speed?
Edit: Additionally, if the execution of each program was sub-second (assume it wasn't for the above), would time still be okay to use?
time produces good enough times for benchmarks that run over one second otherwise the time it took exec()ing a process may be large compared to its run-time.
However, when benchmarking you should watch out for context switching. That is, another process may be using CPU thus contending for CPU with your benchmark and increasing its run time. To avoid contention with other processes you should run a benchmark like this:
sudo chrt -f 99 /usr/bin/time --verbose <benchmark>
Or
sudo chrt -f 99 perf stat -ddd <benchmark>
sudo chrt -f 99 runs your benchmark in FIFO real-time class with priority 99, which makes your process the top priority process and avoids context switching (you can change your /etc/security/limits.conf so that it doesn't require a privileged process to use real-time priorities).
It also makes time report all the available stats, including the number of context switches your benchmark incurred, which should normally be 0, otherwise you may like to rerun the benchmark.
perf stat -ddd is even more informative than /usr/bin/time and displays such information as instructions-per-cycle, branch and cache misses, etc.
And it is better to disable the CPU frequency scaling and boost, so that the CPU frequency stays constant during the benchmark to get consistent results.
Nowadays, imo, there is no reason to use time for benchmarking purposes. Use perf stat instead. It gives you much more useful information and can repeat the benchmarking process any given number of time and do statistics on the results, i.e. calculate variance and mean value. This is much more reliable and just as simple to use as time:
perf stat -r 10 -d <your app and arguments>
The -r 10 will run your app 10 times and do statistics over it. -d outputs some more data, such as cache misses.
So while time might be reliable enough for long-running applications, it definitely is not as reliable as perf stat. Use that instead.
Addendum: If you really want to keep using time, at least don't use the bash-builtin command, but the real-deal in verbose mode:
/usr/bin/time -v <some command with arguments>
The output is then e.g.:
Command being timed: "ls"
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 93
Voluntary context switches: 1
Involuntary context switches: 2
Swaps: 0
File system inputs: 8
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Especially note how this is capable of measuring the peak RSS, which is often enough if you want to compare the effect of a patch on the peak memory consumption. I.e. use that value to compare before/after and if there is a significant decrease in the RSS peak, then you did something right.
Yes, time is accurate enough. And you'll need to run only a dozen of times your programs (provided the run lasts more than a second, or a significant fraction of a second - ie more than 200 milliseconds at least). Of course, the file system would be hot (i.e. small files would already be cached in RAM) for most runs (except the first), so take that into account.
the reason you want to have the time-d run to last a few tenths of seconds at least is the accuracy and granularity of the time measurement. Don't expect less than hundredth of second of accuracy. (you need some special kernel option to have it one millisecond)
From inside the application, you could use clock, clock_gettime, gettimeofday,
getrusage, times (they surely have a Python equivalent).
Don't forget to read the time(7) man page.
Yes. The time command gives both elapsed time as well as consumed CPU. The latter is probably what you should focus on, unless you're doing a lot of I/O. If elapsed time is important, make sure the system doesn't have other significant activity while running your test.

Resources