What is GHC doing when run with -N (parallel) flag? - multithreading

I've written the following test application:
main = print $ sum $ map (read . show) [1 .. 10^7]
When I run it with and without -N flag, I get the following results:
$ ghc -O2 -threaded -rtsopts -o test test.hs
...
$ time ./test +RTS -s
50000005000000
real 0m12.411s
user 0m12.367s
sys 0m0.040s
$ time ./test +RTS -s -N12
50000005000000
real 0m22.702s
user 1m14.904s
sys 0m12.608s
It seems like GHC decides to honour the -N12 flag by distributing the calculation over different cores (with very bad results), but I can't find any documentation about how exactly it decides to do so when the code doesn't contain explicit instructions. Is there some documentation that I'm missing?
I have GHC version 8.6.5.
Garbage collection statistics:
$ ghc -O2 -threaded -rtsopts -o test test.hs
...
$ time ./test +RTS -s
50000005000000
54,332,520,712 bytes allocated in the heap
53,571,832 bytes copied during GC
56,824 bytes maximum residency (2 sample(s))
29,192 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 52088 colls, 0 par 0.154s 0.150s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.000s elapsed)
MUT time 12.250s ( 12.249s elapsed)
GC time 0.155s ( 0.151s elapsed)
EXIT time 0.001s ( 0.010s elapsed)
Total time 12.406s ( 12.410s elapsed)
Alloc rate 4,435,169,879 bytes per MUT second
Productivity 98.7% of total user, 98.7% of total elapsed
real 0m12.411s
user 0m12.367s
sys 0m0.040s
$ time ./test +RTS -s -N12
50000005000000
54,332,687,840 bytes allocated in the heap
214,001,248 bytes copied during GC
183,360 bytes maximum residency (2 sample(s))
146,696 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 52088 colls, 52088 par 20.219s 0.975s 0.0000s 0.0001s
Gen 1 2 colls, 1 par 0.001s 0.000s 0.0001s 0.0002s
Parallel GC work balance: 0.15% (serial 0%, perfect 100%)
TASKS: 26 (1 bound, 25 peak workers (25 total), using -N12)
SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.007s ( 0.003s elapsed)
MUT time 67.281s ( 21.720s elapsed)
GC time 20.221s ( 0.975s elapsed)
EXIT time 0.002s ( 0.003s elapsed)
Total time 87.511s ( 22.701s elapsed)
Alloc rate 807,549,654 bytes per MUT second
Productivity 76.9% of total user, 95.7% of total elapsed
real 0m22.702s
user 1m14.904s
sys 0m12.608s

GHC doesn't automatically parallelize code. (The runtime system itself may take advantage of multiple threads for initialization giving a small, fixed performance improvement at startup, but that's the only thing that happends "automatically".)
So, your code is running sequentially. As noted in some of the comments, the bizarre performance problem is probably parallel garbage collection.
Parallel GC has been observed to perform very poorly on certain workloads when running on large numbers of capabilities. See issue #14981, for example. Of course, that issue talks about 32- or 64-core machines.
However, I have observed very poor performance especially with the default runtime GC settings even on relatively small numbers of cores. For example, using your test case and GHC version, I get similar poor performance on my 8-core, 16-thread Intel i9-9980HK laptop with -N12 or more. Here is the comparison of a 1-capability and 12-capability run. Compile it:
$ cat test.hs
main = print $ sum $ map (read . show) [1 .. 10^7]
$ stack ghc --resolver=lts-14.27 -- -fforce-recomp -O2 -threaded -rtsopts -o test test.hs
[1 of 1] Compiling Main ( test.hs, test.o )
Linking test ...
Run it on one capability:
$ time ./test +RTS -N1
50000005000000
real 0m10.803s
user 0m10.770s
sys 0m0.037s
Run it on twelve:
$ time ./test +RTS -N12
50000005000000
real 0m15.655s
user 0m52.103s
sys 0m7.019s
To see that parallel GC is at fault, we can switch to sequential GC:
$ time ./test +RTS -N12 -qg
50000005000000
real 0m11.175s
user 0m11.066s
sys 0m0.120s
I had assumed that this poor parallel GC performance was related to exceeding the number of physical cores, but your experience suggests this can happen with around 12 capabilities even if it doesn't exceed the physical core count.
Instead of disabling parallel GC entirely, you are advised to play with the runtime garbage collector controls. The effects can be startling. For example, increasing the generation 0 allocation area from its default of 1m to 4m results in a big improvement:
$ time ./test +RTS -N12 -A4m
50000005000000
real 0m12.485s
user 0m25.219s
sys 0m2.053s
and going even higher to 16m eliminates the performance problem entirely, at least for this simple test case.
$ time ./test +RTS -N12 -A16m
50000005000000
real 0m11.481s
user 0m11.775s
sys 0m0.126s
I get similar improvements switching to compaction for the second generation:
$ time ./test +RTS -N12 -c
50000005000000
real 0m11.125s
user 0m11.043s
sys 0m0.089s
Of course, running the parallel GC on a reduced number of cores may also help:
$ time ./test +RTS -N12 -qn4
50000005000000
real 0m14.092s
user 0m18.961s
sys 0m3.031s

Related

Is there an equivalent for time([some command]) for checking peak memory usage of a bash command?

I want to figure out how much memory a specific command uses but I'm not sure how to check for the peak memory of the command. Is there anything like the time([command]) usage but for memory?
Basically, I'm going to have to run an interactive queue using SLURM, then test a command for a program I need to use for a single sample, see how much memory was used, then submit a bunch of jobs using that info.
Yes, time is the program that monitors programs and shows the Maximum resident set size. Not to be confused with time Bash builtin that only shows real/user/sys times. On my Arch Linux you have to install time with pacman -S time, it's a separate package.
$ command time -v echo 1
1
Command being timed: "echo 1"
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 90
Voluntary context switches: 1
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Note:
$ type time
time is a shell keyword
$ time -V
bash: -V: command not found
real 0m0.002s
user 0m0.000s
sys 0m0.002s
$ command time -V
time (GNU Time) 1.9
$ /bin/time -V
time (GNU Time) 1.9
$ /usr/bin/time -V
time (GNU Time) 1.9

Why using pipe for sort (linux command) is slow?

I have a large text file of ~8GB which I need to do some simple filtering and then sort all the rows. I am on a 28-core machine with SSD and 128GB RAM. I have tried
Method 1
awk '...' myBigFile | sort --parallel = 56 > myBigFile.sorted
Method 2
awk '...' myBigFile > myBigFile.tmp
sort --parallel 56 myBigFile.tmp > myBigFile.sorted
Surprisingly, method1 takes 11.5 min while method2 only takes (0.75 + 1 < 2) min. Why is sorting so slow when piped? Is it not paralleled?
EDIT
awk and myBigFile is not important, this experiment is repeatable by simply using seq 1 10000000 | sort --parallel 56 (thanks to #Sergei Kurenkov), and I also observed a six-fold speed improvement using un-piped version on my machine.
When reading from a pipe, sort assumes that the file is small, and for small files parallelism isn't helpful. To get sort to utilize parallelism you need to tell it to allocate a large main memory buffer using -S. In this case the data file is about 8GB, so you can use -S8G. However, at least on your system with 128GB of main memory, method 2 may still be faster.
This is because sort in method 2 can know from the size of the file that it is huge, and it can seek in the file (neither of which is possible for a pipe). Further, since you have so much memory compared to these file sizes, the data for myBigFile.tmp need not be written to disc before awk exits, and sort will be able to read the file from cache rather than disc. So the principle difference between method 1 and method 2 (on a machine like yours with lots of memory) is that sort in method 2 knows the file is huge and can easily divide up the work (possibly using seek, but I haven't looked at the implementation), whereas in method 1 sort has to discover the data is huge, and it can not use any parallelism in reading the input since it can't seek the pipe.
I think sort does not use threads when read from pipe.
I have used this command for your first case. And it shows that sort uses only 1 CPU even though it is told to use 4. atop actually also shows that there is only one thread in sort:
/usr/bin/time -v bash -c "seq 1 1000000 | sort --parallel 4 > bf.txt"
I have used this command for your second case. And it shows that sort uses 2 CPU. atop actually also shows that there are four thread in sort:
/usr/bin/time -v bash -c "seq 1 1000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
In you first scenario sort is an I/O bound task, it does lots of read syscalls from stdin. In your second scenario sort uses mmap syscalls to read file and it avoids being an I/O bound task.
Below are results for the first and second scenarios:
$ /usr/bin/time -v bash -c "seq 1 10000000 | sort --parallel 4 > bf.txt"
Command being timed: "bash -c seq 1 10000000 | sort --parallel 4 > bf.txt"
User time (seconds): 35.85
System time (seconds): 0.84
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:37.43
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 9320
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2899
Voluntary context switches: 1920
Involuntary context switches: 1323
Swaps: 0
File system inputs: 0
File system outputs: 459136
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
$ /usr/bin/time -v bash -c "seq 1 10000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
Command being timed: "bash -c seq 1 10000000 > tmp.bf.txt && sort --parallel 4 tmp.bf.txt > bf.txt"
User time (seconds): 43.03
System time (seconds): 0.85
Percent of CPU this job got: 175%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1018004
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2445
Voluntary context switches: 299
Involuntary context switches: 4387
Swaps: 0
File system inputs: 0
File system outputs: 308160
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
You have more system calls, if you use the pipe.
seq 1000000 | strace sort --parallel=56 2>&1 >/dev/null | grep read | wc -l
2059
Without the pipe the file is mapped into memory.
seq 1000000 > input
strace sort --parallel=56 input 2>&1 >/dev/null | grep read | wc -l
33
Kernel calls are in most cases the bottle neck. That is the reason why sendfile has been invented.

Calculate the average of several "time" commands in Linux

I'm profiling a program on Linux, using the "time" command. The problem is it's output is not very statistically relevant as it does only run the program once. Is there a tool or a way to get an average of several "time" runs? Possibly aswel together with statistical information such as deviation?
Here is a script I wrote to do something similar to what you are looking for. It runs the provided command 10 times, logging the real, user CPU and system CPU times to a file, and echoing tham after each command output. It then uses awk to provide averages of each of the 3 columns in the file, but does not (yet) include standard deviation.
#!/bin/bash
rm -f /tmp/mtime.$$
for x in {1..10}
do
/usr/bin/time -f "real %e user %U sys %S" -a -o /tmp/mtime.$$ $#
tail -1 /tmp/mtime.$$
done
awk '{ et += $2; ut += $4; st += $6; count++ } END { printf "Average:\nreal %.3f user %.3f sys %.3f\n", et/count, ut/count, st/count }' /tmp/mtime.$$
Use hyperfine.
For example:
hyperfine 'sleep 0.3'
Will run the command sleep 0.3 multiple times, then output something like this:
hyperfine 'sleep 0.3'
Benchmark #1: sleep 0.3
Time (mean ± σ): 306.7 ms ± 3.0 ms [User: 2.8 ms, System: 3.5 ms]
Range (min … max): 301.0 ms … 310.9 ms 10 runs
perf stat does this for you with the -r (-repeat=<n>) option, with average and variance.
e.g. using a short loop in awk to simulate some work, short enough that CPU frequency ramp-up and other startup overhead might be a factor (Idiomatic way of performance evaluation?), although it seems my CPU ramped up to 3.9GHz pretty quickly, averaging 3.82 GHz.
$ perf stat -r5 awk 'BEGIN{for(i=0;i<1000000;i++){}}'
Performance counter stats for 'awk BEGIN{for(i=0;i<1000000;i++){}}' (5 runs):
37.90 msec task-clock # 0.968 CPUs utilized ( +- 2.18% )
1 context-switches # 31.662 /sec ( +-100.00% )
0 cpu-migrations # 0.000 /sec
181 page-faults # 4.776 K/sec ( +- 0.39% )
144,802,875 cycles # 3.821 GHz ( +- 0.23% )
343,697,186 instructions # 2.37 insn per cycle ( +- 0.05% )
93,854,279 branches # 2.476 G/sec ( +- 0.04% )
29,245 branch-misses # 0.03% of all branches ( +- 12.79% )
0.03917 +- 0.00182 seconds time elapsed ( +- 4.63% )
(Scroll to the right for variance.)
You can use taskset -c3 perf stat ... to pin the task to a specific core (#3 in that case) if you have a single-threaded task and want to minimize context-switches.
By default, perf stat uses hardware perf counters to profile things like instructions, core clock cycles (not the same thing as time on modern CPUs), and branch misses. This has pretty low overhead, especially with the counters in "counting" mode instead of perf record causing interrupts to statistically sample hot spots for events.
You could use -e task-clock to just use that event without using HW perf counters. (Or if your system is in a VM, or you didn't change the default /proc/sys/kernel/perf_event_paranoid, perf might not be able to ask the kernel to program any anyway.)
For more about perf, see
https://www.brendangregg.com/perf.html
https://perf.wiki.kernel.org/index.php/Main_Page
For programs that print output, it looks like this:
$ perf stat -r5 echo hello
hello
hello
hello
hello
hello
Performance counter stats for 'echo hello' (5 runs):
0.27 msec task-clock # 0.302 CPUs utilized ( +- 4.51% )
...
0.000890 +- 0.000411 seconds time elapsed ( +- 46.21% )
For a single run, (the default with no -r), perf stat will show time elapsed, and user / sys. But -r doesn't average those, for some reason.
Like the commenter above mentioned, it sounds like you may want to use a loop to run your program multiple times, to get more data points. You can use the time command with the -o option to output the results of the time command to a text file, like so:
time -o output.txt myprog

Haskell WinGHC Running Program with Performance Statistics

Using WinGHC how can i run my program with the +RTS -sstderr option to get statistics like compile time, memory useage or anything else of interest?
Currently i am using the command line: ghc -rtsopts -O3 -prof -auto-all Main.hs
Using WinGHCi set your active directory where your program is e.g.
Prelude> :cd C:\Haskell
Then, when you are in the right directory as per previous comments above enter:
Prelude> :! ghc +RTS -s -RTS -O2 -prof -fprof-auto Main.hs
The 'Main.hs' should be your program name.
This will then give statistics as follows:
152,495,848 bytes allocated in the heap
36,973,728 bytes copied during GC
10,458,664 bytes maximum residency (5 sample(s))
1,213,516 bytes maximum slop
21 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 241 colls, 0 par 0.14s 0.20s 0.0008s 0.0202s
Gen 1 5 colls, 0 par 0.08s 0.12s 0.0233s 0.0435s
etc..........
etc.............
Other flags can be set to see other statistics.

why is cygwin so slow

I run a script on Ubuntu, and tested its time:
$ time ./merger
./merger 0.02s user 0.03s system 99% cpu 0.050 total
it spent less than 1 second.
but if I used cygwin:
$ time ./merger
real 3m22.407s
user 0m0.367s
sys 0m0.354s
It spent more than 3 minutes.
Why did this happen? What shall I do to increase the executing speed on cygwin?
As others have already mentioned, Cygwin's implementation of fork and process spawning on Windows in general are slow.
Using this fork() benchmark, I get following results:
rr-#cygwin:~$ ./test 1000
Forked, executed and destroyed 1000 processes in 5.660011 seconds.
rr-#arch:~$ ./test 1000
Forked, executed and destroyed 1000 processes in 0.142595 seconds.
rr-#debian:~$ ./test 1000
Forked, executed and destroyed 1000 processes in 1.141982 seconds.
Using time (for i in {1..10000};do cat /dev/null;done) to benchmark process spawning performance, I get following results:
rr-#work:~$ time (for i in {1..10000};do cat /dev/null;done)
(...) 19.11s user 38.13s system 87% cpu 1:05.48 total
rr-#arch:~$ time (for i in {1..10000};do cat /dev/null;done)
(...) 0.06s user 0.56s system 18% cpu 3.407 total
rr-#debian:~$ time (for i in {1..10000};do cat /dev/null;done)
(...) 0.51s user 4.98s system 21% cpu 25.354 total
Hardware specifications:
cygwin: Intel(R) Core(TM) i7-3770K CPU # 3.50GHz
arch: Intel(R) Core(TM) i7-4790K CPU # 4.00GHz
debian: Intel(R) Core(TM)2 Duo CPU T5270 # 1.40GHz
So as you see, no matter what you use, Cygwin will always operate worse. It loses hands down even to worse hardware (cygwin vs. debian in this benchmark, as per this comparison).

Resources