Making sense from GHC profiler - haskell

I'm trying to make sense from GHC profiler. There is a rather simple app, which uses werq and lens-aeson libraries, and while learning about GHC profiling, I decided to play with it a bit.
Using different options (time tool, +RTS -p -RTS and +RTS -p -h) I acquired entirely different numbers of my memory usage. Having all those numbers, I'm now completely lost trying to understand what is going on, and how much memory the app actually uses.
This situation reminds me the phrase by Arthur Bloch: "A man with a watch knows what time it is. A man with two watches is never sure."
Can you, please, suggest me, how I can read all those numbers, and what is the meaning of each of them.
Here are the numbers:
time -l reports around 19M
#/usr/bin/time -l ./simple-wreq
...
3.02 real 0.39 user 0.17 sys
19070976 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
21040 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
71 messages sent
71 messages received
2991 signals received
43 voluntary context switches
6490 involuntary context switches
Using +RTS -p -RTS flag reports around 92M. Although it says "total alloc" it seems strange to me, that a simple app like this one can allocate and release 91M
# ./simple-wreq +RTS -p -RTS
# cat simple-wreq.prof
Fri Oct 14 15:08 2016 Time and Allocation Profiling Report (Final)
simple-wreq +RTS -N -p -RTS
total time = 0.07 secs (69 ticks # 1000 us, 1 processor)
total alloc = 91,905,888 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
main.g Main 60.9 88.8
MAIN MAIN 24.6 2.5
decodeLenient/look Data.ByteString.Base64.Internal 5.8 2.6
decodeLenientWithTable/fill Data.ByteString.Base64.Internal 2.9 0.1
decodeLenientWithTable.\.\.fill Data.ByteString.Base64.Internal 1.4 0.0
decodeLenientWithTable.\.\.fill.\ Data.ByteString.Base64.Internal 1.4 0.1
decodeLenientWithTable.\.\.fill.\.\.\.\ Data.ByteString.Base64.Internal 1.4 3.3
decodeLenient Data.ByteString.Base64.Lazy 1.4 1.4
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 443 0 24.6 2.5 100.0 100.0
main Main 887 0 0.0 0.0 75.4 97.4
main.g Main 889 0 60.9 88.8 75.4 97.4
object_ Data.Aeson.Parser.Internal 925 0 0.0 0.0 0.0 0.2
jstring_ Data.Aeson.Parser.Internal 927 50 0.0 0.2 0.0 0.2
unstream/resize Data.Text.Internal.Fusion 923 600 0.0 0.3 0.0 0.3
decodeLenient Data.ByteString.Base64.Lazy 891 0 1.4 1.4 14.5 8.1
decodeLenient Data.ByteString.Base64 897 500 0.0 0.0 13.0 6.7
....
+RTS -p -h and hp2ps show me the following picture and two numbers: 114K in the header and something around 1.8Mb on the graph.
And, just in case, here is the app:
module Main where
import Network.Wreq
import Control.Lens
import Data.Aeson.Lens
import Control.Monad
main :: IO ()
main = replicateM_ 10 g
where
g = do
r <- get "http://httpbin.org/get"
print $ r ^. responseBody
. key "headers"
. key "User-Agent"
. _String
UPDATE 1: Thank everyone for incredible good responses. As was suggested, I add +RTS -s output, so the entire picture builds up for everyone who read it.
#./simple-wreq +RTS -s
...
128,875,432 bytes allocated in the heap
32,414,616 bytes copied during GC
2,394,888 bytes maximum residency (16 sample(s))
355,192 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 194 colls, 0 par 0.018s 0.022s 0.0001s 0.0022s
Gen 1 16 colls, 0 par 0.027s 0.031s 0.0019s 0.0042s
UPDATE 2: The size of the executable:
#du -h simple-wreq
63M simple-wreq

A man with a watch knows what time it is. A man with two watches is never sure.
Ah, but what do does two watches show? Are both meant to show the current time in UTC? Or is one of them supposed to show the time in UTC, and the other one the time on a certain point on Mars? As long as they are in sync, the second scenario wouldn't be a problem, right?
And that is exactly what is happening here. You compare different memory measurements:
the maximum residency
the total amount of allocated memory
The maximum residency is the highest amount of memory your program ever uses at a given time. That's 19MB. However, the total amount of allocated memory is a lot more, since that's how GHC works: it "allocates" memory for objects that are garbage collected, which is almost everything that's not unpacked.
Let us inspect a C example for this:
int main() {
int i;
char * mem;
for(i = 0; i < 5; ++i) {
mem = malloc(19 * 1000 * 1000);
free(mem);
}
return 0;
}
Whenever we use malloc, we will allocate 19 megabytes of memory. However, we free the memory immediately after. The highest amount of memory we ever have at one point is therefore 19 megabytes (and a little bit more for the stack and the program itself).
However, in total, we allocate 5 * 19M, 95M total. Still, we could run our little program with just 20 megs of RAM fine. That's the difference between total allocated memory and maximum residency. Note that the residency reported by time is always at least du <executable>, since that has to reside in memory too.
That being said, the easiest way to generate statistics is -s, which will show how what was the maximum residency from the Haskell's program point of view. In your case, it will be the 1.9M, the number in your heap profile (or double the amount due to profiling). And yeah, Haskell executables tend to get extremely large, since libraries are statically linked.

time -l is displaying the (resident, i.e. not swapped out) size of the process as seen by the operating system (obviously). This includes twice the maximum size of the Haskell heap (due to the way that GHC's GC works), plus anything else allocated by the RTS or other C libraries, plus the code of your executable itself plus the libraries it depends on, etc. I'm guessing in this case the primary contributor to the 19M is the size of your exectuable.
total alloc is the total amount allocated onto the Haskell heap. It is not at all a measure of maximum heap size (which is what people usually mean by "how much memory is my program using"). Allocation is very cheap and allocation rates of around 1GB/s are typical for a Haskell program.
The number in the header of the hp2ps output "114,272 bytes x seconds" is something completely different again: it is the integral of the graph, and is measured in bytes * seconds, not in bytes. For example if your program holds onto a 10 MB structure for 4 seconds then that will cause this number to increase by 40 MB*s.
The number around 1.8 MB shown in the graph is the actual maximum size of the Haskell heap, which is probably the number you're most interested in.
You've omitted the most useful source of numbers about your program's execution, which is running it with +RTS -s (this doesn't even require it to have been built with profiling).

Related

Optimal number of threads for GNU parallel

I think I have a fairly basic question. I just discovered the GNU parallel package and I think my workflow can really benefit from it!
I am using a loop which loops through my read files and generates the desired output. The command that is excecuted for each read looks something like this:
STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn R1.fq R2.fq
As you can see I specified 8 threads, which is the amount of threads my virtual machine has.
My question now is this following:
If I use GNU parallel with a command like this:
cat reads| parallel -j 3 STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn {}_R1.fq {}_R2.fq
Can my virtual machine handle the number of threads I specified, if I execute 3 jobs in parallel?
Or do I need 24 threads (3*8 threads) to properly excecute this command?
Im sorry if this is a basic question, I am very new to the field and any help is much appreciated!
The best advice is simply: Try different values and measure.
In parallelization there are sooo many factors that can affect the results: Disk I/O, shared CPU cache, and shared RAM bandwidth just to name three.
top is your friend when measuring. If you can manage to get all CPUs to have <5% idle you are unlikely to go any faster - no matter what you do.
top - 14:49:10 up 10 days, 5:48, 123 users, load average: 2.40, 1.72, 1.67
Tasks: 751 total, 3 running, 616 sleeping, 8 stopped, 4 zombie
%Cpu(s): 17.3 us, 6.2 sy, 0.0 ni, 76.2 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
GiB Mem : 31.239 total, 1.441 free, 21.717 used, 8.081 buff/cache
GiB Swap: 117.233 total, 104.146 free, 13.088 used. 4.706 avail Mem
This machine is 76.2% idle. If your processes use loads of CPU then starting more processes in parallel here may help. If they use loads of disk I/O it may or may not help. Only way to know is to test and measure.
top - 14:51:00 up 10 days, 5:50, 124 users, load average: 3.41, 2.04, 1.78
Tasks: 759 total, 8 running, 619 sleeping, 8 stopped, 4 zombie
%Cpu(s): 92.8 us, 6.9 sy, 0.0 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
GiB Mem : 31.239 total, 1.383 free, 21.772 used, 8.083 buff/cache
GiB Swap: 117.233 total, 104.146 free, 13.087 used. 4.649 avail Mem
This machine is 0.1% idle. Starting more processes is unlikely to make things go faster.
So increase the parallelization until idle time hits a minimum or until average processing time hits a minimum (--joblog my.log can be useful to see how long a job takes).
And yes: GNU Parallel is likely to speed-up bioinformatics (being written by a fellow bioinformatician).
Consider reading GNU Parallel 2018 (paper: http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html download: https://doi.org/10.5281/zenodo.1146014) Read at least chapter 1+2. It should take you less than 20 minutes. Your command line will love you for it.

File sizes are reported differently

Why are file sizes all different?
In Windows 10 I can see all of these sizes:
11,116 KB
10.8 MB
11,382,240 Bytes
11,382,784 Bytes
If I use the Console Window:
D:\My Programs\2017\MeetSchedAssist\Inno\Output>dir *.exe
Volume in drive D is DATA
Volume Serial Number is A8B0-A5C6
Directory of D:\My Programs\2017\MeetSchedAssist\Inno\Output
03/04/2018 08:50 11,382,240 MeetSchedAssistSetup.exe
1 File(s) 11,382,240 bytes
0 Dir(s) 719,837,487,104 bytes free
D:\My Programs\2017\MeetSchedAssist\Inno\Output>
I understand that perhaps on the physical media it has to round it to physically take a certain amount of space, but that line above:
Size: 10.8 MB (11,382,240 bytes)
Huh? Why does it not say 11.38 MB?
Once upon a time it has been defined that
1 kB = 1024 B
1 MB = 1024 kB
If you divide your bytes figure all the way down to MB, you'll get all those figures.
Now that they noticed that many people tend to walk into that trap, they have redefined the unit multiples and defined new ones
1 kiB = 1024 B
1 MiB = 1024 kiB
1 kB = 1000 B
1 MB = 1000 kB
but this scheme is not so widespread (seems to be more common with total size specs of storage media).
Funny sidenote: I guess I am not the only one who has learned it the old way and now mixes it up with the current definition all the time. I'd say problems like this are the root cause for humanity being mostly conservatively oriented.

Excessive Gen 2 Free Blocks in Crash Dump

On inspecting a crash dump file for an out of memory exception reported by a client the results of !DumpHeap -stat showed that 575MB of memory is being taken up by 45,000 objects of type "Free" most of which I assume would have to reside in Gen 2 due to the size.
The first places I looked for problems were the large object heap (LOH) and pinned objects. The large object heap with free space included was only 70MB so that wasn't the issue and running !gchandles showed:
GC Handle Statistics:
Strong Handles: 155
Pinned Handles: 265
Async Pinned Handles: 8
Ref Count Handles: 163
Weak Long Handles: 0
Weak Short Handles: 0
Other Handles: 0
which is a very small number of handles (around 600) compared to the number of free objects (45,000). To me this rules out the free blocks being caused by pinning.
I also looked into the free blocks themselves to see if maybe they had a consistent size, but on inspection the sizes varied widely and went from just short of 5MB to only around 12 bytes or so.
Any help would be appreciated! I am at a loss since there is fragmentation but no signs of it being cause by the two places that I know to look which are the large object heap (LOH) and pinned handles.
In which generation are the free objects?
I assume would have to reside in Gen 2 due to the size
Size is not related to generations. To find out in which generation the free blocks reside, you can follow these steps:
From !dumpheap -stat -type Free get the method table:
0:003> !dumpheap -stat -type Free
total 7 objects
Statistics:
MT Count TotalSize Class Name
00723538 7 100 Free
Total 7 objects
From !eeheap -gc, get the start addresses of the generations.
0:003> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x026a1018
generation 1 starts at 0x026a100c
generation 2 starts at 0x026a1000
ephemeral segment allocation context: none
segment begin allocated size
026a0000 026a1000 02731ff4 0x00090ff4(593908)
Large object heap starts at 0x036a1000
segment begin allocated size
036a0000 036a1000 036a3250 0x00002250(8784)
Total Size 0x93244(602692)
------------------------------
GC Heap Size 0x93244(602692)
Then dump the free objects only of a specific generation by passing the start and end address (e.g. of generation 2 here):
0:003> !dumpheap -mt 00723538 0x026a1000 0x026a100c
Address MT Size
026a1000 00723538 12 Free
026a100c 00723538 12 Free
total 2 objects
Statistics:
MT Count TotalSize Class Name
00723538 2 24 Free
Total 2 objects
So in my simple case, there are 2 free objects in generation 2.
Pinned handles
600 pinned objects should not cause 45.000 free memory blocks. Still from my experience, 600 pinned handles are a lot. But first, check in which generation the free memory blocks reside.

Linux(Ubuntu) load average higher than total-true-utilization?

I have a dell pd2950(2x4core) server running Ubuntu server 12.04LTS. And there's a VLC encoder instance running. Recently I updated the script(VLM) for VLC to increase quality and this means I'm increasing the CPU utilization too. So I started to tune the script to avoid exceeding maximum utilization. I use top to monitor the CPU utilization. I found that the load average is higher than 100%(I have 8-cores totally so 8.00 is 100%) but there's still 20-35% is idle, like:
top - 21:41:19 up 2 days, 17:15, 1 user, load average: 9.20, 9.65, 8.80
Tasks: 148 total, 1 running, 147 sleeping, 0 stopped, 0 zombie
Cpu(s): 32.8%us, 0.7%sy, 29.7%ni, 36.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1982680k total, 1735672k used, 247008k free, 126284k buffers
Swap: 0k total, 0k used, 0k free, 774228k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9715 wilson RT 0 2572m 649m 13m S 499 33.5 13914:44 vlc
11663 wilson 20 0 17344 1328 964 R 2 0.1 0:02.00 top
1 root 20 0 24332 2264 1332 S 0 0.1 0:01.06 init
2 root 20 0 0 0 0 S 0 0.0 0:00.09 kthreadd
3 root 20 0 0 0 0 S 0 0.0 0:27.05 ksoftirqd/0
4 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/0:0
5 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/0:0H
To confirm my CPU(s) don't have Hyper-Thread, I tried:
wilson#server:/$ nproc
8
And to reduce the sampling deviation cause by refresh time, I also tried:
wilson#server:/$ top -d 0.1
I looked at the number %id for a long time, it haven't been lower than 14.
I also tried:
wilson#server:/$ uptime
21:57:20 up 2 days, 17:31, 1 user, load average: 9.03, 9.12, 9.35
The 1m load average often reach 14-15. So I'm wondering what's wrong with my system? Has anyone ever have this problem?
More information:
I'm using VLC with x264 codec to encode a live HTTP stream(application/octet-stream). It use ffmpeg(libavc) to decode and output as Apple HLS(.ts segment). I found this problem after I added arguments for x264:
level=41,ref=5,b-adapt=2,direct=auto,me=umh,subq=8,rc-lookahead=60,analyse=all
This almost equal to preset=slower. And as you can see, my VLC in running in real-time. The parameter is:
wilson#server:/$ chrt -p -f 99 vlc-wrapper
There does not appear to be anything wrong with your system. What is wrong seems to be your understanding of CPU accounting. In particular, load average has nearly nothing at all to do with CPU usage. Load average is based on the number of processes that are ready to run (not waiting on I/O, network, keyboard input, etc...), if there is an available CPU for them to be scheduled on. While it's true that, given an 8 core system, if all 8 cores are 100% busy with a single CPU-bound thread each, your load average should be around 8.00, it is entirely possible to have a load average of 200.0 with near-0% CPU utilization. All that would indicate is you have 200 processes that are ready to run, but as soon as they get scheduled, they do almost nothing before they go back to waiting for input of some sort.
Your top output shows that vlc seems to be using roughly the equivalent of 5 of your cores, but it doesn't indicate whether you have 5 cores at 100% each, or if all 8 cores are at 62.5% each. All of the other processes listed by top also contribute to your load average, as well as CPU usage. In particular, top running with a short delay like your example of 0.1 seconds, will probably increase your load average by almost 1 itself, even though, overall, it's not using a lot of CPU time.
Read this:
Understanding load average vs. cpu usage
If the load average is at 7, with 4 hyper-threaded processors, shouldn't that means that the CPU is working to about 7/8 capacity?
No it just means that you have 7 running processes in the job queue on average.
But I think that we can't use load average as a reference number to determine system is overload or not. So that I wonder if there's a kernel-level cpu utitlization statistical tools or not?(why kernel level because reduce performance loss)

How to find CPU Usage in google profiler

I am using Google CPU Profiling tool.
http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.html
On the documentation it is given
Analyzing Text Output
Text mode has lines of output that look like this:
14 2.1% 17.2% 58 8.7% std::_Rb_tree::find
Here is how to interpret the columns:
Number of profiling samples in this
function
Percentage of profiling
samples in this function
Percentage
of profiling samples in the functions
printed so far
Number of profiling
samples in this function and its
callees
Percentage of profiling
samples in this function and its
callees
Function name
But I am not able to understand which columns tell me exact or percentage CPU usages of function ?
How to get CPU uses of a function suing google profile ?
Text mode has lines of output that look like this:
It will have a lot of lines, for example, collect profile:
$ CPUPROFILE=a.pprof LD_PRELOAD=./libprofiler.so ./a.out
The program a.out is the same as here: Kcachegrind/callgrind is inaccurate for dispatcher functions?
Then analyze it with pprof top command:
$ pprof ./a.out a.pprof
Using local file ./a.out.
Using local file a.pprof.
Welcome to pprof! For help, type 'help'.
(pprof) top
Total: 185 samples
76 41.1% 41.1% 76 41.1% do_4
51 27.6% 68.6% 51 27.6% do_3
37 20.0% 88.6% 37 20.0% do_2
21 11.4% 100.0% 21 11.4% do_1
0 0.0% 100.0% 185 100.0% __libc_start_main
0 0.0% 100.0% 185 100.0% dispatcher
0 0.0% 100.0% 34 18.4% first2
0 0.0% 100.0% 42 22.7% inner2
0 0.0% 100.0% 68 36.8% last2
0 0.0% 100.0% 185 100.0% main
So, what is here: the total sample count is 185; and the Frequency is the default (1 sample every 10 ms; or 100 samples per second). Then total runtime is ~ 1.85 second.
First column is the number of samples, which was taken when a.out works in the given function. If we divide it by Frequency, we will get total time estimation of given function, e.g. do_4 runs for ~0.8 sec
Second column is the sample count in given function divided by total count, or the percentage of this function in total program run time. So do_4 is the slowest function (41% of total program time) and do_1 is only 11% of program runtime. I think you are interested in this column.
Third column is the sum of current and preceding lines; so we can know that 2 slowest functions, do_4 and do_3 totally accounted for 68% of total run time (41%+27%)
4rd and 5th columns are like first and second; but these one will account not only samples of the given function itself, but also samples of all functions called from given, both directly and indirectly. You can see, that main and all called from it is 100% of total run time (because main is the program itself; or root of calltree of program) and last2 with its children is 36.8% of runtime (its children in my program are: half of calls to do_4 and half of calls to do_3 = 41.1 + 27.6 /2 = 69.7/2 ~= 34% + some time in the function itself)
PS: there are some other useful pprof commands, like callgrind or gv which shows graphic representation of call tree with profiling information added.

Resources