Java 6 G1GC - Repeated full GC - garbage-collection

We have a Java application which is configured to use Java 6 with G1GC
collector. Recently we observed that Full GC gets stuck in repeated attempts
even though there is enough memory (per GC logs). Has anyone come across
such scenario for Java 6 G1GC combination?
GC Config:
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -XX:-OmitStackTraceInFastThrow -XX:`enter code here`+UseG1GC -XX:InitiatingHeapOccupancyPercent=72 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
GC Logs:
9 172169595.9 172169595.9 172169595.9 172169595.9]
[Other: 0.4 ms]
[Clear CT: 0.8 ms]
[Other: 2.1 ms]
[Choose CSet: 0.0 ms]
[ 6526M->6456M(14848M)]
[Times: user=0.20 sys=0.00, real=0.02 secs]
2018-03-08T15:51:22.822+0000: 172169.607: [Full GC 6456M->4080M(14848M), 6.7084910 secs]
[Times: user=11.75 sys=0.02, real=6.70 secs]
172177.358: [GC pause (young), 0.01637700 secs]
[Parallel Time: 13.9 ms]
[GC Worker Start Time (ms): 172177359.4 172177359.5 172177359.5 172177359.5 172177359.5 172177359.5 172177359.6 172177359.6 172177359.6 172177359.6 172177359.7 172177359.7 172177359.7 172177359.7 172177359.8 172177359.8 172177359.8 172177359.8]
[Update RS (ms): 2.5 1.5 2.2 2.2 2.3 2.2 2.3 2.1 2.2 2.5 1.5 1.5 1.0 1.5 2.3 1.6 1.9 2.2
Avg: 2.0, Min: 1.0, Max: 2.5]
[Processed Buffers : 12 106 62 1 1 1 1 1 1 83 113 149 62 136 1 76 1 1
Sum: 808, Avg: 44, Min: 1, Max: 149]
[Ext Root Scanning (ms): 4.6 5.6 5.5 5.4 5.2 5.4 5.3 5.3 5.0 4.4 5.4 5.3 5.9 5.3 5.1 5.2 5.2 5.1
Avg: 5.2, Min: 4.4, Max: 5.9]
[Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Avg: 0.0, Min: 0.0, Max: 0.0]
[Scan RS (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Avg: 0.0, Min: 0.0, Max: 0.0]
[Object Copy (ms): 6.3 6.3 5.7 5.8 5.8 5.7 5.7 5.9 6.1 6.3 6.3 6.3 6.3 6.3 5.7 6.3 6.0 5.7
Avg: 6.0, Min: 5.7, Max: 6.3]
[Termination (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Avg: 0.0, Min: 0.0, Max: 0.0]
[Termination Attempts : 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Sum: 18, Avg: 1, Min: 1, Max: 1]
[GC Worker End Time (ms): 172177372.9 172177372.9 172177372.9 172177372.9 172177372.9 172177372.9 172177372.9 172177372.9 172177372.9 172177372.9 172177372.9 172177372.9 172177372.9 172177372.9 172177372.9 172177372.9 172177372.9 172177372.9]
[Other: 0.6 ms]
[Clear CT: 0.3 ms]
[Other: 2.2 ms]
[Choose CSet: 0.0 ms]
[ 6543M->6473M(14848M)]
[Times: user=0.22 sys=0.00, real=0.01 secs]
2018-03-08T15:51:30.598+0000: 172177.384: [Full GC 6473M->4098M(14848M), 6.7136020 secs]
[Times: user=11.72 sys=0.01, real=6.71 secs]
172185.070: [GC pause (young), 0.02292200 secs]
[Parallel Time: 19.0 ms]
[GC Worker Start Time (ms): 172185071.9 172185072.0 172185072.0 172185072.1 172185072.1 172185072.2 172185072.2 172185072.2 172185072.2 172185072.3 172185072.3 172185072.3 172185072.3 172185072.4 172185072.4 172185072.4 172185072.5 172185072.5]
[Update RS (ms): 5.4 5.5 4.1 5.2 9.9 3.7 5.0 4.2 4.5 3.8 3.7 4.7 3.8 3.7 4.8 4.5 3.8 5.5
Avg: 4.8, Min: 3.7, Max: 9.9]
[Processed Buffers : 3 2 3 2 2 109 3 2 2 58 68 100 107 69 62 121 97 2
Sum: 812, Avg: 45, Min: 2, Max: 121]
[Ext Root Scanning (ms): 5.8 5.5 5.7 5.3 6.0 5.5 5.4 5.6 5.2 5.3 5.4 4.5 5.3 5.3 4.3 4.5 5.1 5.1
Avg: 5.3, Min: 4.3, Max: 6.0]
[Mark Stack Scanning (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Avg: 0.0, Min: 0.0, Max: 0.0]
[Scan RS (ms): 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Avg: 0.0, Min: 0.0, Max: 0.1]

I also think you should update your JVM. However, I understand that some times this is not so easy here you can see some improvements on the GC1 when the JDK8 is compared with JDK7
Source: https://www.redhat.com/en/blog/part-1-introduction-g1-garbage-collector
A few humongous objects may not cause a problem, but a steady
allocation of them can lead to significant heap fragmentation and a
noticeable performance impact. Prior to JDK8u40, humungous objects
could only be collected through a Full GC so the potential for this to
impact JDK7 and early JDK8 users is very high. This is why it’s
critical to understand both the size of objects your application
produces and what G1 is defining for region size. Even in the latest
JDK8, if you are doing significant numbers of humongous allocations,
it is a good idea to evaluate and tune away as many as possible
Furthermore, within this article you can see this:
Finally and unfortunately, G1 also has to deal with the dreaded Full
GC. While G1 is ultimately trying to avoid Full GC’s, they are still a
harsh reality especially in improperly tuned environments. Given that
G1 is targeting larger heap sizes, the impact of a Full GC can be
catastrophic to in-flight processing and SLAs. One of the primary
reasons is that Full GCs are still a single-threaded operation in
G1. Looking at causes, the first, and most avoidable, is related to
Metaspace.
By the way, it seems to be the newest version of Java (10) is going to include a G1 with the capability of executing Full GCs in parallel.
https://www.opsian.com/blog/java-10-with-g1/
Java 10 reduces Full GC pause times by iteratively improving on its
existing algorithm. Until Java 10 G1 Full GCs ran in a single thread.
That’s right - your 32 core server and it’s 128GB will stop and pause
until a single thread takes out the garbage.

Related

Improving memory usage during serialization (Data.Binary)

I'm still kinda new to Haskell and learning new things every day. My problem is a too high memory usage during seralization using the Data.Binary library. Maybe I'm just using the library the wrong way, but I can't figure it out.
The actual idea is, that I read binary data from disk, add new data und write everything back to disk. Here's the code:
module Main
where
import Data.Binary
import System.Environment
import Data.List (foldl')
data DualNo = DualNo Int Int deriving (Show)
instance Data.Binary.Binary DualNo where
put (DualNo a b) = do
put a
put b
get = do
a <- get
b <- get
return (DualNo a b)
-- read DualNo from HDD
readData :: FilePath -> IO [DualNo]
readData filename = do
no <- decodeFile filename :: IO [DualNo]
return no
-- write DualNo to HDD
writeData :: [DualNo] -> String -> IO ()
writeData no filename = encodeFile filename (no :: [DualNo])
writeEmptyDataToDisk :: String -> IO ()
writeEmptyDataToDisk filename = writeData [] filename
-- feed a the list with a new dataset
feedWithInputData :: [DualNo] -> [(Int, Int)] -> [DualNo]
feedWithInputData existData newData = foldl' func existData newData
where
func dataset (a,b) = DualNo a b : dataset
main :: IO ()
main = do
[newInputData, toPutIntoExistingData] <- System.Environment.getArgs
if toPutIntoExistingData == "empty"
then writeEmptyDataToDisk "myData.dat"
else return ()
loadedData <- readData "myData.dat"
newData <- return (case newInputData of
"dataset1" -> feedWithInputData loadedData dataset1
"dataset2" -> feedWithInputData loadedData dataset2
otherwise -> feedWithInputData loadedData dataset3)
writeData newData "myData.dat"
dataset1 = zip [1..100000] [2,4..200000]
dataset2 = zip [5,10..500000] [3,6..300000]
dataset3 = zip [4,8..400000] [6,12..600000]
I'm pretty sure, there's a lot to improve in this code. But my biggest problem is the memory usage with big datasets.
I profiled my programm with GHC.
$ ghc -O2 --make -prof -fprof-auto -auto-all -caf-all -rtsopts -fforce-recomp Main.hs
$ ./Main dataset1 empty +RTS -p -sstderr
165,085,864 bytes allocated in the heap
70,643,992 bytes copied during GC
12,298,128 bytes maximum residency (7 sample(s))
424,696 bytes maximum slop
35 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 306 colls, 0 par 0.035s 0.035s 0.0001s 0.0015s
Gen 1 7 colls, 0 par 0.053s 0.053s 0.0076s 0.0180s
INIT time 0.001s ( 0.001s elapsed)
MUT time 0.059s ( 0.062s elapsed)
GC time 0.088s ( 0.088s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.003s ( 0.003s elapsed)
Total time 0.154s ( 0.154s elapsed)
%GC time 57.0% (57.3% elapsed)
Alloc rate 2,781,155,968 bytes per MUT second
Productivity 42.3% of total user, 42.5% of total elapsed
Looking at the prof-file:
Tue Apr 12 18:11 2016 Time and Allocation Profiling Report (Final)
Main +RTS -p -sstderr -RTS dataset1 empty
total time = 0.06 secs (60 ticks # 1000 us, 1 processor)
total alloc = 102,613,008 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
put Main 48.3 53.0
writeData Main 30.0 18.8
dataset1 Main 13.3 23.4
feedWithInputData Main 6.7 0.0
feedWithInputData.func Main 1.7 4.7
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 68 0 0.0 0.0 100.0 100.0
main Main 137 0 0.0 0.0 86.7 76.6
feedWithInputData Main 150 1 6.7 0.0 8.3 4.7
feedWithInputData.func Main 154 100000 1.7 4.7 1.7 4.7
writeData Main 148 1 30.0 18.8 78.3 71.8
put Main 155 100000 48.3 53.0 48.3 53.0
readData Main 147 0 0.0 0.1 0.0 0.1
writeEmptyDataToDisk Main 142 0 0.0 0.0 0.0 0.1
writeData Main 143 0 0.0 0.1 0.0 0.1
CAF:main1 Main 133 0 0.0 0.0 0.0 0.0
main Main 136 1 0.0 0.0 0.0 0.0
CAF:main2 Main 132 0 0.0 0.0 0.0 0.0
main Main 139 0 0.0 0.0 0.0 0.0
writeEmptyDataToDisk Main 140 1 0.0 0.0 0.0 0.0
writeData Main 141 1 0.0 0.0 0.0 0.0
CAF:main7 Main 131 0 0.0 0.0 0.0 0.0
main Main 145 0 0.0 0.0 0.0 0.0
readData Main 146 1 0.0 0.0 0.0 0.0
CAF:dataset1 Main 123 0 0.0 0.0 5.0 7.8
dataset1 Main 151 1 5.0 7.8 5.0 7.8
CAF:dataset4 Main 122 0 0.0 0.0 5.0 7.8
dataset1 Main 153 0 5.0 7.8 5.0 7.8
CAF:dataset5 Main 121 0 0.0 0.0 3.3 7.8
dataset1 Main 152 0 3.3 7.8 3.3 7.8
CAF:main4 Main 116 0 0.0 0.0 0.0 0.0
main Main 138 0 0.0 0.0 0.0 0.0
CAF:main6 Main 115 0 0.0 0.0 0.0 0.0
main Main 149 0 0.0 0.0 0.0 0.0
CAF:main3 Main 113 0 0.0 0.0 0.0 0.0
main Main 144 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 107 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 103 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 101 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 94 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 86 0 0.0 0.0 0.0 0.0
Now I add further data:
$ ./Main dataset2 myData.dat +RTS -p -sstderr
343,601,008 bytes allocated in the heap
175,650,728 bytes copied during GC
34,113,936 bytes maximum residency (8 sample(s))
971,896 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 640 colls, 0 par 0.082s 0.083s 0.0001s 0.0017s
Gen 1 8 colls, 0 par 0.140s 0.141s 0.0176s 0.0484s
INIT time 0.001s ( 0.001s elapsed)
MUT time 0.138s ( 0.139s elapsed)
GC time 0.221s ( 0.224s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.006s ( 0.006s elapsed)
Total time 0.370s ( 0.370s elapsed)
%GC time 59.8% (60.5% elapsed)
Alloc rate 2,485,518,518 bytes per MUT second
Productivity 39.9% of total user, 39.8% of total elapsed
Looking at the new prof-file:
Tue Apr 12 18:15 2016 Time and Allocation Profiling Report (Final)
Main +RTS -p -sstderr -RTS dataset2 myData.dat
total time = 0.14 secs (139 ticks # 1000 us, 1 processor)
total alloc = 213,866,232 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
put Main 41.0 50.9
writeData Main 25.9 18.0
get Main 25.2 16.8
dataset2 Main 4.3 11.2
readData Main 1.4 0.8
feedWithInputData.func Main 1.4 2.2
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 68 0 0.0 0.0 100.0 100.0
main Main 137 0 0.0 0.0 95.7 88.8
feedWithInputData Main 148 1 0.7 0.0 2.2 2.2
feedWithInputData.func Main 152 100000 1.4 2.2 1.4 2.2
writeData Main 145 1 25.9 18.0 66.9 68.9
put Main 153 200000 41.0 50.9 41.0 50.9
readData Main 141 0 1.4 0.8 26.6 17.6
get Main 144 0 25.2 16.8 25.2 16.8
CAF:main1 Main 133 0 0.0 0.0 0.0 0.0
main Main 136 1 0.0 0.0 0.0 0.0
CAF:main7 Main 131 0 0.0 0.0 0.0 0.0
main Main 139 0 0.0 0.0 0.0 0.0
readData Main 140 1 0.0 0.0 0.0 0.0
CAF:dataset2 Main 126 0 0.0 0.0 0.7 3.7
dataset2 Main 149 1 0.7 3.7 0.7 3.7
CAF:dataset6 Main 125 0 0.0 0.0 2.2 3.7
dataset2 Main 151 0 2.2 3.7 2.2 3.7
CAF:dataset7 Main 124 0 0.0 0.0 1.4 3.7
dataset2 Main 150 0 1.4 3.7 1.4 3.7
CAF:$fBinaryDualNo1 Main 120 0 0.0 0.0 0.0 0.0
get Main 143 1 0.0 0.0 0.0 0.0
CAF:main4 Main 116 0 0.0 0.0 0.0 0.0
main Main 138 0 0.0 0.0 0.0 0.0
CAF:main6 Main 115 0 0.0 0.0 0.0 0.0
main Main 146 0 0.0 0.0 0.0 0.0
CAF:main5 Main 114 0 0.0 0.0 0.0 0.0
main Main 147 0 0.0 0.0 0.0 0.0
CAF:main3 Main 113 0 0.0 0.0 0.0 0.0
main Main 142 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 107 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 103 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 101 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 94 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 86 0 0.0 0.0 0.0 0.0
The more often I add new data, the higher the memory usage becomes. I mean, it's clear, that I need more memory for a bigger dataset. But isn't there a better solution for this problem (like gradually writing data back to disk).
Edit:
Actually the most important thing, that bothers me, is the following observation:
I run the program for the first time and add new data to an existing (empty) file on my disk.
The size of the saved file on my disk is: 1.53 MByte.
But (looking at the first prof-file) the program allocated more than 102 MByte. More than 50% was allocated by the put function from the Data.Binary package.
I run the program a second time and add new data to an existing (not empty) file on my disk.
The size of the saved file on my disk is 3.05 MByte.
But (looking at the second prof-file) the program allocated more than 213 MByte. More than 66% was allocated by the put and get function together.
=> Conclusion: In the first example I needed 102/1.53 = 66 times more memory running the program than space for the binary file on my disk.
In the second example I needed 213/3.05 = 69 times more memory running the programm than space for the binary file on my disk.
Question:
Is the Data.Binary package for serialization so efficient (and awesome), that it can decrease the needed memory to such an extent.
Analogous question:
Do I really need so much more memory for loading the data in my program than space for the the same data in a binary-file on disk?

Apache Spark - one Spark core divided into several CPU cores

I have a question about Apache Spark. I set up an Apache Spark standalone cluster on my Ubuntu desktop. Then I wrote two lines in the spark_env.sh file: SPARK_WORKER_INSTANCES=4 and SPARK_WORKER_CORES=1. (I found that export is not necessary in spark_env.sh file if I start the cluster after I edit the spark_env.sh file.)
I wanted to have 4 worker instances in my single desktop and let them occupy 1 CPU core each. And the result was like this:
top - 14:37:54 up 2:35, 3 users, load average: 1.30, 3.60, 4.84
Tasks: 255 total, 1 running, 254 sleeping, 0 stopped, 0 zombie
%Cpu0 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 1.7 us, 0.3 sy, 0.0 ni, 98.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 41.6 us, 0.0 sy, 0.0 ni, 58.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 59.0 us, 0.0 sy, 0.0 ni, 41.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 16369608 total, 11026436 used, 5343172 free, 62356 buffers
KiB Swap: 16713724 total, 360 used, 16713364 free. 2228576 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10829 aaaaaa 20 0 42.624g 1.010g 142408 S 101.2 6.5 0:22.78 java
10861 aaaaaa 20 0 42.563g 1.044g 142340 S 101.2 6.7 0:22.75 java
10831 aaaaaa 20 0 42.704g 1.262g 142344 S 100.8 8.1 0:24.86 java
10857 aaaaaa 20 0 42.833g 1.315g 142456 S 100.5 8.4 0:26.48 java
1978 aaaaaa 20 0 1462096 186480 102652 S 1.0 1.1 0:34.82 compiz
10720 aaaaaa 20 0 7159748 1.579g 32008 S 1.0 10.1 0:16.62 java
1246 root 20 0 326624 101148 65244 S 0.7 0.6 0:50.37 Xorg
1720 aaaaaa 20 0 497916 28968 20624 S 0.3 0.2 0:02.83 unity-panel-ser
2238 aaaaaa 20 0 654868 30920 23052 S 0.3 0.2 0:06.31 gnome-terminal
I think java in the first 4 lines are Spark workers. If it's correct, it's nice that there are four Spark workers and each of them are using 1 physical core each (e.g., 101.2%).
But I see that 5 physical cores are used. Among them, CPU0, CPU3, CPU7 are fully used. I think one Spark worker is using one of those physical cores. It's fine.
However, the usage levels of CPU2 and CPU6 are 41.6% and 59.0%, respectively. They add up to 100.6%, and I think one worker's job is distributed to those 2 physical cores.
With SPARK_WORKER_INSTANCES=4 AND SPARK_WORKER_CORES=1, is this a normal situation? Or is this a sign of some errors or problems?
This is perfectly normal behavior. Whenever Spark uses term core it actually means either process or thread and neither one is bound to a single core or processor.
In any multitasking environment processes are not executed continuously. Instead, operating system is constantly switching between different processes which each one getting only small share of available processor time.

What is `MAIN` ? (ghc profiling)

I build an old big project, Pugs, with ghc 7.10.1 using stack build (I wrote my own stack.yaml). Then I run stack build --library-profiling --executable-profiling and .stack-work/install/x86_64-osx/nightly-2015-06-26/7.10.1/bin/pugs -e 'my $i=0; for (1..100_000) { $i++ }; say $i' +RTS -pa and output the following pugs.prof file.
Fri Jul 10 00:10 2015 Time and Allocation Profiling Report (Final)
pugs +RTS -P -RTS -e my $i=0; for (1..10_000) { $i++ }; say $i
total time = 0.60 secs (604 ticks # 1000 us, 1 processor)
total alloc = 426,495,472 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc ticks bytes
MAIN MAIN 92.2 90.6 557 386532168
CAF Pugs.Run 2.8 5.2 17 22191000
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc ticks bytes
MAIN MAIN 287 0 92.2 90.6 100.0 100.0 557 386532168
listAssocOp Pugs.Parser.Operator 841 24 0.0 0.0 0.0 0.0 0 768
nassocOp Pugs.Parser.Operator 840 24 0.0 0.0 0.0 0.0 0 768
lassocOp Pugs.Parser.Operator 839 24 0.0 0.0 0.0 0.0 0 768
rassocOp Pugs.Parser.Operator 838 24 0.0 0.0 0.0 0.0 0 768
postfixOp Pugs.Parser.Operator 837 24 0.0 0.0 0.0 0.0 0 768
termOp Pugs.Parser.Operator 824 24 0.0 0.5 0.7 1.2 0 2062768
insert Data.HashTable.ST.Basic 874 1 0.0 0.0 0.0 0.0 0 152
checkOverflow Data.HashTable.ST.Basic 890 1 0.0 0.0 0.0 0.0 0 80
readDelLoad Data.HashTable.ST.Basic 893 0 0.0 0.0 0.0 0.0 0 184
writeLoad Data.HashTable.ST.Basic 892 0 0.0 0.0 0.0 0.0 0 224
readLoad Data.HashTable.ST.Basic 891 0 0.0 0.0 0.0 0.0 0 184
_values Data.HashTable.ST.Basic 889 1 0.0 0.0 0.0 0.0 0 0
_keys Data.HashTable.ST.Basic 888 1 0.0 0.0 0.0 0.0 0 0
.. snip ..
MAIN costs 92.2% of time, however, I don't know what MAIN means. What does MAIN label mean?
I was in the same spot a few days ago. What I deduced is the same thing, MAIN is expressions without anotations. It's counts shrink significantly if you add "-fprof-auto" and "-caf-all". Those options will also let you find a lot of interesting things happening in your code.

Haskell small CPU leak

I’m experiencing small CPU leaks using GHC 7.8.3 and Yesod 1.4.9.
When I run my site with time and stop it (Ctrl+C) after 1 minute without doing anything (just run, no request at all), it consumes 1 second. It represents approximately 1.7% of CPU.
$ time mysite
^C
real 1m0.226s
user 0m1.024s
sys 0m0.060s
If I disable the idle garbage collector, it drops to 0.35 second (0.6% of CPU). Though it’s better, it still consumes CPU without doing anything.
$ time mysite +RTS -I0 # Disable idle GC
^C
real 1m0.519s
user 0m0.352s
sys 0m0.064s
$ time mysite +RTS -I0
^C
real 4m0.676s
user 0m0.888s
sys 0m0.468s
$ time mysite +RTS -I0
^C
real 7m28.282s
user 0m1.452s
sys 0m0.976s
Compared to a cat command waiting indefinitely for something on the standard input:
$ time cat
^C
real 1m1.349s
user 0m0.000s
sys 0m0.000s
Is there anything else in Haskell that does consume CPU in the background ?
Is it a leak from Yesod ?
Or is it something that I have done in my program ? (I have only added handler functions, I don’t do parallel computation)
Edit 2015-05-31 19:25
Here’s the execution with the -s flag:
$ time mysite +RTS -I0 -s
^C 23,138,184 bytes allocated in the heap
4,422,096 bytes copied during GC
2,319,960 bytes maximum residency (4 sample(s))
210,584 bytes maximum slop
6 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 30 colls, 0 par 0.00s 0.00s 0.0001s 0.0003s
Gen 1 4 colls, 0 par 0.03s 0.04s 0.0103s 0.0211s
TASKS: 5 (1 bound, 4 peak workers (4 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.86s (224.38s elapsed)
GC time 0.03s ( 0.05s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.90s (224.43s elapsed)
Alloc rate 26,778,662 bytes per MUT second
Productivity 96.9% of total user, 0.4% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 3m44.447s
user 0m0.896s
sys 0m0.320s
And with profiling:
$ time mysite +RTS -I0
^C 23,024,424 bytes allocated in the heap
19,367,640 bytes copied during GC
2,319,960 bytes maximum residency (94 sample(s))
211,312 bytes maximum slop
6 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 27 colls, 0 par 0.00s 0.00s 0.0002s 0.0005s
Gen 1 94 colls, 0 par 1.09s 1.04s 0.0111s 0.0218s
TASKS: 5 (1 bound, 4 peak workers (4 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.00s (201.66s elapsed)
GC time 1.07s ( 1.03s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.09s (202.68s elapsed)
Alloc rate 23,115,591 bytes per MUT second
Productivity 47.7% of total user, 0.5% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 3m22.697s
user 0m2.088s
sys 0m0.060s
mysite.prof:
Sun May 31 19:16 2015 Time and Allocation Profiling Report (Final)
mysite +RTS -N -p -s -h -i0.1 -I0 -RTS
total time = 0.05 secs (49 ticks # 1000 us, 1 processor)
total alloc = 17,590,528 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
MAIN MAIN 98.0 93.7
acquireSeedSystem.\.\ System.Random.MWC 2.0 0.0
toByteString Data.Serialize.Builder 0.0 3.9
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 5684 0 98.0 93.7 100.0 100.0
createSystemRandom System.Random.MWC 11396 0 0.0 0.0 2.0 0.3
withSystemRandom System.Random.MWC 11397 0 0.0 0.1 2.0 0.3
acquireSeedSystem System.Random.MWC 11399 0 0.0 0.0 2.0 0.2
acquireSeedSystem.\ System.Random.MWC 11401 1 0.0 0.2 2.0 0.2
acquireSeedSystem.\.\ System.Random.MWC 11403 1 2.0 0.0 2.0 0.0
sndS Data.Serialize.Put 11386 21 0.0 0.0 0.0 0.0
put Data.Serialize 11384 21 0.0 0.0 0.0 0.0
unPut Data.Serialize.Put 11383 21 0.0 0.0 0.0 0.0
toByteString Data.Serialize.Builder 11378 21 0.0 3.9 0.0 4.0
flush.\ Data.Serialize.Builder 11393 21 0.0 0.0 0.0 0.0
withSize Data.Serialize.Builder 11388 0 0.0 0.0 0.0 0.0
withSize.\ Data.Serialize.Builder 11389 21 0.0 0.0 0.0 0.0
runBuilder Data.Serialize.Builder 11390 21 0.0 0.0 0.0 0.0
runBuilder Data.Serialize.Builder 11382 21 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11372 174 0.0 0.1 0.0 0.1
CAF GHC.IO.Encoding 11322 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 11319 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 11318 0 0.0 0.2 0.0 0.2
CAF GHC.Event.Thread 11304 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 11292 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 11288 0 0.0 0.0 0.0 0.0
CAF GHC.TopHandler 11284 0 0.0 0.0 0.0 0.0
CAF GHC.Event.Control 11271 0 0.0 0.0 0.0 0.0
CAF Main 11263 0 0.0 0.0 0.0 0.0
main Main 11368 1 0.0 0.0 0.0 0.0
CAF Application 11262 0 0.0 0.0 0.0 0.0
CAF Foundation 11261 0 0.0 0.0 0.0 0.0
CAF Model 11260 0 0.0 0.1 0.0 0.3
unstream/resize Data.Text.Internal.Fusion 11375 35 0.0 0.1 0.0 0.1
CAF Settings 11259 0 0.0 0.1 0.0 0.2
unstream/resize Data.Text.Internal.Fusion 11370 20 0.0 0.1 0.0 0.1
CAF Database.Persist.Postgresql 6229 0 0.0 0.3 0.0 0.9
unstream/resize Data.Text.Internal.Fusion 11373 93 0.0 0.6 0.0 0.6
CAF Database.PostgreSQL.Simple.Transaction 6224 0 0.0 0.0 0.0 0.0
CAF Database.PostgreSQL.Simple.TypeInfo.Static 6222 0 0.0 0.0 0.0 0.0
CAF Database.PostgreSQL.Simple.Internal 6219 0 0.0 0.0 0.0 0.0
CAF Yesod.Static 6210 0 0.0 0.0 0.0 0.0
CAF Crypto.Hash.Conduit 6193 0 0.0 0.0 0.0 0.0
CAF Yesod.Default.Config2 6192 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11371 1 0.0 0.0 0.0 0.0
CAF Yesod.Core.Internal.Util 6154 0 0.0 0.0 0.0 0.0
CAF Text.Libyaml 6121 0 0.0 0.0 0.0 0.0
CAF Data.Yaml 6120 0 0.0 0.0 0.0 0.0
CAF Data.Yaml.Internal 6119 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11369 1 0.0 0.0 0.0 0.0
CAF Database.Persist.Quasi 6055 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11376 1 0.0 0.0 0.0 0.0
CAF Database.Persist.Sql.Internal 6046 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11377 6 0.0 0.0 0.0 0.0
CAF Data.Pool 6036 0 0.0 0.0 0.0 0.0
CAF Network.HTTP.Client.TLS 6014 0 0.0 0.0 0.0 0.0
CAF System.X509.Unix 6010 0 0.0 0.0 0.0 0.0
CAF Crypto.Hash.MD5 5927 0 0.0 0.0 0.0 0.0
CAF Data.Serialize 5873 0 0.0 0.0 0.0 0.0
put Data.Serialize 11385 1 0.0 0.0 0.0 0.0
CAF Data.Serialize.Put 5872 0 0.0 0.0 0.0 0.0
withSize Data.Serialize.Builder 11387 1 0.0 0.0 0.0 0.0
CAF Data.Serialize.Builder 5870 0 0.0 0.0 0.0 0.0
flush Data.Serialize.Builder 11392 1 0.0 0.0 0.0 0.0
toByteString Data.Serialize.Builder 11391 0 0.0 0.0 0.0 0.0
defaultSize Data.Serialize.Builder 11379 1 0.0 0.0 0.0 0.0
defaultSize.overhead Data.Serialize.Builder 11381 1 0.0 0.0 0.0 0.0
defaultSize.k Data.Serialize.Builder 11380 1 0.0 0.0 0.0 0.0
CAF Crypto.Random.Entropy.Unix 5866 0 0.0 0.0 0.0 0.0
CAF Network.HTTP.Client.Manager 5861 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11374 3 0.0 0.0 0.0 0.0
CAF System.Random.MWC 5842 0 0.0 0.0 0.0 0.0
coff System.Random.MWC 11405 1 0.0 0.0 0.0 0.0
ioff System.Random.MWC 11404 1 0.0 0.0 0.0 0.0
acquireSeedSystem System.Random.MWC 11398 1 0.0 0.0 0.0 0.0
acquireSeedSystem.random System.Random.MWC 11402 1 0.0 0.0 0.0 0.0
acquireSeedSystem.nbytes System.Random.MWC 11400 1 0.0 0.0 0.0 0.0
createSystemRandom System.Random.MWC 11394 1 0.0 0.0 0.0 0.0
withSystemRandom System.Random.MWC 11395 1 0.0 0.0 0.0 0.0
CAF Data.Streaming.Network.Internal 5833 0 0.0 0.0 0.0 0.0
CAF Data.Scientific 5728 0 0.0 0.1 0.0 0.1
CAF Data.Text.Array 5722 0 0.0 0.0 0.0 0.0
CAF Data.Text.Internal 5718 0 0.0 0.0 0.0 0.0
Edit 2015-06-01 08:40
You can browse source code at the following repository → https://github.com/Zigazou/Ouep
Found a related bug in the Yesod bug tracker. Ran my program like this:
myserver +RTS -I0 -RTS Development
And now idle CPU usage is down to almost nothing, compared to 14% or so before (ARM computer). The I0 (that's I and zero) option turns off periodic garbage collection, which defaults to 0.3 secs I think. Not sure about that implications for app responsiveness or memory usage, but for me at least this is definitely the culprit.

How to profile libraries?

Is there any hidden option that will put cost centres in libraries? Currently I have set up my profiling like this:
cabal:
ghc-prof-options: -O2
-threaded
-fexcess-precision
-fprof-auto
-rtsopts
"-with-rtsopts=-N -p -s -h -i0.1"
exec:
# cabal sandbox init
# cabal install --enable-library-profiling --enable-executable-profiling
# cabal configure --enable-library-profiling --enable-executable-profiling
# cabal run
This works and creates the expected .prof file, .hp file and the summary when the program finishes.
Problem is that the .prof file doesn't contain anything that doesn't belong to the current project. My guess is that there is probably a option that will put cost centers in external library code?
My guess is that there is probably a option that will put cost centers in external library code?
Well, not per default. You need to add the cost centers when you compile the dependency. However, you can add -fprof-auto to the ghc options during cabal install:
$ cabal sandbox init
$ cabal install --ghc-option=-fprof-auto -p --enable-executable-profiling
Example
An example using code from this question, where the code from the question is contained in SO.hs:
$ cabal sandbox init
$ cabal install vector -p --ghc-options=-fprof-auto
$ cabal exec -- ghc --make SO.hs -prof -fprof-auto -O2
$ ./SO /usr/share/dict/words +RTS -s -p
$ cat SO.prof
Tue Dec 2 15:01 2014 Time and Allocation Profiling Report (Final)
Test +RTS -s -p -RTS /usr/share/dict/words
total time = 0.70 secs (698 ticks # 1000 us, 1 processor)
total alloc = 618,372,952 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
letterCount Main 40.3 24.3
letterCount.letters1 Main 13.2 18.2
basicUnsafeWrite Data.Vector.Primitive.Mutable 10.0 12.1
basicUnsafeWrite Data.Vector.Unboxed.Base 7.2 7.3
basicUnsafeRead Data.Vector.Primitive.Mutable 5.4 4.9
>>= Data.Vector.Fusion.Util 5.0 13.4
basicUnsafeIndexM Data.Vector.Unboxed.Base 4.9 0.0
basicUnsafeIndexM Data.Vector.Primitive 2.7 4.9
basicUnsafeIndexM Data.Vector.Unboxed.Base 2.3 0.0
letterCount.letters1.\ Main 2.0 2.4
>>= Data.Vector.Fusion.Util 1.9 6.1
basicUnsafeWrite Data.Vector.Unboxed.Base 1.7 0.0
letterCount.\ Main 1.3 2.4
readByteArray# Data.Primitive.Types 0.3 2.4
basicUnsafeNew Data.Vector.Primitive.Mutable 0.0 1.2
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 72 0 0.0 0.0 100.0 100.0
main Main 145 0 0.1 0.2 99.9 100.0
main.counts Main 148 1 0.0 0.0 99.3 99.6
letterCount Main 149 1 40.3 24.3 99.3 99.6
basicUnsafeFreeze Data.Vector.Unboxed.Base 257 1 0.0 0.0 0.0 0.0
primitive Control.Monad.Primitive 259 1 0.0 0.0 0.0 0.0
basicUnsafeFreeze Data.Vector.Primitive 258 1 0.0 0.0 0.0 0.0
letterCount.\ Main 256 938848 1.3 2.4 1.3 2.4
basicUnsafeWrite Data.Vector.Unboxed.Base 252 938848 1.3 0.0 5.0 6.1
basicUnsafeWrite Data.Vector.Primitive.Mutable 253 938848 3.7 6.1 3.7 6.1
writeByteArray# Data.Primitive.Types 255 938848 0.0 0.0 0.0 0.0
primitive Control.Monad.Primitive 254 938848 0.0 0.0 0.0 0.0
basicUnsafeRead Data.Vector.Unboxed.Base 248 938848 0.7 0.0 6.6 7.3
basicUnsafeRead Data.Vector.Primitive.Mutable 249 938848 5.4 4.9 5.9 7.3
readByteArray# Data.Primitive.Types 251 938848 0.3 2.4 0.3 2.4
primitive Control.Monad.Primitive 250 938848 0.1 0.0 0.1 0.0
>>= Data.Vector.Fusion.Util 243 938848 0.0 0.0 0.0 0.0
basicUnsafeIndexM Data.Vector.Unboxed.Base 242 938848 0.0 0.0 0.0 0.0
basicUnsafeIndexM Data.Vector.Unboxed.Base 237 938848 4.9 0.0 11.7 10.9
>>= Data.Vector.Fusion.Util 247 938848 1.9 6.1 1.9 6.1
basicUnsafeIndexM Data.Vector.Unboxed.Base 238 938848 2.3 0.0 5.0 4.9
basicUnsafeIndexM Data.Vector.Primitive 239 938848 2.7 4.9 2.7 4.9
indexByteArray# Data.Primitive.Types 240 938848 0.0 0.0 0.0 0.0
>>= Data.Vector.Fusion.Util 236 938849 3.4 7.3 3.4 7.3
unId Data.Vector.Fusion.Util 235 938849 0.0 0.0 0.0 0.0
basicLength Data.Vector.Unboxed.Base 234 1 0.0 0.0 0.0 0.0
basicLength Data.Vector.Primitive.Mutable 233 1 0.0 0.0 0.0 0.0
basicUnsafeCopy Data.Vector.Unboxed.Base 222 1 0.0 0.0 0.0 0.0
basicUnsafeCopy Data.Vector.Primitive 223 1 0.0 0.0 0.0 0.0
unI# Data.Primitive.ByteArray 226 3 0.0 0.0 0.0 0.0
basicLength Data.Vector.Unboxed.Base 214 1 0.0 0.0 0.0 0.0
basicLength Data.Vector.Primitive 215 1 0.0 0.0 0.0 0.0
basicUnsafeNew Data.Vector.Unboxed.Base 212 1 0.0 0.0 0.0 0.0
primitive Control.Monad.Primitive 220 1 0.0 0.0 0.0 0.0
basicUnsafeNew Data.Vector.Primitive.Mutable 216 1 0.0 0.0 0.0 0.0
sizeOf Data.Primitive 217 1 0.0 0.0 0.0 0.0
sizeOf# Data.Primitive.Types 218 1 0.0 0.0 0.0 0.0
unI# Data.Primitive.Types 219 1 0.0 0.0 0.0 0.0
basicLength Data.Vector.Unboxed.Base 211 1 0.0 0.0 0.0 0.0
letterCount.len Main 178 1 0.0 0.0 0.0 0.0
letterCount.letters1 Main 177 1 13.2 18.2 30.9 41.3
basicUnsafeFreeze Data.Vector.Unboxed.Base 204 1 0.0 0.0 0.0 0.0
basicUnsafeFreeze Data.Vector.Unboxed.Base 210 1 0.0 0.0 0.0 0.0
primitive Control.Monad.Primitive 207 1 0.0 0.0 0.0 0.0
basicUnsafeFreeze Data.Vector.Primitive 206 1 0.0 0.0 0.0 0.0
basicUnsafeFreeze Data.Vector.Unboxed.Base 205 1 0.0 0.0 0.0 0.0
basicUnsafeFreeze Data.Vector.Primitive 208 0 0.0 0.0 0.0 0.0
basicUnsafeSlice Data.Vector.Unboxed.Base 200 1 0.0 0.0 0.0 0.0
basicUnsafeSlice Data.Vector.Unboxed.Base 203 1 0.0 0.0 0.0 0.0
basicUnsafeSlice Data.Vector.Unboxed.Base 201 1 0.0 0.0 0.0 0.0
basicUnsafeSlice Data.Vector.Primitive.Mutable 202 1 0.0 0.0 0.0 0.0
basicUnsafeWrite Data.Vector.Unboxed.Base 193 938848 7.2 7.3 14.2 13.4
basicUnsafeWrite Data.Vector.Unboxed.Base 198 938848 0.0 0.0 0.0 0.0
basicUnsafeWrite Data.Vector.Unboxed.Base 194 938848 0.4 0.0 7.0 6.1
basicUnsafeWrite Data.Vector.Primitive.Mutable 195 938848 6.3 6.1 6.6 6.1
writeByteArray# Data.Primitive.Types 197 938848 0.3 0.0 0.3 0.0
primitive Control.Monad.Primitive 196 938848 0.0 0.0 0.0 0.0
letterCount.letters1.\ Main 192 938848 2.0 2.4 2.0 2.4
>>= Data.Vector.Fusion.Util 191 938848 1.6 6.1 1.6 6.1
unId Data.Vector.Fusion.Util 190 938849 0.0 0.0 0.0 0.0
upperBound Data.Vector.Fusion.Stream.Size 180 1 0.0 0.0 0.0 0.0
basicUnsafeNew Data.Vector.Unboxed.Base 179 1 0.0 0.0 0.0 1.2
basicUnsafeNew Data.Vector.Unboxed.Base 189 1 0.0 0.0 0.0 0.0
primitive Control.Monad.Primitive 187 1 0.0 0.0 0.0 0.0
basicUnsafeNew Data.Vector.Primitive.Mutable 182 1 0.0 0.0 0.0 0.0
basicUnsafeNew Data.Vector.Unboxed.Base 181 1 0.0 0.0 0.0 1.2
basicUnsafeNew Data.Vector.Primitive.Mutable 183 0 0.0 1.2 0.0 1.2
sizeOf Data.Primitive 184 1 0.0 0.0 0.0 0.0
sizeOf# Data.Primitive.Types 185 1 0.0 0.0 0.0 0.0
unI# Data.Primitive.Types 186 1 0.0 0.0 0.0 0.0
printCounts Main 146 1 0.4 0.2 0.4 0.2
basicUnsafeIndexM Data.Vector.Unboxed.Base 266 256 0.0 0.0 0.0 0.0
basicUnsafeIndexM Data.Vector.Primitive 267 0 0.0 0.0 0.0 0.0
indexByteArray# Data.Primitive.Types 268 256 0.0 0.0 0.0 0.0
basicUnsafeIndexM Data.Vector.Primitive 265 256 0.0 0.0 0.0 0.0
>>= Data.Vector.Fusion.Util 264 256 0.0 0.0 0.0 0.0
unId Data.Vector.Fusion.Util 263 256 0.0 0.0 0.0 0.0
basicLength Data.Vector.Unboxed.Base 262 1 0.0 0.0 0.0 0.0
basicLength Data.Vector.Primitive 261 1 0.0 0.0 0.0 0.0
CAF Main 143 0 0.0 0.0 0.0 0.0
main Main 144 1 0.0 0.0 0.0 0.0
main.counts Main 150 0 0.0 0.0 0.0 0.0
letterCount Main 151 0 0.0 0.0 0.0 0.0
basicUnsafeIndexM Data.Vector.Unboxed.Base 244 0 0.0 0.0 0.0 0.0
>>= Data.Vector.Fusion.Util 245 0 0.0 0.0 0.0 0.0
basicUnsafeIndexM Data.Vector.Unboxed.Base 246 0 0.0 0.0 0.0 0.0
primitive Control.Monad.Primitive 224 1 0.0 0.0 0.0 0.0
basicUnsafeFreeze Data.Vector.Unboxed.Base 173 1 0.0 0.0 0.0 0.0
primitive Control.Monad.Primitive 175 1 0.0 0.0 0.0 0.0
basicUnsafeFreeze Data.Vector.Primitive 174 1 0.0 0.0 0.0 0.0
basicUnsafeSlice Data.Vector.Unboxed.Base 171 1 0.0 0.0 0.0 0.0
basicUnsafeSlice Data.Vector.Primitive.Mutable 172 1 0.0 0.0 0.0 0.0
basicUnsafeWrite Data.Vector.Unboxed.Base 167 256 0.0 0.0 0.0 0.0
basicUnsafeWrite Data.Vector.Primitive.Mutable 168 256 0.0 0.0 0.0 0.0
writeByteArray# Data.Primitive.Types 170 256 0.0 0.0 0.0 0.0
primitive Control.Monad.Primitive 169 256 0.0 0.0 0.0 0.0
>>= Data.Vector.Fusion.Util 165 256 0.0 0.0 0.0 0.0
unId Data.Vector.Fusion.Util 164 257 0.0 0.0 0.0 0.0
basicUnsafeNew Data.Vector.Unboxed.Base 156 1 0.0 0.0 0.0 0.0
primitive Control.Monad.Primitive 162 1 0.0 0.0 0.0 0.0
basicUnsafeNew Data.Vector.Primitive.Mutable 157 1 0.0 0.0 0.0 0.0
sizeOf Data.Primitive 158 1 0.0 0.0 0.0 0.0
sizeOf# Data.Primitive.Types 159 1 0.0 0.0 0.0 0.0
unI# Data.Primitive.Types 160 1 0.0 0.0 0.0 0.0
upperBound Data.Vector.Fusion.Stream.Size 153 1 0.0 0.0 0.0 0.0
elemseq Data.Vector.Unboxed.Base 152 1 0.0 0.0 0.0 0.0
printCounts Main 147 0 0.0 0.0 0.0 0.0
CAF Data.Vector.Internal.Check 142 0 0.0 0.0 0.0 0.0
doBoundsChecks Data.Vector.Internal.Check 213 1 0.0 0.0 0.0 0.0
doUnsafeChecks Data.Vector.Internal.Check 155 1 0.0 0.0 0.0 0.0
doInternalChecks Data.Vector.Internal.Check 154 1 0.0 0.0 0.0 0.0
CAF Data.Vector.Fusion.Util 141 0 0.0 0.0 0.0 0.0
return Data.Vector.Fusion.Util 241 1 0.0 0.0 0.0 0.0
return Data.Vector.Fusion.Util 166 1 0.0 0.0 0.0 0.0
CAF Data.Vector.Unboxed.Base 136 0 0.0 0.0 0.0 0.0
basicUnsafeCopy Data.Vector.Unboxed.Base 227 0 0.0 0.0 0.0 0.0
basicUnsafeCopy Data.Vector.Primitive 228 0 0.0 0.0 0.0 0.0
basicUnsafeCopy.sz Data.Vector.Primitive 229 1 0.0 0.0 0.0 0.0
sizeOf Data.Primitive 230 1 0.0 0.0 0.0 0.0
sizeOf# Data.Primitive.Types 231 1 0.0 0.0 0.0 0.0
unI# Data.Primitive.Types 232 1 0.0 0.0 0.0 0.0
CAF Data.Primitive.MachDeps 128 0 0.0 0.0 0.0 0.0
sIZEOF_INT Data.Primitive.MachDeps 161 1 0.0 0.0 0.0 0.0
CAF Text.Printf 118 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 112 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 109 0 0.1 0.0 0.1 0.0
CAF GHC.IO.Encoding 99 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 98 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 95 0 0.0 0.0 0.0 0.0
Unfortunately, you cannot state --ghc-option=… as a flag at the dependencies.
You also need -prof.
GHC Users's Guide says "There are a few other profiling-related compilation options. Use them in addition to -prof. These do not have to be used consistently for all modules in a program.
"

Resources