Improving memory usage during serialization (Data.Binary) - haskell

I'm still kinda new to Haskell and learning new things every day. My problem is a too high memory usage during seralization using the Data.Binary library. Maybe I'm just using the library the wrong way, but I can't figure it out.
The actual idea is, that I read binary data from disk, add new data und write everything back to disk. Here's the code:
module Main
where
import Data.Binary
import System.Environment
import Data.List (foldl')
data DualNo = DualNo Int Int deriving (Show)
instance Data.Binary.Binary DualNo where
put (DualNo a b) = do
put a
put b
get = do
a <- get
b <- get
return (DualNo a b)
-- read DualNo from HDD
readData :: FilePath -> IO [DualNo]
readData filename = do
no <- decodeFile filename :: IO [DualNo]
return no
-- write DualNo to HDD
writeData :: [DualNo] -> String -> IO ()
writeData no filename = encodeFile filename (no :: [DualNo])
writeEmptyDataToDisk :: String -> IO ()
writeEmptyDataToDisk filename = writeData [] filename
-- feed a the list with a new dataset
feedWithInputData :: [DualNo] -> [(Int, Int)] -> [DualNo]
feedWithInputData existData newData = foldl' func existData newData
where
func dataset (a,b) = DualNo a b : dataset
main :: IO ()
main = do
[newInputData, toPutIntoExistingData] <- System.Environment.getArgs
if toPutIntoExistingData == "empty"
then writeEmptyDataToDisk "myData.dat"
else return ()
loadedData <- readData "myData.dat"
newData <- return (case newInputData of
"dataset1" -> feedWithInputData loadedData dataset1
"dataset2" -> feedWithInputData loadedData dataset2
otherwise -> feedWithInputData loadedData dataset3)
writeData newData "myData.dat"
dataset1 = zip [1..100000] [2,4..200000]
dataset2 = zip [5,10..500000] [3,6..300000]
dataset3 = zip [4,8..400000] [6,12..600000]
I'm pretty sure, there's a lot to improve in this code. But my biggest problem is the memory usage with big datasets.
I profiled my programm with GHC.
$ ghc -O2 --make -prof -fprof-auto -auto-all -caf-all -rtsopts -fforce-recomp Main.hs
$ ./Main dataset1 empty +RTS -p -sstderr
165,085,864 bytes allocated in the heap
70,643,992 bytes copied during GC
12,298,128 bytes maximum residency (7 sample(s))
424,696 bytes maximum slop
35 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 306 colls, 0 par 0.035s 0.035s 0.0001s 0.0015s
Gen 1 7 colls, 0 par 0.053s 0.053s 0.0076s 0.0180s
INIT time 0.001s ( 0.001s elapsed)
MUT time 0.059s ( 0.062s elapsed)
GC time 0.088s ( 0.088s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.003s ( 0.003s elapsed)
Total time 0.154s ( 0.154s elapsed)
%GC time 57.0% (57.3% elapsed)
Alloc rate 2,781,155,968 bytes per MUT second
Productivity 42.3% of total user, 42.5% of total elapsed
Looking at the prof-file:
Tue Apr 12 18:11 2016 Time and Allocation Profiling Report (Final)
Main +RTS -p -sstderr -RTS dataset1 empty
total time = 0.06 secs (60 ticks # 1000 us, 1 processor)
total alloc = 102,613,008 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
put Main 48.3 53.0
writeData Main 30.0 18.8
dataset1 Main 13.3 23.4
feedWithInputData Main 6.7 0.0
feedWithInputData.func Main 1.7 4.7
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 68 0 0.0 0.0 100.0 100.0
main Main 137 0 0.0 0.0 86.7 76.6
feedWithInputData Main 150 1 6.7 0.0 8.3 4.7
feedWithInputData.func Main 154 100000 1.7 4.7 1.7 4.7
writeData Main 148 1 30.0 18.8 78.3 71.8
put Main 155 100000 48.3 53.0 48.3 53.0
readData Main 147 0 0.0 0.1 0.0 0.1
writeEmptyDataToDisk Main 142 0 0.0 0.0 0.0 0.1
writeData Main 143 0 0.0 0.1 0.0 0.1
CAF:main1 Main 133 0 0.0 0.0 0.0 0.0
main Main 136 1 0.0 0.0 0.0 0.0
CAF:main2 Main 132 0 0.0 0.0 0.0 0.0
main Main 139 0 0.0 0.0 0.0 0.0
writeEmptyDataToDisk Main 140 1 0.0 0.0 0.0 0.0
writeData Main 141 1 0.0 0.0 0.0 0.0
CAF:main7 Main 131 0 0.0 0.0 0.0 0.0
main Main 145 0 0.0 0.0 0.0 0.0
readData Main 146 1 0.0 0.0 0.0 0.0
CAF:dataset1 Main 123 0 0.0 0.0 5.0 7.8
dataset1 Main 151 1 5.0 7.8 5.0 7.8
CAF:dataset4 Main 122 0 0.0 0.0 5.0 7.8
dataset1 Main 153 0 5.0 7.8 5.0 7.8
CAF:dataset5 Main 121 0 0.0 0.0 3.3 7.8
dataset1 Main 152 0 3.3 7.8 3.3 7.8
CAF:main4 Main 116 0 0.0 0.0 0.0 0.0
main Main 138 0 0.0 0.0 0.0 0.0
CAF:main6 Main 115 0 0.0 0.0 0.0 0.0
main Main 149 0 0.0 0.0 0.0 0.0
CAF:main3 Main 113 0 0.0 0.0 0.0 0.0
main Main 144 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 107 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 103 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 101 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 94 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 86 0 0.0 0.0 0.0 0.0
Now I add further data:
$ ./Main dataset2 myData.dat +RTS -p -sstderr
343,601,008 bytes allocated in the heap
175,650,728 bytes copied during GC
34,113,936 bytes maximum residency (8 sample(s))
971,896 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 640 colls, 0 par 0.082s 0.083s 0.0001s 0.0017s
Gen 1 8 colls, 0 par 0.140s 0.141s 0.0176s 0.0484s
INIT time 0.001s ( 0.001s elapsed)
MUT time 0.138s ( 0.139s elapsed)
GC time 0.221s ( 0.224s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.006s ( 0.006s elapsed)
Total time 0.370s ( 0.370s elapsed)
%GC time 59.8% (60.5% elapsed)
Alloc rate 2,485,518,518 bytes per MUT second
Productivity 39.9% of total user, 39.8% of total elapsed
Looking at the new prof-file:
Tue Apr 12 18:15 2016 Time and Allocation Profiling Report (Final)
Main +RTS -p -sstderr -RTS dataset2 myData.dat
total time = 0.14 secs (139 ticks # 1000 us, 1 processor)
total alloc = 213,866,232 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
put Main 41.0 50.9
writeData Main 25.9 18.0
get Main 25.2 16.8
dataset2 Main 4.3 11.2
readData Main 1.4 0.8
feedWithInputData.func Main 1.4 2.2
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 68 0 0.0 0.0 100.0 100.0
main Main 137 0 0.0 0.0 95.7 88.8
feedWithInputData Main 148 1 0.7 0.0 2.2 2.2
feedWithInputData.func Main 152 100000 1.4 2.2 1.4 2.2
writeData Main 145 1 25.9 18.0 66.9 68.9
put Main 153 200000 41.0 50.9 41.0 50.9
readData Main 141 0 1.4 0.8 26.6 17.6
get Main 144 0 25.2 16.8 25.2 16.8
CAF:main1 Main 133 0 0.0 0.0 0.0 0.0
main Main 136 1 0.0 0.0 0.0 0.0
CAF:main7 Main 131 0 0.0 0.0 0.0 0.0
main Main 139 0 0.0 0.0 0.0 0.0
readData Main 140 1 0.0 0.0 0.0 0.0
CAF:dataset2 Main 126 0 0.0 0.0 0.7 3.7
dataset2 Main 149 1 0.7 3.7 0.7 3.7
CAF:dataset6 Main 125 0 0.0 0.0 2.2 3.7
dataset2 Main 151 0 2.2 3.7 2.2 3.7
CAF:dataset7 Main 124 0 0.0 0.0 1.4 3.7
dataset2 Main 150 0 1.4 3.7 1.4 3.7
CAF:$fBinaryDualNo1 Main 120 0 0.0 0.0 0.0 0.0
get Main 143 1 0.0 0.0 0.0 0.0
CAF:main4 Main 116 0 0.0 0.0 0.0 0.0
main Main 138 0 0.0 0.0 0.0 0.0
CAF:main6 Main 115 0 0.0 0.0 0.0 0.0
main Main 146 0 0.0 0.0 0.0 0.0
CAF:main5 Main 114 0 0.0 0.0 0.0 0.0
main Main 147 0 0.0 0.0 0.0 0.0
CAF:main3 Main 113 0 0.0 0.0 0.0 0.0
main Main 142 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 107 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 103 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 101 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 94 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 86 0 0.0 0.0 0.0 0.0
The more often I add new data, the higher the memory usage becomes. I mean, it's clear, that I need more memory for a bigger dataset. But isn't there a better solution for this problem (like gradually writing data back to disk).
Edit:
Actually the most important thing, that bothers me, is the following observation:
I run the program for the first time and add new data to an existing (empty) file on my disk.
The size of the saved file on my disk is: 1.53 MByte.
But (looking at the first prof-file) the program allocated more than 102 MByte. More than 50% was allocated by the put function from the Data.Binary package.
I run the program a second time and add new data to an existing (not empty) file on my disk.
The size of the saved file on my disk is 3.05 MByte.
But (looking at the second prof-file) the program allocated more than 213 MByte. More than 66% was allocated by the put and get function together.
=> Conclusion: In the first example I needed 102/1.53 = 66 times more memory running the program than space for the binary file on my disk.
In the second example I needed 213/3.05 = 69 times more memory running the programm than space for the binary file on my disk.
Question:
Is the Data.Binary package for serialization so efficient (and awesome), that it can decrease the needed memory to such an extent.
Analogous question:
Do I really need so much more memory for loading the data in my program than space for the the same data in a binary-file on disk?

Related

How can I determine if a function is being memoized in Haskell?

I've got a Haskell program that is performing non linearly performance wise (worse then O(n)).
I'm trying to investigate whether memoization is taking place on a function, can I verify this? I'm familiar with GHC profiling - but I'm not too sure which values I should be looking at?
A work around is too just plug some values and observe the execution time - but it's not ideal.
As far as I know there is no automatic memoization in Haskell.
That said there seems to be an optimization in GHC that caches values for parameterless function like the following
rightTriangles = [ (a,b,c) |
c <- [1..],
b <- [1..c],
a <- [1..b],
a^2 + b^2 == c^2]
If you try out the following in GHCi twice, you'll see that the second call ist much faster:
ghci > take 500 rightTriangles
Not really an answer but should still be helpfull, memoization does not seem to make a difference in profiling output in terms of function "entries". Demonstrated with the following basic example:
module Main where
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
fibmemo = (map fib [0 ..] !!)
main :: IO ()
main = do
putStrLn "Begin.."
print $ fib 10
-- print $ fibmemo 10
With the above code the profiling output is:
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 119 0 0.0 1.3 0.0 100.0
CAF Main <entire-module> 237 0 0.0 1.0 0.0 1.2
main Main Main.hs:(12,1)-(14,16) 238 1 0.0 0.2 0.0 0.2
fib Main Main.hs:(5,1)-(7,29) 240 177 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal <entire-module> 230 0 0.0 1.2 0.0 1.2
CAF GHC.IO.Encoding <entire-module> 220 0 0.0 5.4 0.0 5.4
CAF GHC.IO.Encoding.Iconv <entire-module> 218 0 0.0 0.4 0.0 0.4
CAF GHC.IO.Handle.FD <entire-module> 210 0 0.0 67.7 0.0 67.7
CAF GHC.IO.Handle.Text <entire-module> 208 0 0.0 0.2 0.0 0.2
main Main Main.hs:(12,1)-(14,16) 239 0 0.0 22.6 0.0 22.6
While if we comment out fib 10 and uncomment the fibmemo 10 we get:
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 119 0 0.0 1.2 0.0 100.0
CAF Main <entire-module> 237 0 0.0 1.0 0.0 2.9
fibmemo Main Main.hs:9:1-29 240 1 0.0 1.6 0.0 1.6
fib Main Main.hs:(5,1)-(7,29) 242 177 0.0 0.0 0.0 0.0
main Main Main.hs:(12,1)-(15,20) 238 1 0.0 0.2 0.0 0.2
fibmemo Main Main.hs:9:1-29 241 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal <entire-module> 230 0 0.0 1.2 0.0 1.2
CAF GHC.IO.Encoding <entire-module> 220 0 0.0 5.3 0.0 5.3
CAF GHC.IO.Encoding.Iconv <entire-module> 218 0 0.0 0.4 0.0 0.4
CAF GHC.IO.Handle.FD <entire-module> 210 0 0.0 66.6 0.0 66.6
CAF GHC.IO.Handle.Text <entire-module> 208 0 0.0 0.2 0.0 0.2
main Main Main.hs:(12,1)-(15,20) 239 0 0.0 22.2 0.0 22.2

What is `MAIN` ? (ghc profiling)

I build an old big project, Pugs, with ghc 7.10.1 using stack build (I wrote my own stack.yaml). Then I run stack build --library-profiling --executable-profiling and .stack-work/install/x86_64-osx/nightly-2015-06-26/7.10.1/bin/pugs -e 'my $i=0; for (1..100_000) { $i++ }; say $i' +RTS -pa and output the following pugs.prof file.
Fri Jul 10 00:10 2015 Time and Allocation Profiling Report (Final)
pugs +RTS -P -RTS -e my $i=0; for (1..10_000) { $i++ }; say $i
total time = 0.60 secs (604 ticks # 1000 us, 1 processor)
total alloc = 426,495,472 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc ticks bytes
MAIN MAIN 92.2 90.6 557 386532168
CAF Pugs.Run 2.8 5.2 17 22191000
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc ticks bytes
MAIN MAIN 287 0 92.2 90.6 100.0 100.0 557 386532168
listAssocOp Pugs.Parser.Operator 841 24 0.0 0.0 0.0 0.0 0 768
nassocOp Pugs.Parser.Operator 840 24 0.0 0.0 0.0 0.0 0 768
lassocOp Pugs.Parser.Operator 839 24 0.0 0.0 0.0 0.0 0 768
rassocOp Pugs.Parser.Operator 838 24 0.0 0.0 0.0 0.0 0 768
postfixOp Pugs.Parser.Operator 837 24 0.0 0.0 0.0 0.0 0 768
termOp Pugs.Parser.Operator 824 24 0.0 0.5 0.7 1.2 0 2062768
insert Data.HashTable.ST.Basic 874 1 0.0 0.0 0.0 0.0 0 152
checkOverflow Data.HashTable.ST.Basic 890 1 0.0 0.0 0.0 0.0 0 80
readDelLoad Data.HashTable.ST.Basic 893 0 0.0 0.0 0.0 0.0 0 184
writeLoad Data.HashTable.ST.Basic 892 0 0.0 0.0 0.0 0.0 0 224
readLoad Data.HashTable.ST.Basic 891 0 0.0 0.0 0.0 0.0 0 184
_values Data.HashTable.ST.Basic 889 1 0.0 0.0 0.0 0.0 0 0
_keys Data.HashTable.ST.Basic 888 1 0.0 0.0 0.0 0.0 0 0
.. snip ..
MAIN costs 92.2% of time, however, I don't know what MAIN means. What does MAIN label mean?
I was in the same spot a few days ago. What I deduced is the same thing, MAIN is expressions without anotations. It's counts shrink significantly if you add "-fprof-auto" and "-caf-all". Those options will also let you find a lot of interesting things happening in your code.

Haskell small CPU leak

I’m experiencing small CPU leaks using GHC 7.8.3 and Yesod 1.4.9.
When I run my site with time and stop it (Ctrl+C) after 1 minute without doing anything (just run, no request at all), it consumes 1 second. It represents approximately 1.7% of CPU.
$ time mysite
^C
real 1m0.226s
user 0m1.024s
sys 0m0.060s
If I disable the idle garbage collector, it drops to 0.35 second (0.6% of CPU). Though it’s better, it still consumes CPU without doing anything.
$ time mysite +RTS -I0 # Disable idle GC
^C
real 1m0.519s
user 0m0.352s
sys 0m0.064s
$ time mysite +RTS -I0
^C
real 4m0.676s
user 0m0.888s
sys 0m0.468s
$ time mysite +RTS -I0
^C
real 7m28.282s
user 0m1.452s
sys 0m0.976s
Compared to a cat command waiting indefinitely for something on the standard input:
$ time cat
^C
real 1m1.349s
user 0m0.000s
sys 0m0.000s
Is there anything else in Haskell that does consume CPU in the background ?
Is it a leak from Yesod ?
Or is it something that I have done in my program ? (I have only added handler functions, I don’t do parallel computation)
Edit 2015-05-31 19:25
Here’s the execution with the -s flag:
$ time mysite +RTS -I0 -s
^C 23,138,184 bytes allocated in the heap
4,422,096 bytes copied during GC
2,319,960 bytes maximum residency (4 sample(s))
210,584 bytes maximum slop
6 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 30 colls, 0 par 0.00s 0.00s 0.0001s 0.0003s
Gen 1 4 colls, 0 par 0.03s 0.04s 0.0103s 0.0211s
TASKS: 5 (1 bound, 4 peak workers (4 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.86s (224.38s elapsed)
GC time 0.03s ( 0.05s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.90s (224.43s elapsed)
Alloc rate 26,778,662 bytes per MUT second
Productivity 96.9% of total user, 0.4% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 3m44.447s
user 0m0.896s
sys 0m0.320s
And with profiling:
$ time mysite +RTS -I0
^C 23,024,424 bytes allocated in the heap
19,367,640 bytes copied during GC
2,319,960 bytes maximum residency (94 sample(s))
211,312 bytes maximum slop
6 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 27 colls, 0 par 0.00s 0.00s 0.0002s 0.0005s
Gen 1 94 colls, 0 par 1.09s 1.04s 0.0111s 0.0218s
TASKS: 5 (1 bound, 4 peak workers (4 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.00s (201.66s elapsed)
GC time 1.07s ( 1.03s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.09s (202.68s elapsed)
Alloc rate 23,115,591 bytes per MUT second
Productivity 47.7% of total user, 0.5% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 3m22.697s
user 0m2.088s
sys 0m0.060s
mysite.prof:
Sun May 31 19:16 2015 Time and Allocation Profiling Report (Final)
mysite +RTS -N -p -s -h -i0.1 -I0 -RTS
total time = 0.05 secs (49 ticks # 1000 us, 1 processor)
total alloc = 17,590,528 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
MAIN MAIN 98.0 93.7
acquireSeedSystem.\.\ System.Random.MWC 2.0 0.0
toByteString Data.Serialize.Builder 0.0 3.9
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 5684 0 98.0 93.7 100.0 100.0
createSystemRandom System.Random.MWC 11396 0 0.0 0.0 2.0 0.3
withSystemRandom System.Random.MWC 11397 0 0.0 0.1 2.0 0.3
acquireSeedSystem System.Random.MWC 11399 0 0.0 0.0 2.0 0.2
acquireSeedSystem.\ System.Random.MWC 11401 1 0.0 0.2 2.0 0.2
acquireSeedSystem.\.\ System.Random.MWC 11403 1 2.0 0.0 2.0 0.0
sndS Data.Serialize.Put 11386 21 0.0 0.0 0.0 0.0
put Data.Serialize 11384 21 0.0 0.0 0.0 0.0
unPut Data.Serialize.Put 11383 21 0.0 0.0 0.0 0.0
toByteString Data.Serialize.Builder 11378 21 0.0 3.9 0.0 4.0
flush.\ Data.Serialize.Builder 11393 21 0.0 0.0 0.0 0.0
withSize Data.Serialize.Builder 11388 0 0.0 0.0 0.0 0.0
withSize.\ Data.Serialize.Builder 11389 21 0.0 0.0 0.0 0.0
runBuilder Data.Serialize.Builder 11390 21 0.0 0.0 0.0 0.0
runBuilder Data.Serialize.Builder 11382 21 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11372 174 0.0 0.1 0.0 0.1
CAF GHC.IO.Encoding 11322 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 11319 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 11318 0 0.0 0.2 0.0 0.2
CAF GHC.Event.Thread 11304 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 11292 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 11288 0 0.0 0.0 0.0 0.0
CAF GHC.TopHandler 11284 0 0.0 0.0 0.0 0.0
CAF GHC.Event.Control 11271 0 0.0 0.0 0.0 0.0
CAF Main 11263 0 0.0 0.0 0.0 0.0
main Main 11368 1 0.0 0.0 0.0 0.0
CAF Application 11262 0 0.0 0.0 0.0 0.0
CAF Foundation 11261 0 0.0 0.0 0.0 0.0
CAF Model 11260 0 0.0 0.1 0.0 0.3
unstream/resize Data.Text.Internal.Fusion 11375 35 0.0 0.1 0.0 0.1
CAF Settings 11259 0 0.0 0.1 0.0 0.2
unstream/resize Data.Text.Internal.Fusion 11370 20 0.0 0.1 0.0 0.1
CAF Database.Persist.Postgresql 6229 0 0.0 0.3 0.0 0.9
unstream/resize Data.Text.Internal.Fusion 11373 93 0.0 0.6 0.0 0.6
CAF Database.PostgreSQL.Simple.Transaction 6224 0 0.0 0.0 0.0 0.0
CAF Database.PostgreSQL.Simple.TypeInfo.Static 6222 0 0.0 0.0 0.0 0.0
CAF Database.PostgreSQL.Simple.Internal 6219 0 0.0 0.0 0.0 0.0
CAF Yesod.Static 6210 0 0.0 0.0 0.0 0.0
CAF Crypto.Hash.Conduit 6193 0 0.0 0.0 0.0 0.0
CAF Yesod.Default.Config2 6192 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11371 1 0.0 0.0 0.0 0.0
CAF Yesod.Core.Internal.Util 6154 0 0.0 0.0 0.0 0.0
CAF Text.Libyaml 6121 0 0.0 0.0 0.0 0.0
CAF Data.Yaml 6120 0 0.0 0.0 0.0 0.0
CAF Data.Yaml.Internal 6119 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11369 1 0.0 0.0 0.0 0.0
CAF Database.Persist.Quasi 6055 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11376 1 0.0 0.0 0.0 0.0
CAF Database.Persist.Sql.Internal 6046 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11377 6 0.0 0.0 0.0 0.0
CAF Data.Pool 6036 0 0.0 0.0 0.0 0.0
CAF Network.HTTP.Client.TLS 6014 0 0.0 0.0 0.0 0.0
CAF System.X509.Unix 6010 0 0.0 0.0 0.0 0.0
CAF Crypto.Hash.MD5 5927 0 0.0 0.0 0.0 0.0
CAF Data.Serialize 5873 0 0.0 0.0 0.0 0.0
put Data.Serialize 11385 1 0.0 0.0 0.0 0.0
CAF Data.Serialize.Put 5872 0 0.0 0.0 0.0 0.0
withSize Data.Serialize.Builder 11387 1 0.0 0.0 0.0 0.0
CAF Data.Serialize.Builder 5870 0 0.0 0.0 0.0 0.0
flush Data.Serialize.Builder 11392 1 0.0 0.0 0.0 0.0
toByteString Data.Serialize.Builder 11391 0 0.0 0.0 0.0 0.0
defaultSize Data.Serialize.Builder 11379 1 0.0 0.0 0.0 0.0
defaultSize.overhead Data.Serialize.Builder 11381 1 0.0 0.0 0.0 0.0
defaultSize.k Data.Serialize.Builder 11380 1 0.0 0.0 0.0 0.0
CAF Crypto.Random.Entropy.Unix 5866 0 0.0 0.0 0.0 0.0
CAF Network.HTTP.Client.Manager 5861 0 0.0 0.0 0.0 0.0
unstream/resize Data.Text.Internal.Fusion 11374 3 0.0 0.0 0.0 0.0
CAF System.Random.MWC 5842 0 0.0 0.0 0.0 0.0
coff System.Random.MWC 11405 1 0.0 0.0 0.0 0.0
ioff System.Random.MWC 11404 1 0.0 0.0 0.0 0.0
acquireSeedSystem System.Random.MWC 11398 1 0.0 0.0 0.0 0.0
acquireSeedSystem.random System.Random.MWC 11402 1 0.0 0.0 0.0 0.0
acquireSeedSystem.nbytes System.Random.MWC 11400 1 0.0 0.0 0.0 0.0
createSystemRandom System.Random.MWC 11394 1 0.0 0.0 0.0 0.0
withSystemRandom System.Random.MWC 11395 1 0.0 0.0 0.0 0.0
CAF Data.Streaming.Network.Internal 5833 0 0.0 0.0 0.0 0.0
CAF Data.Scientific 5728 0 0.0 0.1 0.0 0.1
CAF Data.Text.Array 5722 0 0.0 0.0 0.0 0.0
CAF Data.Text.Internal 5718 0 0.0 0.0 0.0 0.0
Edit 2015-06-01 08:40
You can browse source code at the following repository → https://github.com/Zigazou/Ouep
Found a related bug in the Yesod bug tracker. Ran my program like this:
myserver +RTS -I0 -RTS Development
And now idle CPU usage is down to almost nothing, compared to 14% or so before (ARM computer). The I0 (that's I and zero) option turns off periodic garbage collection, which defaults to 0.3 secs I think. Not sure about that implications for app responsiveness or memory usage, but for me at least this is definitely the culprit.

how to optimize this Haskell program?

I use the following code to memoize the total stopping time of Collatz function by using a state monad to cache input-result pairs.
Additionally the snd part of the state is used to keep track of the input value that maximizes the output, and the goal is to find the input value under one million that maximuzes the total stopping time. (The problem can be found on project euler.
import Control.Applicative
import Control.Arrow
import Control.Monad.State
import qualified Data.Map.Strict as M
collatz :: Integer -> Integer
collatz n = if odd n
then 3 * n + 1
else n `div` 2
memoCollatz :: Integer
-> State (M.Map Integer Int, (Integer,Int)) Int
memoCollatz 1 = return 1
memoCollatz n = do
result <- gets (M.lookup n . fst)
case result of
Nothing -> do
l <- succ <$> memoCollatz (collatz n)
let update p#(_,curMaxV) =
if l > curMaxV
then (n,l)
else p
modify (M.insert n l *** update)
return l
Just v -> return v
main :: IO ()
main = print $ snd (execState (mapM_ memoCollatz [1..limit]) (M.empty,(1,1)))
where
limit = 1000000
The program works fine but is really slow. So I want to spend some time figuring out
how to make it work faster.
I took a look at the profiling chapter of RWH, but have no clue about what is the problem:
I compiled it using ghc -O2 -rtsopts -prof -auto-all -caf-all -fforce-recomp, and ran it with +RTS -s -p and here is the result:
6,633,397,720 bytes allocated in the heap
9,357,527,000 bytes copied during GC
2,616,881,120 bytes maximum residency (15 sample(s))
60,183,944 bytes maximum slop
5274 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10570 colls, 0 par 3.36s 3.36s 0.0003s 0.0013s
Gen 1 15 colls, 0 par 7.03s 7.03s 0.4683s 3.4337s
INIT time 0.00s ( 0.00s elapsed)
MUT time 4.02s ( 4.01s elapsed)
GC time 10.39s ( 10.39s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.16s ( 0.16s elapsed)
Total time 14.57s ( 14.56s elapsed)
%GC time 71.3% (71.3% elapsed)
Alloc rate 1,651,363,842 bytes per MUT second
Productivity 28.7% of total user, 28.7% of total elapsed
And the .prof file:
total time = 4.08 secs (4080 ticks # 1000 us, 1 processor)
total alloc = 3,567,324,056 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
memoCollatz Main 84.9 91.9
memoCollatz.update Main 10.5 0.0
main Main 2.4 5.8
collatz Main 2.2 2.3
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 52 0 0.0 0.0 100.0 100.0
main Main 105 0 0.0 0.0 0.0 0.0
CAF:main1 Main 102 0 0.0 0.0 0.0 0.0
main Main 104 1 0.0 0.0 0.0 0.0
CAF:main2 Main 101 0 0.0 0.0 0.0 0.0
main Main 106 0 0.0 0.0 0.0 0.0
CAF:main4 Main 100 0 0.0 0.0 0.0 0.0
main Main 107 0 0.0 0.0 0.0 0.0
CAF:main5 Main 99 0 0.0 0.0 94.4 86.7
main Main 108 0 1.4 0.9 94.4 86.7
memoCollatz Main 113 0 82.4 85.8 92.9 85.8
memoCollatz.update Main 115 2168610 10.5 0.0 10.5 0.0
CAF:main10 Main 98 0 0.0 0.0 5.1 11.0
main Main 109 0 0.4 2.7 5.1 11.0
memoCollatz Main 112 3168610 2.5 6.0 4.7 8.3
collatz Main 114 2168610 2.2 2.3 2.2 2.3
CAF:main11 Main 97 0 0.0 0.0 0.5 2.2
main Main 110 0 0.5 2.2 0.5 2.2
main.limit Main 111 1 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 94 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 89 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 88 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 82 0 0.0 0.0 0.0 0.0
What I can see is that the garbage collector is taking too much time and the program has spent most of its time running memoCollatz.
And here are two screenshots from heap profiling:
I expect the memory usage to increase and then decrease rapidly because the program is doing memoization using a Map, but not sure what is causing the rapid drop in the graph (maybe this is a bug when visualizing the result?).
I want to know how to analyze these tables / graphs and how they indicates the real problem.
The Haskell Wiki contains a couple of different solutions to this problem: (link)
The fastest solution there uses an Array to memoize the results. On my machine it runs in about 1 second and max. residency is about 35 MB.
Below is a version which runs in about 0.3 seconds and uses 1/4 of the memory of the Array version but it runs in the IO monad.
There are trade-offs between all of the different versions, and you have to decide which one you consider acceptable.
{-# LANGUAGE BangPatterns #-}
import Data.Array.IO
import Data.Array.Unboxed
import Control.Monad
collatz x
| even x = div x 2
| otherwise = 3*x+1
solve n = do
arr <- newArray (1,n) 0 :: IO (IOUArray Int Int)
writeArray arr 1 1
let eval :: Int -> IO Int
eval x = do
if x > n
then fmap (1+) $ eval (collatz x)
else do d <- readArray arr x
if d == 0
then do d <- fmap (1+) $ eval (collatz x)
writeArray arr x d
return d
else return d
go :: (Int,Int) -> Int -> IO (Int,Int)
go !m x = do d <- eval x
return $ max m (d,x)
foldM go (0,0) [2..n]
main = solve 1000000 >>= print

Performance of reading string to Int in Haskell ( Bytestring vs [Char])

Just doing some simple benchmark on Bytestring and String. The code load a files of 10,000,000 lines, each an integer; and then convert each of the strings into the integer. Turns out the Prelude.read is much slower than ByteString.readInt.
I am wondering what is the reason for the inefficiency. Meanwhile, I am also not sure which part of the profiling report corresponds to the time cost of loading files (The data file is about 75 MB).
Here is the code for the test:
import System.Environment
import System.IO
import qualified Data.ByteString.Lazy.Char8 as LC
main :: IO ()
main = do
xs <- getArgs
let file = xs !! 0
inputIo <- readFile file
let iIo = map readInt . linesStr $ inputIo
let sIo = sum iIo
inputIoBs <- LC.readFile file
let iIoBs = map readIntBs . linesBs $ inputIoBs
let sIoBs = sum iIoBs
print [sIo, sIoBs]
linesStr = lines
linesBs = LC.lines
readInt :: String -> Int
readInt x = read x :: Int
readIntBs :: LC.ByteString -> Int
readIntBs bs = case LC.readInt bs of
Nothing -> error "Not an integer"
Just (x, _) -> x
The code is compiled and executed as:
> ghc -o strO2 -O2 --make Str.hs -prof -auto-all -caf-all -rtsopts
> ./strO2 a.dat +RTS -K500M -p
Note "a.dat" is at aforementioned format and about 75MB. The profiling result is:
strO2 +RTS -K500M -p -RTS a.dat
total time = 116.41 secs (116411 ticks # 1000 us, 1 processor)
total alloc = 117,350,372,624 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
readInt Main 86.9 74.6
main.iIo Main 8.7 9.5
main Main 2.9 13.5
main.iIoBs Main 0.6 1.9
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 54 0 0.0 0.0 100.0 100.0
main Main 109 0 2.9 13.5 100.0 100.0
main.iIoBs Main 116 1 0.6 1.9 1.3 2.4
readIntBs Main 118 10000000 0.7 0.5 0.7 0.5
main.sIoBs Main 115 1 0.0 0.0 0.0 0.0
main.sIo Main 113 1 0.2 0.0 0.2 0.0
main.iIo Main 111 1 8.7 9.5 95.6 84.1
readInt Main 114 10000000 86.9 74.6 86.9 74.6
main.file Main 110 1 0.0 0.0 0.0 0.0
CAF:main1 Main 106 0 0.0 0.0 0.0 0.0
main Main 108 1 0.0 0.0 0.0 0.0
CAF:linesBs Main 105 0 0.0 0.0 0.0 0.0
linesBs Main 117 1 0.0 0.0 0.0 0.0
CAF:linesStr Main 104 0 0.0 0.0 0.0 0.0
linesStr Main 112 1 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 100 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 93 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 91 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 86 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 84 0 0.0 0.0 0.0 0.0
CAF Text.Read.Lex 70 0 0.0 0.0 0.0 0.0
Edit:
The input file "a.dat" are 10,000,000 lines of numbers:
1
2
3
...
10000000
Following the discussion I replaced "a.dat" by 10,000,000 lines of 1s, which does not affect the above performance observation:
1
1
...
1
read is doing a much harder job than readInt. For example, compare:
> map read ["(100)", " 100", "- 100"] :: [Int]
[100,100,-100]
> map readInt ["(100)", " 100", "- 100"]
[Nothing,Nothing,Nothing]
read is essentially parsing Haskell. Combined with the fact that it's consuming linked lists, it's no surprise at all that it's really very slow indeed.

Resources