Haskell Profiling - haskell

I'm going through LYAH and have been looking at the use of list comprehension vs map/filters when processing lists. I've profiled the following two functions and included the prof outputs as well. If I am reading the prof correctly id say the FiltB is running a lot slower than FiltA (although its only by thousandths of seconds).
Would be right in saying this is because FiltB has to evaluate x^2 twice?
FiltA.hs (filter odd)
-- FiltA.hs
module Main
where
main = do
let x = sum (takeWhile (<10000) (filter odd (map (^2) [1..])))
print x
Sat Jul 26 18:26 2014 Time and Allocation Profiling Report (Final)
Filta.exe +RTS -p -RTS
total time = 0.00 secs (0 ticks # 1000 us, 1 processor)
total alloc = 92,752 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
main Main 0.0 10.1
main.x Main 0.0 53.0
CAF GHC.IO.Handle.FD 0.0 36.3
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 37 0 0.0 0.2 0.0 100.0
CAF GHC.IO.Encoding.CodePage 61 0 0.0 0.1 0.0 0.1
CAF GHC.IO.Encoding 58 0 0.0 0.1 0.0 0.1
CAF GHC.IO.Handle.FD 52 0 0.0 36.3 0.0 36.3
CAF Main 44 0 0.0 0.2 0.0 63.3
main Main 74 1 0.0 10.1 0.0 63.1
main.x Main 75 1 0.0 53.0 0.0 53.0
FiltB (list comprehensions)
-- FiltB.hs
module Main
where
main = do
let x = sum (takeWhile (<10000) [n^2 | n <- [1..], odd (n^2)])
print x
Sat Jul 26 18:30 2014 Time and Allocation Profiling Report (Final)
FiltB.exe +RTS -p -RTS
total time = 0.00 secs (2 ticks # 1000 us, 1 processor)
total alloc = 107,236 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
main Main 50.0 8.8
CAF Main 50.0 0.1
main.x Main 0.0 59.4
CAF GHC.IO.Handle.FD 0.0 31.4
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 37 0 0.0 0.2 100.0 100.0
CAF GHC.IO.Encoding.CodePage 61 0 0.0 0.1 0.0 0.1
CAF GHC.IO.Encoding 58 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 52 0 0.0 31.4 0.0 31.4
CAF Main 44 0 50.0 0.1 100.0 68.3
main Main 74 1 50.0 8.8 50.0 68.1
main.x Main 75 1 0.0 59.4 0.0 59.4

Yes. In this particular case, since n^2 will be odd if and only if n is odd, you can speed up FiltB to the same speed as FiltA by replacing odd (n^2) with odd n.
Like you said, the problem is that for each element n it's squaring it, checking if that's odd, and if it is, then squaring n and adding that to the list.
More generally though, the difference is that in a list comprehension, the filtering happens before you map, whereas with map and filter, you can choose the order. So if what you actually want to filter based on is the values in the list after mapping, using map and filter is probably a better choice. You still could do something like this to filter based on whether the squared values are odd:
sum (takeWhile (<10000) [ x | x <- [ n^2 | n <- [1..] ], odd x ])
But that's getting to be pretty hard to read. Map and filter or filtering over the list comprehension explicitly (i.e. filter odd [ n^2 | n <- [1..] ]) are much better options.

Related

How can I determine if a function is being memoized in Haskell?

I've got a Haskell program that is performing non linearly performance wise (worse then O(n)).
I'm trying to investigate whether memoization is taking place on a function, can I verify this? I'm familiar with GHC profiling - but I'm not too sure which values I should be looking at?
A work around is too just plug some values and observe the execution time - but it's not ideal.
As far as I know there is no automatic memoization in Haskell.
That said there seems to be an optimization in GHC that caches values for parameterless function like the following
rightTriangles = [ (a,b,c) |
c <- [1..],
b <- [1..c],
a <- [1..b],
a^2 + b^2 == c^2]
If you try out the following in GHCi twice, you'll see that the second call ist much faster:
ghci > take 500 rightTriangles
Not really an answer but should still be helpfull, memoization does not seem to make a difference in profiling output in terms of function "entries". Demonstrated with the following basic example:
module Main where
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
fibmemo = (map fib [0 ..] !!)
main :: IO ()
main = do
putStrLn "Begin.."
print $ fib 10
-- print $ fibmemo 10
With the above code the profiling output is:
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 119 0 0.0 1.3 0.0 100.0
CAF Main <entire-module> 237 0 0.0 1.0 0.0 1.2
main Main Main.hs:(12,1)-(14,16) 238 1 0.0 0.2 0.0 0.2
fib Main Main.hs:(5,1)-(7,29) 240 177 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal <entire-module> 230 0 0.0 1.2 0.0 1.2
CAF GHC.IO.Encoding <entire-module> 220 0 0.0 5.4 0.0 5.4
CAF GHC.IO.Encoding.Iconv <entire-module> 218 0 0.0 0.4 0.0 0.4
CAF GHC.IO.Handle.FD <entire-module> 210 0 0.0 67.7 0.0 67.7
CAF GHC.IO.Handle.Text <entire-module> 208 0 0.0 0.2 0.0 0.2
main Main Main.hs:(12,1)-(14,16) 239 0 0.0 22.6 0.0 22.6
While if we comment out fib 10 and uncomment the fibmemo 10 we get:
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 119 0 0.0 1.2 0.0 100.0
CAF Main <entire-module> 237 0 0.0 1.0 0.0 2.9
fibmemo Main Main.hs:9:1-29 240 1 0.0 1.6 0.0 1.6
fib Main Main.hs:(5,1)-(7,29) 242 177 0.0 0.0 0.0 0.0
main Main Main.hs:(12,1)-(15,20) 238 1 0.0 0.2 0.0 0.2
fibmemo Main Main.hs:9:1-29 241 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal <entire-module> 230 0 0.0 1.2 0.0 1.2
CAF GHC.IO.Encoding <entire-module> 220 0 0.0 5.3 0.0 5.3
CAF GHC.IO.Encoding.Iconv <entire-module> 218 0 0.0 0.4 0.0 0.4
CAF GHC.IO.Handle.FD <entire-module> 210 0 0.0 66.6 0.0 66.6
CAF GHC.IO.Handle.Text <entire-module> 208 0 0.0 0.2 0.0 0.2
main Main Main.hs:(12,1)-(15,20) 239 0 0.0 22.2 0.0 22.2

Improving memory usage during serialization (Data.Binary)

I'm still kinda new to Haskell and learning new things every day. My problem is a too high memory usage during seralization using the Data.Binary library. Maybe I'm just using the library the wrong way, but I can't figure it out.
The actual idea is, that I read binary data from disk, add new data und write everything back to disk. Here's the code:
module Main
where
import Data.Binary
import System.Environment
import Data.List (foldl')
data DualNo = DualNo Int Int deriving (Show)
instance Data.Binary.Binary DualNo where
put (DualNo a b) = do
put a
put b
get = do
a <- get
b <- get
return (DualNo a b)
-- read DualNo from HDD
readData :: FilePath -> IO [DualNo]
readData filename = do
no <- decodeFile filename :: IO [DualNo]
return no
-- write DualNo to HDD
writeData :: [DualNo] -> String -> IO ()
writeData no filename = encodeFile filename (no :: [DualNo])
writeEmptyDataToDisk :: String -> IO ()
writeEmptyDataToDisk filename = writeData [] filename
-- feed a the list with a new dataset
feedWithInputData :: [DualNo] -> [(Int, Int)] -> [DualNo]
feedWithInputData existData newData = foldl' func existData newData
where
func dataset (a,b) = DualNo a b : dataset
main :: IO ()
main = do
[newInputData, toPutIntoExistingData] <- System.Environment.getArgs
if toPutIntoExistingData == "empty"
then writeEmptyDataToDisk "myData.dat"
else return ()
loadedData <- readData "myData.dat"
newData <- return (case newInputData of
"dataset1" -> feedWithInputData loadedData dataset1
"dataset2" -> feedWithInputData loadedData dataset2
otherwise -> feedWithInputData loadedData dataset3)
writeData newData "myData.dat"
dataset1 = zip [1..100000] [2,4..200000]
dataset2 = zip [5,10..500000] [3,6..300000]
dataset3 = zip [4,8..400000] [6,12..600000]
I'm pretty sure, there's a lot to improve in this code. But my biggest problem is the memory usage with big datasets.
I profiled my programm with GHC.
$ ghc -O2 --make -prof -fprof-auto -auto-all -caf-all -rtsopts -fforce-recomp Main.hs
$ ./Main dataset1 empty +RTS -p -sstderr
165,085,864 bytes allocated in the heap
70,643,992 bytes copied during GC
12,298,128 bytes maximum residency (7 sample(s))
424,696 bytes maximum slop
35 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 306 colls, 0 par 0.035s 0.035s 0.0001s 0.0015s
Gen 1 7 colls, 0 par 0.053s 0.053s 0.0076s 0.0180s
INIT time 0.001s ( 0.001s elapsed)
MUT time 0.059s ( 0.062s elapsed)
GC time 0.088s ( 0.088s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.003s ( 0.003s elapsed)
Total time 0.154s ( 0.154s elapsed)
%GC time 57.0% (57.3% elapsed)
Alloc rate 2,781,155,968 bytes per MUT second
Productivity 42.3% of total user, 42.5% of total elapsed
Looking at the prof-file:
Tue Apr 12 18:11 2016 Time and Allocation Profiling Report (Final)
Main +RTS -p -sstderr -RTS dataset1 empty
total time = 0.06 secs (60 ticks # 1000 us, 1 processor)
total alloc = 102,613,008 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
put Main 48.3 53.0
writeData Main 30.0 18.8
dataset1 Main 13.3 23.4
feedWithInputData Main 6.7 0.0
feedWithInputData.func Main 1.7 4.7
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 68 0 0.0 0.0 100.0 100.0
main Main 137 0 0.0 0.0 86.7 76.6
feedWithInputData Main 150 1 6.7 0.0 8.3 4.7
feedWithInputData.func Main 154 100000 1.7 4.7 1.7 4.7
writeData Main 148 1 30.0 18.8 78.3 71.8
put Main 155 100000 48.3 53.0 48.3 53.0
readData Main 147 0 0.0 0.1 0.0 0.1
writeEmptyDataToDisk Main 142 0 0.0 0.0 0.0 0.1
writeData Main 143 0 0.0 0.1 0.0 0.1
CAF:main1 Main 133 0 0.0 0.0 0.0 0.0
main Main 136 1 0.0 0.0 0.0 0.0
CAF:main2 Main 132 0 0.0 0.0 0.0 0.0
main Main 139 0 0.0 0.0 0.0 0.0
writeEmptyDataToDisk Main 140 1 0.0 0.0 0.0 0.0
writeData Main 141 1 0.0 0.0 0.0 0.0
CAF:main7 Main 131 0 0.0 0.0 0.0 0.0
main Main 145 0 0.0 0.0 0.0 0.0
readData Main 146 1 0.0 0.0 0.0 0.0
CAF:dataset1 Main 123 0 0.0 0.0 5.0 7.8
dataset1 Main 151 1 5.0 7.8 5.0 7.8
CAF:dataset4 Main 122 0 0.0 0.0 5.0 7.8
dataset1 Main 153 0 5.0 7.8 5.0 7.8
CAF:dataset5 Main 121 0 0.0 0.0 3.3 7.8
dataset1 Main 152 0 3.3 7.8 3.3 7.8
CAF:main4 Main 116 0 0.0 0.0 0.0 0.0
main Main 138 0 0.0 0.0 0.0 0.0
CAF:main6 Main 115 0 0.0 0.0 0.0 0.0
main Main 149 0 0.0 0.0 0.0 0.0
CAF:main3 Main 113 0 0.0 0.0 0.0 0.0
main Main 144 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 107 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 103 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 101 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 94 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 86 0 0.0 0.0 0.0 0.0
Now I add further data:
$ ./Main dataset2 myData.dat +RTS -p -sstderr
343,601,008 bytes allocated in the heap
175,650,728 bytes copied during GC
34,113,936 bytes maximum residency (8 sample(s))
971,896 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 640 colls, 0 par 0.082s 0.083s 0.0001s 0.0017s
Gen 1 8 colls, 0 par 0.140s 0.141s 0.0176s 0.0484s
INIT time 0.001s ( 0.001s elapsed)
MUT time 0.138s ( 0.139s elapsed)
GC time 0.221s ( 0.224s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.006s ( 0.006s elapsed)
Total time 0.370s ( 0.370s elapsed)
%GC time 59.8% (60.5% elapsed)
Alloc rate 2,485,518,518 bytes per MUT second
Productivity 39.9% of total user, 39.8% of total elapsed
Looking at the new prof-file:
Tue Apr 12 18:15 2016 Time and Allocation Profiling Report (Final)
Main +RTS -p -sstderr -RTS dataset2 myData.dat
total time = 0.14 secs (139 ticks # 1000 us, 1 processor)
total alloc = 213,866,232 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
put Main 41.0 50.9
writeData Main 25.9 18.0
get Main 25.2 16.8
dataset2 Main 4.3 11.2
readData Main 1.4 0.8
feedWithInputData.func Main 1.4 2.2
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 68 0 0.0 0.0 100.0 100.0
main Main 137 0 0.0 0.0 95.7 88.8
feedWithInputData Main 148 1 0.7 0.0 2.2 2.2
feedWithInputData.func Main 152 100000 1.4 2.2 1.4 2.2
writeData Main 145 1 25.9 18.0 66.9 68.9
put Main 153 200000 41.0 50.9 41.0 50.9
readData Main 141 0 1.4 0.8 26.6 17.6
get Main 144 0 25.2 16.8 25.2 16.8
CAF:main1 Main 133 0 0.0 0.0 0.0 0.0
main Main 136 1 0.0 0.0 0.0 0.0
CAF:main7 Main 131 0 0.0 0.0 0.0 0.0
main Main 139 0 0.0 0.0 0.0 0.0
readData Main 140 1 0.0 0.0 0.0 0.0
CAF:dataset2 Main 126 0 0.0 0.0 0.7 3.7
dataset2 Main 149 1 0.7 3.7 0.7 3.7
CAF:dataset6 Main 125 0 0.0 0.0 2.2 3.7
dataset2 Main 151 0 2.2 3.7 2.2 3.7
CAF:dataset7 Main 124 0 0.0 0.0 1.4 3.7
dataset2 Main 150 0 1.4 3.7 1.4 3.7
CAF:$fBinaryDualNo1 Main 120 0 0.0 0.0 0.0 0.0
get Main 143 1 0.0 0.0 0.0 0.0
CAF:main4 Main 116 0 0.0 0.0 0.0 0.0
main Main 138 0 0.0 0.0 0.0 0.0
CAF:main6 Main 115 0 0.0 0.0 0.0 0.0
main Main 146 0 0.0 0.0 0.0 0.0
CAF:main5 Main 114 0 0.0 0.0 0.0 0.0
main Main 147 0 0.0 0.0 0.0 0.0
CAF:main3 Main 113 0 0.0 0.0 0.0 0.0
main Main 142 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 107 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 103 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 101 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 94 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 86 0 0.0 0.0 0.0 0.0
The more often I add new data, the higher the memory usage becomes. I mean, it's clear, that I need more memory for a bigger dataset. But isn't there a better solution for this problem (like gradually writing data back to disk).
Edit:
Actually the most important thing, that bothers me, is the following observation:
I run the program for the first time and add new data to an existing (empty) file on my disk.
The size of the saved file on my disk is: 1.53 MByte.
But (looking at the first prof-file) the program allocated more than 102 MByte. More than 50% was allocated by the put function from the Data.Binary package.
I run the program a second time and add new data to an existing (not empty) file on my disk.
The size of the saved file on my disk is 3.05 MByte.
But (looking at the second prof-file) the program allocated more than 213 MByte. More than 66% was allocated by the put and get function together.
=> Conclusion: In the first example I needed 102/1.53 = 66 times more memory running the program than space for the binary file on my disk.
In the second example I needed 213/3.05 = 69 times more memory running the programm than space for the binary file on my disk.
Question:
Is the Data.Binary package for serialization so efficient (and awesome), that it can decrease the needed memory to such an extent.
Analogous question:
Do I really need so much more memory for loading the data in my program than space for the the same data in a binary-file on disk?

What is `MAIN` ? (ghc profiling)

I build an old big project, Pugs, with ghc 7.10.1 using stack build (I wrote my own stack.yaml). Then I run stack build --library-profiling --executable-profiling and .stack-work/install/x86_64-osx/nightly-2015-06-26/7.10.1/bin/pugs -e 'my $i=0; for (1..100_000) { $i++ }; say $i' +RTS -pa and output the following pugs.prof file.
Fri Jul 10 00:10 2015 Time and Allocation Profiling Report (Final)
pugs +RTS -P -RTS -e my $i=0; for (1..10_000) { $i++ }; say $i
total time = 0.60 secs (604 ticks # 1000 us, 1 processor)
total alloc = 426,495,472 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc ticks bytes
MAIN MAIN 92.2 90.6 557 386532168
CAF Pugs.Run 2.8 5.2 17 22191000
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc ticks bytes
MAIN MAIN 287 0 92.2 90.6 100.0 100.0 557 386532168
listAssocOp Pugs.Parser.Operator 841 24 0.0 0.0 0.0 0.0 0 768
nassocOp Pugs.Parser.Operator 840 24 0.0 0.0 0.0 0.0 0 768
lassocOp Pugs.Parser.Operator 839 24 0.0 0.0 0.0 0.0 0 768
rassocOp Pugs.Parser.Operator 838 24 0.0 0.0 0.0 0.0 0 768
postfixOp Pugs.Parser.Operator 837 24 0.0 0.0 0.0 0.0 0 768
termOp Pugs.Parser.Operator 824 24 0.0 0.5 0.7 1.2 0 2062768
insert Data.HashTable.ST.Basic 874 1 0.0 0.0 0.0 0.0 0 152
checkOverflow Data.HashTable.ST.Basic 890 1 0.0 0.0 0.0 0.0 0 80
readDelLoad Data.HashTable.ST.Basic 893 0 0.0 0.0 0.0 0.0 0 184
writeLoad Data.HashTable.ST.Basic 892 0 0.0 0.0 0.0 0.0 0 224
readLoad Data.HashTable.ST.Basic 891 0 0.0 0.0 0.0 0.0 0 184
_values Data.HashTable.ST.Basic 889 1 0.0 0.0 0.0 0.0 0 0
_keys Data.HashTable.ST.Basic 888 1 0.0 0.0 0.0 0.0 0 0
.. snip ..
MAIN costs 92.2% of time, however, I don't know what MAIN means. What does MAIN label mean?
I was in the same spot a few days ago. What I deduced is the same thing, MAIN is expressions without anotations. It's counts shrink significantly if you add "-fprof-auto" and "-caf-all". Those options will also let you find a lot of interesting things happening in your code.

how to optimize this Haskell program?

I use the following code to memoize the total stopping time of Collatz function by using a state monad to cache input-result pairs.
Additionally the snd part of the state is used to keep track of the input value that maximizes the output, and the goal is to find the input value under one million that maximuzes the total stopping time. (The problem can be found on project euler.
import Control.Applicative
import Control.Arrow
import Control.Monad.State
import qualified Data.Map.Strict as M
collatz :: Integer -> Integer
collatz n = if odd n
then 3 * n + 1
else n `div` 2
memoCollatz :: Integer
-> State (M.Map Integer Int, (Integer,Int)) Int
memoCollatz 1 = return 1
memoCollatz n = do
result <- gets (M.lookup n . fst)
case result of
Nothing -> do
l <- succ <$> memoCollatz (collatz n)
let update p#(_,curMaxV) =
if l > curMaxV
then (n,l)
else p
modify (M.insert n l *** update)
return l
Just v -> return v
main :: IO ()
main = print $ snd (execState (mapM_ memoCollatz [1..limit]) (M.empty,(1,1)))
where
limit = 1000000
The program works fine but is really slow. So I want to spend some time figuring out
how to make it work faster.
I took a look at the profiling chapter of RWH, but have no clue about what is the problem:
I compiled it using ghc -O2 -rtsopts -prof -auto-all -caf-all -fforce-recomp, and ran it with +RTS -s -p and here is the result:
6,633,397,720 bytes allocated in the heap
9,357,527,000 bytes copied during GC
2,616,881,120 bytes maximum residency (15 sample(s))
60,183,944 bytes maximum slop
5274 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10570 colls, 0 par 3.36s 3.36s 0.0003s 0.0013s
Gen 1 15 colls, 0 par 7.03s 7.03s 0.4683s 3.4337s
INIT time 0.00s ( 0.00s elapsed)
MUT time 4.02s ( 4.01s elapsed)
GC time 10.39s ( 10.39s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.16s ( 0.16s elapsed)
Total time 14.57s ( 14.56s elapsed)
%GC time 71.3% (71.3% elapsed)
Alloc rate 1,651,363,842 bytes per MUT second
Productivity 28.7% of total user, 28.7% of total elapsed
And the .prof file:
total time = 4.08 secs (4080 ticks # 1000 us, 1 processor)
total alloc = 3,567,324,056 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
memoCollatz Main 84.9 91.9
memoCollatz.update Main 10.5 0.0
main Main 2.4 5.8
collatz Main 2.2 2.3
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 52 0 0.0 0.0 100.0 100.0
main Main 105 0 0.0 0.0 0.0 0.0
CAF:main1 Main 102 0 0.0 0.0 0.0 0.0
main Main 104 1 0.0 0.0 0.0 0.0
CAF:main2 Main 101 0 0.0 0.0 0.0 0.0
main Main 106 0 0.0 0.0 0.0 0.0
CAF:main4 Main 100 0 0.0 0.0 0.0 0.0
main Main 107 0 0.0 0.0 0.0 0.0
CAF:main5 Main 99 0 0.0 0.0 94.4 86.7
main Main 108 0 1.4 0.9 94.4 86.7
memoCollatz Main 113 0 82.4 85.8 92.9 85.8
memoCollatz.update Main 115 2168610 10.5 0.0 10.5 0.0
CAF:main10 Main 98 0 0.0 0.0 5.1 11.0
main Main 109 0 0.4 2.7 5.1 11.0
memoCollatz Main 112 3168610 2.5 6.0 4.7 8.3
collatz Main 114 2168610 2.2 2.3 2.2 2.3
CAF:main11 Main 97 0 0.0 0.0 0.5 2.2
main Main 110 0 0.5 2.2 0.5 2.2
main.limit Main 111 1 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 94 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 89 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 88 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 82 0 0.0 0.0 0.0 0.0
What I can see is that the garbage collector is taking too much time and the program has spent most of its time running memoCollatz.
And here are two screenshots from heap profiling:
I expect the memory usage to increase and then decrease rapidly because the program is doing memoization using a Map, but not sure what is causing the rapid drop in the graph (maybe this is a bug when visualizing the result?).
I want to know how to analyze these tables / graphs and how they indicates the real problem.
The Haskell Wiki contains a couple of different solutions to this problem: (link)
The fastest solution there uses an Array to memoize the results. On my machine it runs in about 1 second and max. residency is about 35 MB.
Below is a version which runs in about 0.3 seconds and uses 1/4 of the memory of the Array version but it runs in the IO monad.
There are trade-offs between all of the different versions, and you have to decide which one you consider acceptable.
{-# LANGUAGE BangPatterns #-}
import Data.Array.IO
import Data.Array.Unboxed
import Control.Monad
collatz x
| even x = div x 2
| otherwise = 3*x+1
solve n = do
arr <- newArray (1,n) 0 :: IO (IOUArray Int Int)
writeArray arr 1 1
let eval :: Int -> IO Int
eval x = do
if x > n
then fmap (1+) $ eval (collatz x)
else do d <- readArray arr x
if d == 0
then do d <- fmap (1+) $ eval (collatz x)
writeArray arr x d
return d
else return d
go :: (Int,Int) -> Int -> IO (Int,Int)
go !m x = do d <- eval x
return $ max m (d,x)
foldM go (0,0) [2..n]
main = solve 1000000 >>= print

Performance of reading string to Int in Haskell ( Bytestring vs [Char])

Just doing some simple benchmark on Bytestring and String. The code load a files of 10,000,000 lines, each an integer; and then convert each of the strings into the integer. Turns out the Prelude.read is much slower than ByteString.readInt.
I am wondering what is the reason for the inefficiency. Meanwhile, I am also not sure which part of the profiling report corresponds to the time cost of loading files (The data file is about 75 MB).
Here is the code for the test:
import System.Environment
import System.IO
import qualified Data.ByteString.Lazy.Char8 as LC
main :: IO ()
main = do
xs <- getArgs
let file = xs !! 0
inputIo <- readFile file
let iIo = map readInt . linesStr $ inputIo
let sIo = sum iIo
inputIoBs <- LC.readFile file
let iIoBs = map readIntBs . linesBs $ inputIoBs
let sIoBs = sum iIoBs
print [sIo, sIoBs]
linesStr = lines
linesBs = LC.lines
readInt :: String -> Int
readInt x = read x :: Int
readIntBs :: LC.ByteString -> Int
readIntBs bs = case LC.readInt bs of
Nothing -> error "Not an integer"
Just (x, _) -> x
The code is compiled and executed as:
> ghc -o strO2 -O2 --make Str.hs -prof -auto-all -caf-all -rtsopts
> ./strO2 a.dat +RTS -K500M -p
Note "a.dat" is at aforementioned format and about 75MB. The profiling result is:
strO2 +RTS -K500M -p -RTS a.dat
total time = 116.41 secs (116411 ticks # 1000 us, 1 processor)
total alloc = 117,350,372,624 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
readInt Main 86.9 74.6
main.iIo Main 8.7 9.5
main Main 2.9 13.5
main.iIoBs Main 0.6 1.9
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 54 0 0.0 0.0 100.0 100.0
main Main 109 0 2.9 13.5 100.0 100.0
main.iIoBs Main 116 1 0.6 1.9 1.3 2.4
readIntBs Main 118 10000000 0.7 0.5 0.7 0.5
main.sIoBs Main 115 1 0.0 0.0 0.0 0.0
main.sIo Main 113 1 0.2 0.0 0.2 0.0
main.iIo Main 111 1 8.7 9.5 95.6 84.1
readInt Main 114 10000000 86.9 74.6 86.9 74.6
main.file Main 110 1 0.0 0.0 0.0 0.0
CAF:main1 Main 106 0 0.0 0.0 0.0 0.0
main Main 108 1 0.0 0.0 0.0 0.0
CAF:linesBs Main 105 0 0.0 0.0 0.0 0.0
linesBs Main 117 1 0.0 0.0 0.0 0.0
CAF:linesStr Main 104 0 0.0 0.0 0.0 0.0
linesStr Main 112 1 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 100 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 93 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 91 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 86 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 84 0 0.0 0.0 0.0 0.0
CAF Text.Read.Lex 70 0 0.0 0.0 0.0 0.0
Edit:
The input file "a.dat" are 10,000,000 lines of numbers:
1
2
3
...
10000000
Following the discussion I replaced "a.dat" by 10,000,000 lines of 1s, which does not affect the above performance observation:
1
1
...
1
read is doing a much harder job than readInt. For example, compare:
> map read ["(100)", " 100", "- 100"] :: [Int]
[100,100,-100]
> map readInt ["(100)", " 100", "- 100"]
[Nothing,Nothing,Nothing]
read is essentially parsing Haskell. Combined with the fact that it's consuming linked lists, it's no surprise at all that it's really very slow indeed.

Resources