Here's an old question from 7 months ago, when stack overflowers agreed that Haskell's inefficiency in computing the Ackermann function was due to a compiler error.
Ackermann very inefficient with Haskell/GHC
7 months later, this appears to be fixed. It seems like ack runs with linear memory, but it runs pretty freakin' slow.
main = print (ack 4 1)
-- Ackermann function
ack 0 n = n + 1
ack m 0 = ack (m-1) 1
ack m n = ack (m-1) (ack m (n - 1))
$ time ./ack
65533
>real 8m53.274s
>user 8m47.313s
>sys 0m4.868s
Processor 2.8 GHz Intel Core i7
Memory 8 GB 1333 MHz DDR3
Software Mac OS X Lion 10.7.5 (11G63)
I am just asking for any insights into this. The more detailed ones will get upvoted. Keep in mind I am new to functional programming and even simple remarks about tail recursion vs regular recursion would be appreciated and upvoted.
I don't know how you're running it, but I suspect the complete list is:
Your program with no changes and compiling with no optimizations. Initial time: 7m29.755s
It appears you didn't use optimization. Be sure to use -O2 and try -fllvm when compiling. New time: 1m2.412s
Use explicit type signatures and use Int (vs the default of Integer) when you can. New time: 0m15.486s
So we received almost 8x speed-up by using optimizations (why does every other benchmark question not use optimization flags?!?!?) and an additional ~4x by using Int instead of Integer.
Add a type signature to ack:
ack :: Int -> Int -> Int
This should solve two problems with your code:
Overly general types
Without the signature, the compiler derives the following type:
ack :: (Eq a, Eq b, Num a, Num b) => a -> b -> b
ack ends up generalized to all number types, instead of just integers. This additional layer of indirection makes the code slow.
Giving ack a concrete type (like Int) removes this indirection.
Type defaulting
In addition, I'm guessing your main action is written like this:
main = print (ack 4 1)
Your ack works on any number type, but you don't specify exactly which one. This means GHC chooses one automatically, in a process called type defaulting.
In this case, it chooses Integer, a variable length type. Because Integer can handle numbers of arbitrary size, it is much slower than the machine sized Int.
Conclusion
To summarize:
Always write type signatures for top-level definitions.
Always compile with -Wall.
Related
I am writing a small snake game in Haskell as sort of a guided tutorial for beginners. The "rendering" just takes a Board and produces a Data.ByteString.Builder which is printed in the terminal. (the html profiles are pushed to the repo, you can inspect them without compiling the programm)
The problem
The problem I have is that the heap profiling looks weird: There are many spikes, and suddenly Builder, PAP and BuildStep take as same memory as the rest of the program. Considering that rendering is happenning 10 times in a second (i.e. every second we produce 10 builders), it seems inconsistent that every once in a while the builder just takes that much memory. I don't know if this is considered an space leak, since there is no thunks in the profile, but the PAP doesn't look right (I don't know...)
Implementation
The board is represented as an inmutable array of builders indexed by coordinaates (tuples) type Board = Array (Int, Int) Builder (essentialy, what should be printed in each coordinate). The function which converts the board into a builder is the expected strict fold which handle new lines using height and width of the board.
toBuilder :: RenderState -> Builder
-- |- The Array (Int, Int) Builder
toBuilder (RenderState b binf#(BoardInfo h w) gOver s) =
-- ^^^ height and width
if gOver
then ppScore s <> fst (boardToString $ emptyGrid binf) -- Not interesting. Case of game over print build an empty grid
else ppScore s <> fst (boardToString b) -- print the current board
where
boardToString = foldl' fprint (mempty, 0) -- concatenate builders and count the number, such that when #width builders have been concatenated, add a new line.
fprint (!s, !i) cell =
if ((i + 1) `mod` w) == 0
then (s <> cell <> B.charUtf8 '\n', i + 1 )
else (s <> cell , i + 1)
Up to the .prof file this function take most of the time and space (92%, which is expected). Moreover, this is the only part of the code that produces a big builder, so the problem should be here.
The buffering mode
The above profile happens when BufferMode is set to LineBuffering (default), but interestingly if I change it to NoBuffering then the profile looks the same but a thunk appears and the builder disappear...
The questions
I have reached a point which I don't know whats going on, hence my questions are a little bit vague:
Is my code with line buffering (first profile) actually leaking? No thunk appears but the PAP eating so much memory looks like a warning
The second profile clearly(?) leaks, is there an standard way to inspect which part of the code is producing the thunk?
Am I completely missing something, and actually the profile looks fine?
In case anyone is interested, I think I've found the problem. It is the terminal speed... If I run an smaller board size or a slower rendering time (the picture is for a 50x70 board with 10 renders a second), then the memory usage is completely normal.
What I think is happening, is that the board is printed into the console using B.hPutBuilder stdout, this action takes shorter than the console to actually print it, so the haskell thread continues and creates another board which should wait to be printed because the console is busy. I guess this leads to some how, two boards living in memory for a short time.
Other guesses are welcome!
I implemented a count function in Haskell and I am wondering will this behave badly on large lists :
count :: Eq a => a -> [a] -> Int
count x = length . filter (==x)
I believe the length function runs in linear time, is this correct?
Edit: Refactor suggested by #Higemaru
Length runs in linear time to the size of the list, yes.
Normally, you would be worried that your code had to take two passes through the list: first one to filter and then one to count the length of the resulting list. However, I believe this does not happen here because filter is not strict on the structure of the list. Instead, the length function forces the elements of the filtered list as it goes along, doing the actual count in one pass.
I think you can make it slightly shorter
count :: Eq a => a -> [a] -> Int
count x = length . filter (x==)
(I would have written a (lowly) comment if I could)
That really depends on the list. For a normal, lazily evaluated list of Ints on my computer, I see this function running in about 2 seconds for 10^9 elements, 0.2 seconds for 10^8, and 0.3 seconds for 10^7, so it appears to run in linear time. You can check this yourself by passing the flags +RTS -s -RTS to your executable when running it from the command line.
I also tried running it with more cores, but it doesn't seem to do anything but increase the memory usage a bit.
An added bonus of the lazy computation is that you only make a single pass over the list. filter and length get turned into a single loop by the compiler (with optimizations turned on), so you save memory and efficiency.
Another microbenchmark: Why is this "loop" (compiled with ghc -O2 -fllvm, 7.4.1, Linux 64bit 3.2 kernel, redirected to /dev/null)
mapM_ print [1..100000000]
about 5x slower than a simple for-cycle in plain C with write(2) non-buffered syscall? I am trying to gather Haskell gotchas.
Even this slow C solution is much faster than Haskell
int i;
char buf[16];
for (i=0; i<=100000000; i++) {
sprintf(buf, "%d\n", i);
write(1, buf, strlen(buf));
}
Okay, on my box the C code, compiled per gcc -O3 takes about 21.5 seconds to run, the original Haskell code about 56 seconds. So not a factor of 5, a bit above 2.5.
The first nontrivial difference is that
mapM_ print [1..100000000]
uses Integers, that's a bit slower because it involves a check upfront, and then works with boxed Ints, while the Show instance of Int does the conversion work on unboxed Int#s.
Adding a type signature, so that the Haskell code works on Ints,
mapM_ print [1 :: Int .. 100000000]
brings the time down to 47 seconds, a bit above twice the time the C code takes.
Now, another big difference is that show produces a linked list of Char and doesn't just fill a contiguous buffer of bytes. That is slower too.
Then that linked list of Chars is used to fill a byte buffer that then is written to the stdout handle.
So, the Haskell code does more, and more complicated things than the C code, thus it's not surprising that it takes longer.
Admittedly, it would be desirable to have an easy way to output such things more directly (and hence faster). However, the proper way to handle it is to use a more suitable algorithm (that applies to C too). A simple change to
putStr . unlines $ map show [0 :: Int .. 100000000]
almost halves the time taken, and if one wants it really fast, one uses the faster ByteString I/O and builds the output efficiently as exemplified in applicative's answer.
On my (rather slow and outdated) machine the results are:
$ time haskell-test > haskell-out.txt
real 1m57.497s
user 1m47.759s
sys 0m9.369s
$ time c-test > c-out.txt
real 7m28.792s
user 1m9.072s
sys 6m13.923s
$ diff haskell-out.txt c-out.txt
$
(I have fixed the list so that both C and Haskell start with 0).
Yes you read this right. Haskell is several times faster than C. Or rather, normally buffered Haskell is faster than C with write(2) non-buffered syscall.
(When measuring output to /dev/null instead of a real disk file, C is about 1.5 times faster, but who cares about /dev/null performance?)
Technical data: Intel E2140 CPU, 2 cores, 1.6 GHz, 1M cache, Gentoo Linux, gcc4.6.1, ghc7.6.1.
The standard Haskell way to hand giant bytestrings over to the operating system is to use a builder monoid.
import Data.ByteString.Lazy.Builder -- requires bytestring-0.10.x
import Data.ByteString.Lazy.Builder.ASCII -- omit for bytestring-0.10.2.x
import Data.Monoid
import System.IO
main = hPutBuilder stdout $ build [0..100000000::Int]
build = foldr add_line mempty
where add_line n b = intDec n <> charUtf8 '\n' <> b
which gives me:
$ time ./printbuilder >> /dev/null
real 0m7.032s
user 0m6.603s
sys 0m0.398s
in contrast to Haskell approach you used
$ time ./print >> /dev/null
real 1m0.143s
user 0m58.349s
sys 0m1.032s
That is, it's child's play to do nine times better than mapM_ print, contra Daniel Fischer's suprising defeatism. Everything you need to know is here: http://hackage.haskell.org/packages/archive/bytestring/0.10.2.0/doc/html/Data-ByteString-Builder.html I won't compare it with your C since my results were much slower than Daniel's and n.m. so I figure something was going wrong.
Edit: Made the imports consistent with all versions of bytestring-0.10.x It occurred to me the following might be clearer -- the Builder equivalent of unlines . map show:
main = hPutBuilder stdout $ unlines_ $ map intDec [0..100000000::Int]
where unlines_ = mconcat . map (<> charUtf8 '\n')
Just out of curiosity, I made a simple script to check speed and memory efficiency of constructing a list in Haskell:
wasteMem :: Int -> [Int]
wasteMem 0 = [199]
wasteMem x = (12432483483467856487256348746328761:wasteMem (x-1))
main = do
putStrLn("hello")
putStrLn(show (wasteMem 10000000000000000000000000000000000))
The strange thing is, when I tried this, it didn't run out of memory or stack space, it only prints [199], the same as running wasteMem 0. It doesn't even print an error message... why? Entering this large number in ghci just prints the number, so I don't think it's a rounding or reading error.
Your program is using a number greater than maxBound :: Int32. This means it will behave differently on different platforms. For GHC x86_64 Int is 64 bits (32 bits otherwise, but the Haskell report only promises 29 bits). This means your absurdly large value (1x10^34) is represented as 4003012203950112768 for me and zero for you 32-bit folks:
GHCI> 10000000000000000000000000000000000 :: Int
4003012203950112768
GHCI> 10000000000000000000000000000000000 :: Data.Int.Int32
0
This could be made platform independent by either using a fixed-size type (ex: from Data.Word or Data.Int) or using Integer.
All that said, this is a poorly conceived test to begin with. Haskell is lazy, so the amount of memory consumed by wastedMem n for any value n is minimal - it's just a thunk. Once you try to show this result it will grab elements off the list one at a time - first generating "[12432483483467856487256348746328761, and leaving the rest of the list as a thunk. The first value can be garbage collected before the second value is even considered (a constant-space program).
Adding to Thomas' answer, if you really want to waste space, you have to perform an operation on the list, which needs the whole list in memory at once. One such operation is sorting:
print . sort . wasteMem $ (2^16)
Also note that it's almost impossible to estimate the run-time memory usage of your list. If you want a more predictable memory benchmark, create an unboxed array instead of a list. This also doesn't require any complicated operation to ensure that everything stays in memory. Indexing a single element in an array already makes sure that the array is in memory at least once.
How can I find the actual amount of memory required to store a value of some data type in Haskell (mostly with GHC)? Is it possible to evaluate it at runtime (e.g. in GHCi) or is it possible to estimate memory requirements of a compound data type from its components?
In general, if memory requirements of types a and b are known, what is the memory overhead of algebraic data types such as:
data Uno = Uno a
data Due = Due a b
For example, how many bytes in memory do these values occupy?
1 :: Int8
1 :: Integer
2^100 :: Integer
\x -> x + 1
(1 :: Int8, 2 :: Int8)
[1] :: [Int8]
Just (1 :: Int8)
Nothing
I understand that actual memory allocation is higher due to delayed garbage collection. It may be significantly different due to lazy evaluation (and thunk size is not related to the size of the value). The question is, given a data type, how much memory does its value take when fully evaluated?
I found there is a :set +s option in GHCi to see memory stats, but it is not clear how to estimate the memory footprint of a single value.
(The following applies to GHC, other compilers may use different storage conventions)
Rule of thumb: a constructor costs one word for a header, and one word for each field. Exception: a constructor with no fields (like Nothing or True) takes no space, because GHC creates a single instance of these constructors and shares it amongst all uses.
A word is 4 bytes on a 32-bit machine, and 8 bytes on a 64-bit machine.
So e.g.
data Uno = Uno a
data Due = Due a b
an Uno takes 2 words, and a Due takes 3.
The Int type is defined as
data Int = I# Int#
now, Int# takes one word, so Int takes 2 in total. Most unboxed types take one word, the exceptions being Int64#, Word64#, and Double# (on a 32-bit machine) which take 2. GHC actually has a cache of small values of type Int and Char, so in many cases these take no heap space at all. A String only requires space for the list cells, unless you use Chars > 255.
An Int8 has identical representation to Int. Integer is defined like this:
data Integer
= S# Int# -- small integers
| J# Int# ByteArray# -- large integers
so a small Integer (S#) takes 2 words, but a large integer takes a variable amount of space depending on its value. A ByteArray# takes 2 words (header + size) plus space for the array itself.
Note that a constructor defined with newtype is free. newtype is purely a compile-time idea, and it takes up no space and costs no instructions at run time.
More details in The Layout of Heap Objects in the GHC Commentary.
The ghc-datasize package provides the recursiveSize function to calculate the size of a GHC object. However...
A garbage collection is performed before the size is calculated,
because the garbage collector would make heap walks difficult.
...so it wouldn't be practical to call this often!
Also see How to find out GHC's memory representations of data types? and How can I determine size of a type in Haskell?.