No error message in Haskell - haskell

Just out of curiosity, I made a simple script to check speed and memory efficiency of constructing a list in Haskell:
wasteMem :: Int -> [Int]
wasteMem 0 = [199]
wasteMem x = (12432483483467856487256348746328761:wasteMem (x-1))
main = do
putStrLn("hello")
putStrLn(show (wasteMem 10000000000000000000000000000000000))
The strange thing is, when I tried this, it didn't run out of memory or stack space, it only prints [199], the same as running wasteMem 0. It doesn't even print an error message... why? Entering this large number in ghci just prints the number, so I don't think it's a rounding or reading error.

Your program is using a number greater than maxBound :: Int32. This means it will behave differently on different platforms. For GHC x86_64 Int is 64 bits (32 bits otherwise, but the Haskell report only promises 29 bits). This means your absurdly large value (1x10^34) is represented as 4003012203950112768 for me and zero for you 32-bit folks:
GHCI> 10000000000000000000000000000000000 :: Int
4003012203950112768
GHCI> 10000000000000000000000000000000000 :: Data.Int.Int32
0
This could be made platform independent by either using a fixed-size type (ex: from Data.Word or Data.Int) or using Integer.
All that said, this is a poorly conceived test to begin with. Haskell is lazy, so the amount of memory consumed by wastedMem n for any value n is minimal - it's just a thunk. Once you try to show this result it will grab elements off the list one at a time - first generating "[12432483483467856487256348746328761, and leaving the rest of the list as a thunk. The first value can be garbage collected before the second value is even considered (a constant-space program).

Adding to Thomas' answer, if you really want to waste space, you have to perform an operation on the list, which needs the whole list in memory at once. One such operation is sorting:
print . sort . wasteMem $ (2^16)
Also note that it's almost impossible to estimate the run-time memory usage of your list. If you want a more predictable memory benchmark, create an unboxed array instead of a list. This also doesn't require any complicated operation to ensure that everything stays in memory. Indexing a single element in an array already makes sure that the array is in memory at least once.

Related

Heap profiling in Haskell. Am I having space leaks?

I am writing a small snake game in Haskell as sort of a guided tutorial for beginners. The "rendering" just takes a Board and produces a Data.ByteString.Builder which is printed in the terminal. (the html profiles are pushed to the repo, you can inspect them without compiling the programm)
The problem
The problem I have is that the heap profiling looks weird: There are many spikes, and suddenly Builder, PAP and BuildStep take as same memory as the rest of the program. Considering that rendering is happenning 10 times in a second (i.e. every second we produce 10 builders), it seems inconsistent that every once in a while the builder just takes that much memory. I don't know if this is considered an space leak, since there is no thunks in the profile, but the PAP doesn't look right (I don't know...)
Implementation
The board is represented as an inmutable array of builders indexed by coordinaates (tuples) type Board = Array (Int, Int) Builder (essentialy, what should be printed in each coordinate). The function which converts the board into a builder is the expected strict fold which handle new lines using height and width of the board.
toBuilder :: RenderState -> Builder
-- |- The Array (Int, Int) Builder
toBuilder (RenderState b binf#(BoardInfo h w) gOver s) =
-- ^^^ height and width
if gOver
then ppScore s <> fst (boardToString $ emptyGrid binf) -- Not interesting. Case of game over print build an empty grid
else ppScore s <> fst (boardToString b) -- print the current board
where
boardToString = foldl' fprint (mempty, 0) -- concatenate builders and count the number, such that when #width builders have been concatenated, add a new line.
fprint (!s, !i) cell =
if ((i + 1) `mod` w) == 0
then (s <> cell <> B.charUtf8 '\n', i + 1 )
else (s <> cell , i + 1)
Up to the .prof file this function take most of the time and space (92%, which is expected). Moreover, this is the only part of the code that produces a big builder, so the problem should be here.
The buffering mode
The above profile happens when BufferMode is set to LineBuffering (default), but interestingly if I change it to NoBuffering then the profile looks the same but a thunk appears and the builder disappear...
The questions
I have reached a point which I don't know whats going on, hence my questions are a little bit vague:
Is my code with line buffering (first profile) actually leaking? No thunk appears but the PAP eating so much memory looks like a warning
The second profile clearly(?) leaks, is there an standard way to inspect which part of the code is producing the thunk?
Am I completely missing something, and actually the profile looks fine?
In case anyone is interested, I think I've found the problem. It is the terminal speed... If I run an smaller board size or a slower rendering time (the picture is for a 50x70 board with 10 renders a second), then the memory usage is completely normal.
What I think is happening, is that the board is printed into the console using B.hPutBuilder stdout, this action takes shorter than the console to actually print it, so the haskell thread continues and creates another board which should wait to be printed because the console is busy. I guess this leads to some how, two boards living in memory for a short time.
Other guesses are welcome!

Is there inherent "cost of carry" of garbage thunks in Haskell?

I often see high number of cycles spent in GC when running GHC-compiled programs.
These numbers tend to be order of magnitude higher than my JVM experience suggests they should be. In particular, number of bytes "copied" by GC seems to be vastly larger than amounts of data I'm computing.
Is such difference between non- and strict languages fundamental?
tl;dr: Most of the stuff that the JVM does in stack frames, GHC does on the heap. If you wanted to compare GHC heap/GC stats with the JVM equivalent, you'd really need to account for some portion of the bytes/cycles the JVM spends pushing arguments on the stack or copying return values between stack frames.
Long version:
Languages targeting the JVM typically make use of its call stack. Each invoked method has an active stack frame that includes storage for the parameters passed to it, additional local variables, and temporary results, plus room for an "operand stack" used for passing arguments to and receiving results from other methods it calls.
As a simple example, if the Haskell code:
bar :: Int -> Int -> Int
bar a b = a * b
foo :: Int -> Int -> Int -> Int
foo x y z = let u = bar y z in x + u
were compiled to JVM, the byte code would probably look something like:
public static int bar(int, int);
Code:
stack=2, locals=2, args_size=2
0: iload_0 // push a
1: iload_1 // push b
2: imul // multiply and push result
3: ireturn // pop result and return it
public static int foo(int, int, int);
Code:
stack=2, locals=4, args_size=3
0: iload_1 // push y
1: iload_2 // push z
2: invokestatic bar // call bar, pushing result
5: istore_3 // pop and save to "u"
6: iload_0 // push x
7: iload_3 // push u
8: iadd // add and push result
9: ireturn // pop result and return it
Note that calls to built-in primitives like imul and user-defined methods like bar involve copying/pushing the parameter values from local storage to the operand stack (using iload instructions) and then invoking the primitive or method. Return values then need to be saved/popped to local storage (with istore) or returned to the caller with ireturn; occasionally, a return value can be left on the stack to serve as an operand for another method invocation. Also, while it's not explicit in the byte code, the ireturn instruction involves a copy, from the callee's operand stack to the caller's operand stack. Of course, in actual JVM implementations, various optimizations are presumably possible to reduce copying.
When something else eventually calls foo to produce a computation, for example:
some_caller t = foo (1+3) (2+4) t + 1
the (unoptimized) code might look like:
iconst_1
iconst_3
iadd // put 1+3 on the stack
iconst_2
iconst_4
iadd // put 2+4 on the stack
iload_0 // put t on the stack
invokestatic foo
iconst 1
iadd
ireturn
Again, subexpressions are evaluated with a lot of pushing and popping on the operand stack. Eventually, foo is invoked with its arguments pushed on the stack and its result popped off for further processing.
All of this allocation and copying takes place on this stack, so there's no heap allocation involved in this example.
Now, what happens if that same code is compiled with GHC 8.6.4 (without optimization and on an x86_64 architecture for the sake of concreteness)? Well, the pseudocode for the generated assembly is something like:
foo [x, y, z] =
u = new THUNK(sat_u) // thunk, 32 bytes on heap
jump: (+) x u
sat_u [] = // saturated closure for "bar y z"
push UPDATE(sat_u) // update frame, 16 bytes on stack
jump: bar y z
bar [a, b] =
jump: (*) a b
The calls/jumps to the (+) and (*) "primitives" are actually more complicated than I've made them out to be because of the typeclass that's involved. For example, the jump to (+) looks more like:
push CONTINUATION(\f -> f x u) // continuation, 24 bytes on stack
jump: (+) dNumInt // get the right (+) from typeclass instance
If you turn on -O2, GHC optimizes away this more complicated call, but it also optimizes away everything else that's interesting about this example, so for the sake of argument, let's pretend the pseudocode above is accurate.
Again, foo isn't of much use until someone calls it. For the some_caller example above, the portion of code that calls foo will look something like:
some_caller [t] =
...
foocall = new THUNK(sat_foocall) // thunk, 24 bytes on heap
...
sat_foocall [] = // saturated closure for "foo (1+3) (2+4) t"
...
v = new THUNK(sat_v) // thunk "1+3", 16 bytes on heap
w = new THUNK(sat_w) // thunk "2+4", 16 bytes on heap
push UPDATE(sat_foocall) // update frame, 16 bytes on stack
jump: foo sat_v sat_w t
sat_v [] = ...
sat_w [] = ...
Note that nearly all of this allocation and copying takes place on the heap, rather than the stack.
Now, let's compare these two approaches. At first blush, it looks like the culprit really is lazy evaluation. We're creating these thunks all over the place that wouldn't be necessary if evaluation was strict, right? But let's look at one of these thunks more carefully. Consider the thunk for sat_u in the definition of foo. It's 32 bytes / 4 words with the following contents:
// THUNK(sat_u)
word 0: ptr to sat_u info table/code
1: space for return value
// variables we closed over:
2: ptr to "y"
3: ptr to "z"
The creation of this thunk isn't fundamentally different than the JVM code:
0: iload_1 // push y
1: iload_2 // push z
2: invokestatic bar // call bar, pushing result
5: istore_3 // pop and save to "u"
Instead of pushing y and z onto the operand stack, we loaded them into a heap-allocated thunk. Instead of popping the result off the operand stack into our stack frame's local storage and managing stack frames and return addresses, we left space for the result in the thunk and pushed a 16-byte update frame onto the stack before transferring control to bar.
Similarly, in the call to foo in some_caller, instead of evaluating the argument subexpressions by pushing constants on the stack and invoking primitives to push results on the stack, we created thunks on the heap, each of which included a pointer to info table / code for invoking primitives on those arguments and space for the return value; an update frame replaced the stack bookkeeping and result copying implicit in the JVM version.
Ultimately, thunks and update frames are GHC's replacement for stack-based parameter and result passing, local variables, and temporary workspace. A lot of activity that takes place in JVM stack frames takes place in the GHC heap.
Now, most of the stuff in JVM stack frames and on the GHC heap quickly becomes garbage. The main difference is that in the JVM, stack frames are automatically tossed out when a function returns, after the runtime has copied the important stuff out (e.g., return values). In GHC, the heap needs to be garbage collected. As others have noted, the GHC runtime is built around the idea that the vast majority of heap objects will immediately become garbage: a fast bump allocator is used for initial heap object allocation, and instead of copying out the important stuff every time a function returns (as for the JVM), the garbage collector copies it out when the bump heap gets kind of full.
Obviously, the above toy example is ridiculous. In particular, things are going to get much more complicated when we start talking about code that operates on Java objects and Haskell ADTs, rather than Ints. However, it serves to illustrate the point that a direct comparison of heap usage and GC cycles between GHC and JVM doesn't make a whole lot of sense. Certainly, an exact accounting doesn't really seem possible as the JVM and GHC approaches are too fundamentally different, and the proof would be in real-world performance. At the very least, an apples-to-apples comparison of GHC heap usage and GC stats needs to account for some portion of the cycles the JVM spends pushing, popping, and copying values between operand stacks. In particular, at least some fraction of JVM return instructions should count towards GHC's "bytes copied".
As for the contribution of "laziness" to heap usage (and heap "garbage" in particular), it seems hard to isolate. Thunks really play a dual role as a replacement for stack-based operand passing and as a mechanism for deferred evaluation. Certainly a switch from laziness to strictness can reduce garbage -- instead of first creating a thunk and then eventually evaluating it to another closure (e.g., a constructor), you can just create the evaluated closure directly -- but that just means that instead of your simple program allocating a mind-blowing 172 gigabytes on the heap, maybe the strict version "only" allocates a modest 84 gigabytes.
As far as I can see, the specific contribution of lazy evaluation to "bytes copied" should be minimal -- if a closure is important at GC time, it will need to be copied. If it's still an unevaluated thunk, the thunk will be copied. If it's been evaluated, just the final closure will need to be copied. If anything, since thunks for complicated structures are much smaller than their evaluated versions, laziness should typically reduce bytes copied. Instead, the usual big win with strictness is that it allows certain heap objects (or stack objects) to become garbage faster so we don't end up with space leaks.
No, laziness does not inherently lead to a large amount of copying in GC. The programmer's failure to manage laziness properly, however, can certainly do so. For example, if a persistent data structure ends up full of chains of thunks due to lazy modification, then it will end up badly bloated.
Another major issue you may be encountering, as Daniel Wagner mentioned, is the cost of immutability. While it is certainly possible to program with mutable structures in Haskell, it is much more idiomatic to work with immutable ones when possible. Immutable structure designs have various trade-offs. For example, ones designed for high performance when used persistently tend to have low branching factors to increase sharing, which leads to some bloat when they're used ephemerally.

Haskell 'count occurrences' function

I implemented a count function in Haskell and I am wondering will this behave badly on large lists :
count :: Eq a => a -> [a] -> Int
count x = length . filter (==x)
I believe the length function runs in linear time, is this correct?
Edit: Refactor suggested by #Higemaru
Length runs in linear time to the size of the list, yes.
Normally, you would be worried that your code had to take two passes through the list: first one to filter and then one to count the length of the resulting list. However, I believe this does not happen here because filter is not strict on the structure of the list. Instead, the length function forces the elements of the filtered list as it goes along, doing the actual count in one pass.
I think you can make it slightly shorter
count :: Eq a => a -> [a] -> Int
count x = length . filter (x==)
(I would have written a (lowly) comment if I could)
That really depends on the list. For a normal, lazily evaluated list of Ints on my computer, I see this function running in about 2 seconds for 10^9 elements, 0.2 seconds for 10^8, and 0.3 seconds for 10^7, so it appears to run in linear time. You can check this yourself by passing the flags +RTS -s -RTS to your executable when running it from the command line.
I also tried running it with more cores, but it doesn't seem to do anything but increase the memory usage a bit.
An added bonus of the lazy computation is that you only make a single pass over the list. filter and length get turned into a single loop by the compiler (with optimizations turned on), so you save memory and efficiency.

Basic I/O performance in Haskell

Another microbenchmark: Why is this "loop" (compiled with ghc -O2 -fllvm, 7.4.1, Linux 64bit 3.2 kernel, redirected to /dev/null)
mapM_ print [1..100000000]
about 5x slower than a simple for-cycle in plain C with write(2) non-buffered syscall? I am trying to gather Haskell gotchas.
Even this slow C solution is much faster than Haskell
int i;
char buf[16];
for (i=0; i<=100000000; i++) {
sprintf(buf, "%d\n", i);
write(1, buf, strlen(buf));
}
Okay, on my box the C code, compiled per gcc -O3 takes about 21.5 seconds to run, the original Haskell code about 56 seconds. So not a factor of 5, a bit above 2.5.
The first nontrivial difference is that
mapM_ print [1..100000000]
uses Integers, that's a bit slower because it involves a check upfront, and then works with boxed Ints, while the Show instance of Int does the conversion work on unboxed Int#s.
Adding a type signature, so that the Haskell code works on Ints,
mapM_ print [1 :: Int .. 100000000]
brings the time down to 47 seconds, a bit above twice the time the C code takes.
Now, another big difference is that show produces a linked list of Char and doesn't just fill a contiguous buffer of bytes. That is slower too.
Then that linked list of Chars is used to fill a byte buffer that then is written to the stdout handle.
So, the Haskell code does more, and more complicated things than the C code, thus it's not surprising that it takes longer.
Admittedly, it would be desirable to have an easy way to output such things more directly (and hence faster). However, the proper way to handle it is to use a more suitable algorithm (that applies to C too). A simple change to
putStr . unlines $ map show [0 :: Int .. 100000000]
almost halves the time taken, and if one wants it really fast, one uses the faster ByteString I/O and builds the output efficiently as exemplified in applicative's answer.
On my (rather slow and outdated) machine the results are:
$ time haskell-test > haskell-out.txt
real 1m57.497s
user 1m47.759s
sys 0m9.369s
$ time c-test > c-out.txt
real 7m28.792s
user 1m9.072s
sys 6m13.923s
$ diff haskell-out.txt c-out.txt
$
(I have fixed the list so that both C and Haskell start with 0).
Yes you read this right. Haskell is several times faster than C. Or rather, normally buffered Haskell is faster than C with write(2) non-buffered syscall.
(When measuring output to /dev/null instead of a real disk file, C is about 1.5 times faster, but who cares about /dev/null performance?)
Technical data: Intel E2140 CPU, 2 cores, 1.6 GHz, 1M cache, Gentoo Linux, gcc4.6.1, ghc7.6.1.
The standard Haskell way to hand giant bytestrings over to the operating system is to use a builder monoid.
import Data.ByteString.Lazy.Builder -- requires bytestring-0.10.x
import Data.ByteString.Lazy.Builder.ASCII -- omit for bytestring-0.10.2.x
import Data.Monoid
import System.IO
main = hPutBuilder stdout $ build [0..100000000::Int]
build = foldr add_line mempty
where add_line n b = intDec n <> charUtf8 '\n' <> b
which gives me:
$ time ./printbuilder >> /dev/null
real 0m7.032s
user 0m6.603s
sys 0m0.398s
in contrast to Haskell approach you used
$ time ./print >> /dev/null
real 1m0.143s
user 0m58.349s
sys 0m1.032s
That is, it's child's play to do nine times better than mapM_ print, contra Daniel Fischer's suprising defeatism. Everything you need to know is here: http://hackage.haskell.org/packages/archive/bytestring/0.10.2.0/doc/html/Data-ByteString-Builder.html I won't compare it with your C since my results were much slower than Daniel's and n.m. so I figure something was going wrong.
Edit: Made the imports consistent with all versions of bytestring-0.10.x It occurred to me the following might be clearer -- the Builder equivalent of unlines . map show:
main = hPutBuilder stdout $ unlines_ $ map intDec [0..100000000::Int]
where unlines_ = mconcat . map (<> charUtf8 '\n')

Memory footprint of Haskell data types

How can I find the actual amount of memory required to store a value of some data type in Haskell (mostly with GHC)? Is it possible to evaluate it at runtime (e.g. in GHCi) or is it possible to estimate memory requirements of a compound data type from its components?
In general, if memory requirements of types a and b are known, what is the memory overhead of algebraic data types such as:
data Uno = Uno a
data Due = Due a b
For example, how many bytes in memory do these values occupy?
1 :: Int8
1 :: Integer
2^100 :: Integer
\x -> x + 1
(1 :: Int8, 2 :: Int8)
[1] :: [Int8]
Just (1 :: Int8)
Nothing
I understand that actual memory allocation is higher due to delayed garbage collection. It may be significantly different due to lazy evaluation (and thunk size is not related to the size of the value). The question is, given a data type, how much memory does its value take when fully evaluated?
I found there is a :set +s option in GHCi to see memory stats, but it is not clear how to estimate the memory footprint of a single value.
(The following applies to GHC, other compilers may use different storage conventions)
Rule of thumb: a constructor costs one word for a header, and one word for each field. Exception: a constructor with no fields (like Nothing or True) takes no space, because GHC creates a single instance of these constructors and shares it amongst all uses.
A word is 4 bytes on a 32-bit machine, and 8 bytes on a 64-bit machine.
So e.g.
data Uno = Uno a
data Due = Due a b
an Uno takes 2 words, and a Due takes 3.
The Int type is defined as
data Int = I# Int#
now, Int# takes one word, so Int takes 2 in total. Most unboxed types take one word, the exceptions being Int64#, Word64#, and Double# (on a 32-bit machine) which take 2. GHC actually has a cache of small values of type Int and Char, so in many cases these take no heap space at all. A String only requires space for the list cells, unless you use Chars > 255.
An Int8 has identical representation to Int. Integer is defined like this:
data Integer
= S# Int# -- small integers
| J# Int# ByteArray# -- large integers
so a small Integer (S#) takes 2 words, but a large integer takes a variable amount of space depending on its value. A ByteArray# takes 2 words (header + size) plus space for the array itself.
Note that a constructor defined with newtype is free. newtype is purely a compile-time idea, and it takes up no space and costs no instructions at run time.
More details in The Layout of Heap Objects in the GHC Commentary.
The ghc-datasize package provides the recursiveSize function to calculate the size of a GHC object. However...
A garbage collection is performed before the size is calculated,
because the garbage collector would make heap walks difficult.
...so it wouldn't be practical to call this often!
Also see How to find out GHC's memory representations of data types? and How can I determine size of a type in Haskell?.

Resources