tail recursion recognition - haskell

I'm trying to learn Haskell and I stumbled upon the following:
myAdd (x:xs) = x + myAdd xs
myAdd null = 0
f = let n = 10000000 in myAdd [1 .. n]
main = do
putStrLn (show f)
When compiling with GHC, this yields a stack overflow. As a C/C++ programmer, I would have expected the compiler to do tail call optimization.
I don't like that I would have to "help" the compiler in simple cases like these, but what options are there? I think it is reasonable to require that the calculation given above be done without using O(n) memory, and without deferring to specialized functions.
If I cannot state my problem naturally (even on a toy problem such as this), and expect reasonable performance in terms of time & space, much of the appeal of Haskell would be lost.

Firstly, make sure you're compiling with -O2. It makes a lot of performance problems just go away :)
The first problem I can see is that null is just a variable name there. You want []. It's equivalent here because the only options are x:xs and [], but it won't always be.
The issue here is simple: when you call sum [1,2,3,4], it looks like this:
1 + (2 + (3 + (4 + 0)))
without ever reducing any of these additions to a number, because of Haskell's non-strict semantics. The solution is simple:
myAdd = myAdd' 0
where myAdd' !total [] = total
myAdd' !total (x:xs) = myAdd' (total + x) xs
(You'll need {-# LANGUAGE BangPatterns #-} at the top of your source file to compile this.)
This accumulates the addition in another parameter, and is actually tail recursive (yours isn't; + is in tail position rather than myAdd). But in fact, it's not quite tail recursion we care about in Haskell; that distinction is mainly relevant in strict languages. The secret here is the bang pattern on total: it forces it to be evaluated every time myAdd' is called, so no unevaluated additions build up, and it runs in constant space. In this case, GHC can actually figure this out with -O2 thanks to its strictness analysis, but I think it's usually best to be explicit about what you want strict and what you don't.
Note that if addition was lazy, your myAdd definition would work fine; the problem is that you're doing a lazy traversal of the list with a strict operation, which ends up causing the stack overflow. This mostly comes up with arithmetic, which is strict for the standard numeric types (Int, Integer, Float, Double, etc.).
This is quite ugly, and it would be a pain to write something like this every time we want to write a strict fold. Thankfully, Haskell has an abstraction ready for this!
myAdd = foldl' (+) 0
(You'll need to add import Data.List to compile this.)
foldl' (+) 0 [a, b, c, d] is just like (((0 + a) + b) + c) + d, except that at each application of (+) (which is how we refer to the binary operator + as a function value), the value is forced to be evaluated. The resulting code is cleaner, faster, and easier to read (once you know how the list folds work, you can understand any definition written in terms of them easier than a recursive definition).
Basically, the problem here is not that the compiler can't figure out how to make your program efficient — it's that making it as efficient as you like could change its semantics, which an optimisation should never do. Haskell's non-strict semantics certainly pose a learning curve to programmers in more "traditional" languages like C, but it gets easier over time, and once you see the power and abstraction that Haskell's non-strictness offers, you'll never want to go back :)

Expanding the example ehird hinted at in the comments:
data Peano = Z | S Peano
deriving (Eq, Show)
instance Ord Peano where
compare (S a) (S b) = compare a b
compare Z Z = EQ
compare Z _ = LT
compare _ _ = GT
instance Num Peano where
Z + n = n
(S a) + n = S (a + n)
-- omit others
fromInteger 0 = Z
fromInteger n
| n < 0 = error "Peano: fromInteger requires non-negative argument"
| otherwise = S (fromInteger (n-1))
instance Enum Peano where
succ = S
pred (S a) = a
pred _ = error "Peano: no predecessor"
toEnum n
| n < 0 = error "toEnum: invalid argument"
| otherwise = fromInteger (toInteger n)
fromEnum Z = 0
fromEnum (S a) = 1 + fromEnum a
enumFrom = iterate S
enumFromTo a b = takeWhile (<= b) $ enumFrom a
-- omit others
infinity :: Peano
infinity = S infinity
result :: Bool
result = 3 < myAdd [1 .. infinity]
result is True by the definition of myAdd, but if the compiler transformed into a tail-recursive loop, it wouldn't terminate. So that transformation is not only a change in efficiency, but also in semantics, hence a compiler must not do it.

A little funny example regarding "The issue is why the compiler is unable to optimize something that appears to be rather trivial to optimize."
Let's say I'm coming from Haskell to C++. I used to write foldr because in Haskell foldr is usually more effective than foldl because of laziness and list fusion.
So I'm trying to write a foldr for a (single-linked) list in C and complaining why it's grossly inefficient:
int foldr(int (*f)(int, node*), int base, node * list)
{
return list == NULL
? base
: f(a, foldr(f, base, list->next));
}
It is inefficient not because the C compiler in question is an unrealistic toy tool developed by ivory tower theorists for their own satisfaction, but because the code in question is grossly non-idiomatic for C.
It is not the case that you cannot write an efficient foldr in C: you just need a doubly-linked list. In Haskell, similarly, you can write an efficient foldl, you need strictness annotations for foldl to be efficient. The standard library provides both foldl (without annotations) and foldl' (with annotations).
The idea of left folding a list in Haskell is the same kind of perversion as a desire to iterate a singly-linked list backwards using recursion in C. Compiler is there to help normal people, not perverts lol.
As your C++ projects probably don't have code iterating singly-linked lists backwards, my HNC project contains only 1 foldl I incorrectly wrote before I mastered Haskell enough. You hardly ever need to foldl in Haskell.
You must unlearn that forward iteration is natural and fastest, and learn that backward iteration is. The forward iteration (left folding) does not do what you intend, until you annotate: it does three passes - list creation, thunk chain buildup and thunk evaluation, instead of two (list creation and list traversal). Note that in immutable world lists can be only efficiently created backwards: a : b is O(1), and a ++ [b] is O(N).
And the backward iteration doesn't do what you intend either. It does one pass instead of three as you might expect from your C background. It doesn't create a list, traverse it to the bottom and then traverse it backwards (2 passes) - it traverses the list as it creates it, that is, 1 pass. With optimizations on, it is just a loop - no actual list elements are created. With optimizations off, it is still O(1) space operation with bigger constant overhead, but explanation is a bit longer.

So there are two things I will address about your problem, firstly the performance problem, and then secondly the expressive problem, that of having to help the compiler with something that seems trivial.
The performance
The thing is that your program is in fact not tail recursive, that is, there is no single call to a function that can replace the recursion. Lets have a look at what happens when we expand myAdd [1..3]:
myAdd [1,2,3]
1 + myAdd [2,3]
1 + 2 + myAdd [3]
As you can see, at any given step, we cannot replace the recursion with a function call, we could simplify the expression by reducing 1 + 2 to 3, but that is not what tail recursion is about.
So here is a version that is tail recursive:
myAdd2 = go 0
where go a [] = a
go a (x:xs) = go (a + x) xs
Lets have a look at how go 0 [1,2,3] is evaluated:
go 0 [1,2,3]
go (1+0) [2,3]
go (2 + 1 + 0) [3]
As you see, at every step, we only need to keep track of
one function call, and as long the first parameter is
evaluated strictly we should not get an exponential space
blow up, and in fact, if you compile with optimization (-O1 or -O2)
ghc is smart enough to figure that out on its own.
Expressiveness
Alright so it is a bit harder to reason about performance in haskell, but most of the time you don't have to. The thing is that you can use combinators that ensure efficiency. This particular pattern above is captured by foldl (and its strict cousin foldl') so myAdd can be written as:
myAdd = foldl (+) 0
and if you compile that with optimization it will not give you an exponential space blowup!

Related

TCO optimized Tower of Hanoi in Clojure

I'm reading a Introdution to Haskell course and they're introducing well known Tower of Hanoi problem as a homework for the first class. I was tempted and wrote a solution:
type Peg = String
type Move = (Peg, Peg)
hanoi :: Int -> Peg -> Peg -> Peg -> [Move]
hanoi n b a e
| n == 1 = [(b, e)]
| n > 1 = hanoi (n - 1) b e a ++ hanoi 1 b a e ++ hanoi (n - 1) a b e
| otherwise = []
I've played with it a little and saw that it obviously using Tail Call Optimization since it works in constant memory.
Clojure is language that I work with most of the time and hence I was challenged to write a Clojure solution. Naive ones are discarded since I want to write it to use TCO:
(defn hanoi-non-optimized
[n b a e]
(cond
(= n 1) [[b e]]
(> n 1) (concat (hanoi-non-optimized (dec n) b e a)
(hanoi-non-optimized 1 b a e)
(hanoi-non-optimized (dec n) a b e))
:else []))
Well, Clojure is JVM hosted and thus don't have TCO by default and one should use recur to get it (I know the story...). In the other hand, recur imposes some syntactic constraints since it have to be last expression - have to be the tail. I feel a bit bad because I still can't write a solution that is short/expressive as that one in Haskell and use TCO at the same time.
Is there a simple solution for this which I can't see at the moment?
I have a great respect for both languages and already know that this is rather problem with my approach than with Clojure itself.
No, the Haskell code isn't tail-recursive. It is guarded-recursive, with recursion guarded by a lazy data constructor, : (to which the ++ calls are ultimately transformed), where because of the laziness only one part of the recursion call tree (a ++ b ++ c) is explored in its turn, so the stack's depth never exceeds n, the number of disks. Which is very small, like 7 or 8.
So Haskell code explores a, putting the c part aside. Your Clojure code on the other hand calculates the two parts (a and c, as b doesn't count) before concatenating them, so is double recursive, i.e. computationally heavy.
What you're looking for is not TCO, but TRMCO -- tail recursion modulo cons optimization, -- i.e. building the list in a top-down manner from inside a loop with a simulated stack. Clojure is especially well-suited for this, with its tail-appending conj (right?) instead of Lisp's and Haskell's head-prepending cons.
Or just print the moves instead of building the list of all of them.
edit: actually, TRMCO means we're allowed to reuse the call frame if we maintain the "continuation stack" ourselves, so the stack depth becomes exactly 1. Haskell in all likelihood builds a left-deepening tree of nested ++ thunk nodes in this case, as explained here, but in Clojure we're allowed to rearrange it into the right-nested list ourselves, when we maintain our own stack of to-do-next invocation descriptions (for the b and c parts of the a ++ b ++ c expression).

Universal memoization in Haskell [duplicate]

I can't figure out why m1 is apparently memoized while m2 is not in the following:
m1 = ((filter odd [1..]) !!)
m2 n = ((filter odd [1..]) !! n)
m1 10000000 takes about 1.5 seconds on the first call, and a fraction of that on subsequent calls (presumably it caches the list), whereas m2 10000000 always takes the same amount of time (rebuilding the list with each call). Any idea what's going on? Are there any rules of thumb as to if and when GHC will memoize a function? Thanks.
GHC does not memoize functions.
It does, however, compute any given expression in the code at most once per time that its surrounding lambda-expression is entered, or at most once ever if it is at top level. Determining where the lambda-expressions are can be a little tricky when you use syntactic sugar like in your example, so let's convert these to equivalent desugared syntax:
m1' = (!!) (filter odd [1..]) -- NB: See below!
m2' = \n -> (!!) (filter odd [1..]) n
(Note: The Haskell 98 report actually describes a left operator section like (a %) as equivalent to \b -> (%) a b, but GHC desugars it to (%) a. These are technically different because they can be distinguished by seq. I think I might have submitted a GHC Trac ticket about this.)
Given this, you can see that in m1', the expression filter odd [1..] is not contained in any lambda-expression, so it will only be computed once per run of your program, while in m2', filter odd [1..] will be computed each time the lambda-expression is entered, i.e., on each call of m2'. That explains the difference in timing you are seeing.
Actually, some versions of GHC, with certain optimization options, will share more values than the above description indicates. This can be problematic in some situations. For example, consider the function
f = \x -> let y = [1..30000000] in foldl' (+) 0 (y ++ [x])
GHC might notice that y does not depend on x and rewrite the function to
f = let y = [1..30000000] in \x -> foldl' (+) 0 (y ++ [x])
In this case, the new version is much less efficient because it will have to read about 1 GB from memory where y is stored, while the original version would run in constant space and fit in the processor's cache. In fact, under GHC 6.12.1, the function f is almost twice as fast when compiled without optimizations than it is compiled with -O2.
m1 is computed only once because it is a Constant Applicative Form, while m2 is not a CAF, and so is computed for each evaluation.
See the GHC wiki on CAFs: http://www.haskell.org/haskellwiki/Constant_applicative_form
There is a crucial difference between the two forms: the monomorphism restriction applies to m1 but not m2, because m2 has explicitly given arguments. So m2's type is general but m1's is specific. The types they are assigned are:
m1 :: Int -> Integer
m2 :: (Integral a) => Int -> a
Most Haskell compilers and interpreters (all of them that I know of actually) do not memoize polymorphic structures, so m2's internal list is recreated every time it's called, where m1's is not.
I'm not sure, because I'm quite new to Haskell myself, but it appears that it's beacuse the second function is parametrized and the first one is not. The nature of the function is that, it's result depends on input value and in functional paradigm especailly it depends ONLY on the input. Obvious implication is that a function with no parameters returns always the same value over and over, no matter what.
Aparently there's an optimizing mechanizm in GHC compiler that exploits this fact to compute the value of such a function only once for whole program runtime. It does it lazily, to be sure, but does it nonetheless. I noticed it myself, when I wrote the following function:
primes = filter isPrime [2..]
where isPrime n = null [factor | factor <- [2..n-1], factor `divides` n]
where f `divides` n = (n `mod` f) == 0
Then to test it, I entered GHCI and wrote: primes !! 1000. It took a few seconds, but finally I got the answer: 7927. Then I called primes !! 1001 and got the answer instantly. Similarly in an instant I got the result for take 1000 primes, because Haskell had to compute the whole thousand-element list to return 1001st element before.
Thus if you can write your function such that it takes no parameters, you probably want it. ;)

Extent of GHC's optimization

I am not very familiar with the degree that Haskell/GHC can optimize code. Below I have a pretty "brute-force" (in the declarative sense) implementation of the n queens problem. I know it can be written more efficiently, but thats not my question. Its that this got me thinking about the GHC optimizations capabilities and limits.
I have expressed it in what I consider a pretty straightforward declarative sense. Filter permutations of [1..n] that fulfill the predicate For all indices i,j s.t j<i, abs(vi - vj) != j-i I would hope this is the kind of thing that can be optimized, but it also kind of feels like asking a lot of compiler.
validQueens x = and [abs (x!!i - x!!j) /= j-i | i<-[0..length x - 2], j<-[i+1..length x - 1]]
queens n = filter validQueens (permutations [1..n])
oneThru x = [1..x]
pointlessQueens = filter validQueens . permutations . oneThru
main = do
n <- getLine
print $ pointlessQueens $ (read :: String -> Int) n
This runs fairly slow and grows quickly. n=10 takes about a second and n=12 takes forever. Without optimization I can tell the growth is factorial (# of permutations) multiplied by quadratic (# of differences in the predicate to check). Is there any way this code can perform better thru intelligent compilation? I tried the basic ghc options such has -O2 and didn't notice a significant difference, but I don't know the finer points (just added the flagS)
My impression is that the function i call queens can not be optimized and must generate all permutations before filter. Does the point-free version have a better chance? On the one hand I feel like a smart function comprehension between a filter and a predicate might be able to knock off some obviously undesired elements before they are even fully generated, but on the other hand it kind of feels like a lot to ask.
Sorry if this seems rambling, i guess my question is
Is the pointfree version of above function more capable of being optimized?
What steps could I take at make/compile/link time to encourage optimization?
Can you briefly describe some possible (and contrast with the impossible!) means of optimization for the above code? At what point in the process do these occur?
Is there any particular part of ghc --make queensN -O2 -v output I should be paying attention to? Nothing stands out to me. Don't even see much difference in output due to optimization flags
I am not overly concerned with this code example, but I thought writing it got me thinking and it seems to me like a decent vehicle for discussing optimization.
PS - permutations is from Data.List and looks like this:
permutations :: [a] -> [[a]]
permutations xs0 = xs0 : perms xs0 []
where
perms [] _ = []
perms (t:ts) is = foldr interleave (perms ts (t:is)) (permutations is)
where interleave xs r = let (_,zs) = interleave' id xs r in zs
interleave' _ [] r = (ts, r)
interleave' f (y:ys) r = let (us,zs) = interleave' (f . (y:)) ys r
in (y:us, f (t:y:us) : zs)
At a more general level regarding "what kind of optimizations can GHC do", it may help to break the idea of an "optimization" apart a little bit. There are conceptual distinctions that can be drawn between aspects of a program that can be optimized. For instance, consider:
The intrinsic logical structure of the algorithm: You can safely assume in almost every case that this will never be optimized. Outside of experimental research, you're not likely to find a compiler that will replace a bubble sort with a merge sort, or even an insertion sort, and extremely unlikely to find one that would replace a bogosort with something sensible.
Non-essential logical structure of the algorithm: For instance, in the expression g (f x) (f x), how many times will f x be computed? What about an expression like g (f x 2) (f x 5)? These aren't intrinsic to the algorithm, and different variations can be interchanged without impacting anything other than performance. The difficulties in performing optimization here are essentially recognizing when a substitution can in fact be done without changing the meaning, and predicting which version will have the best results. A lot of manual optimizations fall into this category, along with a great deal of GHC's cleverness.
This is also the part that trips a lot of people up, because they see how clever GHC is and expect it to do even more. And because of the reasonable expectation that GHC should never make things worse, it's not uncommon to have potential optimizations that seem obvious (and are, to the programmer) that GHC can't apply because it's nontrivial to distinguish cases where the same transformation would significantly degrade performance. This is, for instance, why memoization and common subexpression elimination aren't always automatic.
This is also the part where GHC has a huge advantage, because laziness and purity make a lot of things much easier, and is I suspect what leads to people making tongue-in-cheek remarks like "Optimizing compilers are a myth (except perhaps in Haskell).", but also to unrealistic optimism about what even GHC can do.
Low-level details: Things like memory layout and other aspects of the final code. These tend to be somewhat arcane and highly dependent on implementation details of the runtime, the OS, and the processor. Optimizations of this sort are essentially why we have compilers, and usually not something you need to worry about unless you're writing code that is very computationally demanding (or are writing a compiler yourself).
As far as your specific example here goes: GHC isn't going to significantly alter the intrinsic time complexity of your algorithm. It might be able to remove some constant factors. What it can't do is apply constant-factor improvements that it can't be sure are correct, particularly ones that technically change the meaning of the program in ways that you don't care about. Case in point here is #sclv's answer, which explains how your use of print creates unnecessary overhead; there's nothing GHC could do about that, and in fact the current form would possibly inhibit other optimizations.
There's a conceptual problem here. Permutations is generating streaming permutations, and filter is streaming too. What's forcing everything prematurely is the "show" implicit in "print". Change your last line to:
mapM print $ pointlessQueens $ (read :: String -> Int) n
and you'll see that results are generated in a streaming fashion much more rapidly. That fixes, for large result sets, a potential space leak, and other than that just lets things be printed as computed rather than all at once at the end.
However, you shouldn't expect any order of magnitude improvements from ghc optimizations (there are a few, obvious ones that you do get, mostly having to do with strictness and folds, but its irritating to rely on them). What you'll get are constant factors, generally.
Edit: As luqui points out below, show is also streaming (or at least show of [Int] is), but the line buffering nonetheless makes it harder to see the genuine speed of computation...
It should be noted, although you do express that it is not part of your question, that the big problem with your code is that you do not do any pruning.
In the case of your question, it feels foolish to talk about possible/impossible optimization, compiler flags, and how to best formulate it etc. when an improvement of the algorithm is staring us so blatantly in the face.
One of the first things that will be tried is the permutations starting with the first queen in position 1 and the second queen in position 2 ([1,2...]). This is of course not a solution and we will have to move one of the queens. However, in your implementation, all permutations involving this combination of the two first queens will be tested! The search should stop there and instantly move to the permutations involving [1,3,...].
Here is a version that does this sort of pruning:
import Data.List
import Control.Monad
main = getLine >>= mapM print . queens . read
queens :: Int -> [[Int]]
queens n = queens' [] n
queens' xs n
| length xs == n = return xs
| otherwise = do
x <- [1..n] \\ xs
guard (validQueens (x:xs))
queens' (x:xs) n
validQueens x =
and [abs (x!!i - x!!j) /= j-i | i<-[0..length x - 2], j<-[i+1..length x - 1]]
I understand that your question was about compiler optimization but as the discussion has shown pruning is necessary.
The first paper that I know of about how to do this for the n queens problem in a lazy functional language is Turner's paper "Recursion Equations as a Programming Language" You can read it in Google Books here.
In terms of your comment about a pattern worth remembering, this problem introduces a very powerful pattern. A great paper on this idea is Philip Wadler's paper, "How to Replace Failure by a List of Successes", which can be read in Google Books here
Here is a pure, non-monadic, implementation based on Turner's Miranda implementation. In the case of n = 12 (queens 12 12) it returns the first solution in .01 secs and will compute all 14,200 solutions in under 6 seconds. Of course printing those takes much longer.
queens :: Int -> Int -> [[Int]]
queens n boardsize =
queensi n
where
-- given a safe arrangement of queens in the first n - 1 rows,
-- "queensi n" returns a list of all the safe arrangements of queens
-- in the first n rows
queensi :: Int -> [[Int]]
queensi 0 = [[]]
queensi n = [ x : y | y <- queensi (n-1) , x <- [1..boardsize], safe x y 1]
-- "safe x y n" tests whether a queen at column x would be safe from previous
-- queens in y where the first element of y is n rows away from x, the second
-- element is (n+1) rows away from x, etc.
safe :: Int -> [Int] -> Int -> Bool
safe _ [] _ = True
safe x (c:y) n = and [ x /= c , x /= c + n , x /= c - n , safe x y (n+1)]
-- we only need to check for queens in the same column, and the same diagonals;
-- queens in the same row are not possible by the fact that we only pick one
-- queen per row

When is memoization automatic in GHC Haskell?

I can't figure out why m1 is apparently memoized while m2 is not in the following:
m1 = ((filter odd [1..]) !!)
m2 n = ((filter odd [1..]) !! n)
m1 10000000 takes about 1.5 seconds on the first call, and a fraction of that on subsequent calls (presumably it caches the list), whereas m2 10000000 always takes the same amount of time (rebuilding the list with each call). Any idea what's going on? Are there any rules of thumb as to if and when GHC will memoize a function? Thanks.
GHC does not memoize functions.
It does, however, compute any given expression in the code at most once per time that its surrounding lambda-expression is entered, or at most once ever if it is at top level. Determining where the lambda-expressions are can be a little tricky when you use syntactic sugar like in your example, so let's convert these to equivalent desugared syntax:
m1' = (!!) (filter odd [1..]) -- NB: See below!
m2' = \n -> (!!) (filter odd [1..]) n
(Note: The Haskell 98 report actually describes a left operator section like (a %) as equivalent to \b -> (%) a b, but GHC desugars it to (%) a. These are technically different because they can be distinguished by seq. I think I might have submitted a GHC Trac ticket about this.)
Given this, you can see that in m1', the expression filter odd [1..] is not contained in any lambda-expression, so it will only be computed once per run of your program, while in m2', filter odd [1..] will be computed each time the lambda-expression is entered, i.e., on each call of m2'. That explains the difference in timing you are seeing.
Actually, some versions of GHC, with certain optimization options, will share more values than the above description indicates. This can be problematic in some situations. For example, consider the function
f = \x -> let y = [1..30000000] in foldl' (+) 0 (y ++ [x])
GHC might notice that y does not depend on x and rewrite the function to
f = let y = [1..30000000] in \x -> foldl' (+) 0 (y ++ [x])
In this case, the new version is much less efficient because it will have to read about 1 GB from memory where y is stored, while the original version would run in constant space and fit in the processor's cache. In fact, under GHC 6.12.1, the function f is almost twice as fast when compiled without optimizations than it is compiled with -O2.
m1 is computed only once because it is a Constant Applicative Form, while m2 is not a CAF, and so is computed for each evaluation.
See the GHC wiki on CAFs: http://www.haskell.org/haskellwiki/Constant_applicative_form
There is a crucial difference between the two forms: the monomorphism restriction applies to m1 but not m2, because m2 has explicitly given arguments. So m2's type is general but m1's is specific. The types they are assigned are:
m1 :: Int -> Integer
m2 :: (Integral a) => Int -> a
Most Haskell compilers and interpreters (all of them that I know of actually) do not memoize polymorphic structures, so m2's internal list is recreated every time it's called, where m1's is not.
I'm not sure, because I'm quite new to Haskell myself, but it appears that it's beacuse the second function is parametrized and the first one is not. The nature of the function is that, it's result depends on input value and in functional paradigm especailly it depends ONLY on the input. Obvious implication is that a function with no parameters returns always the same value over and over, no matter what.
Aparently there's an optimizing mechanizm in GHC compiler that exploits this fact to compute the value of such a function only once for whole program runtime. It does it lazily, to be sure, but does it nonetheless. I noticed it myself, when I wrote the following function:
primes = filter isPrime [2..]
where isPrime n = null [factor | factor <- [2..n-1], factor `divides` n]
where f `divides` n = (n `mod` f) == 0
Then to test it, I entered GHCI and wrote: primes !! 1000. It took a few seconds, but finally I got the answer: 7927. Then I called primes !! 1001 and got the answer instantly. Similarly in an instant I got the result for take 1000 primes, because Haskell had to compute the whole thousand-element list to return 1001st element before.
Thus if you can write your function such that it takes no parameters, you probably want it. ;)

Why toInteger :: Int -> Integer is lazy?

I have the following code:
{-# NOINLINE i2i #-}
i2i :: Int -> Integer
i2i x = toInteger x
main = print $ i2i 2
Running GHC with -ddump-simpl flag gives:
[Arity 1
NoCafRefs
Str: DmdType U(L)]
Main.i2i = GHC.Real.toInteger1
Seems that conversion from Int to Integer is lazy. Why is it so - is there a case when I can have
(toInteger _|_ ::Int) /= _|_
?
Edit: the question has more to do with GHC strictness analyzer, than with laziness per se. This code was derived from exploring standard mean function:
--mean :: Integer -> Integer -> [Integer] -> Double
mean :: Integer -> Int -> [Integer] -> Double
mean acc n [] = fromIntegral acc / fromIntegral n
mean acc n (x:xs) = mean (acc + x) (n + 1) xs
main = print $ mean 0 0 [1..1000000]
This code runs on O(N) space. When I uncomment first line, space consumption changes to O(1). Seems that it comes down to fromIntegral call, which in turn comes down to toInteger. Strictness analyzer somehow cannot infer that conversion is strict, which seems strange to me.
Response to your edit: the dangers of O(N) space leaks for accumulating parameters are well known, at least to Haskell programmers. What ought to be well known but isn't is that no matter what the language, you should never trust to the optimizer to provide asymptotic guarantees for the space and time behavior of your programs. I don't understand the implications of simple optimizers I've written myself, let alone something hairy like GHC's front end, what with a strictness analyzer, inliner, and all the rest.
As to your two questions,
Why doesn't GHC's strictness analyzer optimize this particular code, when it does optimize very similar code?
Who knows?? (Maybe Simon PJ knows, maybe not.) If you care about performance, you shouldn't be relying on the strictness analyzer. Which brings us to the second, implied question:
How can I avoid O(N) space costs on this function and on every other function that uses accumulating parameters?
By putting strictness annotations on the accumluating parameters that force them to be evaluated at each tail-recursive call.
I think you're looking at this the wrong way. Consider the following, silly fragment of code
let x = [undefined]
let y = map toInteger x
If we evaluate
y == []
we get False, whereas if we evaluate
head y
we get an undefined exception. There's no reason that applying map or comparing y with [] should diverge just because the only element of x is undefined. That's the essence of non-strictness.

Resources