TCO optimized Tower of Hanoi in Clojure - haskell

I'm reading a Introdution to Haskell course and they're introducing well known Tower of Hanoi problem as a homework for the first class. I was tempted and wrote a solution:
type Peg = String
type Move = (Peg, Peg)
hanoi :: Int -> Peg -> Peg -> Peg -> [Move]
hanoi n b a e
| n == 1 = [(b, e)]
| n > 1 = hanoi (n - 1) b e a ++ hanoi 1 b a e ++ hanoi (n - 1) a b e
| otherwise = []
I've played with it a little and saw that it obviously using Tail Call Optimization since it works in constant memory.
Clojure is language that I work with most of the time and hence I was challenged to write a Clojure solution. Naive ones are discarded since I want to write it to use TCO:
(defn hanoi-non-optimized
[n b a e]
(cond
(= n 1) [[b e]]
(> n 1) (concat (hanoi-non-optimized (dec n) b e a)
(hanoi-non-optimized 1 b a e)
(hanoi-non-optimized (dec n) a b e))
:else []))
Well, Clojure is JVM hosted and thus don't have TCO by default and one should use recur to get it (I know the story...). In the other hand, recur imposes some syntactic constraints since it have to be last expression - have to be the tail. I feel a bit bad because I still can't write a solution that is short/expressive as that one in Haskell and use TCO at the same time.
Is there a simple solution for this which I can't see at the moment?
I have a great respect for both languages and already know that this is rather problem with my approach than with Clojure itself.

No, the Haskell code isn't tail-recursive. It is guarded-recursive, with recursion guarded by a lazy data constructor, : (to which the ++ calls are ultimately transformed), where because of the laziness only one part of the recursion call tree (a ++ b ++ c) is explored in its turn, so the stack's depth never exceeds n, the number of disks. Which is very small, like 7 or 8.
So Haskell code explores a, putting the c part aside. Your Clojure code on the other hand calculates the two parts (a and c, as b doesn't count) before concatenating them, so is double recursive, i.e. computationally heavy.
What you're looking for is not TCO, but TRMCO -- tail recursion modulo cons optimization, -- i.e. building the list in a top-down manner from inside a loop with a simulated stack. Clojure is especially well-suited for this, with its tail-appending conj (right?) instead of Lisp's and Haskell's head-prepending cons.
Or just print the moves instead of building the list of all of them.
edit: actually, TRMCO means we're allowed to reuse the call frame if we maintain the "continuation stack" ourselves, so the stack depth becomes exactly 1. Haskell in all likelihood builds a left-deepening tree of nested ++ thunk nodes in this case, as explained here, but in Clojure we're allowed to rearrange it into the right-nested list ourselves, when we maintain our own stack of to-do-next invocation descriptions (for the b and c parts of the a ++ b ++ c expression).

Related

Apply function to all pairs efficiently

I need a second order function pairApply that applies a binary function f to all unique pairs of a list-like structure and then combines them somehow. An example / sketch:
pairApply (+) f [a, b, c] = f a b + f a c + f b c
Some research leads me to believe that Data.Vector.Unboxed probably will have good performance (I will also need fast access to specific elements); also it necessary for Statistics.Sample, which would come in handy further down the line.
With this in mind I have the following, which almost compiles:
import qualified Data.Vector.Unboxed as U      
pairElement :: (U.Unbox a, U.Unbox b)    
=> (U.Vector a)                    
  -> (a -> a -> b)                   
  -> Int                             
-> a                               
 -> (U.Vector b)                    
pairElement v f idx el =
U.map (f el) $ U.drop (idx + 1) v            
pairUp :: (U.Unbox a, U.Unbox b)   
=> (a -> a -> b)                        
 -> (U.Vector a)                         
-> (U.Vector (U.Vector b))
pairUp f v = U.imap (pairElement v f) v 
pairApply :: (U.Unbox a, U.Unbox b)
=> (b -> b -> b)                     
-> b                                 
 -> (a -> a -> b)                     
-> (U.Vector a)                      
 -> b
pairApply combine neutral f v =
folder $ U.map folder (pairUp f v) where
folder = U.foldl combine neutral
The reason this doesn't compile is that there is no Unboxed instance of a U.Vector (U.Vector a)). I have been able to create new unboxed instances in other cases using Data.Vector.Unboxed.Deriving, but I'm not sure it would be so easy in this case (transform it to a tuple pair where the first element is all the inner vectors concatenated and the second is the length of the vectors, to know how to unpack?)
My question can be stated in two parts:
Does the above implementation make sense at all or is there some quick library function magic etc that could do it much easier?
If so, is there a better way to make an unboxed vector of vectors than the one sketched above?
Note that I'm aware that foldl is probably not the best choice; once I've got the implementation sorted I plan to benchmark with a few different folds.
There is no way to define a classical instance for Unbox (U.Vector b), because that would require preallocating a memory area in which each element (i.e. each subvector!) has the same fixed amount of space. But in general, each of them may be arbitrarily big, so that's not feasible at all.
It might in principle be possible to define that instance by storing only a flattened form of the nested vector plus an extra array of indices (where each subvector starts). I once briefly gave this a try; it actually seems somewhat promising as far as immutable vectors are concerned, but a G.Vector instance also requires a mutable implementation, and that's hopeless for such an approach (because any mutation that changes the number of elements in one subvector would require shifting everything behind it).
Usually, it's just not worth it, because if the individual element vectors aren't very small the overhead of boxing them won't matter, i.e. often it makes sense to use B.Vector (U.Vector b).
For your application however, I would not do that at all – there's no need to ever wrap the upper element-choices in a single triangular array. (And it would be really bad for performance to do that, because it make the algorithm take O (n²) memory rather than O (n) which is all that's needed.)
I would just do the following:
pairApply combine neutral f v
= U.ifoldl' (\acc i p -> U.foldl' (\acc' q -> combine acc' $ f p q)
acc
(U.drop (i+1) v) )
neutral v
This corresponds pretty much to the obvious nested-loops imperative implementation
pairApply(combine, b, f, v):
for(i in 0..length(v)-1):
for(j in i+1..length(v)-1):
b = combine(b, f(v[i], v[j]);
return b;
My answer is basically the same as leftaroundabout's nested-loops imperative implementation:
pairApply :: (Int -> Int -> Int) -> Vector Int -> Int
pairApply f v = foldl' (+) 0 [f (v ! i) (v ! j) | i <- [0..(n-1)], j <- [(i+1)..(n-1)]]
where n = length v
As far as I know, I do not see any performance issue with this implementation.
Non-polymorphic for simplicity.

ghci compiler optimization: calling a function with same parameter twice

In the simple code below, part of the definition of a function that deletes an element from a binary search tree:
deleteB x (Node n l r) | x == n = Node (leastB r) l (deleteB (leastB r) r)
does the compiler optimize the code so that it calls (least B r) only once as if it were:
deleteB x (Node n l r) | x == n = Node k l (deleteB k r)
where k = leastB r
?
In other words, is the compiler able to understand that since parameter r isn't changed within the body of the function deleteB, the result of the call of the same function (leastB) on it can't give different results, hence it is useless to compute it twice?
More generally, how would I be able to understand if the compiler does this optimization or not in case amazing stackoverflow did not exist? thanks
If you want to know what GHC "really did", you want to look at the "Core" output.
GHC takes your Haskell source code, which is extremely high-level, and transforms it into a sequence of lower and lower-level languages:
Haskell ⇒ Core ⇒ STG ⇒ C−− ⇒ assembly language ⇒ machine code
Almost all of the high-level optimisations happen in Core. The one you're asking about is basically "common subexpression elimination" (CSE). If you think about it, this is a time / space tradeoff; by saving the previous result, you're using less CPU time, but also using more RAM. If the result you're trying to store is tiny (i.e., an integer), this can be worth it. If the result is huge (i.e., the entire contents of that 17GB text file you just loaded), this is probably a Very Bad Idea.
As I understand it (⇒ not very well!), GHC tends not to do CSE. But if you want to know for sure, in your specific case, you want to look at the Core that your program has actually been compiled into. I believe the switch you want is --ddump-prep.
http://www.haskell.org/ghc/docs/7.0.2/html/users_guide/options-debugging.html
GHC does not perform this optimization because it is not always an optimization space-wise.
For instance, consider
n = 1000000
x = (length $ map succ [1..n], length $ map pred [1..n])
On a lazy language such as Haskell, one would expect this to run in constant space. Indeed, the list generating expression [1..n] should lazily produce an element at a time, which would be affected by succ/pred because of the maps, and then counted by length. (Even better, succ and pred are not computed at all since length does not force the list elements). After this, the produced element can be garbage collected, and the list generator can produce the next element, and so on. In real implementations, one would not expect every single element to be garbage collected immediately, but if the garbage collector is good, only a constant amount of them should be in memory at any time.
By comparison, the "optimized" code
n = 1000000
l = [1..n]
x = (length $ map succ l, length $ map pred l)
does not allow to garbage collect the elements of l until both components of x are evaluated. So, while it produces the list only once, it uses O(n) words of memory to store the full list. This is likely to lead to a lower performance than the unoptimized code.

Universal memoization in Haskell [duplicate]

I can't figure out why m1 is apparently memoized while m2 is not in the following:
m1 = ((filter odd [1..]) !!)
m2 n = ((filter odd [1..]) !! n)
m1 10000000 takes about 1.5 seconds on the first call, and a fraction of that on subsequent calls (presumably it caches the list), whereas m2 10000000 always takes the same amount of time (rebuilding the list with each call). Any idea what's going on? Are there any rules of thumb as to if and when GHC will memoize a function? Thanks.
GHC does not memoize functions.
It does, however, compute any given expression in the code at most once per time that its surrounding lambda-expression is entered, or at most once ever if it is at top level. Determining where the lambda-expressions are can be a little tricky when you use syntactic sugar like in your example, so let's convert these to equivalent desugared syntax:
m1' = (!!) (filter odd [1..]) -- NB: See below!
m2' = \n -> (!!) (filter odd [1..]) n
(Note: The Haskell 98 report actually describes a left operator section like (a %) as equivalent to \b -> (%) a b, but GHC desugars it to (%) a. These are technically different because they can be distinguished by seq. I think I might have submitted a GHC Trac ticket about this.)
Given this, you can see that in m1', the expression filter odd [1..] is not contained in any lambda-expression, so it will only be computed once per run of your program, while in m2', filter odd [1..] will be computed each time the lambda-expression is entered, i.e., on each call of m2'. That explains the difference in timing you are seeing.
Actually, some versions of GHC, with certain optimization options, will share more values than the above description indicates. This can be problematic in some situations. For example, consider the function
f = \x -> let y = [1..30000000] in foldl' (+) 0 (y ++ [x])
GHC might notice that y does not depend on x and rewrite the function to
f = let y = [1..30000000] in \x -> foldl' (+) 0 (y ++ [x])
In this case, the new version is much less efficient because it will have to read about 1 GB from memory where y is stored, while the original version would run in constant space and fit in the processor's cache. In fact, under GHC 6.12.1, the function f is almost twice as fast when compiled without optimizations than it is compiled with -O2.
m1 is computed only once because it is a Constant Applicative Form, while m2 is not a CAF, and so is computed for each evaluation.
See the GHC wiki on CAFs: http://www.haskell.org/haskellwiki/Constant_applicative_form
There is a crucial difference between the two forms: the monomorphism restriction applies to m1 but not m2, because m2 has explicitly given arguments. So m2's type is general but m1's is specific. The types they are assigned are:
m1 :: Int -> Integer
m2 :: (Integral a) => Int -> a
Most Haskell compilers and interpreters (all of them that I know of actually) do not memoize polymorphic structures, so m2's internal list is recreated every time it's called, where m1's is not.
I'm not sure, because I'm quite new to Haskell myself, but it appears that it's beacuse the second function is parametrized and the first one is not. The nature of the function is that, it's result depends on input value and in functional paradigm especailly it depends ONLY on the input. Obvious implication is that a function with no parameters returns always the same value over and over, no matter what.
Aparently there's an optimizing mechanizm in GHC compiler that exploits this fact to compute the value of such a function only once for whole program runtime. It does it lazily, to be sure, but does it nonetheless. I noticed it myself, when I wrote the following function:
primes = filter isPrime [2..]
where isPrime n = null [factor | factor <- [2..n-1], factor `divides` n]
where f `divides` n = (n `mod` f) == 0
Then to test it, I entered GHCI and wrote: primes !! 1000. It took a few seconds, but finally I got the answer: 7927. Then I called primes !! 1001 and got the answer instantly. Similarly in an instant I got the result for take 1000 primes, because Haskell had to compute the whole thousand-element list to return 1001st element before.
Thus if you can write your function such that it takes no parameters, you probably want it. ;)

tail recursion recognition

I'm trying to learn Haskell and I stumbled upon the following:
myAdd (x:xs) = x + myAdd xs
myAdd null = 0
f = let n = 10000000 in myAdd [1 .. n]
main = do
putStrLn (show f)
When compiling with GHC, this yields a stack overflow. As a C/C++ programmer, I would have expected the compiler to do tail call optimization.
I don't like that I would have to "help" the compiler in simple cases like these, but what options are there? I think it is reasonable to require that the calculation given above be done without using O(n) memory, and without deferring to specialized functions.
If I cannot state my problem naturally (even on a toy problem such as this), and expect reasonable performance in terms of time & space, much of the appeal of Haskell would be lost.
Firstly, make sure you're compiling with -O2. It makes a lot of performance problems just go away :)
The first problem I can see is that null is just a variable name there. You want []. It's equivalent here because the only options are x:xs and [], but it won't always be.
The issue here is simple: when you call sum [1,2,3,4], it looks like this:
1 + (2 + (3 + (4 + 0)))
without ever reducing any of these additions to a number, because of Haskell's non-strict semantics. The solution is simple:
myAdd = myAdd' 0
where myAdd' !total [] = total
myAdd' !total (x:xs) = myAdd' (total + x) xs
(You'll need {-# LANGUAGE BangPatterns #-} at the top of your source file to compile this.)
This accumulates the addition in another parameter, and is actually tail recursive (yours isn't; + is in tail position rather than myAdd). But in fact, it's not quite tail recursion we care about in Haskell; that distinction is mainly relevant in strict languages. The secret here is the bang pattern on total: it forces it to be evaluated every time myAdd' is called, so no unevaluated additions build up, and it runs in constant space. In this case, GHC can actually figure this out with -O2 thanks to its strictness analysis, but I think it's usually best to be explicit about what you want strict and what you don't.
Note that if addition was lazy, your myAdd definition would work fine; the problem is that you're doing a lazy traversal of the list with a strict operation, which ends up causing the stack overflow. This mostly comes up with arithmetic, which is strict for the standard numeric types (Int, Integer, Float, Double, etc.).
This is quite ugly, and it would be a pain to write something like this every time we want to write a strict fold. Thankfully, Haskell has an abstraction ready for this!
myAdd = foldl' (+) 0
(You'll need to add import Data.List to compile this.)
foldl' (+) 0 [a, b, c, d] is just like (((0 + a) + b) + c) + d, except that at each application of (+) (which is how we refer to the binary operator + as a function value), the value is forced to be evaluated. The resulting code is cleaner, faster, and easier to read (once you know how the list folds work, you can understand any definition written in terms of them easier than a recursive definition).
Basically, the problem here is not that the compiler can't figure out how to make your program efficient — it's that making it as efficient as you like could change its semantics, which an optimisation should never do. Haskell's non-strict semantics certainly pose a learning curve to programmers in more "traditional" languages like C, but it gets easier over time, and once you see the power and abstraction that Haskell's non-strictness offers, you'll never want to go back :)
Expanding the example ehird hinted at in the comments:
data Peano = Z | S Peano
deriving (Eq, Show)
instance Ord Peano where
compare (S a) (S b) = compare a b
compare Z Z = EQ
compare Z _ = LT
compare _ _ = GT
instance Num Peano where
Z + n = n
(S a) + n = S (a + n)
-- omit others
fromInteger 0 = Z
fromInteger n
| n < 0 = error "Peano: fromInteger requires non-negative argument"
| otherwise = S (fromInteger (n-1))
instance Enum Peano where
succ = S
pred (S a) = a
pred _ = error "Peano: no predecessor"
toEnum n
| n < 0 = error "toEnum: invalid argument"
| otherwise = fromInteger (toInteger n)
fromEnum Z = 0
fromEnum (S a) = 1 + fromEnum a
enumFrom = iterate S
enumFromTo a b = takeWhile (<= b) $ enumFrom a
-- omit others
infinity :: Peano
infinity = S infinity
result :: Bool
result = 3 < myAdd [1 .. infinity]
result is True by the definition of myAdd, but if the compiler transformed into a tail-recursive loop, it wouldn't terminate. So that transformation is not only a change in efficiency, but also in semantics, hence a compiler must not do it.
A little funny example regarding "The issue is why the compiler is unable to optimize something that appears to be rather trivial to optimize."
Let's say I'm coming from Haskell to C++. I used to write foldr because in Haskell foldr is usually more effective than foldl because of laziness and list fusion.
So I'm trying to write a foldr for a (single-linked) list in C and complaining why it's grossly inefficient:
int foldr(int (*f)(int, node*), int base, node * list)
{
return list == NULL
? base
: f(a, foldr(f, base, list->next));
}
It is inefficient not because the C compiler in question is an unrealistic toy tool developed by ivory tower theorists for their own satisfaction, but because the code in question is grossly non-idiomatic for C.
It is not the case that you cannot write an efficient foldr in C: you just need a doubly-linked list. In Haskell, similarly, you can write an efficient foldl, you need strictness annotations for foldl to be efficient. The standard library provides both foldl (without annotations) and foldl' (with annotations).
The idea of left folding a list in Haskell is the same kind of perversion as a desire to iterate a singly-linked list backwards using recursion in C. Compiler is there to help normal people, not perverts lol.
As your C++ projects probably don't have code iterating singly-linked lists backwards, my HNC project contains only 1 foldl I incorrectly wrote before I mastered Haskell enough. You hardly ever need to foldl in Haskell.
You must unlearn that forward iteration is natural and fastest, and learn that backward iteration is. The forward iteration (left folding) does not do what you intend, until you annotate: it does three passes - list creation, thunk chain buildup and thunk evaluation, instead of two (list creation and list traversal). Note that in immutable world lists can be only efficiently created backwards: a : b is O(1), and a ++ [b] is O(N).
And the backward iteration doesn't do what you intend either. It does one pass instead of three as you might expect from your C background. It doesn't create a list, traverse it to the bottom and then traverse it backwards (2 passes) - it traverses the list as it creates it, that is, 1 pass. With optimizations on, it is just a loop - no actual list elements are created. With optimizations off, it is still O(1) space operation with bigger constant overhead, but explanation is a bit longer.
So there are two things I will address about your problem, firstly the performance problem, and then secondly the expressive problem, that of having to help the compiler with something that seems trivial.
The performance
The thing is that your program is in fact not tail recursive, that is, there is no single call to a function that can replace the recursion. Lets have a look at what happens when we expand myAdd [1..3]:
myAdd [1,2,3]
1 + myAdd [2,3]
1 + 2 + myAdd [3]
As you can see, at any given step, we cannot replace the recursion with a function call, we could simplify the expression by reducing 1 + 2 to 3, but that is not what tail recursion is about.
So here is a version that is tail recursive:
myAdd2 = go 0
where go a [] = a
go a (x:xs) = go (a + x) xs
Lets have a look at how go 0 [1,2,3] is evaluated:
go 0 [1,2,3]
go (1+0) [2,3]
go (2 + 1 + 0) [3]
As you see, at every step, we only need to keep track of
one function call, and as long the first parameter is
evaluated strictly we should not get an exponential space
blow up, and in fact, if you compile with optimization (-O1 or -O2)
ghc is smart enough to figure that out on its own.
Expressiveness
Alright so it is a bit harder to reason about performance in haskell, but most of the time you don't have to. The thing is that you can use combinators that ensure efficiency. This particular pattern above is captured by foldl (and its strict cousin foldl') so myAdd can be written as:
myAdd = foldl (+) 0
and if you compile that with optimization it will not give you an exponential space blowup!

Haskell: why the convention to name a helper function "go"?

I see go a lot when reading Haskell material or source, but I've never been really comfortable about it - (I guess it has the negative connotation of "goto" in my mind). I started learning Haskell with LYAH, and that's where I picked up the tendency to use acc and step when writing folds. Where does the convention for writing go come from?
Most importantly, what exactly is the name go supposed to imply?
Hmm! Some archaeology!
Since around 2004 I've used go as the generic name for tail-recursive worker loops, when doing a worker/wrapper transformation of a recursive function. I started using it widely in bytestring, e.g.
foldr :: (Word8 -> a -> a) -> a -> ByteString -> a
foldr k v (PS x s l) = inlinePerformIO $ withForeignPtr x $ \ptr ->
go v (ptr `plusPtr` (s+l-1)) (ptr `plusPtr` (s-1))
where
STRICT3(go)
go z p q | p == q = return z
| otherwise = do c <- peek p
go (c `k` z) (p `plusPtr` (-1)) q -- tail recursive
{-# INLINE foldr #-}
was from bytestring in August 2005.
This got written up in RWH, and probably was popularized from there. Also, in the stream fusion library, Duncan Coutts and I started doing it a lot.
From the GHC sources
The idiom goes back further though. foldr in GHC.Base is given as:
foldr k z = go
where
go [] = z
go (y:ys) = y `k` go ys
which is probably where I picked up the trick (I'd thought this was from Andy Gill's thesis, but can't find any use of go there). It isn't given in this form in Gofer, so I think this first appeared in the GHC code base.
By 2001, Simon Marlow was using go in some of the systems-level code, so we might place the blame somewhere in GHC, and this clue leads us to the GHC source, where go is widely used in worker functions:
myCollectBinders expr
= go [] expr
where
go bs (Lam b e) = go (b:bs) e
go bs e#(Note (SCC _) _) = (reverse bs, e)
go bs (Cast e _) = go bs e
go bs (Note _ e) = go bs e
go bs e = (reverse bs, e)
GHC 3.02 and Glasgow
Digging up old versions of GHC, we see that in GHC 0.29 this idiom does not appear, but by GHC 3.02 series (1998), the go idiom appears everywhere. An example, in Numeric.lhs, in the definition of showInt, dated to 1996-1997:
showInt n r
| n < 0 = error "Numeric.showInt: can't show negative numbers"
| otherwise = go n r
where
go n r =
case quotRem n 10 of { (n', d) ->
case chr (ord_0 + fromIntegral d) of { C# c# -> -- stricter than necessary
let
r' = C# c# : r
in
if n' == 0 then r' else go n' r'
}}
this is a different implementation to the one given in the H98 report. Digging into the implementation of "Numeric.lhs", however, we find that it isn't the same as the version that was added to GHC 2.06 in 1997, and a very interesting patch from Sigbjorne Finne appears, in April 1998, adding a go loop to Numeric.lhs.
This says that at least by 1998, Sigbjorne was adding go loops to the GHC "std" library, while simultaneously, many modules in the GHC compiler core had go loops. Digging further, this very interesting commit from Will Partain in July 1996 adds a "go" loop into GHC -- the code comes from Simon PJ though!
So I'm going to call this as a Glasgow idiom invented by people at Glasgow who worked on GHC in the mid 90s, such as Simon Marlow, Sigbjorn Finne, Will Partain and Simon Peyton Jones.
Obviously Don's answer is the correct one. Let me just add in a small detail (since it seems to be my writing that you're directly referring to): go is nice because it's only two letters.
Oh, and the reason the Yesod book devotes so much content to the enumerator package is because I'd already written up the three-part tutorial of enumerator as a blog post series, so decided I may as well include it in the book. The enumerator package is used in a number of places throughout Yesod, so it is relevant.
I'd expect this idiom to be applicable not just to linear structures (and hence "loops"), but also to branching (tree-like) structures.
I wonder how often the go pattern corresponds to accumulation parameters and, more generally, with the continuation-encoding strategites that Mitch Wand explored in the paper Continuation-Based Program Transformation Strategies (one of my all-time favorite papers).
In these cases, the go function has a particular meaning, which can then be used to derive efficient code from an elegant specification.

Resources