Memoized Collatz sequence - haskell

I've posted the same question in CodeReview but failed to get an answer. so I am trying my luck here in SO.
Here is one of my programs that utilized memoization and array to improve performance and memory usage. The performance seems satisfactory but the memory usage is ridiculous and I can't figure out what's wrong:
{-# LANGUAGE BangPatterns #-}
import Data.Functor
import Data.Array (Array)
import qualified Data.Array as Arr
import Control.DeepSeq
genColtzArr n = collatzArr
where collatzArr = Arr.array (1, n) $ take n $ map (\v -> (v, collatz v 0)) [1..]
collatz 1 !acc = 1 + acc
collatz !m !acc
| even m = go (m `div` 2) acc
| otherwise = go (3 * m + 1) acc
where go !l !acc
| l <= n = let !v = collatzArr Arr.! l in 1 + acc + v
| otherwise = collatz l $ 1 + acc
collatz here means this guy. This function is supposed to receive a number n, and then return an array indexing from 1 to n, and in which each cell contains the length of the link from the index to 1 by applying Collatz formula.
But the memory usage of this method is so high. Here is the profiler result (ghc option -prof -fprof-auto -rtsopts, run time option +RTS -p, n == 500000):
total alloc = 730,636,136 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
genColtzArr.collatz Main 40.4 34.7
genColtzArr.collatz.go Main 25.5 14.4
COST CENTRE MODULE no. entries %time %alloc %time %alloc
genColtzArr Main 105 1 0.0 0.0 74.7 72.1
genColtzArr.collatzArr Main 106 1 8.0 20.8 74.7 72.1
genColtzArr.collatzArr.\ Main 107 500000 0.9 2.2 66.8 51.3
genColtzArr.collatz Main 109 1182582 40.4 34.7 65.9 49.1
genColtzArr.collatz.go Main 110 1182581 25.5 14.4 25.5 14.4
Please note that -O2 is not a desired answer. I want to figure out what's the problem in this program and in general, how should I spot time and memory inefficiencies in Haskell code. Specifically, I have no idea why this code, with tail recursion and bang pattern, can consume so much memory.
UPDATE1:
the same code with -s produces this:
1,347,869,264 bytes allocated in the heap
595,901,528 bytes copied during GC
172,105,056 bytes maximum residency (7 sample(s))
897,704 bytes maximum slop
315 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2408 colls, 0 par 0.412s 0.427s 0.0002s 0.0075s
Gen 1 7 colls, 0 par 0.440s 0.531s 0.0759s 0.1835s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.828s ( 0.816s elapsed)
GC time 0.852s ( 0.958s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.004s ( 0.017s elapsed)
Total time 1.684s ( 1.791s elapsed)
%GC time 50.6% (53.5% elapsed)
Alloc rate 1,627,861,429 bytes per MUT second
Productivity 49.4% of total user, 46.4% of total elapsed
so it takes 300 meg. that is still too large.
Update2
full code
{-# LANGUAGE BangPatterns #-}
import Data.Functor
import Data.Array (Array)
import qualified Data.Array as Arr
import Control.DeepSeq
genColtzArr n = collatzArr
where collatzArr = Arr.array (1, n) $ take n $ map (\v -> (v, collatz v 0)) [1..]
collatz 1 !acc = 1 + acc
collatz !m !acc
| even m = go (m `div` 2) acc
| otherwise = go (3 * m + 1) acc
where go !l !acc
| l <= n = let !v = collatzArr Arr.! l in 1 + acc + v
| otherwise = collatz l $ 1 + acc
genLongestArr n = Arr.array (1, n) llist
where colatz = genColtzArr n
llist = (1, 1):zipWith (\(n1, a1) l2 ->
let l1 = colatz Arr.! a1
in (n1 + 1, if l2 < l1 then a1 else n1 + 1))
llist (tail $ Arr.elems colatz)
main :: IO ()
main = getLine >> do
ns <- map read <$> lines <$> getContents
let m = maximum ns
let lar = genLongestArr m
let iter [] = return ()
iter (h:t) = (putStrLn $ show $ lar Arr.! h) >> iter t
iter ns

As the other answer on CodeReview hints, it's alright for a 500000-element boxed array to comsume ~20MB memory, however it's not only the array but a lot of things all together:
Although you put bang patterns every where, array initialization itself is a lazy foldr:
-- from GHC.Arr
array (l,u) ies
= let n = safeRangeSize (l,u)
in unsafeArray' (l,u) n
[(safeIndex (l,u) n i, e) | (i, e) <- ies]
unsafeArray' :: Ix i => (i,i) -> Int -> [(Int, e)] -> Array i e
unsafeArray' (l,u) n#(I# n#) ies = runST (ST $ \s1# ->
case newArray# n# arrEleBottom s1# of
(# s2#, marr# #) ->
foldr (fill marr#) (done l u n marr#) ies s2#)
So unless you evaluated the last bit of an array, it's holding reference to the list used in initialization. Usually the list can be GC'd on fly while you evaluating the array, but in your case the mutual references and self references disturbed the common GC pattern.
llist is self-referencing to produce every single element, so it will not be GC'd until you evaluated the last element of it
it also holds a reference to genColtzArr so genColtzArr won't be GC'd until llist is fully evaluated
you might think collatz is tail recursive but it's not, it's mutual recursive with collatzArr so again both of them won't be GC'd until fully evaluated
Everything combined, your program will keep three 500000-element list-like structures in memory and results ~80MB peak heap size.
Solution
The obvious solution is to force every array / list to normal form before it's used in another so you won't keep multiple copys of the same data in the memory.
genLongestArr :: Int -> Array Int Int
genLongestArr n =
let collatz = genColtzArr n
-- deepseq genColtzArr before mapping over it
-- this is equivalent to your recursive definition
in collatz `deepseq` (Arr.listArray (1,n) $ fmap fst $ scanl' (maxWith snd) (0, 0) $ Arr.assocs collatz)
maxWith :: Ord a => (b -> a) -> b -> b -> b
maxWith f b b' = case compare (f b) (f b') of
LT -> b'
_ -> b
And in main:
-- deepseq lar before mapping over it
-- this is equivalent to your iter loop
lar `deepseq` mapM_ (print . (lar Arr.!)) ns
Nothing can be done with genColtzArr, it's using itself for memorization so the mutual recursion is kind of necessary.
Now the heap graph peaks at ~20MB as it should:
(Disclaimer: All programs in this answer were compiled with -O0)

Related

Finding the size of a list that's too big for memory?

Brand new Haskell programmer here. Just finished "Learn you a Haskell"... I'm interested in how large a set is that has some particular properties. I have working code for some small parameter values, but I'd like to know how to deal with larger structures. I know Haskell can do "infinite data structures" but I'm just not seeing how to get there from where I'm at and Learn You a Haskell / Google isn't getting me over this.
Here's the working code for my eSet given "small" arguments r and t
import Control.Monad
import System.Environment
import System.Exit
myPred :: [Int] -> Bool
myPred a = myPred' [] a
where
myPred' [] [] = False
myPred' [] [0] = True
myPred' _ [] = True
myPred' acc (0:aTail) = myPred' acc aTail
myPred' acc (a:aTail)
| a `elem` acc = False
| otherwise = myPred' (a:acc) aTail
superSet :: Int -> Int -> [[Int]]
superSet r t = replicateM r [0..t]
eSet :: Int -> Int -> [[Int]]
eSet r t = filter myPred $ superSet r t
main :: IO ()
main = do
args <- getArgs
case args of
[rArg, tArg] ->
print $ length $ eSet (read rArg) (read tArg)
[rArg, tArg, "set"] ->
print $ eSet (read rArg) (read tArg)
_ ->
die "Usage: eSet r r set <set optional for printing set itself otherwise just print the size
When compiled/run I get
$ ghc eSet.hs -rtsopts
[1 of 1] Compiling Main ( eSet.hs, eSet.o )
Linking eSet ...
$ # Here's is a tiny eSet to illustrate. It is the set of lists of r integers from zero to t with no repeated nonzero list entries
$ ./eSet 4 2 set
[[0,0,0,0],[0,0,0,1],[0,0,0,2],[0,0,1,0],[0,0,1,2],[0,0,2,0],[0,0,2,1],[0,1,0,0],[0,1,0,2],[0,1,2,0],[0,2,0,0],[0,2,0,1],[0,2,1,0],[1,0,0,0],[1,0,0,2],[1,0,2,0],[1,2,0,0],[2,0,0,0],[2,0,0,1],[2,0,1,0],[2,1,0,0]]
$ ./eSet 8 4 +RTS -sstderr
3393
174,406,136 bytes allocated in the heap
29,061,152 bytes copied during GC
4,382,568 bytes maximum residency (7 sample(s))
148,664 bytes maximum slop
14 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 328 colls, 0 par 0.047s 0.047s 0.0001s 0.0009s
Gen 1 7 colls, 0 par 0.055s 0.055s 0.0079s 0.0147s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.298s ( 0.301s elapsed)
GC time 0.102s ( 0.102s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 0.406s ( 0.405s elapsed)
%GC time 25.1% (25.2% elapsed)
Alloc rate 585,308,888 bytes per MUT second
Productivity 74.8% of total user, 75.0% of total elapsed
$ ./eSet 10 5 +RTS -sstderr
63591
27,478,010,744 bytes allocated in the heap
4,638,903,384 bytes copied during GC
532,163,096 bytes maximum residency (15 sample(s))
16,500,072 bytes maximum slop
1556 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 52656 colls, 0 par 6.865s 6.864s 0.0001s 0.0055s
Gen 1 15 colls, 0 par 8.853s 8.997s 0.5998s 1.8617s
INIT time 0.000s ( 0.000s elapsed)
MUT time 52.652s ( 52.796s elapsed)
GC time 15.717s ( 15.861s elapsed)
EXIT time 0.193s ( 0.211s elapsed)
Total time 68.564s ( 68.868s elapsed)
%GC time 22.9% (23.0% elapsed)
Alloc rate 521,883,277 bytes per MUT second
Productivity 77.1% of total user, 76.7% of total elapsed
I see my memory usage is very high and there's a lot of garbage collecting. When running eSet 12 6 I get a Segmentation fault.
I feel like filter myPred $ superSet r t is keeping me from lazily making the subset one part at a time so I can deal with much larger (but finite) sets, but I don't know how to change to another approach that would do that. I think that's the root of my question.
Also, as this is my first Haskell program, points on style and how to achieve the Haskell analog of "pythonic" are much appreciated!
I suspect the culprit here is replicateM, which has this implementation:
replicateM cnt0 f =
loop cnt0
where
loop cnt
| cnt <= 0 = pure []
| otherwise = liftA2 (:) f (loop (cnt - 1))
The problem line is liftA2 (:) f (loop (cnt - 1)); probably loop (cnt - 1) is getting shared among all the calls to (:) partially applied to elements of f, and so loop (cnt - 1) must be kept in memory. Unfortunately loop (cnt - 1) is quite a long list...
It can be a bit fiddly to convince GHC not to share something. The following redefinition of superSet gives me a nice flat memory usage; it will probably be a bit slower on examples that do fit in memory, of course. The key idea is to make it look to the untrained eye (i.e. GHC) like the recursive monadic action depends on the choices made earlier, even though it doesn't.
superSet :: Int -> Int -> [[Int]]
superSet r t = go r 0 where
go 0 ignored = if ignored == 0 then [[]] else [[]]
go r ignored = do
x <- [0..t]
xs <- go (r-1) (ignored+x)
return (x:xs)
If you don't mind avoiding optimizations, the more natural definition also works:
superSet 0 t = [[]]
superSet r t = do
x <- [0..t]
xs <- superSet (r-1) t
return (x:xs)
...but with -O2 GHC is too clever and notices that it can share the recursive call.
A completely alternate approach is to just do a little bit of combinatorics analysis. Here's the process that eSet r t does, as near as I can make out:
Choose at most r elements without replacement from a set of size t.
Pad the sequence to a length of r by interleaving a sentinel value.
So let's just count the ways of doing each of these steps, rather than actually performing them. We'll introduce a new parameter, s, which is the length of the sequence produced by step (1) (and which we therefore know has s <= r and s <= t). How many sequences of size s are there when drawing elements without replacement from a set of size t? Well, there are t choices for the first element, t-1 choices for the second element, t-2 choices for the third element, ...
-- sequencesWithoutReplacement is a very long name!
seqWORepSize :: Integer -> Integer -> Integer
seqWORepSize s t = product [t-s+1 .. t]
Then we want to pad the sequence out to a length of r. We're going to choose s positions in the r-long sequence to be drawn from our sequence, and the remainder will be sentinels. How many ways are there to do that? This one is a well-known combinatorics operator called choose.
choose :: Integer -> Integer -> Integer
choose r s = product [r-s+1 .. r] `div` product [2 .. s]
The number of ways to produce a padded sequence of a given length is just the product of these two numbers, since the choices of "what values to insert" and "where to insert values" can be made completely independently.
paddedSeqSize :: Integer -> Integer -> Integer -> Integer
paddedSeqSize r s t = seqWORepSize s t * (r `choose` s)
And now we're pretty much done. Just iterate over all possible sequence lengths and add up the paddedSeqSize.
eSetSize :: Integer -> Integer -> Integer
eSetSize r t = sum $ map (\s -> paddedSeqSize r s t) [0..r]
We can try it out in ghci:
> :set +s
> map length $ [eSet 1 1, eSet 4 4, eSet 6 4, eSet 4 6]
[2,209,1045,1045]
(0.13 secs, 26,924,264 bytes)
> [eSetSize 1 1, eSetSize 4 4, eSetSize 6 4, eSetSize 4 6]
[2,209,1045,1045]
(0.01 secs, 120,272 bytes)
This way is significantly faster and significantly more memory-friendly. Indeed, we can make queries and get answers about eSets that we would never be able to count the size of one-by-one, e.g.
> length . show $ eSetSize 1000 1000
2594
(0.26 secs, 909,746,448 bytes)
Good luck counting to 10^2594 one at a time. =P
Edit
This idea can also be adapted to produce the padded sequences themselves rather than just counting how many there are. First, a handy utility that I find myself defining over and over for picking out individual elements of a list and reporting on the leftovers:
zippers :: [a] -> [([a], a, [a])]
zippers = go [] where
go ls [] = []
go ls (h:rs) = (ls, h, rs) : go (h:ls) rs
Now, the sequences without replacement can be done by repeatedly choosing a single element from the leftovers.
seqWORep :: Int -> [a] -> [[a]]
seqWORep 0 _ = [[]]
seqWORep n xs = do
(ls, y, rs) <- zippers xs
ys <- seqWORep (n-1) (ls++rs)
return (y:ys)
Once we have a sequence, we can pad it to a desired size by producing all the interleavings of the sentinel value as follows:
interleavings :: Int -> a -> [a] -> [[a]]
interleavings 0 _ xs = [xs]
interleavings n z [] = [replicate n z]
interleavings n z xs#(x:xt) = map (z:) (interleavings (n-1) z xs)
++ map (x:) (interleavings n z xt)
And finally, the top-level function just delegates to seqWORep and interleavings.
eSet :: Int -> Int -> [[Int]]
eSet r t = do
s <- [0..r]
xs <- seqWORep s [1..t]
interleavings (r-s) 0 xs
In my tests this modified eSet has nice flat memory usage both with and without optimizations; does not generate any spurious elements that need to be later filtered out, and so is faster than your original proposal; and to me looks like quite a natural definition compared to the answer that relies on tricking GHC. A nice collection of properties!
After re-reading parts of LYaH and thinking about #daniel-wagners answer monadically composing sounded like it would be worthwhile to try again. The new code total memory is flat and works with and without the -O2 optimization.
Source:
import Control.Monad
import System.Environment
import System.Exit
allowed :: [Int] -> Bool
allowed a = allowed' [] a
where
allowed' [ ] [ ] = False
allowed' [ ] [0] = True
allowed' _ [ ] = True
allowed' acc (0:aTail) = allowed' acc aTail
allowed' acc (a:aTail)
| a `elem` acc = False
| otherwise = allowed' (a:acc) aTail
branch :: Int -> [Int] -> [[Int]]
branch t x = filter allowed [n:x | n <- [0..t]]
eSet :: Int -> Int -> [[Int]]
eSet r t = return [] >>= foldr (<=<) return (replicate r (branch t))
main :: IO ()
main = do
args <- getArgs
case args of
[rArg, tArg] ->
print $ length $ eSet (read rArg) (read tArg)
[rArg, tArg, "set"] ->
print $ eSet (read rArg) (read tArg)
_ -> die "Usage: eSet r r set <set optional>"
The version with monadic function composition tests much faster and without the memory issues...
$ ./eSetMonad 10 5 +RTS -sstderr
63591
289,726,000 bytes allocated in the heap
997,968 bytes copied during GC
63,360 bytes maximum residency (2 sample(s))
24,704 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 553 colls, 0 par 0.008s 0.008s 0.0000s 0.0002s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0003s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.426s ( 0.429s elapsed)
GC time 0.009s ( 0.009s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.439s ( 0.438s elapsed)
%GC time 2.0% (2.0% elapsed)
Alloc rate 680,079,893 bytes per MUT second
Productivity 98.0% of total user, 98.3% of total elapsed

How to explain this profilng data of Haskell random generator for huge memory usage and low speed?

I want to profile the speed of Haskell random generator, and my test case is to generate 1000000 double precision random numbers range from zero to one and calculate their sum. here is my code:
import System.Random
import System.Environment
import Control.Monad.State
import Data.Time.Clock
type Seed = Int
intDayTime :: IO Int
intDayTime = getCurrentTime >>= return.(floor.utctDayTime :: UTCTime->Int)
n = 1000000 :: Int
main :: IO ()
main = do
calc <- getArgs >>= return . (read ::(String->Int)).head
seed <- intDayTime
let calcit :: Int->Double
calcit 1 = calc1 n seed
calcit 2 = calc2 n (mkStdGen seed)
calcit _ = error "error calculating"
in print $ calcit calc
calc1 ::Int->Seed->Double
calc1 n initSeed =
let next :: Seed->(Double,Seed) -- my simple random number generator, just for test
myRandGen :: State Seed Double
calc :: Int->Int->Seed->Double->Double
next seed = let x = (1103515245 * seed + 12345) `mod` 1073741824 in (((fromIntegral x)/1073741824),x)
myRandGen = state next
calc _ 0 _ r = r
calc n c s r = calc n (c-1) ns (r + nv)
where (nv,ns) = runState myRandGen s
in calc n n initSeed 0
calc2 ::Int->StdGen->Double
calc2 n initSeed =
let myRandGen :: State StdGen Double
calc :: Int->Int->StdGen->Double->Double
next :: StdGen->(Double,StdGen)
next gen = randomR (0,1) gen
myRandGen = state next
calc _ 0 _ r = r
calc n c s r = calc n (c-1) ns (r + nv)
where (nv,ns) = runState myRandGen s
in calc n n initSeed 0
and I compile the code with
ghc profRandGen.hs -O3 -prof -fprof-auto -rtsopts
run with
./profRandGen.exe 1 +RTS -o # for calc1
./profRandGen.exe 2 +RTS -o # for calc2
and the profiling data for calc1 is
total time = 0.10 secs (105 ticks # 1000 us, 1 processor)
total alloc = 128,121,344 bytes (excludes profiling overheads)
profiling data for calc1 is
total time = 1.48 secs (1479 ticks # 1000 us, 1 processor)
total alloc = 2,008,077,560 bytes (excludes profiling overheads)
I can understand that the random generator in System.Randomwould be slower, but why would it be slower so much and why would it allocating much more memory?
I used tail recursion in my code and compiled with -O2 -fforce-recomp option, why didn't I get constant memory usage?
is there anything wrong in my code? for example, is it because there being lazy evaluation that tail recursion is not optimized and a lot of memory is allocated? If so,How could this program be further optimized?
As Daniel Wagner says in the comments, your state accumulator is not strict. First off, try to import Control.Monad.State.Strict. That may itself be enough! Otherwise, you also need to alter myRandGen such that it more thoroughly forces the new generator before replacing it in the state.

What are possible Haskell optimizations keys?

I found benchmark that solves really simple task in different languages https://github.com/starius/lang-bench . Here 's the code for Haskell :
cmpsum i j k =
if i + j == k then 1 else 0
main = print (sum([cmpsum i j k |
i <- [1..1000], j <- [1..1000], k <- [1..1000]]))
This code runs very slow as you can see in benchmark and I found this very strange.
I tried to inline the function cmpsum and compile with the next flags:
ghc -c -O2 main.hs
but it really didn't help. I am not asking about optimizing the algorithm cause it's the same for all languages, but about possible compiler or code optimizations that can make this code run faster.
Not a complete answer, sorry. Compiling with GHC 7.10 on my machine I get ~12s for your version.
I'd suggest always compiling with -Wall which shows us that our numbers are being defaulted to the infinite precision Integer type. Fixing that:
module Main where
cmpsum :: Int -> Int -> Int -> Int
cmpsum i j k =
if i + j == k then 1 else 0
main :: IO ()
main = print (sum([cmpsum i j k |
i <- [1..1000], j <- [1..1000], k <- [1..1000]]))
This runs in ~5s for me. Running with +RTS -s seems to show we have a loop in constant memory:
87,180 bytes allocated in the heap
1,704 bytes copied during GC
42,580 bytes maximum residency (1 sample(s))
18,860 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.001s elapsed)
MUT time 4.920s ( 4.919s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 4.920s ( 4.921s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 17,719 bytes per MUT second
Productivity 100.0% of total user, 100.0% of total elapsed
-fllvm shaves off another second or so. Maybe someone else can look into it further.
Edit: Just digging into this a little further. It doesn't look like fusion is happening. Even if I change sum to a foldr (+) 0 which is an explicit "good producer/good consumer" pair.
Rec {
$wgo [InlPrag=[0], Occ=LoopBreaker] :: Int# -> Int#
[GblId, Arity=1, Str=DmdType <S,U>]
$wgo =
\ (w :: Int#) ->
let {
$j :: Int# -> Int#
[LclId, Arity=1, Str=DmdType]
$j =
\ (ww [OS=OneShot] :: Int#) ->
letrec {
$wgo1 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
[LclId, Arity=1, Str=DmdType <S,1*U>]
$wgo1 =
\ (w1 :: [Int]) ->
case w1 of _ [Occ=Dead] {
[] -> ww;
: y ys ->
case $wgo1 ys of ww1 { __DEFAULT ->
case lvl of _ [Occ=Dead] {
[] -> ww1;
: y1 ys1 ->
case y of _ [Occ=Dead] { I# y2 ->
case y1 of _ [Occ=Dead] { I# y3 ->
case tagToEnum# # Bool (==# (+# w y2) y3) of _ [Occ=Dead] {
False ->
letrec {
$wgo2 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
[LclId, Arity=1, Str=DmdType <S,1*U>]
$wgo2 =
\ (w2 :: [Int]) ->
case w2 of _ [Occ=Dead] {
[] -> ww1;
: y4 ys2 ->
case y4 of _ [Occ=Dead] { I# y5 ->
case tagToEnum# # Bool (==# (+# w y2) y5) of _ [Occ=Dead] {
False -> $wgo2 ys2;
True -> case $wgo2 ys2 of ww2 { __DEFAULT -> +# 1 ww2 }
}
}
}; } in
$wgo2 ys1;
True ->
letrec {
$wgo2 [InlPrag=[0], Occ=LoopBreaker] :: [Int] -> Int#
[LclId, Arity=1, Str=DmdType <S,1*U>]
$wgo2 =
\ (w2 :: [Int]) ->
case w2 of _ [Occ=Dead] {
[] -> ww1;
: y4 ys2 ->
case y4 of _ [Occ=Dead] { I# y5 ->
case tagToEnum# # Bool (==# (+# w y2) y5) of _ [Occ=Dead] {
False -> $wgo2 ys2;
True -> case $wgo2 ys2 of ww2 { __DEFAULT -> +# 1 ww2 }
}
}
}; } in
case $wgo2 ys1 of ww2 { __DEFAULT -> +# 1 ww2 }
}
}
}
}
}
}; } in
$wgo1 lvl } in
case w of wild {
__DEFAULT -> case $wgo (+# wild 1) of ww { __DEFAULT -> $j ww };
1000 -> $j 0
}
end Rec }
In fact, looking at the core for print $ foldr (+) (0:: Int) $ [ i+j | i <- [0..10000], j <- [0..10000]] it seems as though only the first layer of the list comprehension is fused. Is that a bug?
This code gets the job done in 1 second and no extra allocation in GHC 7.10 with -O2 (see the bottom for profiling output):
cmpsum :: Int -> Int -> Int -> Int
cmpsum i j k = fromEnum (i+j==k)
main = print $ sum [cmpsum i j k | i <- [1..1000],
j <- [1..const 1000 i],
k <- [1..const 1000 j]]
In GHC 7.8, you can get almost the same results in this case (1.4 seconds) if you add the following at the beginning:
import Prelude hiding (sum)
sum xs = foldr (\x r a -> a `seq` r (a+x)) id xs 0
There are three issues here:
Specializing the code to Int instead of letting it default to Integer is crucial.
GHC 7.10 offers list fusion for sum that GHC 7.8 does not. This is because the new definition of sum, based on a new definition of foldl, can be very bad in some cases without the "call arity" analysis Joachim Breitner created for GHC 7.10.
GHC performs a limited "full laziness" pass very early in compilation, before any inlining occurs. As a result, the constant [1..1000] terms for j and k, which are used multiple times in the loop, get hoisted out of the loop. This would be good if these were actually expensive to calculate, but in this context it's much cheaper to do the additions over and over and over instead of saving the results. What the code above does is trick GHC. Since const isn't inlined until a little bit later, this first full laziness pass doesn't see that the lists are constant, so it doesn't hoist them out. I wrote it this way because it's nice and short, but it is, admittedly, a little on the fragile side. To make it more robust, use phased inlining:
main = print $ sum [cmpsum i j k | i <- [1..1000],
j <- [1..konst 1000 i],
k <- [1..konst 1000 j]]
{-# INLINE [1] konst #-}
konst = const
This guarantees that konst will be inlined in simplifier phase 1, but no earlier. Phase 1 occurs after list fusion is complete, so it's perfectly safe to let GHC see everything then.
51,472 bytes allocated in the heap
3,408 bytes copied during GC
44,312 bytes maximum residency (1 sample(s))
17,128 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.071s ( 1.076s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.073s ( 1.077s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 48,059 bytes per MUT second
Productivity 99.9% of total user, 99.6% of total elapsed
You are comparing looping over a single statement to counting by generating an intermediate structure (a list) and folding over it. I don't know how great the performance in Java would be if you created a linked list with a billion elements iterated over it.
Here is Haskell code which is (approximately) equivalent to your Java code.
{-# LANGUAGE BangPatterns #-}
main = print (loop3 1 1 1 0)
loop1 :: Int -> Int -> Int -> Int -> Int
loop1 !i !j !k !cc | k <= 1000 = loop1 i j (k+1) (cc + fromEnum (i + j == k))
| otherwise = cc
loop2 :: Int -> Int -> Int -> Int -> Int
loop2 !i !j !k !cc | j <= 1000 = loop2 i (j+1) k (loop1 i j k cc)
| otherwise = cc
loop3 :: Int -> Int -> Int -> Int -> Int
loop3 !i !j !k !cc | i <= 1000 = loop3 (i+1) j k (loop2 i j k cc)
| otherwise = cc
And the execution on my machine (test2 is your Haskell code):
$ ghc --make -O2 test1.hs && ghc --make -O2 test2.hs && javac test3.java
$ time ./test1.exe && time ./test2.exe && time java test3
499500
real 0m1.614s
user 0m0.000s
sys 0m0.000s
499500
real 0m35.922s
user 0m0.000s
sys 0m0.000s
499500
real 0m1.589s
user 0m0.000s
sys 0m0.015s

Space leak when grouping key/value pairs in Haskell

I have a problem where my code is creating too many thunks (over 270MB) and consequently spends way too much time (over 70%) in GC when grouping values by key. I was wondering what the best way to group values by key.
The problem is that I have keys and values represented by vectors and I want to group the values by keys preserving the order. For example:
Input:
keys = 1 2 4 3 1 3 4 2 1
vals = 1 2 3 4 5 6 7 8 9
Output:
1 = 1,5,9
2 = 2,8
3 = 4,6
4 = 3,7
Compile options:
ghc --make -03 -fllvm histogram.hs
In imperative programming, I would just use a multimap so I decided to use a hash table and where the associated value is [Int] to store the grouped values. I am hoping there is a much better FP solution.
{-# LANGUAGE BangPatterns #-}
import qualified Data.HashMap.Strict as M
import qualified Data.Vector.Unboxed as V
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip k v
where
k = V.generate n (\i -> i `mod` 1000)
v = V.generate n (\i -> i)
ts :: V.Vector (Int,Int) -> M.HashMap Int Int
ts vec =
V.foldl' (\ht (k, v) -> M.insertWith (+) k v ht) M.empty vec
ts2 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
ts2 vec =
V.foldl' (\ht (!k, !v) -> M.insertWith (++) k [v] ht) M.empty vec
main :: IO ()
main = ts2 kv `seq` putStrLn "done"
Here's what spits out at runtime:
3,117,102,992 bytes allocated in the heap
1,847,205,880 bytes copied during GC
324,159,752 bytes maximum residency (12 sample(s))
6,502,224 bytes maximum slop
658 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 5991 colls, 0 par 0.58s 0.58s 0.0001s 0.0003s
Gen 1 12 colls, 0 par 0.69s 0.69s 0.0577s 0.3070s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.45s ( 0.45s elapsed)
GC time 1.27s ( 1.27s elapsed)
EXIT time 0.03s ( 0.03s elapsed)
Total time 1.75s ( 1.75s elapsed)
%GC time 72.7% (72.8% elapsed)
Alloc rate 6,933,912,935 bytes per MUT second
Productivity 27.3% of total user, 27.3% of total elapsed
You can see it spends a lot of time in GC so I decided to use bangs to make the list concatenation strict. I guess the ++ is quite expensive too but don't know a workaround around this.
Those strictness annotations are useless. They're forcing only the first constructor of the lists.
Even worse, it appears you're attempting to left fold (++), which is never a good idea. It results in lots of useless copying of intermediate lists, even when it's made fully strict.
You should fold to a [Int] -> [Int] value, instead. That will get rid of the multiple useless allocations. I'm on mobile, so I can't really provide full example code. The main idea is that you change the loop to M.insertWith (.) k (v:) and then map ($ [] ) over the values in the HashMap after the fold.
The bulk of your problem is due to (++) leading to "lots of useless copying of intermediate lists", as Carl puts it in his answer. Having played with a few different approaches at replacing (++), I got the best results thus far by switching to Data.IntMap.Strict from containers (just to take advantage of the less stern API - I don't know which implementation is more efficient per se) and using its alter function to prepend the vector elements without creating singleton lists:
import qualified Data.IntMap.Strict as M
import qualified Data.Vector.Unboxed as V
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip k v
where
k = V.generate n (\i -> i `mod` 1000)
v = V.generate n (\i -> i)
ts2 :: V.Vector (Int,Int) -> M.IntMap [Int]
ts2 vec =
V.foldl' (\ht (k, v) -> M.alter (prep v) k ht) M.empty vec
where
prep x = Just . maybe [x] (x:)
main :: IO ()
main = print $ M.foldl' (+) 0 $ M.map length $ ts2 kv
The second best solution was using
\ht (k, v) -> M.insertWith (\(x:_) -> (x :)) k [v] ht
as the fold operator. That works with both Data.IntMap.Strict and Data.HashMap.Strict, with similar results performance-wise.
N.B.: Note that in all cases, your original implementation included, the vector elements are being prepended, rather than appended, to the lists. Your problems would be much more serious if you were appending the elements, as repeatedly appending to an empty list with (++) is quadratic in the number of elements.
I tried to run your code on my host and I am not able to reproduce your profile:
runhaskell test8.hs +RTS -sstderr
done
120,112 bytes allocated in the heap
3,520 bytes copied during GC
68,968 bytes maximum residency (1 sample(s))
12,952 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.09s 0.0909s 0.0909s
INIT time 0.00s ( 0.01s elapsed)
MUT time 0.00s ( 29.21s elapsed)
GC time 0.00s ( 0.09s elapsed)
EXIT time 0.00s ( 0.09s elapsed)
Total time 0.01s ( 29.40s elapsed)
%GC time 5.7% (0.3% elapsed)
Alloc rate 381,307,936 bytes per MUT second
Productivity 91.1% of total user, 0.0% of total elapsed
Can you pls outline some more detail about how you are testing the code? If you are using ghci then
$ ghci -fobject-code
we probably need to use -fobject-code to eliminate any space leaks from the ghci. If you have already tried the ghci option, assuming that you are using ghci, I will edit my answer. At this point, I would like to reproduce the issue you are seeing.
Update:
# duplode : Thank you for the pointers. I am going to delete the previous output no one objects to it as it is misleading.
I have been able to reduce the gc overhead by a bit using one of the following options. I am getting some benefits but the overhead is still in the 49 - 50 % range:
ts3 :: V.Vector (Int, Int) -> M.HashMap Int [Int]
ts3 vec =
V.foldl (\ht (!k, !v) ->
let
element = M.lookup k ht in
case element of
Nothing -> M.insert k [v] ht
Just aList -> M.insert k (v:aList) ht) M.empty vec
ts4 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
ts4 vec =
let initMap = V.foldl (\ht (!k,_) -> M.insert k [] ht) M.empty vec
in
V.foldl (\ht (!k, !v) -> M.adjust(\x -> v:x) k ht) initMap vec
The adjust seemed a bit better, but they results seem similar to a straight lookup. With ts4 using adjust:
calling ts4 done.
3,838,059,320 bytes allocated in the heap
2,041,603,344 bytes copied during GC
377,412,728 bytes maximum residency (6 sample(s))
7,725,944 bytes maximum slop
737 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 7260 colls, 0 par 1.32s 1.45s 0.0002s 0.0013s
Gen 1 6 colls, 0 par 0.88s 1.40s 0.2328s 0.9236s
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.18s ( 2.21s elapsed)
GC time 2.19s ( 2.85s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.01s ( 0.07s elapsed)
Total time 4.38s ( 5.13s elapsed)
%GC time 50.0% (55.5% elapsed)
Alloc rate 1,757,267,879 bytes per MUT second
Productivity 50.0% of total user, 42.7% of total elapsed
Using the simple lookup/update (imperative style of updating a map)
calling ts3 done.
3,677,137,816 bytes allocated in the heap
2,040,053,712 bytes copied during GC
395,867,512 bytes maximum residency (6 sample(s))
7,326,104 bytes maximum slop
769 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6999 colls, 0 par 1.35s 1.51s 0.0002s 0.0037s
Gen 1 6 colls, 0 par 1.06s 2.16s 0.3601s 1.3175s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.89s ( 2.07s elapsed)
GC time 2.41s ( 3.67s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.01s ( 0.08s elapsed)
Total time 4.31s ( 5.82s elapsed)
%GC time 55.9% (63.0% elapsed)
Alloc rate 1,942,816,558 bytes per MUT second
Productivity 44.1% of total user, 32.6% of total elapsed
I am interested in finding out as to how to reduce the time for lookup as show in the profile output below:
COST CENTRE MODULE %time %alloc
ts3.\ Main 54.1 91.4
ts3.\.element Main 19.0 2.9
ts3 Main 11.0 2.9
kv.k Main 6.5 1.4
kv.v Main 5.2 1.4
kv.k.\ Main 4.0 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 72 0 0.0 0.0 100.0 100.0
main Main 158 0 0.0 0.0 0.0 0.0
CAF:main Main 143 0 0.0 0.0 84.2 97.1
main Main 144 1 0.0 0.0 84.2 97.1
ts3 Main 145 1 11.0 2.9 84.2 97.1
ts3.\ Main 156 5000000 54.1 91.4 73.2 94.3
ts3.\.element Main 157 5000000 19.0 2.9 19.0 2.9
CAF:kv Main 142 0 0.0 0.0 0.0 0.0
Code
-- ghc -O2 --make test8.hs -prof -auto-all -caf-all -fforce-recomp +RTS
-- ./test8 +RTS -p
{-# LANGUAGE BangPatterns #-}
import qualified Data.HashMap.Strict as M
import qualified Data.Vector.Unboxed as V
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip (k) (v)
where
k = V.generate n (\i -> i `mod` 1000)
v = V.generate n (\i -> i)
ts :: V.Vector (Int,Int) -> M.HashMap Int Int
ts vec =
V.foldl' (\ht (k, v) -> M.insertWith (+) k v ht) M.empty vec
ts2 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
ts2 vec =
V.foldl (\ht (!k, !v) -> M.insertWith (++) k [v] ht) M.empty vec
ts3 :: V.Vector (Int, Int) -> M.HashMap Int [Int]
ts3 vec =
V.foldl (\ht (!k, !v) ->
let
element = M.lookup k ht in
case element of
Nothing -> M.insert k [v] ht
Just aList -> M.insert k (v:aList) ht) M.empty vec
ts4 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
ts4 vec =
let initMap = V.foldl (\ht (!k,_) -> M.insert k [] ht) M.empty vec
in
V.foldl (\ht (!k, !v) -> M.adjust(\x -> v:x) k ht) initMap vec
main :: IO ()
main = ts3 kv `seq` putStrLn "calling ts3 done."
main1 = do
if x == y then
putStrLn "Algos Match"
else
putStrLn "Error"
where
x = ts2 kv
y = ts4 kv

Why is this simple haskell algorithm so slow?

Spoiler alert: this is related to Problem 14 from Project Euler.
The following code takes around 15s to run. I have a non-recursive Java solution that runs in 1s. I think I should be able to get this code much closer to that.
import Data.List
collatz a 1 = a
collatz a x
| even x = collatz (a + 1) (x `div` 2)
| otherwise = collatz (a + 1) (3 * x + 1)
main = do
print ((foldl1' max) . map (collatz 1) $ [1..1000000])
I have profiled with +RHS -p and noticed that the allocated memory is large, and grows as the input grows. For n = 100,000 1gb is allocated(!), for n = 1,000,000 13gb(!!) is allocated.
Then again, -sstderr shows that although lots of bytes were allocated, total memory use was 1mb, and productivity was 95%+, so maybe 13gb is red-herring.
I can think of a few possibilities:
Something isn't as strict as it needs to be. I've already discovered
foldl1', but maybe I need to do more? Is it possible to mark collatz
as strict (does that even make sense?
collatz isn't tail-call optimizing. I think it should be but don't
know a way to confirm.
The compiler isn't doing some optimizations I think it should - for instance
only two results of collatz need to be in memory at any one time (max and current)
Any suggestions?
This is pretty much a duplicate of Why is this Haskell expression so slow?, though I will note that the fast Java solution does not have to perform any memoization. Are there any ways to speed this up without having to resort to it?
For reference, here is my profiling output:
Wed Dec 28 09:33 2011 Time and Allocation Profiling Report (Final)
scratch +RTS -p -hc -RTS
total time = 5.12 secs (256 ticks # 20 ms)
total alloc = 13,229,705,716 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
collatz Main 99.6 99.4
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
CAF Main 208 10 0.0 0.0 100.0 100.0
collatz Main 215 1 0.0 0.0 0.0 0.0
main Main 214 1 0.4 0.6 100.0 100.0
collatz Main 216 0 99.6 99.4 99.6 99.4
CAF GHC.IO.Handle.FD 145 2 0.0 0.0 0.0 0.0
CAF System.Posix.Internals 144 1 0.0 0.0 0.0 0.0
CAF GHC.Conc 128 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.Internals 119 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 113 5 0.0 0.0 0.0 0.0
And -sstderr:
./scratch +RTS -sstderr
525
21,085,474,908 bytes allocated in the heap
87,799,504 bytes copied during GC
9,420 bytes maximum residency (1 sample(s))
12,824 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 40219 collections, 0 parallel, 0.40s, 0.51s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 35.38s ( 36.37s elapsed)
GC time 0.40s ( 0.51s elapsed)
RP time 0.00s ( 0.00s elapsed) PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 35.79s ( 36.88s elapsed) %GC time 1.1% (1.4% elapsed) Alloc rate 595,897,095 bytes per MUT second
Productivity 98.9% of total user, 95.9% of total elapsed
And Java solution (not mine, taken from Project Euler forums with memoization removed):
public class Collatz {
public int getChainLength( int n )
{
long num = n;
int count = 1;
while( num > 1 )
{
num = ( num%2 == 0 ) ? num >> 1 : 3*num+1;
count++;
}
return count;
}
public static void main(String[] args) {
Collatz obj = new Collatz();
long tic = System.currentTimeMillis();
int max = 0, len = 0, index = 0;
for( int i = 3; i < 1000000; i++ )
{
len = obj.getChainLength(i);
if( len > max )
{
max = len;
index = i;
}
}
long toc = System.currentTimeMillis();
System.out.println(toc-tic);
System.out.println( "Index: " + index + ", length = " + max );
}
}
At first, I thought you should try putting an exclamation mark before a in collatz:
collatz !a 1 = a
collatz !a x
| even x = collatz (a + 1) (x `div` 2)
| otherwise = collatz (a + 1) (3 * x + 1)
(You'll need to put {-# LANGUAGE BangPatterns #-} at the top of your source file for this to work.)
My reasoning went as follows: The problem is that you're building up a massive thunk in the first argument to collatz: it starts off as 1, and then becomes 1 + 1, and then becomes (1 + 1) + 1, ... all without ever being forced. This bang pattern forces the first argument of collatz to be forced whenever a call is made, so it starts off as 1, and then becomes 2, and so on, without building up a large unevaluated thunk: it just stays as an integer.
Note that a bang pattern is just shorthand for using seq; in this case, we could rewrite collatz as follows:
collatz a _ | seq a False = undefined
collatz a 1 = a
collatz a x
| even x = collatz (a + 1) (x `div` 2)
| otherwise = collatz (a + 1) (3 * x + 1)
The trick here is to force a in the guard, which then always evaluates to False (and so the body is irrelevant). Then evaluation continues with the next case, a having already been evaluated. However, a bang pattern is clearer.
Unfortunately, when compiled with -O2, this doesn't run any faster than the original! What else can we try? Well, one thing we can do is assume that the two numbers never overflow a machine-sized integer, and give collatz this type annotation:
collatz :: Int -> Int -> Int
We'll leave the bang pattern there, since we should still avoid building up thunks, even if they aren't the root of the performance problem. This brings the time down to 8.5 seconds on my (slow) computer.
The next step is to try bringing this closer to the Java solution. The first thing to realise is that in Haskell, div behaves in a more mathematically correct manner with respect to negative integers, but is slower than "normal" C division, which in Haskell is called quot. Replacing div with quot brought the runtime down to 5.2 seconds, and replacing x `quot` 2 with x `shiftR` 1 (importing Data.Bits) to match the Java solution brought it down to 4.9 seconds.
This is about as low as I can get it for now, but I think this is a pretty good result; since your computer is faster than mine, it should hopefully be even closer to the Java solution.
Here's the final code (I did a little bit of clean-up on the way):
{-# LANGUAGE BangPatterns #-}
import Data.Bits
import Data.List
collatz :: Int -> Int
collatz = collatz' 1
where collatz' :: Int -> Int -> Int
collatz' !a 1 = a
collatz' !a x
| even x = collatz' (a + 1) (x `shiftR` 1)
| otherwise = collatz' (a + 1) (3 * x + 1)
main :: IO ()
main = print . foldl1' max . map collatz $ [1..1000000]
Looking at the GHC Core for this program (with ghc-core), I think this is probably about as good as it gets; the collatz loop uses unboxed integers and the rest of the program looks OK. The only improvement I can think of would be eliminating the boxing from the map collatz [1..1000000] iteration.
By the way, don't worry about the "total alloc" figure; it's the total memory allocated over the lifetime of the program, and it never decreases even when the GC reclaims that memory. Figures of multiple terabytes are common.
You could lose the list and the bang patterns and still get the same performance by using the stack instead.
import Data.List
import Data.Bits
coll :: Int -> Int
coll 0 = 0
coll 1 = 1
coll 2 = 2
coll n =
let a = coll (n - 1)
collatz a 1 = a
collatz a x
| even x = collatz (a + 1) (x `shiftR` 1)
| otherwise = collatz (a + 1) (3 * x + 1)
in max a (collatz 1 n)
main = do
print $ coll 100000
One problem with this is that you will have to increase the size of the stack for large inputs, like 1_000_000.
update:
Here is a tail recursive version that doesn't suffer from the stack overflow problem.
import Data.Word
collatz :: Word -> Word -> (Word, Word)
collatz a x
| x == 1 = (a,x)
| even x = collatz (a + 1) (x `quot` 2)
| otherwise = collatz (a + 1) (3 * x + 1)
coll :: Word -> Word
coll n = collTail 0 n
where
collTail m 1 = m
collTail m n = collTail (max (fst $ collatz 1 n) m) (n-1)
Notice the use of Word instead of Int. It makes a difference in performance. You could still use the bang patterns if you want, and that would nearly double the performance.
One thing I found made a surprising difference in this problem. I stuck with the straight recurrence relation rather than folding, you should pardon the expression, the counting in with it. Rewriting
collatz n = if even n then n `div` 2 else 3 * n + 1
as
collatz n = case n `divMod` 2 of
(n', 0) -> n'
_ -> 3 * n + 1
took 1.2 seconds off the runtime for my program on a system with a 2.8 GHz Athlon II X4 430 CPU. My initial faster version (2.3 seconds after the use of divMod):
{-# LANGUAGE BangPatterns #-}
import Data.List
import Data.Ord
collatzChainLen :: Int -> Int
collatzChainLen n = collatzChainLen' n 1
where collatzChainLen' n !l
| n == 1 = l
| otherwise = collatzChainLen' (collatz n) (l + 1)
collatz:: Int -> Int
collatz n = case n `divMod` 2 of
(n', 0) -> n'
_ -> 3 * n + 1
pairMap :: (a -> b) -> [a] -> [(a, b)]
pairMap f xs = [(x, f x) | x <- xs]
main :: IO ()
main = print $ fst (maximumBy (comparing snd) (pairMap collatzChainLen [1..999999]))
A perhaps more idiomatic Haskell version run in about 9.7 seconds (8.5 with divMod); it's identical save for
collatzChainLen :: Int -> Int
collatzChainLen n = 1 + (length . takeWhile (/= 1) . (iterate collatz)) n
Using Data.List.Stream is supposed to allow stream fusion that would make this version run more like that with the explicit accumulation, but I can't find an Ubuntu libghc* package that has Data.List.Stream, so I can't yet verify that.

Resources