Why is this simple haskell algorithm so slow? - haskell

Spoiler alert: this is related to Problem 14 from Project Euler.
The following code takes around 15s to run. I have a non-recursive Java solution that runs in 1s. I think I should be able to get this code much closer to that.
import Data.List
collatz a 1 = a
collatz a x
| even x = collatz (a + 1) (x `div` 2)
| otherwise = collatz (a + 1) (3 * x + 1)
main = do
print ((foldl1' max) . map (collatz 1) $ [1..1000000])
I have profiled with +RHS -p and noticed that the allocated memory is large, and grows as the input grows. For n = 100,000 1gb is allocated(!), for n = 1,000,000 13gb(!!) is allocated.
Then again, -sstderr shows that although lots of bytes were allocated, total memory use was 1mb, and productivity was 95%+, so maybe 13gb is red-herring.
I can think of a few possibilities:
Something isn't as strict as it needs to be. I've already discovered
foldl1', but maybe I need to do more? Is it possible to mark collatz
as strict (does that even make sense?
collatz isn't tail-call optimizing. I think it should be but don't
know a way to confirm.
The compiler isn't doing some optimizations I think it should - for instance
only two results of collatz need to be in memory at any one time (max and current)
Any suggestions?
This is pretty much a duplicate of Why is this Haskell expression so slow?, though I will note that the fast Java solution does not have to perform any memoization. Are there any ways to speed this up without having to resort to it?
For reference, here is my profiling output:
Wed Dec 28 09:33 2011 Time and Allocation Profiling Report (Final)
scratch +RTS -p -hc -RTS
total time = 5.12 secs (256 ticks # 20 ms)
total alloc = 13,229,705,716 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
collatz Main 99.6 99.4
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
CAF Main 208 10 0.0 0.0 100.0 100.0
collatz Main 215 1 0.0 0.0 0.0 0.0
main Main 214 1 0.4 0.6 100.0 100.0
collatz Main 216 0 99.6 99.4 99.6 99.4
CAF GHC.IO.Handle.FD 145 2 0.0 0.0 0.0 0.0
CAF System.Posix.Internals 144 1 0.0 0.0 0.0 0.0
CAF GHC.Conc 128 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.Internals 119 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 113 5 0.0 0.0 0.0 0.0
And -sstderr:
./scratch +RTS -sstderr
525
21,085,474,908 bytes allocated in the heap
87,799,504 bytes copied during GC
9,420 bytes maximum residency (1 sample(s))
12,824 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 40219 collections, 0 parallel, 0.40s, 0.51s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 35.38s ( 36.37s elapsed)
GC time 0.40s ( 0.51s elapsed)
RP time 0.00s ( 0.00s elapsed) PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 35.79s ( 36.88s elapsed) %GC time 1.1% (1.4% elapsed) Alloc rate 595,897,095 bytes per MUT second
Productivity 98.9% of total user, 95.9% of total elapsed
And Java solution (not mine, taken from Project Euler forums with memoization removed):
public class Collatz {
public int getChainLength( int n )
{
long num = n;
int count = 1;
while( num > 1 )
{
num = ( num%2 == 0 ) ? num >> 1 : 3*num+1;
count++;
}
return count;
}
public static void main(String[] args) {
Collatz obj = new Collatz();
long tic = System.currentTimeMillis();
int max = 0, len = 0, index = 0;
for( int i = 3; i < 1000000; i++ )
{
len = obj.getChainLength(i);
if( len > max )
{
max = len;
index = i;
}
}
long toc = System.currentTimeMillis();
System.out.println(toc-tic);
System.out.println( "Index: " + index + ", length = " + max );
}
}

At first, I thought you should try putting an exclamation mark before a in collatz:
collatz !a 1 = a
collatz !a x
| even x = collatz (a + 1) (x `div` 2)
| otherwise = collatz (a + 1) (3 * x + 1)
(You'll need to put {-# LANGUAGE BangPatterns #-} at the top of your source file for this to work.)
My reasoning went as follows: The problem is that you're building up a massive thunk in the first argument to collatz: it starts off as 1, and then becomes 1 + 1, and then becomes (1 + 1) + 1, ... all without ever being forced. This bang pattern forces the first argument of collatz to be forced whenever a call is made, so it starts off as 1, and then becomes 2, and so on, without building up a large unevaluated thunk: it just stays as an integer.
Note that a bang pattern is just shorthand for using seq; in this case, we could rewrite collatz as follows:
collatz a _ | seq a False = undefined
collatz a 1 = a
collatz a x
| even x = collatz (a + 1) (x `div` 2)
| otherwise = collatz (a + 1) (3 * x + 1)
The trick here is to force a in the guard, which then always evaluates to False (and so the body is irrelevant). Then evaluation continues with the next case, a having already been evaluated. However, a bang pattern is clearer.
Unfortunately, when compiled with -O2, this doesn't run any faster than the original! What else can we try? Well, one thing we can do is assume that the two numbers never overflow a machine-sized integer, and give collatz this type annotation:
collatz :: Int -> Int -> Int
We'll leave the bang pattern there, since we should still avoid building up thunks, even if they aren't the root of the performance problem. This brings the time down to 8.5 seconds on my (slow) computer.
The next step is to try bringing this closer to the Java solution. The first thing to realise is that in Haskell, div behaves in a more mathematically correct manner with respect to negative integers, but is slower than "normal" C division, which in Haskell is called quot. Replacing div with quot brought the runtime down to 5.2 seconds, and replacing x `quot` 2 with x `shiftR` 1 (importing Data.Bits) to match the Java solution brought it down to 4.9 seconds.
This is about as low as I can get it for now, but I think this is a pretty good result; since your computer is faster than mine, it should hopefully be even closer to the Java solution.
Here's the final code (I did a little bit of clean-up on the way):
{-# LANGUAGE BangPatterns #-}
import Data.Bits
import Data.List
collatz :: Int -> Int
collatz = collatz' 1
where collatz' :: Int -> Int -> Int
collatz' !a 1 = a
collatz' !a x
| even x = collatz' (a + 1) (x `shiftR` 1)
| otherwise = collatz' (a + 1) (3 * x + 1)
main :: IO ()
main = print . foldl1' max . map collatz $ [1..1000000]
Looking at the GHC Core for this program (with ghc-core), I think this is probably about as good as it gets; the collatz loop uses unboxed integers and the rest of the program looks OK. The only improvement I can think of would be eliminating the boxing from the map collatz [1..1000000] iteration.
By the way, don't worry about the "total alloc" figure; it's the total memory allocated over the lifetime of the program, and it never decreases even when the GC reclaims that memory. Figures of multiple terabytes are common.

You could lose the list and the bang patterns and still get the same performance by using the stack instead.
import Data.List
import Data.Bits
coll :: Int -> Int
coll 0 = 0
coll 1 = 1
coll 2 = 2
coll n =
let a = coll (n - 1)
collatz a 1 = a
collatz a x
| even x = collatz (a + 1) (x `shiftR` 1)
| otherwise = collatz (a + 1) (3 * x + 1)
in max a (collatz 1 n)
main = do
print $ coll 100000
One problem with this is that you will have to increase the size of the stack for large inputs, like 1_000_000.
update:
Here is a tail recursive version that doesn't suffer from the stack overflow problem.
import Data.Word
collatz :: Word -> Word -> (Word, Word)
collatz a x
| x == 1 = (a,x)
| even x = collatz (a + 1) (x `quot` 2)
| otherwise = collatz (a + 1) (3 * x + 1)
coll :: Word -> Word
coll n = collTail 0 n
where
collTail m 1 = m
collTail m n = collTail (max (fst $ collatz 1 n) m) (n-1)
Notice the use of Word instead of Int. It makes a difference in performance. You could still use the bang patterns if you want, and that would nearly double the performance.

One thing I found made a surprising difference in this problem. I stuck with the straight recurrence relation rather than folding, you should pardon the expression, the counting in with it. Rewriting
collatz n = if even n then n `div` 2 else 3 * n + 1
as
collatz n = case n `divMod` 2 of
(n', 0) -> n'
_ -> 3 * n + 1
took 1.2 seconds off the runtime for my program on a system with a 2.8 GHz Athlon II X4 430 CPU. My initial faster version (2.3 seconds after the use of divMod):
{-# LANGUAGE BangPatterns #-}
import Data.List
import Data.Ord
collatzChainLen :: Int -> Int
collatzChainLen n = collatzChainLen' n 1
where collatzChainLen' n !l
| n == 1 = l
| otherwise = collatzChainLen' (collatz n) (l + 1)
collatz:: Int -> Int
collatz n = case n `divMod` 2 of
(n', 0) -> n'
_ -> 3 * n + 1
pairMap :: (a -> b) -> [a] -> [(a, b)]
pairMap f xs = [(x, f x) | x <- xs]
main :: IO ()
main = print $ fst (maximumBy (comparing snd) (pairMap collatzChainLen [1..999999]))
A perhaps more idiomatic Haskell version run in about 9.7 seconds (8.5 with divMod); it's identical save for
collatzChainLen :: Int -> Int
collatzChainLen n = 1 + (length . takeWhile (/= 1) . (iterate collatz)) n
Using Data.List.Stream is supposed to allow stream fusion that would make this version run more like that with the explicit accumulation, but I can't find an Ubuntu libghc* package that has Data.List.Stream, so I can't yet verify that.

Related

Finding the size of a list that's too big for memory?

Brand new Haskell programmer here. Just finished "Learn you a Haskell"... I'm interested in how large a set is that has some particular properties. I have working code for some small parameter values, but I'd like to know how to deal with larger structures. I know Haskell can do "infinite data structures" but I'm just not seeing how to get there from where I'm at and Learn You a Haskell / Google isn't getting me over this.
Here's the working code for my eSet given "small" arguments r and t
import Control.Monad
import System.Environment
import System.Exit
myPred :: [Int] -> Bool
myPred a = myPred' [] a
where
myPred' [] [] = False
myPred' [] [0] = True
myPred' _ [] = True
myPred' acc (0:aTail) = myPred' acc aTail
myPred' acc (a:aTail)
| a `elem` acc = False
| otherwise = myPred' (a:acc) aTail
superSet :: Int -> Int -> [[Int]]
superSet r t = replicateM r [0..t]
eSet :: Int -> Int -> [[Int]]
eSet r t = filter myPred $ superSet r t
main :: IO ()
main = do
args <- getArgs
case args of
[rArg, tArg] ->
print $ length $ eSet (read rArg) (read tArg)
[rArg, tArg, "set"] ->
print $ eSet (read rArg) (read tArg)
_ ->
die "Usage: eSet r r set <set optional for printing set itself otherwise just print the size
When compiled/run I get
$ ghc eSet.hs -rtsopts
[1 of 1] Compiling Main ( eSet.hs, eSet.o )
Linking eSet ...
$ # Here's is a tiny eSet to illustrate. It is the set of lists of r integers from zero to t with no repeated nonzero list entries
$ ./eSet 4 2 set
[[0,0,0,0],[0,0,0,1],[0,0,0,2],[0,0,1,0],[0,0,1,2],[0,0,2,0],[0,0,2,1],[0,1,0,0],[0,1,0,2],[0,1,2,0],[0,2,0,0],[0,2,0,1],[0,2,1,0],[1,0,0,0],[1,0,0,2],[1,0,2,0],[1,2,0,0],[2,0,0,0],[2,0,0,1],[2,0,1,0],[2,1,0,0]]
$ ./eSet 8 4 +RTS -sstderr
3393
174,406,136 bytes allocated in the heap
29,061,152 bytes copied during GC
4,382,568 bytes maximum residency (7 sample(s))
148,664 bytes maximum slop
14 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 328 colls, 0 par 0.047s 0.047s 0.0001s 0.0009s
Gen 1 7 colls, 0 par 0.055s 0.055s 0.0079s 0.0147s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.298s ( 0.301s elapsed)
GC time 0.102s ( 0.102s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 0.406s ( 0.405s elapsed)
%GC time 25.1% (25.2% elapsed)
Alloc rate 585,308,888 bytes per MUT second
Productivity 74.8% of total user, 75.0% of total elapsed
$ ./eSet 10 5 +RTS -sstderr
63591
27,478,010,744 bytes allocated in the heap
4,638,903,384 bytes copied during GC
532,163,096 bytes maximum residency (15 sample(s))
16,500,072 bytes maximum slop
1556 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 52656 colls, 0 par 6.865s 6.864s 0.0001s 0.0055s
Gen 1 15 colls, 0 par 8.853s 8.997s 0.5998s 1.8617s
INIT time 0.000s ( 0.000s elapsed)
MUT time 52.652s ( 52.796s elapsed)
GC time 15.717s ( 15.861s elapsed)
EXIT time 0.193s ( 0.211s elapsed)
Total time 68.564s ( 68.868s elapsed)
%GC time 22.9% (23.0% elapsed)
Alloc rate 521,883,277 bytes per MUT second
Productivity 77.1% of total user, 76.7% of total elapsed
I see my memory usage is very high and there's a lot of garbage collecting. When running eSet 12 6 I get a Segmentation fault.
I feel like filter myPred $ superSet r t is keeping me from lazily making the subset one part at a time so I can deal with much larger (but finite) sets, but I don't know how to change to another approach that would do that. I think that's the root of my question.
Also, as this is my first Haskell program, points on style and how to achieve the Haskell analog of "pythonic" are much appreciated!
I suspect the culprit here is replicateM, which has this implementation:
replicateM cnt0 f =
loop cnt0
where
loop cnt
| cnt <= 0 = pure []
| otherwise = liftA2 (:) f (loop (cnt - 1))
The problem line is liftA2 (:) f (loop (cnt - 1)); probably loop (cnt - 1) is getting shared among all the calls to (:) partially applied to elements of f, and so loop (cnt - 1) must be kept in memory. Unfortunately loop (cnt - 1) is quite a long list...
It can be a bit fiddly to convince GHC not to share something. The following redefinition of superSet gives me a nice flat memory usage; it will probably be a bit slower on examples that do fit in memory, of course. The key idea is to make it look to the untrained eye (i.e. GHC) like the recursive monadic action depends on the choices made earlier, even though it doesn't.
superSet :: Int -> Int -> [[Int]]
superSet r t = go r 0 where
go 0 ignored = if ignored == 0 then [[]] else [[]]
go r ignored = do
x <- [0..t]
xs <- go (r-1) (ignored+x)
return (x:xs)
If you don't mind avoiding optimizations, the more natural definition also works:
superSet 0 t = [[]]
superSet r t = do
x <- [0..t]
xs <- superSet (r-1) t
return (x:xs)
...but with -O2 GHC is too clever and notices that it can share the recursive call.
A completely alternate approach is to just do a little bit of combinatorics analysis. Here's the process that eSet r t does, as near as I can make out:
Choose at most r elements without replacement from a set of size t.
Pad the sequence to a length of r by interleaving a sentinel value.
So let's just count the ways of doing each of these steps, rather than actually performing them. We'll introduce a new parameter, s, which is the length of the sequence produced by step (1) (and which we therefore know has s <= r and s <= t). How many sequences of size s are there when drawing elements without replacement from a set of size t? Well, there are t choices for the first element, t-1 choices for the second element, t-2 choices for the third element, ...
-- sequencesWithoutReplacement is a very long name!
seqWORepSize :: Integer -> Integer -> Integer
seqWORepSize s t = product [t-s+1 .. t]
Then we want to pad the sequence out to a length of r. We're going to choose s positions in the r-long sequence to be drawn from our sequence, and the remainder will be sentinels. How many ways are there to do that? This one is a well-known combinatorics operator called choose.
choose :: Integer -> Integer -> Integer
choose r s = product [r-s+1 .. r] `div` product [2 .. s]
The number of ways to produce a padded sequence of a given length is just the product of these two numbers, since the choices of "what values to insert" and "where to insert values" can be made completely independently.
paddedSeqSize :: Integer -> Integer -> Integer -> Integer
paddedSeqSize r s t = seqWORepSize s t * (r `choose` s)
And now we're pretty much done. Just iterate over all possible sequence lengths and add up the paddedSeqSize.
eSetSize :: Integer -> Integer -> Integer
eSetSize r t = sum $ map (\s -> paddedSeqSize r s t) [0..r]
We can try it out in ghci:
> :set +s
> map length $ [eSet 1 1, eSet 4 4, eSet 6 4, eSet 4 6]
[2,209,1045,1045]
(0.13 secs, 26,924,264 bytes)
> [eSetSize 1 1, eSetSize 4 4, eSetSize 6 4, eSetSize 4 6]
[2,209,1045,1045]
(0.01 secs, 120,272 bytes)
This way is significantly faster and significantly more memory-friendly. Indeed, we can make queries and get answers about eSets that we would never be able to count the size of one-by-one, e.g.
> length . show $ eSetSize 1000 1000
2594
(0.26 secs, 909,746,448 bytes)
Good luck counting to 10^2594 one at a time. =P
Edit
This idea can also be adapted to produce the padded sequences themselves rather than just counting how many there are. First, a handy utility that I find myself defining over and over for picking out individual elements of a list and reporting on the leftovers:
zippers :: [a] -> [([a], a, [a])]
zippers = go [] where
go ls [] = []
go ls (h:rs) = (ls, h, rs) : go (h:ls) rs
Now, the sequences without replacement can be done by repeatedly choosing a single element from the leftovers.
seqWORep :: Int -> [a] -> [[a]]
seqWORep 0 _ = [[]]
seqWORep n xs = do
(ls, y, rs) <- zippers xs
ys <- seqWORep (n-1) (ls++rs)
return (y:ys)
Once we have a sequence, we can pad it to a desired size by producing all the interleavings of the sentinel value as follows:
interleavings :: Int -> a -> [a] -> [[a]]
interleavings 0 _ xs = [xs]
interleavings n z [] = [replicate n z]
interleavings n z xs#(x:xt) = map (z:) (interleavings (n-1) z xs)
++ map (x:) (interleavings n z xt)
And finally, the top-level function just delegates to seqWORep and interleavings.
eSet :: Int -> Int -> [[Int]]
eSet r t = do
s <- [0..r]
xs <- seqWORep s [1..t]
interleavings (r-s) 0 xs
In my tests this modified eSet has nice flat memory usage both with and without optimizations; does not generate any spurious elements that need to be later filtered out, and so is faster than your original proposal; and to me looks like quite a natural definition compared to the answer that relies on tricking GHC. A nice collection of properties!
After re-reading parts of LYaH and thinking about #daniel-wagners answer monadically composing sounded like it would be worthwhile to try again. The new code total memory is flat and works with and without the -O2 optimization.
Source:
import Control.Monad
import System.Environment
import System.Exit
allowed :: [Int] -> Bool
allowed a = allowed' [] a
where
allowed' [ ] [ ] = False
allowed' [ ] [0] = True
allowed' _ [ ] = True
allowed' acc (0:aTail) = allowed' acc aTail
allowed' acc (a:aTail)
| a `elem` acc = False
| otherwise = allowed' (a:acc) aTail
branch :: Int -> [Int] -> [[Int]]
branch t x = filter allowed [n:x | n <- [0..t]]
eSet :: Int -> Int -> [[Int]]
eSet r t = return [] >>= foldr (<=<) return (replicate r (branch t))
main :: IO ()
main = do
args <- getArgs
case args of
[rArg, tArg] ->
print $ length $ eSet (read rArg) (read tArg)
[rArg, tArg, "set"] ->
print $ eSet (read rArg) (read tArg)
_ -> die "Usage: eSet r r set <set optional>"
The version with monadic function composition tests much faster and without the memory issues...
$ ./eSetMonad 10 5 +RTS -sstderr
63591
289,726,000 bytes allocated in the heap
997,968 bytes copied during GC
63,360 bytes maximum residency (2 sample(s))
24,704 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 553 colls, 0 par 0.008s 0.008s 0.0000s 0.0002s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0003s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.426s ( 0.429s elapsed)
GC time 0.009s ( 0.009s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.439s ( 0.438s elapsed)
%GC time 2.0% (2.0% elapsed)
Alloc rate 680,079,893 bytes per MUT second
Productivity 98.0% of total user, 98.3% of total elapsed

Memoized Collatz sequence

I've posted the same question in CodeReview but failed to get an answer. so I am trying my luck here in SO.
Here is one of my programs that utilized memoization and array to improve performance and memory usage. The performance seems satisfactory but the memory usage is ridiculous and I can't figure out what's wrong:
{-# LANGUAGE BangPatterns #-}
import Data.Functor
import Data.Array (Array)
import qualified Data.Array as Arr
import Control.DeepSeq
genColtzArr n = collatzArr
where collatzArr = Arr.array (1, n) $ take n $ map (\v -> (v, collatz v 0)) [1..]
collatz 1 !acc = 1 + acc
collatz !m !acc
| even m = go (m `div` 2) acc
| otherwise = go (3 * m + 1) acc
where go !l !acc
| l <= n = let !v = collatzArr Arr.! l in 1 + acc + v
| otherwise = collatz l $ 1 + acc
collatz here means this guy. This function is supposed to receive a number n, and then return an array indexing from 1 to n, and in which each cell contains the length of the link from the index to 1 by applying Collatz formula.
But the memory usage of this method is so high. Here is the profiler result (ghc option -prof -fprof-auto -rtsopts, run time option +RTS -p, n == 500000):
total alloc = 730,636,136 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
genColtzArr.collatz Main 40.4 34.7
genColtzArr.collatz.go Main 25.5 14.4
COST CENTRE MODULE no. entries %time %alloc %time %alloc
genColtzArr Main 105 1 0.0 0.0 74.7 72.1
genColtzArr.collatzArr Main 106 1 8.0 20.8 74.7 72.1
genColtzArr.collatzArr.\ Main 107 500000 0.9 2.2 66.8 51.3
genColtzArr.collatz Main 109 1182582 40.4 34.7 65.9 49.1
genColtzArr.collatz.go Main 110 1182581 25.5 14.4 25.5 14.4
Please note that -O2 is not a desired answer. I want to figure out what's the problem in this program and in general, how should I spot time and memory inefficiencies in Haskell code. Specifically, I have no idea why this code, with tail recursion and bang pattern, can consume so much memory.
UPDATE1:
the same code with -s produces this:
1,347,869,264 bytes allocated in the heap
595,901,528 bytes copied during GC
172,105,056 bytes maximum residency (7 sample(s))
897,704 bytes maximum slop
315 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2408 colls, 0 par 0.412s 0.427s 0.0002s 0.0075s
Gen 1 7 colls, 0 par 0.440s 0.531s 0.0759s 0.1835s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.828s ( 0.816s elapsed)
GC time 0.852s ( 0.958s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.004s ( 0.017s elapsed)
Total time 1.684s ( 1.791s elapsed)
%GC time 50.6% (53.5% elapsed)
Alloc rate 1,627,861,429 bytes per MUT second
Productivity 49.4% of total user, 46.4% of total elapsed
so it takes 300 meg. that is still too large.
Update2
full code
{-# LANGUAGE BangPatterns #-}
import Data.Functor
import Data.Array (Array)
import qualified Data.Array as Arr
import Control.DeepSeq
genColtzArr n = collatzArr
where collatzArr = Arr.array (1, n) $ take n $ map (\v -> (v, collatz v 0)) [1..]
collatz 1 !acc = 1 + acc
collatz !m !acc
| even m = go (m `div` 2) acc
| otherwise = go (3 * m + 1) acc
where go !l !acc
| l <= n = let !v = collatzArr Arr.! l in 1 + acc + v
| otherwise = collatz l $ 1 + acc
genLongestArr n = Arr.array (1, n) llist
where colatz = genColtzArr n
llist = (1, 1):zipWith (\(n1, a1) l2 ->
let l1 = colatz Arr.! a1
in (n1 + 1, if l2 < l1 then a1 else n1 + 1))
llist (tail $ Arr.elems colatz)
main :: IO ()
main = getLine >> do
ns <- map read <$> lines <$> getContents
let m = maximum ns
let lar = genLongestArr m
let iter [] = return ()
iter (h:t) = (putStrLn $ show $ lar Arr.! h) >> iter t
iter ns
As the other answer on CodeReview hints, it's alright for a 500000-element boxed array to comsume ~20MB memory, however it's not only the array but a lot of things all together:
Although you put bang patterns every where, array initialization itself is a lazy foldr:
-- from GHC.Arr
array (l,u) ies
= let n = safeRangeSize (l,u)
in unsafeArray' (l,u) n
[(safeIndex (l,u) n i, e) | (i, e) <- ies]
unsafeArray' :: Ix i => (i,i) -> Int -> [(Int, e)] -> Array i e
unsafeArray' (l,u) n#(I# n#) ies = runST (ST $ \s1# ->
case newArray# n# arrEleBottom s1# of
(# s2#, marr# #) ->
foldr (fill marr#) (done l u n marr#) ies s2#)
So unless you evaluated the last bit of an array, it's holding reference to the list used in initialization. Usually the list can be GC'd on fly while you evaluating the array, but in your case the mutual references and self references disturbed the common GC pattern.
llist is self-referencing to produce every single element, so it will not be GC'd until you evaluated the last element of it
it also holds a reference to genColtzArr so genColtzArr won't be GC'd until llist is fully evaluated
you might think collatz is tail recursive but it's not, it's mutual recursive with collatzArr so again both of them won't be GC'd until fully evaluated
Everything combined, your program will keep three 500000-element list-like structures in memory and results ~80MB peak heap size.
Solution
The obvious solution is to force every array / list to normal form before it's used in another so you won't keep multiple copys of the same data in the memory.
genLongestArr :: Int -> Array Int Int
genLongestArr n =
let collatz = genColtzArr n
-- deepseq genColtzArr before mapping over it
-- this is equivalent to your recursive definition
in collatz `deepseq` (Arr.listArray (1,n) $ fmap fst $ scanl' (maxWith snd) (0, 0) $ Arr.assocs collatz)
maxWith :: Ord a => (b -> a) -> b -> b -> b
maxWith f b b' = case compare (f b) (f b') of
LT -> b'
_ -> b
And in main:
-- deepseq lar before mapping over it
-- this is equivalent to your iter loop
lar `deepseq` mapM_ (print . (lar Arr.!)) ns
Nothing can be done with genColtzArr, it's using itself for memorization so the mutual recursion is kind of necessary.
Now the heap graph peaks at ~20MB as it should:
(Disclaimer: All programs in this answer were compiled with -O0)

My attempt at Project Euler #92 is too slow

I'm trying to solve Project Euler problem #92 with Haskell. I started learning Haskell recently. It's the first Project Euler problem I've tried to solve with Haskell, but my piece of code doesn't terminate even in 10 minutes. I know you don't give me the answer directly, but again I should warn I find answer with c++ doesn't give answer of Euler or new logic to solve Euler. I'm just curious why that guy doesn't work fast and what should I do to make it faster?
{--EULER 92--}
import Data.List
myFirstFunction 1 = 0
myFirstFunction 89 = 1
myFirstFunction x= myFirstFunction (giveResult x)
giveResult 0 = 0
giveResult x = (square (mod x 10)) + (giveResult (div x 10))
square x = x*x
a=[1..10000000]
main = putStrLn(show (sum (map myFirstFunction a)))
The biggest speedup can of course be gained from using a better algorithm. I'm not going deep into that here, though.
Original algorithm tweakings
So let's focus on improving the used algorithm without really changing it.
You never give any type signature, therefore the type defaults to the arbitrary precision Integer. Everything here fits easily in an Int, there's no danger of overflow, so let's use that. Adding a type signature myFirstFunction :: Int -> Int helps: time drops from Total time 13.77s ( 13.79s elapsed) to Total time 6.24s ( 6.24s elapsed) and total allocation drops by a factor of about 15. Not bad for such a simple change.
You use div and mod. These always compute a non-negative remainder and the corresponding quotient, so they need some extra checks in case some negative numbers are involved. The functions quot and rem map to the machine division instructions, they don't involve such checks and therefore are somewhat faster. If you compile via the LLVM backend (-fllvm), that also takes advantage of the fact that you always divide by a single known number (10), and converts the division into multiplication and bit-shift. Time now: Total time 1.56s ( 1.56s elapsed).
Instead of using quot and rem separately, let's use the quotRem function that computes both at once, so that we don't repeat the division (even with multiplication+shift that takes a little time):
giveResult x = case x `quotRem` 10 of
(q,r) -> r*r + giveResult q
That doesn't gain much, but a little: Total time 1.49s ( 1.49s elapsed).
You're using a list a = [1 .. 10000000], and map the function over that list and then sum the resulting list. That's idiomatic, neat and short, but not super fast, since allocating all those list cells and garbage collecting them takes time too - not very much, since GHC is very good at that, but transforming it into a loop
main = print $ go 0 1
where
go acc n
| n > 10000000 = acc
| otherwise = go (acc + myFirstFunction n) (n+1)
gains us a little still: Total time 1.34s ( 1.34s elapsed) and the allocation dropped from 880,051,856 bytes allocated in the heap for the last list version to 51,840 bytes allocated in the heap.
giveResult is recursive, and therefore cannot be inlined. The same holds for myFirstFunction, hence each computation needs two function calls (at least). We can avoid that by rewriting giveResult to a non-recursive wrapper and a recursive local loop,
giveResult x = go 0 x
where
go acc 0 = acc
go acc n = case n `quotRem` 10 of
(q,r) -> go (acc + r*r) q
so that that can be inlined: Total time 1.04s ( 1.04s elapsed).
Those were the most obvious points, further improvements - apart from the memoisation mentioned by hammar in the comments - would need some thinking.
We are now at
module Main (main) where
myFirstFunction :: Int -> Int
myFirstFunction 1 = 0
myFirstFunction 89 = 1
myFirstFunction x= myFirstFunction (giveResult x)
giveResult :: Int -> Int
giveResult x = go 0 x
where
go acc 0 = acc
go acc n = case n `quotRem` 10 of
(q,r) -> go (acc + r*r) q
main :: IO ()
main = print $ go 0 1
where
go acc n
| n > 10000000 = acc
| otherwise = go (acc + myFirstFunction n) (n+1)
With -O2 -fllvm, that runs in 1.04 seconds here, but with the native code generator (only -O2), it takes 3.5 seconds. That difference is due to the fact that GHC itself doesn't convert the division into a multiplication and bit-shift. If we do it by hand, we get pretty much the same performance from the native code generator.
Because we know something that the compiler doesn't, namely that we never deal with negative numbers here, and the numbers don't become large, we can even generate a better multiply-and-shift (that would produce wrong results for negative or large dividends) than the compiler and take the time down to 0.9 seconds for the native code generator and 0.73 seconds for the LLVM backend:
import Data.Bits
qr10 :: Int -> (Int, Int)
qr10 n = (q, r)
where
q = (n * 0x66666667) `unsafeShiftR` 34
r = n - 10 * q
Note: That requires that Int is a 64-bit type, it won't work with 32-bit Ints, it will produce wrong results for negative n, and the multiplication will overflow for large n. We're entering dirty-hack territory. We can alleviate the dirtyness by using Word instead of Int, that leaves only the overflow (which doesn't occur for n <= 10737418236 with Word resp n <= 5368709118 for Int, so here we are comfortably in the safe zone). The times aren't affected.
The corresponding C programme
#include <stdio.h>
unsigned int myFirstFunction(unsigned int i);
unsigned int giveResult(unsigned int i);
int main(void) {
unsigned int sum = 0;
for(unsigned int i = 1; i <= 10000000; ++i) {
sum += myFirstFunction(i);
}
printf("%u\n",sum);
return 0;
}
unsigned int myFirstFunction(unsigned int i) {
if (i == 1) return 0;
if (i == 89) return 1;
return myFirstFunction(giveResult(i));
}
unsigned int giveResult(unsigned int i) {
unsigned int acc = 0, r, q;
while(i) {
q = (i*0x66666667UL) >> 34;
r = i - q*10;
i = q;
acc += r*r;
}
return acc;
}
performs similarly, compiled with gcc -O3, it runs in 0.78 seconds, and with clang -O3 in 0.71.
That's pretty much the end without changing the algorithm.
Memoisation
Now, a minor change of algorithm is memoisation. If we build a lookup table for the numbers <= 7*9², we need only one computation of the sum of the squares of the digits for each number rather than iterating that until we reach 1 or 89, so let's memoise,
module Main (main) where
import Data.Array.Unboxed
import Data.Array.IArray
import Data.Array.Base (unsafeAt)
import Data.Bits
qr10 :: Int -> (Int, Int)
qr10 n = (q, r)
where
q = (n * 0x66666667) `unsafeShiftR` 34
r = n - 10 * q
digitSquareSum :: Int -> Int
digitSquareSum = go 0
where
go acc 0 = acc
go acc n = case qr10 n of
(q,r) -> go (acc + r*r) q
table :: UArray Int Int
table = array (0,567) $ assocs helper
where
helper :: Array Int Int
helper = array (0,567) [(i, f i) | i <- [0 .. 567]]
f 0 = 0
f 1 = 0
f 89 = 1
f n = helper ! digitSquareSum n
endPoint :: Int -> Int
endPoint n = table `unsafeAt` digitSquareSum n
main :: IO ()
main = print $ go 0 1
where
go acc n
| n > 10000000 = acc
| otherwise = go (acc + endPoint n) (n+1)
Doing the memoisation by hand instead of using a library makes the code longer, but we can tailor it to our needs. We can use an unboxed array, and we can omit the bounds check on the array access. Both significantly speed the computation up. The time is now 0.18 seconds for the native code generator, and 0.13 seconds for the LLVM backend. The corresponding C programme runs in 0.16 seconds compiled with gcc -O3, and 0.145 seconds compiled with clang -O3 (Haskell beats C, w00t!).
Scaling and a hint for a better algorithm
The used algorithm however doesn't scale too well, a bit worse than linear, and for an upper bound of 108 (with suitably adapted memoisation limit), it runs in 1.5 seconds (ghc -O2 -fllvm), resp. 1.64 seconds (clang -O3) and 1.87 seconds (gcc -O3) [2.02 seconds for the native code generator].
Using a different algorithm that counts the numbers whose sequence ends in 1 by partitioning such numbers into a sum of squares of digits (The only numbers that directly produce 1 are powers of 10. We can write
10 = 1×3² + 1×1²
10 = 2×2² + 2×1²
10 = 1×2² + 6×1²
10 = 10×1²
From the first, we obtain 13, 31, 103, 130, 301, 310, 1003, 1030, 1300, 3001, 3010, 3100, ...
From the second, we obtain 1122, 1212, 1221, 2112, 2121, 2211, 11022, 11202, ...
From the third 1111112, 1111121, ...
Only 13, 31, 103, 130, 301, 310 are possible sums of squares of the digits of numbers <= 10^10, so only those need to be investigated further. We can write
100 = 1×9² + 1×4² + 3×1²
...
100 = 1×8² + 1×6²
...
The first of these partitions generates no children since it requires five nonzero digits, the other explicitly given generates the two children 68 and 86 (also 608 if the limit is 108, more for larger limits)), we can get better scaling and a faster algorithm.
The fairly unoptimised programme I wrote way back when to solve this problem runs (input is exponent of 10 of the limit)
$ time ./problem92 7
8581146
real 0m0.010s
user 0m0.008s
sys 0m0.002s
$ time ./problem92 8
85744333
real 0m0.022s
user 0m0.018s
sys 0m0.003s
$ time ./problem92 9
854325192
real 0m0.040s
user 0m0.033s
sys 0m0.006s
$ time ./problem92 10
8507390852
real 0m0.074s
user 0m0.069s
sys 0m0.004s
in a different league.
First off, I took the liberty of cleaning up your code a little:
endsAt89 1 = 0
endsAt89 89 = 1
endsAt89 n = endsAt89 (sumOfSquareDigits n)
sumOfSquareDigits 0 = 0
sumOfSquareDigits n = (n `mod` 10)^2 + sumOfSquareDigits (n `div` 10)
main = print . sum $ map endsAt89 [1..10^7]
On my crappy netbook is 1 min 13 sec. Let's see if we can improve that.
Since the numbers are small, we can start by using machine-sized Int instead of arbitrary-size Integer. This is just a matter of adding type signatures, e.g.
sumOfSquareDigits :: Int -> Int
That improves the run time drastically to 20 seconds.
Since the numbers are all positive, we can replace div and mod with the slightly faster quot and rem, or even both in one go with quotRem:
sumOfSquareDigits :: Int -> Int
sumOfSquareDigits 0 = 0
sumOfSquareDigits n = r^2 + sumOfSquareDigits q
where (q, r) = quotRem x 10
Run time is now 17 seconds. Making it tail recursive shaves off another second:
sumOfSquareDigits :: Int -> Int
sumOfSquareDigits n = loop n 0
where
loop 0 !s = s
loop n !s = loop q (s + r^2)
where (q, r) = quotRem n 10
For further improvements, we can notice that sumOfSquareDigits returns at most 567 = 7 * 9^2 for the given input numbers, so we can memoize for small numbers to reduce the number of iterations needed. Here's my final version (using the data-memocombinators package for the memoization):
{-# LANGUAGE BangPatterns #-}
import qualified Data.MemoCombinators as Memo
endsAt89 :: Int -> Int
endsAt89 = Memo.arrayRange (1, 7*9^2) endsAt89'
where
endsAt89' 1 = 0
endsAt89' 89 = 1
endsAt89' n = endsAt89 (sumOfSquareDigits n)
sumOfSquareDigits :: Int -> Int
sumOfSquareDigits n = loop n 0
where
loop 0 !s = s
loop n !s = loop q (s + r^2)
where (q, r) = quotRem n 10
main = print . sum $ map endsAt89 [1..10^7]
This runs in just under 9 seconds on my machine.

Caching in Haskell and explicit parallelism

I'm currently trying to optimize my solution to problem 14 at Projet Euler.
I really enjoy Haskell and I think it's a very good fit for these kind of problems, here's three different solutions I've tried:
import Data.List (unfoldr, maximumBy)
import Data.Maybe (fromJust, isNothing)
import Data.Ord (comparing)
import Control.Parallel
next :: Integer -> Maybe (Integer)
next 1 = Nothing
next n
| even n = Just (div n 2)
| odd n = Just (3 * n + 1)
get_sequence :: Integer -> [Integer]
get_sequence n = n : unfoldr (pack . next) n
where pack n = if isNothing n then Nothing else Just (fromJust n, fromJust n)
get_sequence_length :: Integer -> Integer
get_sequence_length n
| isNothing (next n) = 1
| otherwise = 1 + (get_sequence_length $ fromJust (next n))
-- 8 seconds
main1 = print $ maximumBy (comparing length) $ map get_sequence [1..1000000]
-- 5 seconds
main2 = print $ maximum $ map (\n -> (get_sequence_length n, n)) [1..1000000]
-- Never finishes
main3 = print solution
where
s1 = maximumBy (comparing length) $ map get_sequence [1..500000]
s2 = maximumBy (comparing length) $ map get_sequence [500001..10000000]
solution = (s1 `par` s2) `pseq` max s1 s2
Now if you look at the actual problem there's a lot of potential for caching, as most new sequences will contain subsequences that have already been calculated before.
For comparison, I wrote a version in C too:
Running time with caching: 0.03 seconds
Running time without caching: 0.3 seconds
That's just insane! Sure, caching reduced the time by a factor of 10, but even without caching it's still at least 17 times faster than my Haskell code.
What's wrong with my code?
Why doesn't Haskell cache the function calls for me? As the functions are pure caching shouldn't caching be trivial, only a matter of available memory?
What's the problem with my third parallel version? Why doesn't it finish?
Regarding Haskell as a language, does the compiler automatically parallellize some code (folds, maps etc), or does it always have to be done explicitly using Control.Parallel?
Edit: I stumbled upon this similar question. They mentioned that his function wasn't tail-recursive. Is my get_sequence_length tail recursive? If not how can I make it so?
Edit2: To Daniel:
Thanks a lot for the reply, really awesome.
I've been playing around with your improvements and I've found some really bad gotchas.
I'm running the tests on Windws 7 (64-bit), 3.3 GHZ Quad core with 8GB RAM.
The first thing I did was as you say replace all Integer with Int, but whenever I ran any of the mains I ran out of memory,
even with +RTS kSize -RTS set ridiciously high.
Eventually I found this (stackoverflow is awesome...), which means that since all Haskell programs on Windows are run as 32-bit, the Ints were overflowing causing infinite recursion, just wow...
I ran the tests in a Linux virtual machine (with the 64-bit ghc) instead and got similar results.
Alright, let's start from the top. First important thing is to give the exact command line you're using to compile and run; for my answer, I'll use this line for the timings of all programs:
ghc -O2 -threaded -rtsopts test && time ./test +RTS -N
Next up: since timings vary greatly from machine to machine, we'll give some baseline timings for my machine and your programs. Here's the output of uname -a for my computer:
Linux sorghum 3.4.4-2-ARCH #1 SMP PREEMPT Sun Jun 24 18:59:47 CEST 2012 x86_64 Intel(R) Core(TM)2 Quad CPU Q6600 # 2.40GHz GenuineIntel GNU/Linux
The highlights are: quad-core, 2.4GHz, 64-bit.
Using main1: 30.42s user 2.61s system 149% cpu 22.025 total
Using main2: 21.42s user 1.18s system 129% cpu 17.416 total
Using main3: 22.71s user 2.02s system 220% cpu 11.237 total
Actually, I modified main3 in two ways: first, by removing one of the zeros from the end of the range in s2, and second, by changing max s1 s2 to maximumBy (comparing length) [s1, s2], since the former only accidentally computes the right answer. =)
I'll now focus on serial speed. (To answer one of your direct questions: no, GHC does not automatically parallelize or memoize your programs. Both of those things have overheads that are very difficult to estimate, and consequently it's very difficult to decide when doing them will be beneficial. I have no idea why even the serial solutions in this answer are getting >100% CPU utilization; perhaps some garbage collection is happening in another thread or some such thing.) We'll start from main2, since it was the faster of the two serial implementations. The cheapest way to get a little boost is to change all the type signatures from Integer to Int:
Using Int: 11.17s user 0.50s system 129% cpu 8.986 total (about twice as fast)
The next boost comes from reducing allocation in the inner loop (eliminating the intermediate Maybe values).
import Data.List
import Data.Ord
get_sequence_length :: Int -> Int
get_sequence_length 1 = 1
get_sequence_length n
| even n = 1 + get_sequence_length (n `div` 2)
| odd n = 1 + get_sequence_length (3 * n + 1)
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
Using this: 4.84s user 0.03s system 101% cpu 4.777 total
The next boost comes from using faster operations than even and div:
import Data.Bits
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
get_sequence_length :: Int -> Int
get_sequence_length 1 = 1
get_sequence_length n = 1 + get_sequence_length next where
next = if even' n then n `quot` 2 else 3 * n + 1
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
Using this: 1.27s user 0.03s system 105% cpu 1.232 total
For those following along at home, this is about 17 times faster than the main2 that we started with -- a competitive improvement with switching to C.
For memoization, there's a few choices. The simplest is to use a pre-existing package like data-memocombinators to create an immutable array and read from it. The timings are fairly sensitive to choosing a good size for this array; for this problem, I found 50000 to be a pretty good upper bound.
import Data.Bits
import Data.MemoCombinators
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
pre_length :: (Int -> Int) -> (Int -> Int)
pre_length f 1 = 1
pre_length f n = 1 + f next where
next = if even' n then n `quot` 2 else 3 * n + 1
get_sequence_length :: Int -> Int
get_sequence_length = arrayRange (1,50000) (pre_length get_sequence_length)
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
With this: 0.53s user 0.10s system 149% cpu 0.421 total
The fastest of all is to use a mutable, unboxed array for the memoization bit. It's much less idiomatic, but it's bare-metal speed. The speed is much less sensitive on the size of this array, so long as the array is about as large as the biggest thing you want the answer for.
import Control.Monad
import Control.Monad.ST
import Data.Array.Base
import Data.Array.ST
import Data.Bits
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
next n = if even' n then n `quot` 2 else 3 * n + 1
get_sequence_length :: STUArray s Int Int -> Int -> ST s Int
get_sequence_length arr n = do
bounds#(lo,hi) <- getBounds arr
if not (inRange bounds n) then (+1) `fmap` get_sequence_length arr (next n) else do
let ix = n-lo
v <- unsafeRead arr ix
if v > 0 then return v else do
v' <- get_sequence_length arr (next n)
unsafeWrite arr ix (v'+1)
return (v'+1)
maxLength :: (Int,Int)
maxLength = runST $ do
arr <- newArray (1,1000000) 0
writeArray arr 1 1
loop arr 1 1 1000000
where
loop arr n len 1 = return (n,len)
loop arr n len n' = do
len' <- get_sequence_length arr n'
if len' > len then loop arr n' len' (n'-1) else loop arr n len (n'-1)
main = print maxLength
With this: 0.16s user 0.02s system 138% cpu 0.130 total (which is competitive with the memoized C version)
GHC won't parallel-ize anything automatically for you. And as you guess get_sequence_length is not tail-recursive. See here. And consider how the compiler (unless it's doing some nice optimizations for you) can't evaluate all those recursive additions until you hit the end; you're "building up thunks" which isn't usually a good thing.
Try instead calling a recursive helper function and passing an accumulator, or try defining it in terms of foldr.

Project Euler 14: performance compared to C and memoization

I'm currently working on project euler problem 14.
I solved it using a poorly coded program, without memoization, that took 386 5 seconds to run (see edit).
Here it is:
step :: (Integer, Int) -> Integer -> (Integer, Int)
step (i, m) n | nextValue > m = (n, nextValue)
| otherwise = (i, m)
where nextValue = syr n 1
syr :: Integer -> Int -> Int
syr 1 acc = acc
syr x acc | even x = syr (x `div` 2) (acc + 1)
| otherwise = syr (3 * x + 1) (acc + 1)
p14 = foldl step (0, 0) [500000..999999]
My question is about several comments in the thread related to this problem, where were mentionned execution times of <1 s for programs as follow (C code, credits to the project euler forum user ix for the code -- note: I didn't check that the execution time is in fact as mentionned):
#include <stdio.h>
int main(int argc, char **argv) {
int longest = 0;
int terms = 0;
int i;
unsigned long j;
for (i = 1; i <= 1000000; i++) {
j = i;
int this_terms = 1;
while (j != 1) {
this_terms++;
if (this_terms > terms) {
terms = this_terms;
longest = i;
}
if (j % 2 == 0) {
j = j / 2;
} else {
j = 3 * j + 1;
}
}
}
printf("longest: %d (%d)\n", longest, terms);
return 0;
}
To me, those programs are kind of the same, when talking about the algorithm.
So I wonder why there is such a big difference? Or is there any fondamental difference between our two algorithms that can justify a x6 factor in performance?
BTW, I'm currently trying to implement this algorithm with memoization, but am kind of lost as to me, it's way easier to implement in an imperative language (and I don't manipulate monads yet so I can't use this paradigm). So if you have any good tutorial that fits a beginner to learn memoization, I'll be glad (the ones I encountered were not detailed enough or out of my league).
Note: I came to declarative paradigm through Prolog and am still in the very early process of discovering Haskell, so I might miss important things.
Note2: any general advice about my code is welcome.
EDIT: thanks to delnan's help, I compiled the program and it now runs in 5 seconds, so I mainly look for hints on memoization now (even if ideas about the existing x6 gap are still welcome).
After having compiled it with optimisations, there are still several differences to the C programme
you use div, while the C programme uses machine division (which truncates) [but any self-respecting C compiler transforms that into a shift, so that makes it yet faster], that would be quot in Haskell; that reduced the run time by some 15% here.
the C programme uses fixed-width 64-bit (or even 32-bit, but then it's just luck that it gets the correct answer, since some intermediate values exceed 32-bit range) integers, the Haskell programme uses arbitrary precision Integers. If you have 64-bit Ints in your GHC (64-bit OS other than Windows), replace Integer with Int. That reduced the run time by a factor of about 3 here. If you're on a 32-bit system, you're out of luck, GHC doesn't use native 64-bit instructions there, these operations are implemented as C calls, that's still rather slow.
For the memoisation, you can outsource it to one of the memoisation packages on hackage, the only one that I remember is data-memocombinators, but there are others. Or you can do it yourself, for example keeping a map of previously calculated values - that would work best in the State monad,
import Control.Monad.State.Strict
import qualified Data.Map as Map
import Data.Map (Map, singleton)
type Memo = Map Integer Int
syr :: Integer -> State Memo Int
syr n = do
mb <- gets (Map.lookup n)
case mb of
Just l -> return l
Nothing -> do
let m = if even n then n `quot` 2 else 3*n+1
l <- syr m
let l' = l+1
modify (Map.insert n l')
return l'
solve :: Integer -> Int -> Integer -> State Memo (Integer,Int)
solve maxi len start
| len > 1000000 = return (maxi,len)
| otherwise = do
l <- syr start
if len < l
then solve start l (start+1)
else solve maxi len (start+1)
p14 :: (Integer,Int)
p14 = evalState (solve 0 0 500000) (singleton 1 1)
but that will probably not gain too much (not even when you've added the necessary strictness). The trouble is that a lookup in a Map is not too cheap and an insertion is relatively expensive.
Another method is to keep a mutable array for the lookup. The code becomes more complicated, since you have to choose a reasonable upper bound for the values to cache (should be not much larger than the bound for the starting values) and deal with the parts of the sequences falling outside the memoised range. But an array lookup and write are fast. If you have 64-bit Ints, the below code runs pretty fast, here it takes 0.03s for a limit of 1 million, and 0.33s for a limit of 10 million, the corresponding (as closely as I reasonably could) C code runs in 0.018 resp. 0.2s.
module Main (main) where
import System.Environment (getArgs)
import Data.Array.ST
import Data.Array.Base
import Control.Monad.ST
import Data.Bits
import Data.Int
main :: IO ()
main = do
args <- getArgs
let bd = case args of
a:_ -> read a
_ -> 100000
print $ collMax bd
next :: Int -> Int
next n
| n .&. 1 == 0 = n `unsafeShiftR` 1
| otherwise = 3*n + 1
collMax :: Int -> (Int,Int16)
collMax upper = runST $ do
arr <- newArray (0,upper) 0 :: ST s (STUArray s Int Int16)
let go l m
| upper < m = go (l+1) $ next m
| otherwise = do
l' <- unsafeRead arr m
case l' of
0 -> do
l'' <- go 1 $ next m
unsafeWrite arr m (l'' + 1)
return (l+l'')
_ -> return (l+l'-1)
collect mi ml i
| upper < i = return (mi, ml)
| otherwise = do
l <- go 1 i
if l > ml
then collect i l (i+1)
else collect mi ml (i+1)
unsafeWrite arr 1 1
collect 1 1 2
Well, the C program uses unsigned long, but Integer can store arbitrarily large integers (it's a bignum). If you import Data.Word, then you can use Word, which is a machine-word-sized unsigned integer.
After replacing Integer with Word, and using ghc -O2 and gcc -O3, the C program runs in 0.72 seconds, while the Haskell programs runs in 1.92 seconds. 2.6x isn't bad. However, ghc -O2 doesn't always help, and this is one of the programs on which it doesn't! Using just -O, as you did, brings the runtime down to 1.90 seconds.
I tried replacing div with quot (which uses the same type of division as C; they only differ on negative inputs), but strangely it actually made the Haskell program run slightly slower for me.
You should be able to speed up the syr function with the help of this previous Stack Overflow question I answered about the same Project Euler problem.
On my current system (32-bit Core2Duo) your Haskell code, including all the optimizations given in the answers, takes 0.8s to compile and 1.2s to run.
You could transfer the run-time to compile-time, and reduce the run-time to nearly zero.
module Euler14 where
import Data.Word
import Language.Haskell.TH
terms :: Word -> Word
terms n = countTerms n 0
where
countTerms 1 acc = acc + 1
countTerms n acc | even n = countTerms (n `div` 2) (acc + 1)
| otherwise = countTerms (3 * n + 1) (acc + 1)
longestT :: Word -> Word -> (Word, Word)
longestT mi mx = find mi mx (0, 0)
where
find mi mx (ct,cn) | mi == mx = if ct > terms mi then (ct,cn) else (terms mi, mi)
| otherwise = find (mi + 1) mx
(if ct > terms mi then (ct,cn) else (terms mi, mi))
longest :: Word -> Word -> ExpQ
longest mi mx = return $ TupE [LitE (IntegerL (fromIntegral a)),
LitE (IntegerL (fromIntegral b))]
where
(a,b) = longestT mi mx
and
{-# LANGUAGE TemplateHaskell #-}
import Euler14
main = print $(longest 500000 999999)
On my system it takes 2.3s to compile this but the run-time goes down to 0.003s. Compile Time Function Execution (CTFE) is something you can't do in C/C++. The only other programming language that I know of that supports CTFE is the D programming language. And just to be complete, the C code takes 0.1s to compile and 0.7s to run.

Resources