Project Euler 14: performance compared to C and memoization - haskell

I'm currently working on project euler problem 14.
I solved it using a poorly coded program, without memoization, that took 386 5 seconds to run (see edit).
Here it is:
step :: (Integer, Int) -> Integer -> (Integer, Int)
step (i, m) n | nextValue > m = (n, nextValue)
| otherwise = (i, m)
where nextValue = syr n 1
syr :: Integer -> Int -> Int
syr 1 acc = acc
syr x acc | even x = syr (x `div` 2) (acc + 1)
| otherwise = syr (3 * x + 1) (acc + 1)
p14 = foldl step (0, 0) [500000..999999]
My question is about several comments in the thread related to this problem, where were mentionned execution times of <1 s for programs as follow (C code, credits to the project euler forum user ix for the code -- note: I didn't check that the execution time is in fact as mentionned):
#include <stdio.h>
int main(int argc, char **argv) {
int longest = 0;
int terms = 0;
int i;
unsigned long j;
for (i = 1; i <= 1000000; i++) {
j = i;
int this_terms = 1;
while (j != 1) {
this_terms++;
if (this_terms > terms) {
terms = this_terms;
longest = i;
}
if (j % 2 == 0) {
j = j / 2;
} else {
j = 3 * j + 1;
}
}
}
printf("longest: %d (%d)\n", longest, terms);
return 0;
}
To me, those programs are kind of the same, when talking about the algorithm.
So I wonder why there is such a big difference? Or is there any fondamental difference between our two algorithms that can justify a x6 factor in performance?
BTW, I'm currently trying to implement this algorithm with memoization, but am kind of lost as to me, it's way easier to implement in an imperative language (and I don't manipulate monads yet so I can't use this paradigm). So if you have any good tutorial that fits a beginner to learn memoization, I'll be glad (the ones I encountered were not detailed enough or out of my league).
Note: I came to declarative paradigm through Prolog and am still in the very early process of discovering Haskell, so I might miss important things.
Note2: any general advice about my code is welcome.
EDIT: thanks to delnan's help, I compiled the program and it now runs in 5 seconds, so I mainly look for hints on memoization now (even if ideas about the existing x6 gap are still welcome).

After having compiled it with optimisations, there are still several differences to the C programme
you use div, while the C programme uses machine division (which truncates) [but any self-respecting C compiler transforms that into a shift, so that makes it yet faster], that would be quot in Haskell; that reduced the run time by some 15% here.
the C programme uses fixed-width 64-bit (or even 32-bit, but then it's just luck that it gets the correct answer, since some intermediate values exceed 32-bit range) integers, the Haskell programme uses arbitrary precision Integers. If you have 64-bit Ints in your GHC (64-bit OS other than Windows), replace Integer with Int. That reduced the run time by a factor of about 3 here. If you're on a 32-bit system, you're out of luck, GHC doesn't use native 64-bit instructions there, these operations are implemented as C calls, that's still rather slow.
For the memoisation, you can outsource it to one of the memoisation packages on hackage, the only one that I remember is data-memocombinators, but there are others. Or you can do it yourself, for example keeping a map of previously calculated values - that would work best in the State monad,
import Control.Monad.State.Strict
import qualified Data.Map as Map
import Data.Map (Map, singleton)
type Memo = Map Integer Int
syr :: Integer -> State Memo Int
syr n = do
mb <- gets (Map.lookup n)
case mb of
Just l -> return l
Nothing -> do
let m = if even n then n `quot` 2 else 3*n+1
l <- syr m
let l' = l+1
modify (Map.insert n l')
return l'
solve :: Integer -> Int -> Integer -> State Memo (Integer,Int)
solve maxi len start
| len > 1000000 = return (maxi,len)
| otherwise = do
l <- syr start
if len < l
then solve start l (start+1)
else solve maxi len (start+1)
p14 :: (Integer,Int)
p14 = evalState (solve 0 0 500000) (singleton 1 1)
but that will probably not gain too much (not even when you've added the necessary strictness). The trouble is that a lookup in a Map is not too cheap and an insertion is relatively expensive.
Another method is to keep a mutable array for the lookup. The code becomes more complicated, since you have to choose a reasonable upper bound for the values to cache (should be not much larger than the bound for the starting values) and deal with the parts of the sequences falling outside the memoised range. But an array lookup and write are fast. If you have 64-bit Ints, the below code runs pretty fast, here it takes 0.03s for a limit of 1 million, and 0.33s for a limit of 10 million, the corresponding (as closely as I reasonably could) C code runs in 0.018 resp. 0.2s.
module Main (main) where
import System.Environment (getArgs)
import Data.Array.ST
import Data.Array.Base
import Control.Monad.ST
import Data.Bits
import Data.Int
main :: IO ()
main = do
args <- getArgs
let bd = case args of
a:_ -> read a
_ -> 100000
print $ collMax bd
next :: Int -> Int
next n
| n .&. 1 == 0 = n `unsafeShiftR` 1
| otherwise = 3*n + 1
collMax :: Int -> (Int,Int16)
collMax upper = runST $ do
arr <- newArray (0,upper) 0 :: ST s (STUArray s Int Int16)
let go l m
| upper < m = go (l+1) $ next m
| otherwise = do
l' <- unsafeRead arr m
case l' of
0 -> do
l'' <- go 1 $ next m
unsafeWrite arr m (l'' + 1)
return (l+l'')
_ -> return (l+l'-1)
collect mi ml i
| upper < i = return (mi, ml)
| otherwise = do
l <- go 1 i
if l > ml
then collect i l (i+1)
else collect mi ml (i+1)
unsafeWrite arr 1 1
collect 1 1 2

Well, the C program uses unsigned long, but Integer can store arbitrarily large integers (it's a bignum). If you import Data.Word, then you can use Word, which is a machine-word-sized unsigned integer.
After replacing Integer with Word, and using ghc -O2 and gcc -O3, the C program runs in 0.72 seconds, while the Haskell programs runs in 1.92 seconds. 2.6x isn't bad. However, ghc -O2 doesn't always help, and this is one of the programs on which it doesn't! Using just -O, as you did, brings the runtime down to 1.90 seconds.
I tried replacing div with quot (which uses the same type of division as C; they only differ on negative inputs), but strangely it actually made the Haskell program run slightly slower for me.
You should be able to speed up the syr function with the help of this previous Stack Overflow question I answered about the same Project Euler problem.

On my current system (32-bit Core2Duo) your Haskell code, including all the optimizations given in the answers, takes 0.8s to compile and 1.2s to run.
You could transfer the run-time to compile-time, and reduce the run-time to nearly zero.
module Euler14 where
import Data.Word
import Language.Haskell.TH
terms :: Word -> Word
terms n = countTerms n 0
where
countTerms 1 acc = acc + 1
countTerms n acc | even n = countTerms (n `div` 2) (acc + 1)
| otherwise = countTerms (3 * n + 1) (acc + 1)
longestT :: Word -> Word -> (Word, Word)
longestT mi mx = find mi mx (0, 0)
where
find mi mx (ct,cn) | mi == mx = if ct > terms mi then (ct,cn) else (terms mi, mi)
| otherwise = find (mi + 1) mx
(if ct > terms mi then (ct,cn) else (terms mi, mi))
longest :: Word -> Word -> ExpQ
longest mi mx = return $ TupE [LitE (IntegerL (fromIntegral a)),
LitE (IntegerL (fromIntegral b))]
where
(a,b) = longestT mi mx
and
{-# LANGUAGE TemplateHaskell #-}
import Euler14
main = print $(longest 500000 999999)
On my system it takes 2.3s to compile this but the run-time goes down to 0.003s. Compile Time Function Execution (CTFE) is something you can't do in C/C++. The only other programming language that I know of that supports CTFE is the D programming language. And just to be complete, the C code takes 0.1s to compile and 0.7s to run.

Related

Haskell ways to the 3n+1 challenge

Here is a simple programming problem from SPOJ: http://www.spoj.com/problems/PROBTRES/.
Basically, you are asked to output the biggest Collatz cycle for numbers between i and j. (Collatz cycle of a number $n$ is the number of steps to eventually get from $n$ to 1.)
I have been looking for a Haskell way to solve the problem with comparative performance than that of Java or C++ (so as to fits in the allowed run-time limit). Although a simple Java solution that memoizes the cycle length of any already computed cycles will work. I haven't been successful at applying the idea to obtain a Haskell solution.
I have tried the Data.Function.Memoize, as well as home-brewed log time memoization technique using the idea from this post: Memoization in Haskell?. Unfortunately, memoization actually makes the computation of cycle(n) even slower. I believe the slow down comes from the overhead of haskell way. (I tried running with the compiled binary code, instead of interpreting.)
I also suspect that simply iterating numbers from i to j can be costly ($i,j\le10^6$). So I even tried precompute everything for the range query, using idea from http://blog.openendings.net/2013/10/range-trees-and-profiling-in-haskell.html. However, this still gives "Time Limit Exceeding" error.
Can you help to inform a neat competitive Haskell program for this?
Thanks!
>>> using the approach bellow, I could submit an accepted answer to SPOJ. You may check the entire code from here.
The problem has bounds 0 < n < 1,000,000. Pre-calculate all of them and store them inside an array; then freeze the array. The array can be used as its own cache / memoization space.
The problem would then reduce to a range query problem over an array, which can be done very efficiently using trees.
With the code bellow I can get Collatz of 1..1,000,000 in a fraction of a second:
$ time echo 1000000 | ./collatz
525
real 0m0.177s
user 0m0.173s
sys 0m0.003s
Note that collatz function below, uses mutable STUArray internally, but itself is a pure function:
import Control.Monad.ST (ST)
import Control.Monad (mapM_)
import Control.Applicative ((<$>))
import Data.Array.Unboxed (UArray, elems)
import Data.Array.ST (STUArray, readArray, writeArray, runSTUArray, newArray)
collatz :: Int -> UArray Int Int
collatz size = out
where
next i = if odd i then 3 * i + 1 else i `div` 2
loop :: STUArray s Int Int -> Int -> ST s Int
loop arr k
| size < k = succ <$> loop arr (next k)
| otherwise = do
out <- readArray arr k
if out /= 0 then return out
else do
out <- succ <$> loop arr (next k)
writeArray arr k out
return out
out = runSTUArray $ do
arr <- newArray (1, size) 0
writeArray arr 1 1
mapM_ (loop arr) [2..size]
return arr
main = do
size <- read <$> getLine
print . maximum . elems $ collatz size
In order to perform range queries on this array, you may build a balanced tree as simple as below:
type Range = (Int, Int)
data Tree = Leaf Int | Node Tree Tree Range Int
build_tree :: Int -> Tree
build_tree size = loop 1 cnt
where
ctz = collatz size
cnt = head . dropWhile (< size) $ iterate (*2) 1
(Leaf a) +: (Leaf b) = max a b
(Node _ _ _ a) +: (Node _ _ _ b) = max a b
loop lo hi
| lo == hi = Leaf $ if size < lo then minBound else ctz ! lo
| otherwise = Node left right (lo, hi) (left +: right)
where
i = (lo + hi) `div` 2
left = loop lo i
right = loop (i + 1) hi
query_tree :: Tree -> Int -> Int -> Int
query_tree (Leaf x) _ _ = x
query_tree (Node l r (lo, hi) x) i j
| i <= lo && hi <= j = x
| mid < i = query_tree r i j
| j < 1 + mid = query_tree l i j
| otherwise = max (query_tree l i j) (query_tree r i j)
where mid = (lo + hi) `div` 2
Here is the same as in the other answer, but with an immutable recursively defined array (and it also leaks slightly (can someone say why?) and so two times slower):
import Data.Array
upper = 10^6
step :: Integer -> Int
step i = 1 + colAt (if odd i then 3 * i + 1 else i `div` 2)
colAt :: Integer -> Int
colAt i | i > upper = step i
colAt i = col!i
col :: Array Integer Int
col = array (1, upper) $ (1, 1) : [(i, step i) | i <- [2..upper]]
main = print $ maximum $ elems col

My attempt at Project Euler #92 is too slow

I'm trying to solve Project Euler problem #92 with Haskell. I started learning Haskell recently. It's the first Project Euler problem I've tried to solve with Haskell, but my piece of code doesn't terminate even in 10 minutes. I know you don't give me the answer directly, but again I should warn I find answer with c++ doesn't give answer of Euler or new logic to solve Euler. I'm just curious why that guy doesn't work fast and what should I do to make it faster?
{--EULER 92--}
import Data.List
myFirstFunction 1 = 0
myFirstFunction 89 = 1
myFirstFunction x= myFirstFunction (giveResult x)
giveResult 0 = 0
giveResult x = (square (mod x 10)) + (giveResult (div x 10))
square x = x*x
a=[1..10000000]
main = putStrLn(show (sum (map myFirstFunction a)))
The biggest speedup can of course be gained from using a better algorithm. I'm not going deep into that here, though.
Original algorithm tweakings
So let's focus on improving the used algorithm without really changing it.
You never give any type signature, therefore the type defaults to the arbitrary precision Integer. Everything here fits easily in an Int, there's no danger of overflow, so let's use that. Adding a type signature myFirstFunction :: Int -> Int helps: time drops from Total time 13.77s ( 13.79s elapsed) to Total time 6.24s ( 6.24s elapsed) and total allocation drops by a factor of about 15. Not bad for such a simple change.
You use div and mod. These always compute a non-negative remainder and the corresponding quotient, so they need some extra checks in case some negative numbers are involved. The functions quot and rem map to the machine division instructions, they don't involve such checks and therefore are somewhat faster. If you compile via the LLVM backend (-fllvm), that also takes advantage of the fact that you always divide by a single known number (10), and converts the division into multiplication and bit-shift. Time now: Total time 1.56s ( 1.56s elapsed).
Instead of using quot and rem separately, let's use the quotRem function that computes both at once, so that we don't repeat the division (even with multiplication+shift that takes a little time):
giveResult x = case x `quotRem` 10 of
(q,r) -> r*r + giveResult q
That doesn't gain much, but a little: Total time 1.49s ( 1.49s elapsed).
You're using a list a = [1 .. 10000000], and map the function over that list and then sum the resulting list. That's idiomatic, neat and short, but not super fast, since allocating all those list cells and garbage collecting them takes time too - not very much, since GHC is very good at that, but transforming it into a loop
main = print $ go 0 1
where
go acc n
| n > 10000000 = acc
| otherwise = go (acc + myFirstFunction n) (n+1)
gains us a little still: Total time 1.34s ( 1.34s elapsed) and the allocation dropped from 880,051,856 bytes allocated in the heap for the last list version to 51,840 bytes allocated in the heap.
giveResult is recursive, and therefore cannot be inlined. The same holds for myFirstFunction, hence each computation needs two function calls (at least). We can avoid that by rewriting giveResult to a non-recursive wrapper and a recursive local loop,
giveResult x = go 0 x
where
go acc 0 = acc
go acc n = case n `quotRem` 10 of
(q,r) -> go (acc + r*r) q
so that that can be inlined: Total time 1.04s ( 1.04s elapsed).
Those were the most obvious points, further improvements - apart from the memoisation mentioned by hammar in the comments - would need some thinking.
We are now at
module Main (main) where
myFirstFunction :: Int -> Int
myFirstFunction 1 = 0
myFirstFunction 89 = 1
myFirstFunction x= myFirstFunction (giveResult x)
giveResult :: Int -> Int
giveResult x = go 0 x
where
go acc 0 = acc
go acc n = case n `quotRem` 10 of
(q,r) -> go (acc + r*r) q
main :: IO ()
main = print $ go 0 1
where
go acc n
| n > 10000000 = acc
| otherwise = go (acc + myFirstFunction n) (n+1)
With -O2 -fllvm, that runs in 1.04 seconds here, but with the native code generator (only -O2), it takes 3.5 seconds. That difference is due to the fact that GHC itself doesn't convert the division into a multiplication and bit-shift. If we do it by hand, we get pretty much the same performance from the native code generator.
Because we know something that the compiler doesn't, namely that we never deal with negative numbers here, and the numbers don't become large, we can even generate a better multiply-and-shift (that would produce wrong results for negative or large dividends) than the compiler and take the time down to 0.9 seconds for the native code generator and 0.73 seconds for the LLVM backend:
import Data.Bits
qr10 :: Int -> (Int, Int)
qr10 n = (q, r)
where
q = (n * 0x66666667) `unsafeShiftR` 34
r = n - 10 * q
Note: That requires that Int is a 64-bit type, it won't work with 32-bit Ints, it will produce wrong results for negative n, and the multiplication will overflow for large n. We're entering dirty-hack territory. We can alleviate the dirtyness by using Word instead of Int, that leaves only the overflow (which doesn't occur for n <= 10737418236 with Word resp n <= 5368709118 for Int, so here we are comfortably in the safe zone). The times aren't affected.
The corresponding C programme
#include <stdio.h>
unsigned int myFirstFunction(unsigned int i);
unsigned int giveResult(unsigned int i);
int main(void) {
unsigned int sum = 0;
for(unsigned int i = 1; i <= 10000000; ++i) {
sum += myFirstFunction(i);
}
printf("%u\n",sum);
return 0;
}
unsigned int myFirstFunction(unsigned int i) {
if (i == 1) return 0;
if (i == 89) return 1;
return myFirstFunction(giveResult(i));
}
unsigned int giveResult(unsigned int i) {
unsigned int acc = 0, r, q;
while(i) {
q = (i*0x66666667UL) >> 34;
r = i - q*10;
i = q;
acc += r*r;
}
return acc;
}
performs similarly, compiled with gcc -O3, it runs in 0.78 seconds, and with clang -O3 in 0.71.
That's pretty much the end without changing the algorithm.
Memoisation
Now, a minor change of algorithm is memoisation. If we build a lookup table for the numbers <= 7*9², we need only one computation of the sum of the squares of the digits for each number rather than iterating that until we reach 1 or 89, so let's memoise,
module Main (main) where
import Data.Array.Unboxed
import Data.Array.IArray
import Data.Array.Base (unsafeAt)
import Data.Bits
qr10 :: Int -> (Int, Int)
qr10 n = (q, r)
where
q = (n * 0x66666667) `unsafeShiftR` 34
r = n - 10 * q
digitSquareSum :: Int -> Int
digitSquareSum = go 0
where
go acc 0 = acc
go acc n = case qr10 n of
(q,r) -> go (acc + r*r) q
table :: UArray Int Int
table = array (0,567) $ assocs helper
where
helper :: Array Int Int
helper = array (0,567) [(i, f i) | i <- [0 .. 567]]
f 0 = 0
f 1 = 0
f 89 = 1
f n = helper ! digitSquareSum n
endPoint :: Int -> Int
endPoint n = table `unsafeAt` digitSquareSum n
main :: IO ()
main = print $ go 0 1
where
go acc n
| n > 10000000 = acc
| otherwise = go (acc + endPoint n) (n+1)
Doing the memoisation by hand instead of using a library makes the code longer, but we can tailor it to our needs. We can use an unboxed array, and we can omit the bounds check on the array access. Both significantly speed the computation up. The time is now 0.18 seconds for the native code generator, and 0.13 seconds for the LLVM backend. The corresponding C programme runs in 0.16 seconds compiled with gcc -O3, and 0.145 seconds compiled with clang -O3 (Haskell beats C, w00t!).
Scaling and a hint for a better algorithm
The used algorithm however doesn't scale too well, a bit worse than linear, and for an upper bound of 108 (with suitably adapted memoisation limit), it runs in 1.5 seconds (ghc -O2 -fllvm), resp. 1.64 seconds (clang -O3) and 1.87 seconds (gcc -O3) [2.02 seconds for the native code generator].
Using a different algorithm that counts the numbers whose sequence ends in 1 by partitioning such numbers into a sum of squares of digits (The only numbers that directly produce 1 are powers of 10. We can write
10 = 1×3² + 1×1²
10 = 2×2² + 2×1²
10 = 1×2² + 6×1²
10 = 10×1²
From the first, we obtain 13, 31, 103, 130, 301, 310, 1003, 1030, 1300, 3001, 3010, 3100, ...
From the second, we obtain 1122, 1212, 1221, 2112, 2121, 2211, 11022, 11202, ...
From the third 1111112, 1111121, ...
Only 13, 31, 103, 130, 301, 310 are possible sums of squares of the digits of numbers <= 10^10, so only those need to be investigated further. We can write
100 = 1×9² + 1×4² + 3×1²
...
100 = 1×8² + 1×6²
...
The first of these partitions generates no children since it requires five nonzero digits, the other explicitly given generates the two children 68 and 86 (also 608 if the limit is 108, more for larger limits)), we can get better scaling and a faster algorithm.
The fairly unoptimised programme I wrote way back when to solve this problem runs (input is exponent of 10 of the limit)
$ time ./problem92 7
8581146
real 0m0.010s
user 0m0.008s
sys 0m0.002s
$ time ./problem92 8
85744333
real 0m0.022s
user 0m0.018s
sys 0m0.003s
$ time ./problem92 9
854325192
real 0m0.040s
user 0m0.033s
sys 0m0.006s
$ time ./problem92 10
8507390852
real 0m0.074s
user 0m0.069s
sys 0m0.004s
in a different league.
First off, I took the liberty of cleaning up your code a little:
endsAt89 1 = 0
endsAt89 89 = 1
endsAt89 n = endsAt89 (sumOfSquareDigits n)
sumOfSquareDigits 0 = 0
sumOfSquareDigits n = (n `mod` 10)^2 + sumOfSquareDigits (n `div` 10)
main = print . sum $ map endsAt89 [1..10^7]
On my crappy netbook is 1 min 13 sec. Let's see if we can improve that.
Since the numbers are small, we can start by using machine-sized Int instead of arbitrary-size Integer. This is just a matter of adding type signatures, e.g.
sumOfSquareDigits :: Int -> Int
That improves the run time drastically to 20 seconds.
Since the numbers are all positive, we can replace div and mod with the slightly faster quot and rem, or even both in one go with quotRem:
sumOfSquareDigits :: Int -> Int
sumOfSquareDigits 0 = 0
sumOfSquareDigits n = r^2 + sumOfSquareDigits q
where (q, r) = quotRem x 10
Run time is now 17 seconds. Making it tail recursive shaves off another second:
sumOfSquareDigits :: Int -> Int
sumOfSquareDigits n = loop n 0
where
loop 0 !s = s
loop n !s = loop q (s + r^2)
where (q, r) = quotRem n 10
For further improvements, we can notice that sumOfSquareDigits returns at most 567 = 7 * 9^2 for the given input numbers, so we can memoize for small numbers to reduce the number of iterations needed. Here's my final version (using the data-memocombinators package for the memoization):
{-# LANGUAGE BangPatterns #-}
import qualified Data.MemoCombinators as Memo
endsAt89 :: Int -> Int
endsAt89 = Memo.arrayRange (1, 7*9^2) endsAt89'
where
endsAt89' 1 = 0
endsAt89' 89 = 1
endsAt89' n = endsAt89 (sumOfSquareDigits n)
sumOfSquareDigits :: Int -> Int
sumOfSquareDigits n = loop n 0
where
loop 0 !s = s
loop n !s = loop q (s + r^2)
where (q, r) = quotRem n 10
main = print . sum $ map endsAt89 [1..10^7]
This runs in just under 9 seconds on my machine.

Haskell solution for InterviewStreet String Similarity challenge

This is my best attempt to solve the String Similarity challenge for InterviewStreet.
import Control.Monad
import Data.Text as T
import qualified Data.Text.IO as TIO
sumSimilarities s = (T.length s) + (sum $ Prelude.map (similarity s) (Prelude.tail $ tails s))
similarity :: Text -> Text -> Int
similarity a b = case commonPrefixes a b of
Just (x,_,_) -> T.length x
Nothing -> 0
main = do
cases <- fmap read getLine
inputs <- replicateM cases TIO.getLine
forM_ inputs $ print . sumSimilarities
It only passes 7/10 of the test cases. Test cases 7, 8, and 9 fail because they exceed the allotted execution time.
I'm half trying to verify that this is indeed possible to solve in Haskell and half looking for what an optimized Haskell program looks like.
Thanks!
Tyler
Like user5402, I'd be curious whether an equivalent (for certain values of equivalent) C programme would finish within the time limit or also time out. If it would, it would be interesting to see whether an equivalent programme using ByteStrings could finish in time. - Not that ByteStrings are per se faster than Text, but since the input must be converted to the internal representation of Text while ByteString takes it as is, that might make a difference. Another possible reason that ByteStrings might be faster - if the testing machines have 32-bit GHCs - would be that text's fusion at least used to need more registers than generally available on 32 bit architectures to get full profit [a long time ago, in the days of text-0.5 to text-0.7, on my 32-bit box, bytestring used to be quite a bit faster, no idea whether that still holds for newer text versions].
Okay, since user5402 has verified that the naïve algorithm is fast enough in C, I've gone ahead and wrote an implementation of the naïve algorithm using ByteStrings
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Unsafe as U
import Control.Monad
import Data.Word
main :: IO ()
main = do
cl <- C.getLine
case C.readInt cl of
Just (cases,_) -> replicateM_ cases (C.getLine >>= print . similarity)
Nothing -> return ()
-- Just to keep the condition readable.
(?) :: B.ByteString -> Int -> Word8
(?) = U.unsafeIndex
similarity :: B.ByteString -> Int
similarity bs
| len == 0 = 0
| otherwise = go len 1
where
!len = B.length bs
go !acc i
| i < len = go (acc + prf 0 i) (i+1)
| otherwise = acc
prf !k j
| j < len && bs ? k == bs ? j = prf (k+1) (j+1)
| otherwise = k
and compared it to the OP's Text version on some bad cases. On my box, that is more than four times faster than the Text version, so it'd be interesting whether that's fast enough (the C version is another 4.5 times faster, so it may well not be).
However, I consider it more likely that the time limit is exceeded due to using the naïve algorithm that has quadratic worst-case behaviour. Probably there are test cases that evoke the worst-case for the naïve algorithm.
So the solution would be to use an algorithm that scales better, optimally linear. One linear algorithm to compute the similarity of a string is the Z-algorithm.
The idea is simple (but, like most good ideas, not easy to have). Let us call a (non-empty) substring that is also a prefix of the string a prefix-substring. To avoid recomputation, the algorithm uses a window of the prefix-substring starting before the currently considered index that extends farthest to the right (initially, the window is empty).
Variables used and invariants of the algorithm:
i, the index under consideration, starts at 1 (for 0-based indexing; the entire string is not considered) and is incremented to length - 1
left and right, the first and last index of the prefix-substring window; invariants:
left < i, left <= right < length(S), either left > 0 or right < 1,
if left > 0, then S[left .. right] is the maximal common prefix of S and S[left .. ],
if 1 <= j < i and S[j .. k] is a prefix of S, then k <= right
An array Z, invariant: for 1 <= k < i, Z[k] contains the length of the longest common prefix of S[k .. ] and S.
The algorithm:
Set i = 1, left = right = 0 (any values with left <= right < 1 are allowed), and set Z[j] = 0 for all indices 1 <= j < length(S).
If i == length(S), stop.
If i > right, find the length l of the longest common prefix of S and S[i .. ], store it in Z[i]. If l > 0 we have found a window extending farther right than the previous, then set left = i and right = i+l-1, otherwise leave them unchanged. Increment i and go to 2.
Here left < i <= right, so the substring S[i .. right] is known - since S[left .. right] is a prefix of S, it is equal to S[i-left .. right-left].
Now consider the longest common prefix of S with the substring starting at index i - left.
Its length is Z[i-left], hence S[k] = S[i-left + k] for 0 <= k < Z[i-left] and
S[Z[i-left]] ≠ S[i-left+Z[i-left]]. Now, if Z[i-left] <= right-i, then i + Z[i-left] is inside the known window, therefore
S[i + Z[i-left]] = S[i-left + Z[i-left]] ≠ S[Z[i-left]]
S[i + k] = S[i-left + k] = S[k] for 0 <= k < Z[i-left]
and we see that the length of the longest common prefix of S and S[i .. ] has length Z[i-left].
Then set Z[i] = Z[i-left], increment i, and go to 2.
Otherwise, S[i .. right] is a prefix of S and we check how far it extends, starting the comparison of characters at the indices right+1 and right+1 - i. Let the length be l. Set Z[i] = l, left = i, right = i + l - 1, increment i, and go to 2.
Since the window never moves left, and the comparisons always start after the end of the window, each character in the string is compared at most once successfully to an earlier character in the string, and for each starting index, there is at most one unsuccessful comparison, therefore the algorithm is linear.
The code (using ByteString out of habit, ought to be trivially portable to Text):
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Unsafe as U
import Data.Array.ST
import Data.Array.Base
import Control.Monad.ST
import Control.Monad
import Data.Word
main :: IO ()
main = do
cl <- C.getLine
case C.readInt cl of
Just (cases,_) -> replicateM_ cases (C.getLine >>= print . similarity)
Nothing -> return ()
-- Just to keep the condition readable.
(?) :: B.ByteString -> Int -> Word8
(?) = U.unsafeIndex
-- Calculate the similarity of a string using the Z-algorithm
similarity :: B.ByteString -> Int
similarity bs
| len == 0 = 0
| otherwise = runST getSim
where
!len = B.length bs
getSim = do
za <- newArray (0,len-1) 0 :: ST s (STUArray s Int Int)
-- The common prefix of the string with itself is entire string.
unsafeWrite za 0 len
let -- Find the length of the common prefix.
go !k j
| j < len && (bs ? j == bs ? k) = go (k+1) (j+1)
| otherwise = return k
-- The window with indices in [left .. right] is the prefix-substring
-- starting before i that extends farthest.
loop !left !right i
| i >= len = count 0 0 -- when done, sum
| i > right = do
-- We're outside the window, simply
-- find the length of the common prefix
-- and store it in the Z-array.
w <- go 0 i
unsafeWrite za i w
if w > 0
-- We got a non-empty common prefix and a new window.
then loop i (i+w-1) (i+1)
-- No new window, same procedure at next index.
else loop left right (i+1)
| otherwise = do
-- We're inside the window, so the substring starting at
-- (i - left) has a common prefix with the substring
-- starting at i of length at least (right - i + 1)
-- (since the [left .. right] window is a prefix of bs).
-- But we already know how long the common prefix
-- starting at (i - left) is.
z <- unsafeRead za (i-left)
let !s = right-i+1 -- length of known prefix starting at i
if z < s
-- If the common prefix of the substring starting at
-- (i - left) is shorter than the rest of the window,
-- the common prefix of the substring starting at i
-- is the same. Store it and move on with the same window.
then do
unsafeWrite za i z
loop left right (i+1)
else do
-- Otherwise, find out how far the common prefix
-- extends, starting at (right + 1) == s + i.
w <- go s (s+i)
unsafeWrite za i w
loop i (i+w-1) (i+1)
count !acc i
| i == len = return acc
| otherwise = do
n <- unsafeRead za i
count (acc+n) (i+1)
loop 0 0 1

Caching in Haskell and explicit parallelism

I'm currently trying to optimize my solution to problem 14 at Projet Euler.
I really enjoy Haskell and I think it's a very good fit for these kind of problems, here's three different solutions I've tried:
import Data.List (unfoldr, maximumBy)
import Data.Maybe (fromJust, isNothing)
import Data.Ord (comparing)
import Control.Parallel
next :: Integer -> Maybe (Integer)
next 1 = Nothing
next n
| even n = Just (div n 2)
| odd n = Just (3 * n + 1)
get_sequence :: Integer -> [Integer]
get_sequence n = n : unfoldr (pack . next) n
where pack n = if isNothing n then Nothing else Just (fromJust n, fromJust n)
get_sequence_length :: Integer -> Integer
get_sequence_length n
| isNothing (next n) = 1
| otherwise = 1 + (get_sequence_length $ fromJust (next n))
-- 8 seconds
main1 = print $ maximumBy (comparing length) $ map get_sequence [1..1000000]
-- 5 seconds
main2 = print $ maximum $ map (\n -> (get_sequence_length n, n)) [1..1000000]
-- Never finishes
main3 = print solution
where
s1 = maximumBy (comparing length) $ map get_sequence [1..500000]
s2 = maximumBy (comparing length) $ map get_sequence [500001..10000000]
solution = (s1 `par` s2) `pseq` max s1 s2
Now if you look at the actual problem there's a lot of potential for caching, as most new sequences will contain subsequences that have already been calculated before.
For comparison, I wrote a version in C too:
Running time with caching: 0.03 seconds
Running time without caching: 0.3 seconds
That's just insane! Sure, caching reduced the time by a factor of 10, but even without caching it's still at least 17 times faster than my Haskell code.
What's wrong with my code?
Why doesn't Haskell cache the function calls for me? As the functions are pure caching shouldn't caching be trivial, only a matter of available memory?
What's the problem with my third parallel version? Why doesn't it finish?
Regarding Haskell as a language, does the compiler automatically parallellize some code (folds, maps etc), or does it always have to be done explicitly using Control.Parallel?
Edit: I stumbled upon this similar question. They mentioned that his function wasn't tail-recursive. Is my get_sequence_length tail recursive? If not how can I make it so?
Edit2: To Daniel:
Thanks a lot for the reply, really awesome.
I've been playing around with your improvements and I've found some really bad gotchas.
I'm running the tests on Windws 7 (64-bit), 3.3 GHZ Quad core with 8GB RAM.
The first thing I did was as you say replace all Integer with Int, but whenever I ran any of the mains I ran out of memory,
even with +RTS kSize -RTS set ridiciously high.
Eventually I found this (stackoverflow is awesome...), which means that since all Haskell programs on Windows are run as 32-bit, the Ints were overflowing causing infinite recursion, just wow...
I ran the tests in a Linux virtual machine (with the 64-bit ghc) instead and got similar results.
Alright, let's start from the top. First important thing is to give the exact command line you're using to compile and run; for my answer, I'll use this line for the timings of all programs:
ghc -O2 -threaded -rtsopts test && time ./test +RTS -N
Next up: since timings vary greatly from machine to machine, we'll give some baseline timings for my machine and your programs. Here's the output of uname -a for my computer:
Linux sorghum 3.4.4-2-ARCH #1 SMP PREEMPT Sun Jun 24 18:59:47 CEST 2012 x86_64 Intel(R) Core(TM)2 Quad CPU Q6600 # 2.40GHz GenuineIntel GNU/Linux
The highlights are: quad-core, 2.4GHz, 64-bit.
Using main1: 30.42s user 2.61s system 149% cpu 22.025 total
Using main2: 21.42s user 1.18s system 129% cpu 17.416 total
Using main3: 22.71s user 2.02s system 220% cpu 11.237 total
Actually, I modified main3 in two ways: first, by removing one of the zeros from the end of the range in s2, and second, by changing max s1 s2 to maximumBy (comparing length) [s1, s2], since the former only accidentally computes the right answer. =)
I'll now focus on serial speed. (To answer one of your direct questions: no, GHC does not automatically parallelize or memoize your programs. Both of those things have overheads that are very difficult to estimate, and consequently it's very difficult to decide when doing them will be beneficial. I have no idea why even the serial solutions in this answer are getting >100% CPU utilization; perhaps some garbage collection is happening in another thread or some such thing.) We'll start from main2, since it was the faster of the two serial implementations. The cheapest way to get a little boost is to change all the type signatures from Integer to Int:
Using Int: 11.17s user 0.50s system 129% cpu 8.986 total (about twice as fast)
The next boost comes from reducing allocation in the inner loop (eliminating the intermediate Maybe values).
import Data.List
import Data.Ord
get_sequence_length :: Int -> Int
get_sequence_length 1 = 1
get_sequence_length n
| even n = 1 + get_sequence_length (n `div` 2)
| odd n = 1 + get_sequence_length (3 * n + 1)
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
Using this: 4.84s user 0.03s system 101% cpu 4.777 total
The next boost comes from using faster operations than even and div:
import Data.Bits
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
get_sequence_length :: Int -> Int
get_sequence_length 1 = 1
get_sequence_length n = 1 + get_sequence_length next where
next = if even' n then n `quot` 2 else 3 * n + 1
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
Using this: 1.27s user 0.03s system 105% cpu 1.232 total
For those following along at home, this is about 17 times faster than the main2 that we started with -- a competitive improvement with switching to C.
For memoization, there's a few choices. The simplest is to use a pre-existing package like data-memocombinators to create an immutable array and read from it. The timings are fairly sensitive to choosing a good size for this array; for this problem, I found 50000 to be a pretty good upper bound.
import Data.Bits
import Data.MemoCombinators
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
pre_length :: (Int -> Int) -> (Int -> Int)
pre_length f 1 = 1
pre_length f n = 1 + f next where
next = if even' n then n `quot` 2 else 3 * n + 1
get_sequence_length :: Int -> Int
get_sequence_length = arrayRange (1,50000) (pre_length get_sequence_length)
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
With this: 0.53s user 0.10s system 149% cpu 0.421 total
The fastest of all is to use a mutable, unboxed array for the memoization bit. It's much less idiomatic, but it's bare-metal speed. The speed is much less sensitive on the size of this array, so long as the array is about as large as the biggest thing you want the answer for.
import Control.Monad
import Control.Monad.ST
import Data.Array.Base
import Data.Array.ST
import Data.Bits
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
next n = if even' n then n `quot` 2 else 3 * n + 1
get_sequence_length :: STUArray s Int Int -> Int -> ST s Int
get_sequence_length arr n = do
bounds#(lo,hi) <- getBounds arr
if not (inRange bounds n) then (+1) `fmap` get_sequence_length arr (next n) else do
let ix = n-lo
v <- unsafeRead arr ix
if v > 0 then return v else do
v' <- get_sequence_length arr (next n)
unsafeWrite arr ix (v'+1)
return (v'+1)
maxLength :: (Int,Int)
maxLength = runST $ do
arr <- newArray (1,1000000) 0
writeArray arr 1 1
loop arr 1 1 1000000
where
loop arr n len 1 = return (n,len)
loop arr n len n' = do
len' <- get_sequence_length arr n'
if len' > len then loop arr n' len' (n'-1) else loop arr n len (n'-1)
main = print maxLength
With this: 0.16s user 0.02s system 138% cpu 0.130 total (which is competitive with the memoized C version)
GHC won't parallel-ize anything automatically for you. And as you guess get_sequence_length is not tail-recursive. See here. And consider how the compiler (unless it's doing some nice optimizations for you) can't evaluate all those recursive additions until you hit the end; you're "building up thunks" which isn't usually a good thing.
Try instead calling a recursive helper function and passing an accumulator, or try defining it in terms of foldr.

GHC Optimization: Collatz conjecture

I've written code for the Project Euler's Challenge 14, in both Haskell and C++ (ideone links). They both remember any calculations they have previously done in an array.
Using ghc -O2 and g++ -O3 respectively, the C++ runs 10-15 times faster than the Haskell version.
Whilst I understand the Haskell version may run slower, and that Haskell is a nicer language to write in, it would be nice to know some code changes I can make to the Haskell version to make it run faster (ideally within a factor of 2 or 3 of the C++ version)?
Haskell code is here:
import Data.Array
import Data.Word
import Data.List
collatz_array =
let
upperbound = 1000000
a = array (1, upperbound) [(i :: Word64, f i :: Int) | i <- [1..upperbound]]
f i = i `seq`
let
check_f i = i `seq` if i <= upperbound then a ! i else f i
in
if (i == 1) then 0 else (check_f ((if (even i) then i else 3 * i + 1) `div` 2)) + 1
in a
main =
putStrLn $ show $
foldl1' (\(x1,x2) (y1,y2) -> if (x2 >= y2) then (x1, x2) else (y1, y2)) $! (assocs collatz_array)
Edit:
I've now also done a version using unboxed mutable arrays. It is still 5 times slower than the C++ version, but a significant improvement. The code is on ideone here.
I'd like to know improvements to the mutable array version which bring it closer to the C++ version.
Some problems with your (mutable array) code:
You use a fold to find the maximal chain length, for that the array has to be converted to an association list, that takes time and allocation the C++ version doesn't need.
You use even and div for testing resp dividing by 2. These are slow. g++ optimises both operations to the faster bit operations (on platforms where that is supposedly faster, at least), but GHC doesn't do these low-level optimisations (yet), so for the time being, they have to be done by hand.
You use readArray and writeArray. The extra bounds-checking that isn't done in the C++ code also takes time, once the other problems are dealt with, that amounts to a significant portion of the running time (ca. 25% on my box), since there are done a lot of reads and writes in the algorithm.
Incorporating that into the implementation, I get
import Data.Array.ST
import Data.Array.Base
import Control.Monad.ST
import Data.Bits
collatz_array :: ST s (STUArray s Int Int)
collatz_array = do
let upper = 10000000
arr <- newArray (0,upper) 0
unsafeWrite arr 2 1
let check i
| upper < i = return arr
| i .&. 1 == 0 = do
l <- unsafeRead arr (i `shiftR` 1)
unsafeWrite arr i (l+1)
check (i+1)
| otherwise = do
let j = (3*i+1) `shiftR` 1
find k l
| upper < k = find (next k) $! l+1
| k < i = do
m <- unsafeRead arr k
return (m+l)
| otherwise = do
m <- unsafeRead arr k
if m == 0
then do
n <- find (next k) 1
unsafeWrite arr k n
return (n+l)
else return (m+l)
where
next h
| h .&. 1 == 0 = h `shiftR` 1
| otherwise = (3*h+1) `shiftR` 1
l <- find j 1
unsafeWrite arr i l
check (i+1)
check 3
collatz_max :: ST s (Int,Int)
collatz_max = do
car <- collatz_array
(_,upper) <- getBounds car
let find w m i
| upper < i = return (w,m)
| otherwise = do
l <- unsafeRead car i
if m < l
then find i l (i+1)
else find w m (i+1)
find 1 0 2
main :: IO ()
main = print (runST collatz_max)
And the timings (both for 10 million):
$ time ./cccoll
8400511 429
real 0m0.210s
user 0m0.200s
sys 0m0.009s
$ time ./stcoll
(8400511,429)
real 0m0.341s
user 0m0.307s
sys 0m0.033s
which doesn't look too bad.
Important note: That code only works on 64-bit GHC (so, in particular, on Windows, you need ghc-7.6.1 or later, previous GHCs were 32-bit even on 64-bit Windows) since intermediate chain elements exceed 32-bit range. On 32-bit systems, one would have to use Integer or a 64-bit integer type (Int64 or Word64) for following the chains, at a drastic performance cost, since the primitive 64-bit operations (arithmetic and shifts) are implemented as foreign calls to C functions in 32-bit GHCs (fast foreign calls, but still much slower than direct machine ops).
The ideone site is using a ghc 6.8.2, which is getting pretty old. On ghc version 7.4.1, the difference is much smaller.
With ghc:
$ ghc -O2 euler14.hs && time ./euler14
(837799,329)
./euler14 0.63s user 0.04s system 98% cpu 0.685 total
With g++ 4.7.0:
$ g++ --std=c++0x -O3 euler14.cpp && time ./a.out
8400511 429
./a.out 0.24s user 0.01s system 99% cpu 0.252 total
For me, the ghc version is only 2.7 times slower than the c++ version.
Also, the two programs aren't giving the same result... (not a good sign, especially for benchmarking)

Resources