Fibonacci's closed-form expression, the ST monad, and Haskell - haskell

Two recent questions about Fibonacci's closed-form expression (here and here) as well as the HaskellWiki's page about the ST monad motivated me to try and compare two ways of calculating Fibonacci numbers.
The first implementation uses the closed-form expression together with rationals as seen in hammar's answer here (where Fib is a datatype abstracting numbers of the form a+b*√5):
fibRational :: Integer -> Integer
fibRational n = divSq5 $ phi^n - (1-phi)^n
where
phi = Fib (1/2) (1/2)
divSq5 (Fib 0 b) = numerator b
The second implementation is from the HaskellWiki's page about the ST monad, with some added strictness that was necessary in order to avoid a stack overflow:
fibST :: Integer -> Integer
fibST n | n < 2 = n
fibST n = runST $ do
x <- newSTRef 0
y <- newSTRef 1
fibST' n x y
where
fibST' 0 x _ = readSTRef x
fibST' !n x y = do
x' <- readSTRef x
y' <- readSTRef y
y' `seq` writeSTRef x y'
x' `seq` writeSTRef y (x'+y')
fibST' (n-1) x y
For reference, here's also the full code that I used for testing:
{-# LANGUAGE BangPatterns #-}
import Data.Ratio
import Data.STRef.Strict
import Control.Monad.ST.Strict
import System.Environment
data Fib =
Fib !Rational !Rational
deriving (Eq, Show)
instance Num Fib where
negate (Fib a b) = Fib (-a) (-b)
(Fib a b) + (Fib c d) = Fib (a+c) (b+d)
(Fib a b) * (Fib c d) = Fib (a*c+5*b*d) (a*d+b*c)
fromInteger i = Fib (fromInteger i) 0
abs = undefined
signum = undefined
fibRational :: Integer -> Integer
fibRational n = divSq5 $ phi^n - (1-phi)^n
where
phi = Fib (1/2) (1/2)
divSq5 (Fib 0 b) = numerator b
fibST :: Integer -> Integer
fibST n | n < 2 = n
fibST n = runST $ do
x <- newSTRef 0
y <- newSTRef 1
fibST' n x y
where
fibST' 0 x _ = readSTRef x
fibST' !n x y = do
x' <- readSTRef x
y' <- readSTRef y
y' `seq` writeSTRef x y'
x' `seq` writeSTRef y (x'+y')
fibST' (n-1) x y
main = do
(m:n:_) <- getArgs
let n' = read n
st = fibST n'
rt = fibRational n'
case m of
"st" -> print st
"rt" -> print rt
"cm" -> print (st == rt)
Now it turns out that the ST version is significantly slower than the closed-form version, although I'm not a hundred percent sure why:
# time ./fib rt 1000000 >/dev/null
./fib rt 1000000 > /dev/null 0.23s user 0.00s system 99% cpu 0.235 total
# time ./fib st 1000000 >/dev/null
./fib st 1000000 > /dev/null 11.35s user 0.06s system 99% cpu 11.422 total
So my question is: Can someone help me understand why the first implementation is so much faster? Is it algorithmic complexity, overhead or something else entirely? (I checked that both functions yield the same result). Thanks!

You are comparing very different versions here. To make it fair, here is an implementation that is equivalent to the ST solution you give, but in pure Haskell:
fibIt :: Integer -> Integer
fibIt n | n < 2 = n
fibIt n = go 1 1 (n-2)
where go !_x !y 0 = y
go !x !y i = go y (x+y) (i-1)
This one seems to perform exactly as good or bad as the ST version (both 10s here). The runtime is most likely dominated by all the Integer additions, overhead is therefore too low to be measurable.

First, the two implementations use two very different algorithms with different asymptotic complexity (well, depending on what the complexity of the Integer operations are).
Second, the st implementation is using references. References are (comparatively) slow in ghc. (Because updating a reference needs a GC write barrier due to the generational garbage collector.)
So, you're comparing two functions that differ both in algorithm an implementation technique.
You should rewrite the second one not to use references, that way you can compare just algorithms. Or rewrite the first one to use references. But why use references when it's the wrong thing? :)

You can compare the algorithmic complexities.
The first is O(1);
the second is O(n)

Related

Haskell Space Leak

all.
While trying to solve some programming quiz:
https://www.hackerrank.com/challenges/missing-numbers
, I came across with space leak.
Main function is difference, which implements multi-set difference.
I've found out that List ':' and Triples (,,) kept on heaps
with -hT option profiling. However, only big lists are difference's
two arguments, and it shrinks as difference keeps on tail recursion.
But the memory consumed by lists keeps increasing as program runs.
Triples is ephemeral array structure, used for bookkeeping the count of multiset's each element. But the memory consumed by triples also
keeps increasing, and I cannot find out why.
Though I've browsed similar 'space leak' questions in stackoverflow,
I couldn't grasp the idea. Surely I have much to study.
I appreciate any comments. Thank you.
p.s) executable is compiled with -O2 switch.
$ ./difference -hT < input04.txt
Stack space overflow: current size 8388608 bytes.
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.6.3
.
import Data.List
import Data.Array
-- array (non-zero-count, start-offset, array_data)
array_size=101
myindex :: Int -> Int -> Int
myindex key offset
| key >= offset = key - offset
| otherwise = key - offset + array_size
mylookup x (_,offset,arr) = arr ! idx
where idx = myindex x offset
addOrReplace :: Int -> Int -> (Int, Int, Array Int (Int,Int)) -> (Int, Int, Array Int (Int,Int))
addOrReplace key value (count,offset,arr) = (count', offset, arr // [(idx,(key,value))])
where idx = myindex key offset
(_,prev_value) = arr ! idx
count' = case (prev_value, value) of
(0,0) -> count
(0,_) -> count + 1
(_,0) -> count - 1
otherwise -> count
difference :: (Int,Int,Array Int (Int,Int)) -> [Int] -> [Int] -> [Int]
difference (count,offset,arr) [] []
| count == 0 = []
| otherwise = [ k | x <- [0..array_size-1], let (k,v) = (arr ! x), v /= 0]
difference m (x:xs) y = difference new_m xs y
where (_,v) = mylookup x m
new_m = addOrReplace x (v + 1) m
difference m [] (y:ys) = difference new_m [] ys
where (_,v) = mylookup y m
new_m = if v == 0
then m
else addOrReplace y (v - 1) m
main = do
n <- readLn :: IO Int
pp <- getLine
m <- readLn :: IO Int
qq <- getLine
let p = map (read :: String->Int) . words $ pp
q = map (read :: String->Int) . words $ qq
startArray = (0,head q, array (0,100) [(i,(0,0)) | i <- [0..100]] )
putStrLn . unwords . map show . sort $ difference startArray q p
[EDIT]
I seq'ed value and Array thanks to Carl's advice.
I attach heap diagram.
[original heap profiling]
[]1
[after seq'ing value v]
difference m (x:xs) y = difference new_m xs y
where (_,v) = mylookup x m
new_m = v `seq` addOrReplace x (v + 1) m
[after seq'ing value v and Array]
difference m (x:xs) y = new_m `seq` difference new_m xs y
where (_,v) = mylookup x m
new_m = v `seq` addOrReplace x (v + 1) m
I see three main problems with this code.
First (and not the cause of the memory use, but definitely the cause of generally poor performance) Array is horrible for this use case. O(1) lookups are useless when updates are O(n).
Speaking of, the values being stored in the Array aren't forced while difference is looping over its first input. They are thunks containing pointers to an unevaluated lookup in the previous version of the array. You can ensure that the value is evaluated at the same time the array is updated, in a variety of ways. When difference loops over its second input, it does this accidentally, in fact, by comparing the value against 0.
Third, difference doesn't even force the evaluation of the new arrays being created while traversing its first argument. Nothing requires the old array to be evaluated during that portion of the loop.
Both of those latter issues need to be resolved to fix the space leak. The first issue doesn't cause a space leak, just much higher overheads than needed.

Solve the equation a * b = c, where a, b and c are natural numbers

I have some natural number c. I want to find all pairs of natural numbers a and b, where a < b, such as a * b = c.
I have a solution:
solve c = do solveHelper [1..c] c where
solveHelper xs c = do
x <- xs
(division, modulo ) <- return (c `divMod` x)
True <- return (modulo == 0)
True <- return (x <= division)
return (x, division)
Example:
*Main> solve 10
[(1,10),(2,5)]
Is there a way to accelerate my code, or a better algorithm I should use?
You can do much, much better. The basic idea is this: first, factorize the number; then enumerate the partitions of the factorization. The product of each partition is a solution. There are fast factorization algorithms out there, but even the naive one is quite an improvement on your code; so:
factorize :: Integer -> [Integer]
factorize n
| n < 1 = error "no. =("
| otherwise = go 2 n
where
go p n | p * p > n = [n]
go p n = case quotRem n p of
(q, 0) -> p:go p q
_ -> go (p+1) n
I will use the very nice multiset-comb package to compute partitions of the set of factors. It doesn't support the usual Foldable/Traversable stuff out of the box, so we have to roll our own product operation -- but in fact this can be a bit more efficient than using the product that the standard interface would give us anyway.
import Math.Combinatorics.Multiset
productMS :: Multiset Integer -> Integer
productMS (MS cs) = product [n^p | (n, p) <- cs]
divisors :: Integer -> [(Integer, Integer)]
divisors n =
[ (a, b)
| (aMS, bMS) <- splits (fromList (factorize n))
, let a = productMS aMS; b = productMS bMS
, a <= b
]
For unfair timings, we can compare in ghci:
*Main> :set +s
*Main> length $ solve (product [1..10])
135
(3.55 secs, 2,884,836,952 bytes)
*Main> length $ divisors (product [1..10])
135
(0.00 secs, 4,612,104 bytes)
*Main> length $ solve (product [1..15])
^CInterrupted. [after several minutes, I gave up]
*Main> length $ divisors (product [1..15])
2016
(0.03 secs, 33,823,168 bytes)
Here solve is your solution, divisors is mine. For a fair comparison, we should compile; I used this program:
main = print . last . solve . product $ [1..11]
(And similar with divisors in place of solve.) I compiled with -O2; yours used 1.367s total, mine 0.002s total.
There's one optimization you don't use: you don't have to try every value from 0 to c.
a < b and a * b = c, so a * a < c, meaning you only have to try numbers from 0 to sqrt c. Or, if you don't want to compute the square root of c, you can stop as soon as a * a >= c.
To do so, you can replace [1..c] by (takeWhile (\x -> x * x < c) [1..]).

Short-circuiting a function over a lower triangular(ish) array in Haskell: speed leads to ugly code

I've got a function, in my minimum example called maybeProduceValue i j, which is only valid when i > j. Note that in my actual code, the js are not uniform and so the data only resembles a triangular matrix, I don't know what the mathematical name for this is.
I'd like my code, which loops over i and j and returns essentially (where js is sorted)
[maximum [f i j | j <- js, j < i] | i <- [0..iMax]]
to not check any more j's once one has failed. In C-like languages, this is simple as
if (j >= i) {break;}
and I'm trying to recreate this behaviour in Haskell. I've got two implementations below:
one which tries to take advantage of laziness by using takeWhile to only inspect at most one value (per i) which fails the test and returns Nothing;
one which remembers the number of js which worked for the previous i and so, for i+1, it doesn't bother doing any safety checks until it exceeds this number.
This latter function is more than twice as fast by my benchmarks but it really is a mess - I'm trying to convince people that Haskell is more concise and safe while still reasonably performant and here is some fast code which is dense, cluttered and does a bunch of unsafe operations.
Is there a solution, perhaps using Cont, Error or Exception, that can achieve my desired behaviour?
n.b. I've tried using Traversable.mapAccumL and Vector.unfoldrN instead of State and they end up being about the same speed and clarity. It's still a very overcomplicated way of solving this problem.
import Criterion.Config
import Criterion.Main
import Control.DeepSeq
import Control.Monad.State
import Data.Maybe
import qualified Data.Traversable as T
import qualified Data.Vector as V
main = deepseq inputs $ defaultMainWith (defaultConfig{cfgSamples = ljust 10}) (return ()) [
bcompare [
bench "whileJust" $ nf whileJust js,
bench "memoised" $ nf memoisedSection js
]]
iMax = 5000
jMax = 10000
-- any sorted vector
js :: V.Vector Int
js = V.enumFromN 0 jMax
maybeProduceValue :: Int -> Int -> Maybe Float
maybeProduceValue i j | j < i = Just (fromIntegral (i+j))
| otherwise = Nothing
unsafeProduceValue :: Int -> Int -> Float
-- unsafeProduceValue i j | j >= i = error "you fool!"
unsafeProduceValue i j = fromIntegral (i+j)
whileJust, memoisedSection
:: V.Vector Int -> V.Vector Float
-- mean: 389ms
-- short circuits properly
whileJust inputs' = V.generate iMax $ \i ->
safeMax . V.map fromJust . V.takeWhile isJust $ V.map (maybeProduceValue i) inputs'
where safeMax v = if V.null v then 0 else V.maximum v
-- mean: 116ms
-- remembers the (monotonically increasing) length of the section of
-- the vector that is safe. I have tested that this doesn't violate the condition that j < i
memoisedSection inputs' = flip evalState 0 $ V.generateM iMax $ \i -> do
validSection <- state $ \oldIx ->
let newIx = oldIx + V.length (V.takeWhile (< i) (V.unsafeDrop oldIx inputs'))
in (V.unsafeTake newIx inputs', newIx)
return $ V.foldl' max 0 $ V.map (unsafeProduceValue i) validSection
Here's a simple way of solving the problem with Applicatives, provided that you don't need to keep the rest of the list once you run into an issue:
import Control.Applicative
memoizeSections :: Ord t => [(t, t)] -> Maybe [t]
memoizeSections [] = Just []
memoizeSections ((x, y):xs) = (:) <$> maybeProduceValue x y <*> memoizeSections xs
This is equivalent to:
import Data.Traversable
memoizeSections :: Ord t => [(t, t)] -> Maybe [t]
memoizeSections = flip traverse (uncurry maybeProduceValue)
and will return Nothing on the first occurrence of failure. Note that I don't know how fast this is, but it's certainly concise, and arguably pretty clear (particularly the first example).
Some minor comments:
-- any sorted vector
js :: V.Vector Int
js = V.enumFromN 0 jMax
If you have a vector of Ints (or Floats, etc), you want to use Data.Vector.Unboxed.
maybeProduceValue :: Int -> Int -> Maybe Float
maybeProduceValue i j | j < i = Just (fromIntegral (i+j))
| otherwise = Nothing
Since Just is lazy in its only field, this will create a thunk for the computation fromIntegral (i+j). You almost always want to apply Just like so
maybeProduceValue i j | j < i = Just $! fromIntegral (i+j)
There are some more thunks in:
memoisedSection inputs' = flip evalState 0 $ V.generateM iMax $ \i -> do
validSection <- state $ \oldIx ->
let newIx = oldIx + V.length (V.takeWhile (< i) (V.unsafeDrop oldIx inputs'))
in (V.unsafeTake newIx inputs', newIx)
return $ V.foldl' max 0 $ V.map (unsafeProduceValue i) validSection
Namely you want to:
let !newIx = oldIx + V.length (V.takeWhile (< i) (V.unsafeDrop oldIx inputs'))
!v = V.unsafeTake newIx inputs'
in (v, newIx)
as the pair is lazy in its fields and
return $! V.foldl' max 0 $ V.map (unsafeProduceValue i) validSection
because return in the state monad is lazy in the value.
You can use a guard in a single list comprehension:
[f i j | j <- js, i <- is, j < i]
If you're trying to get the same results as
[foo i j | i <- is, j <- js, j < i]
when you know that js is increasing, just write
[foo i j | i <- is, j <- takeWhile (< i) js]
There's no need to mess around with Maybe for this. Note that making the input list global has a likely-unfortunate effect: instead of fusing the production of the input list with its transformation(s) and ultimate consumption, it's forced to actually construct the list and then keep it in memory. It's quite possible that it will take longer to pull the list into cache from memory than to generate it piece by piece on the fly!

Memoization not function correctly

I have the following code:
pB :: [(Integer, Integer, Integer)] -> Integer -> Integer -> [(Integer, Integer, Integer)]
pB lst x y
| screenList lst x y /= -1 = lst
| abs x > y = lst++[(x, y, 0)]
| y == 1 = lst++[(x, y, 1)]
| otherwise = lst++newEls
where
newEls = (pB lst x (y-1))++(pB lst (x-1) (y-1))++(pB lst (x+1) (y-1))
getFirst :: (Integer, Integer, Integer) -> Integer
getFirst (x, _, _) = x
getSecond :: (Integer, Integer, Integer) -> Integer
getSecond (_, y, _) = y
getThird :: (Integer, Integer, Integer) -> Integer
getThird (_, _, z) = z
screenList :: [(Integer, Integer, Integer)] -> Integer -> Integer -> Integer
screenList [] _ _ = -1
screenList lst x y
| getFirst leader == x && getSecond leader == y = getThird leader
| otherwise = screenList (tail lst) x y
where
leader = head lst
Which, by running an inefficient solution of (Ie: One which didn't keep track of values which had already been computed) returned the value 51 for input x = 0, y = 5. Now, running this with input [] 0 5 I should be able to find (0,5,51) in the output, which unfortunately I don't.
I have been looking at it for a few hours, but can't seem to understand where I'm going wrong.
Does anybody have any suggestions?
EDIT: Inefficient version:
nPB :: Integer -> Integer -> Integer
nPB x y
| abs x > y = 0
| y == 1 = 1
| otherwise = (nPB x (y-1)) + (nPB (x-1) (y-1)) + (nPB (x+1) (y-1))
Administrivia
It is rather hard to tell what you are asking, but I gather that you have a function that is terribly slow and you have tried to manually memoize this function. I don't think anyone is trying to understand your attempt, so if this question is primarily about manually memoizing a function and/or fixing your code then please submit another question that more clearly outlines its design.
In the remainder of this question I will show you how to use monad-memo and memo-trie to memoize the function you've named nPB.
Memoizing nPB with monad-memo
The nPB function is a prime target for memoization. This is readily apparent by glancing at it's three recursive calls. The below small benchmark takes 1 second to run, lets see if we can do better.
nPB :: Integer -> Integer -> Integer
nPB x y
| abs x > y = 0
| y == 1 = 1
| otherwise = (nPB x (y-1)) + (nPB (x-1) (y-1)) + (nPB (x+1) (y-1))
main = print (nPB 10 20)
In a previous answer I used the monad-memo package. Using monad-memo involves making your function monadic, which is syntactically more invasive than the other packages I know of, but I've always have good performance.
To use the package you simply:
make sure to call one of the memo functions with the target function as the first parameter.
Be sure to return your final result
Adjust your type signatures to include a constraint of MonadMemo and adjust the result to be some monad m.
Run the function with startEvalMemo
The code is:
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad.Memo
nPB :: (MonadMemo (Integer,Integer) Integer m) => Integer -> Integer -> m Integer
nPB x y
| abs x > y = return 0
| y == 1 = return 1
| otherwise = do
t1 <- for2 memo nPB x (y-1)
t2 <- for2 memo nPB (x-1) (y-1)
t3 <- for2 memo nPB (x+1) (y-1)
return (t1+t2+t3)
main = print (startEvalMemo $ nPB 10 20)
Memoizing nPB with MemoTrie
The most common Haskell memoization package in use is MemoTrie. This is also a syntactically cleaner memoization package as it does not requires any sort of monad, but it currently suffers from a slight performance issue when using Integer as we shall soon see (bug has been reported, use of Int and other types seems fine).
There is much less to do to use MemoTrie, just replace your recursive calls with memoN where N is the number of arguments:
import Data.MemoTrie
nPB :: Integer -> Integer -> Integer
nPB x y
| abs x > y = 0
| y == 1 = 1
| otherwise = (memo2 nPB x (y-1)) + (memo2 nPB (x-1) (y-1)) + (memo2 nPB (x+1) (y-1))
main = print (nPB 10 20)
Performance
Using a type of Integer the performance is:
$ ghc original.hs -O2 && time ./original
8533660
real 0m1.047s
$ ghc monad-memo.hs -O2 && time ./monad-memo
8533660
real 0m0.002s
$ ghc memotrie.hs -O2 && time ./memotrie
8533660
real 0m0.331s
And using Int:
$ ghc original.hs -O2 && time ./original
8533660
real 0m0.190s
$ ghc monad-memo.hs -O2 && time ./monad-memo
8533660
real 0m0.002s
$ ghc memotrie.hs -O2 && time ./memotrie
8533660
real 0m0.002s
I guess this question is about memoization. I'm not sure how you are trying to implement this, but there are two "standard" ways of memoizing functions: use one of the libraries, or explicitly memoize the data yourself.
import Data.Function.Memoize (memoize)
import Data.MemoTrie (memo2)
import Data.Map (fromList, (!))
import System.Environment
test0 :: Integer -> Integer -> Integer
test0 x y
| abs x > y = 0
| y == 1 = 1
| otherwise = (test0 x (y-1)) + (test0 (x-1) (y-1)) + (test0 (x+1) (y-1))
test1 :: Integer -> Integer -> Integer
test1 = memoize test0
test2 :: Integer -> Integer -> Integer
test2 = memo2 test0
But it doesn't look like the memo libraries I tried are able to handle this, or I did something wrong, I've never really used these libraries: (The test code is at the bottom - these results from x,y = 0,18)
test0 : Total time 9.06s
test1 : Total time 9.08s
test2 : Total time 32.78s
So lets try manual memoization. The principle is simple: construct your domain in such a way that later elements only require the value of earlier elements. This is very simple here since your function always recurses on y-1, so you just need to build the domain moving up the rows. Then write a function which looks up earlier values in a table (here I use Data.Map.Map), and map over the domain:
test3 :: Integer -> Integer -> Integer
test3 x' y' = m ! (x', y')
where
xs = concat [ map (flip (,) y) [-x + x' .. x + x'] | (x, y) <- zip [y', y' - 1 .. 1] [1..]]
m = fromList [ ((x,y), go x y) | (x,y) <- xs]
go x y
| abs x > y = 0
| y == 1 = 1
| otherwise = m ! (x, y-1) + m ! (x-1, y-1) + m ! (x+1, y-1)
I actually construct a domain that is much than needed for simplicity, but the performance penalty is small since the extra domain is all 0 anyways. Taking a look at the performance, it is almost instant (Total time 0.02s). Even with x,y=0,1000 it still only takes 7 seconds. Although with large inputs you end up wasting a lot of time on GC.
-- usage: ghc --make -O2 -rtsopts Main.hs && Main n x y +RTS -sstderr
main = do
[n, x, y] <- getArgs
print $ (d !! (read n)) x y
where d = [test0, test1, test2, test3]
Here is the version written with memoFix2. Better performance than any other versions.
test4 :: Integer -> Integer -> Integer
test4 = memoFix2 go where
go r x y
| abs x > y = 0
| y == 1 = 1
| otherwise = (r x (y-1)) + (r (x-1) (y-1)) + (r (x+1) (y-1))

Euler #4 with bigger domain

Consider the modified Euler problem #4 -- "Find the maximum palindromic number which is a product of two numbers between 100 and 9999."
rev :: Int -> Int
rev x = rev' x 0
rev' :: Int -> Int -> Int
rev' n r
| n == 0 = r
| otherwise = rev' (n `div` 10) (r * 10 + n `mod` 10)
pali :: Int -> Bool
pali x = x == rev x
main :: IO ()
main = print . maximum $ [ x*y | x <- nums, y <- nums, pali (x*y)]
where
nums = [9999,9998..100]
This Haskell solution using -O2 and ghc 7.4.1 takes about 18
seconds.
The similar C solution takes 0.1 second.
So Haskell is 180 times
slower. What's wrong with my solution? I assume that this type of
problems Haskell solves pretty well.
Appendix - analogue C solution:
#define A 100
#define B 9999
int ispali(int n)
{
int n0=n, k=0;
while (n>0) {
k = 10*k + n%10;
n /= 10;
}
return n0 == k;
}
int main(void)
{
int max = 0;
for (int i=B; i>=A; i--)
for (int j=B; j>=A; j--) {
if (i*j > max && ispali(i*j))
max = i*j; }
printf("%d\n", max);
}
The similar C solution
That is a common misconception.
Lists are not loops!
And using lists to emulate loops has performance implications unless the compiler is able to eliminate the list from the code.
If you want to compare apples to apples, write the Haskell structure more or less equivalent to a loop, a tail recursive worker (with strict accumulator, though often the compiler is smart enough to figure out the strictness by itself).
Now let's take a more detailed look. For comparison, the C, compiled with gcc -O3, takes ~0.08 seconds here, the original Haskell, compiled with ghc -O2 takes ~20.3 seconds, with ghc -O2 -fllvm ~19.9 seconds. Pretty terrible.
One mistake in the original code is to use div and mod. The C code uses the equivalent of quot and rem, which map to the machine division instructions and are faster than div and mod. For positive arguments, the semantics are the same, so whenever you know that the arguments are always non-negative, never use div and mod.
Changing that, the running time becomes ~15.4 seconds when compiling with the native code generator, and ~2.9 seconds when compiling with the LLVM backend.
The difference is due to the fact that even the machine division operations are quite slow, and LLVM replaces the division/remainder with a multiply-and-shift operation. Doing the same by hand for the native backend (actually, a slightly better replacement taking advantage of the fact that I know the arguments will always be non-negative) brings its time down to ~2.2 seconds.
We're getting closer, but are still a far cry from the C.
That is due to the lists. The code still builds a list of palindromes (and traverses a list of Ints for the two factors).
Since lists cannot contain unboxed elements, that means there is a lot of boxing and unboxing going on in the code, that takes time.
So let us eliminate the lists, and take a look at the result of translating the C to Haskell:
module Main (main) where
a :: Int
a = 100
b :: Int
b = 9999
ispali :: Int -> Bool
ispali n = go n 0
where
go 0 acc = acc == n
go m acc = go (m `quot` 10) (acc * 10 + (m `rem` 10))
maxpal :: Int
maxpal = go 0 b
where
go mx i
| i < a = mx
| otherwise = go (inner mx b) (i-1)
where
inner m j
| j < a = m
| p > m && ispali p = inner p (j-1)
| otherwise = inner m (j-1)
where
p = i*j
main :: IO ()
main = print maxpal
The nested loop is translated to two nested worker functions, we use an accumulator to store the largest palindrome found so far. Compiled with ghc -O2, that runs in ~0.18 seconds, with ghc -O2 -fllvm it runs in ~0.14 seconds (yes, LLVM is better at optimising loops than the native code generator).
Still not quite there, but a factor of about 2 isn't too bad.
Maybe some find the following where the loop is abstracted out more readable, the generated core is for all intents and purposes identical (modulo a switch of argument order), and the performance of course the same:
module Main (main) where
a :: Int
a = 100
b :: Int
b = 9999
ispali :: Int -> Bool
ispali n = go n 0
where
go 0 acc = acc == n
go m acc = go (m `quot` 10) (acc * 10 + (m `rem` 10))
downto :: Int -> Int -> a -> (a -> Int -> a) -> a
downto high low acc fun = go high acc
where
go i acc
| i < low = acc
| otherwise = go (i-1) (fun acc i)
maxpal :: Int
maxpal = downto b a 0 $ \m i ->
downto b a m $ \mx j ->
let p = i*j
in if mx < p && ispali p then p else mx
main :: IO ()
main = print maxpal
#axblount is at least partly right; the following modification makes the program run almost three times as fast as the original:
maxPalindrome = foldl f 0
where f a x | x > a && pali x = x
| otherwise = a
main :: IO ()
main = print . maxPalindrome $ [x * y | x <- nums, y <- nums]
where nums = [9999,9998..100]
That still leaves a factor 60 slowdown, though.
This is more true to what the C code is doing:
maxpali :: [Int] -> Int
maxpali xs = go xs 0
where
go [] m = m
go (x:xs) m = if x > m && pali(x) then go xs x else go xs m
main :: IO()
main = print . maxpali $ [ x*y | x <- nums, y <- nums ]
where nums = [9999,9998..100]
On my box this takes 2 seconds vs .5 for the C version.
Haskell may be storing that entire list [ x*y | x <- nums, y <- nums, pali (x*y)] where as the C solution calculates the maximum on the fly. I'm not sure about this.
Also the C solution will only calculate ispali if the product beats the previous maximum. I would bet Haskell calculates are palindrome products regardless of whether x*y is a possible max.
It seems to me that you are having a branch prediction problem. In the C code, you have two nested loops and as soon as a palindrome is seen in the inner loop, the rest of the inner loop will be skipped very fast.
The way you feed this list of products instead of the nested loops I am not sure that ghc is doing any of this prediction.
Another way to write this is to use two folds, instead of one fold over the flattened list:
-- foldl g0 0 [x*y | x<-[b-1,b-2..a], y<-[b-1,b-2..a], pali(x*y)] (A)
-- foldl g1 0 [x*y | x<-[b-1,b-2..a], y<-[b-1,b-2..a]] (B)
-- foldl g2 0 [ [x*y | y<-[b-1,b-2..a]] | x<-[b-1,b-2..a]] (C)
maxpal b a = foldl f1 0 [b-1,b-2..a] -- (D)
where
f1 m x = foldl f2 m [b-1,b-2..a]
where
f2 m y | p>m && pali p = p
| otherwise = m
where p = x*y
main = print $ maxpal 10000 100
Seems to run much faster than (B) (as in larsmans's answer), too (only 3x - 4x slower then the following loops-based code). Fusing foldl and enumFromThenTo definitions gets us the "functional loops" code (as in DanielFischer's answer),
maxpal_loops b a = f (b-1) 0 -- (E)
where
f x m | x < a = m
| otherwise = g (b-1) m
where
g y m | y < a = f (x-1) m
| p>m && pali p = g (y-1) p
| otherwise = g (y-1) m
where p = x*y
The (C) variant is very suggestive of further algorithmic improvements (that's outside the scope of the original Q of course) that exploit the hidden order in the lists, destroyed by the flattening:
{- foldl g2 0 [ [x*y | y<-[b-1,b-2..a]] | x<-[b-1,b-2..a]] (C)
foldl g2 0 [ [x*y | y<-[x, x-1..a]] | x<-[b-1,b-2..a]] (C1)
foldl g0 0 [ safehead 0 . filter pali $
[x*y | y<-[x, x-1..a]] | x<-[b-1,b-2..a]] (C2)
fst $ until ... (\(m,s)-> (max m .
safehead 0 . filter pali . takeWhile (> m) $
head s, tail s))
(0,[ [x*y | y<-[x, x-1..a]] | x<-[b-1,b-2..a]]) (C3)
safehead 0 $ filter pali $ mergeAllDescending
[ [x*y | y<-[x, x-1..a]] | x<-[b-1,b-2..a]] (C4)
-}
(C3) can stop as soon as the head x*y in a sub-list is smaller than the currently found maximum. It is what short-cutting functional loops code could achieve, but not (C4), which is guaranteed to find the maximal palindromic number first. Plus, for list-based code its algorithmic nature is more visually apparent, IMO.

Resources