Haskell: sequential Fibonacci faster than parallel

Haskell: sequential Fibonacci faster than parallel - haskell

So, I'm experimenting with parallelism in Haskell. I took a classic example of implementing a Fibonacci sequence method both in sequential and in parallel. here is my Main.hs file:
module Main where
import Control.Parallel
main = print (fib 47)
fib :: Int -> Int
fib n
| n <=1 = n
| otherwise = fib (n-1) + fib (n-2)
I compile with ghc -O2 --make Main.hs -threaded -rtsopts
and execute with time ./Main +RTS -N4 which gives me:
2971215073
63.23user 13.03system 0:20.30elapsed 375%CPU (0avgtext+0avgdata 3824maxresident)k
0inputs+0outputs (0major+276minor)pagefaults 0swaps
So with normal Fibonacci it takes about 20 seconds.
Now if I change my fib method to
pfib :: Int -> Int
pfib n
| n <= 1 = n
| otherwise = n1 `par` (n2 `par` n1 + n2)
where
n1 = pfib (n - 1)
n2 = pfib (n - 2)
Compiling and running as above, time takes way longer and finishes with the output:
2971215073
179.50user 9.04system 0:53.08elapsed 355%CPU (0avgtext+0avgdata 6980maxresident)k
0inputs+0outputs (0major+1066minor)pagefaults 0swaps
Further modifying my pfib to use pseq instead of the second par, time gives:
2971215073
113.34user 3.42system 0:30.91elapsed 377%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (0major+1119minor)pagefaults 0swaps
Is there an issue with my code? why do I have that illogical time difference between the various implementations?

From the documentation for par:
Also it is a good idea to ensure that a is not a trivial computation, otherwise the cost of spawning it in parallel overshadows the benefits obtained by running it in parallel.
An addition and a couple of subtractions are a trivial computation. If you only run a few levels of depth in parallel, you’ll see benefits:
module Main where
import Control.Parallel
main = print (pfib 16 47)
fib :: Int -> Int
fib n
| n <= 1 = n
| otherwise = fib (n-1) + fib (n-2)
pfib :: Int -> Int -> Int
pfib 1 n = fib n
pfib p n
| n <= 1 = n
| otherwise = n1 `par` (n2 `par` n1 + n2)
where
n1 = pfib (p - 1) (n - 1)
n2 = pfib (p - 1) (n - 2)

Related

Haskell: Parallel code is slower than sequential version

I am pretty new to Haskell threads (and parallel programming in general) and I am not sure why my parallel version of an algorithm runs slower than the corresponding sequential version.
The algorithm tries to find all k-combinations without using recursion. For this, I am using this helper function, which given a number with k bits set, returns the next number with the same number of bits set:
import Data.Bits
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
It is now easy to obtain sequentially all k-combinations in the range [(2^k - 1), (2^(n-k)+...+ 2^(n-1)):
import qualified Data.Stream as ST
combs :: Int -> Int -> [Integer]
combs n k = ST.takeWhile (<= end) $ kBitNumbers start
where start = 2^k - 1
end = sum $ fmap (2^) [n-k..n-1]
kBitNumbers :: Integer -> ST.Stream Integer
kBitNumbers = ST.iterate nextKBitNumber
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
print $ length (combs n k)
My idea is that this should be easily parallelizable splitting this range into smaller parts. For example:
start :: Int -> Integer
start k = 2 ^ k - 1
end :: Int -> Int -> Integer
end n k = sum $ fmap (2 ^) [n-k..n-1]
splits :: Int -> Int -> Int -> [(Integer, Integer, Int)]
splits n k numSplits = fixedRanges ranges []
where s = start k
e = end n k
step = (e-s) `div` (min (e-s) (toInteger numSplits))
initSplits = [s,s+step..e]
ranges = zip initSplits (tail initSplits)
fixedRanges [] acc = acc
fixedRanges [x] acc = acc ++ [(fst x, e, k)]
fixedRanges (x:xs) acc = fixedRanges xs (acc ++ [(fst x, snd x, k)])
At this point, I would like to run each split in parallel, something like:
runSplit :: (Integer, Integer, Int) -> [Integer]
runSplit (start, end, k) = ST.takeWhile (<= end) $ kBitNumbers (fixStart start)
where fixStart s
| popCount s == k = s
| otherwise = fixStart $ s + 1
For pallalelization I am using the monad-par package:
import Control.Monad.Par
import System.Environment
import qualified Data.Set as S
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
numTasks = read $ params !! 2
batches = runPar $ parMap runSplit (splits n k numTasks)
reducedNumbers = foldl S.union S.empty $ fmap S.fromList batches
print $ S.size reducedNumbers
The result is that the sequential version is way faster and it uses little memory, while the parallel version consumes a lot of memory and it is noticeable slower.
What might be the reasons causing this? Are threads a good approach for this problem? For example, every thread generates a (potentially large) list of integers and the main thread reduces the results; are threads expected to need much memory or are simply meant to produce simple results (i.e. only cpu-intensive computations)?
I compile my program with stack build --ghc-options -threaded --ghc-options -rtsopts --executable-profiling --library-profiling and run it with ./.stack-work/install/x86_64-osx/lts-6.1/7.10.3/bin/combinatorics 20 3 4 +RTS -pa -N4 -RTS for n=20, k=3 and numSplits=4. An example of the profiling report for the parallel version can be found here and for the sequential version here.

In your sequential version calling combs does not build up a list in memory since after length consumes an element it isn't needed anymore and is freed. Indeed, GHC may not even allocate storage for it.
For instance, this will take a while but won't consume a lot of memory:
main = print $ length [1..1000000000] -- 1 billion
In your parallel version you are generating sub-lists, concatenating them together, building Sets, etc. and therefore the results of each sub-task have to be kept in memory.
A fairer comparison would be to have each parallel task compute the length of the k-bit numbers in its assigned range, and then add up the results. That way the k-bit numbers found by each parallel task wouldn't have to be kept in memory and would operate more like the sequential version.
Update
Here is an example of how to use parMap. Note: under 7.10.2 I've had mixed success getting the parallelism to fire - sometimes it does and sometimes it doesn't. (Figured it out - I was using -RTS -N2 instead of +RTS -N2.)
{-
compile with: ghc -O2 -threaded -rtsopts foo.hs
compare:
time ./foo 26 +RTS -N1
time ./foo 26 +RTS -N2
-}
import Data.Bits
import Control.Parallel.Strategies
import System.Environment
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
combs :: Int -> Int -> [Integer]
combs n k = takeWhile (<= end) $ iterate nextKBitNumber start
where start = 2^k - 1
end = shift start (n-k)
main :: IO ()
main = do
( arg1 : _) <- getArgs
let n = read arg1
print $ parMap rseq (length . combs n) [1..n]

good approaches for this problem
What do you mean by this problem? If it's how to write, analyze and tune a parallel Haskell program, then this is required background reading:
Simon Marlow: Parallel and Concurrent Programming in Haskell
http://community.haskell.org/~simonmar/pcph/
in particular, Section 15 (Debugging, Tuning, ..)
Use threadscope! (a graphical viewer for thread profile information generated by the Glasgow Haskell compiler) https://hackage.haskell.org/package/threadscope

Why is this Haskell expression so slow?

I'm working on Project Euler Problem 14. Here's my solution.
import Data.List
collatzLength :: Int->Int
collatzLength 1 = 1
collatzLength n | odd n = 1 + collatzLength (3 * n + 1)
| even n = 1 + collatzLength (n `quot` 2)
maxTuple :: (Int, Int)->(Int, Int)->Ordering
maxTuple (x1, x2) (y1, y2) | x1 > y1 = GT
| x1 < y1 = LT
| otherwise = EQ
I'm running the following out of GHCi
maximumBy maxTuple [(collatzLength x, x) | x <- [1..1000000]]
I know that if Haskell evaluated strictly, the time on this would be something like O(n3). Since Haskell evaluates lazily though, it seems like this should be some constant multiple of n. This has been running for nearly an hour now. Seems very unreasonable. Does anyone have any idea why?

You're assuming that the collatzLength function will be memoized. Haskell does not do automatic memoization. You'll need to do that yourself. Here's an example using the data-memocombinators package.
import Data.List
import Data.Ord
import qualified Data.MemoCombinators as Memo
collatzLength :: Integer -> Integer
collatzLength = Memo.arrayRange (1,1000000) collatzLength'
where
collatzLength' 1 = 1
collatzLength' n | odd n = 1 + collatzLength (3 * n + 1)
| even n = 1 + collatzLength (n `quot` 2)
main = print $ foldl1' max $ [(collatzLength n, n) | n <- [1..1000000]]
This runs in about 1 second when compiled with -O2.

For being able to find a maximum of a list, the whole list needs to be evaluated.
So it will calculate collatzLength from 1 to 1000000 and collatzLength is recursive. The worst thing is, that your definition of collatzLength is even not tail-recursive.

cL is short for collatzLength
cL!!n stands for collatzLength n
cL :: [Int]
cL = 1 : 1 : [ 1 + (if odd n then cL!!(3*n+1) else cL!!(n `div` 2)) | n <- [2..]]
Simple test:
ghci> cL !! 13
10

Partial memoization in Haskell

I'm trying to find a good way to memoize a function for only part of its domain (non-negative integers) in Haskell, using Data.MemoCombinators.
import Data.MemoCombinators
--approach 1
partFib n | n < 0 = undefined
| otherwise = integral fib n where
fib 0 = 1
fib 1 = 1
fib k = partFib (k-1) + partFib (k-2)
--approach 2
partFib2 n | n < 0 = undefined
| otherwise = fib n
fib = integral fib'
where
fib' 0 = 1
fib' 1 = 1
fib' n = partFib2 (n-1) + partFib2 (n-2)
Approach 1 is how I would like to do it, however, it doesn't seem to work. I assume this is because the fib function is "recreated" every time partFib is called, throwing away the memoization. fib doesn't depend on the input of partFib, so you would assume that the compiler could hoist it, but apparently GHC doesn't work that way.
Approach 2 is how I end up doing it. Eerk, a lot of ugly wiring.
Does anybody know of a better way to do this?

Not quite sure what's "ugly" to your eye, but you can have proper memoization while using only a single top-level identifier by lifting the memoization operation out of the function of n.
partFib3 = \n -> if n < 0 then undefined else fib' n
where fib 0 = 1
fib 1 = 1
fib k = partFib3 (k-1) + partFib3 (k-2)
fib' = integral fib

Hmm what about separating things a bit:
fib 0 = 0
fib 1 = 1
fib x = doFib (x-1) + doFib (x-2)
memFib = Memo.integral fib
doFib n | n < 0 = fib n
| otherwise memFib n
Now you need to use doFib.

There is a combinator in the library for this purpose:
switch :: (a -> Bool) -> Memo a -> Memo a -> Memo a
switch p a b uses the memo table a whenever p gives true and the memo table b whenever p gives false.
Recall that id is technically a memoizer (which does not memoize :-), so you can do:
partFib = Memo.switch (< 0) id Memo.integral fib'
where
...

Fibonacci's closed-form expression, the ST monad, and Haskell

Two recent questions about Fibonacci's closed-form expression (here and here) as well as the HaskellWiki's page about the ST monad motivated me to try and compare two ways of calculating Fibonacci numbers.
The first implementation uses the closed-form expression together with rationals as seen in hammar's answer here (where Fib is a datatype abstracting numbers of the form a+b*√5):
fibRational :: Integer -> Integer
fibRational n = divSq5 $ phi^n - (1-phi)^n
where
phi = Fib (1/2) (1/2)
divSq5 (Fib 0 b) = numerator b
The second implementation is from the HaskellWiki's page about the ST monad, with some added strictness that was necessary in order to avoid a stack overflow:
fibST :: Integer -> Integer
fibST n | n < 2 = n
fibST n = runST $ do
x <- newSTRef 0
y <- newSTRef 1
fibST' n x y
where
fibST' 0 x _ = readSTRef x
fibST' !n x y = do
x' <- readSTRef x
y' <- readSTRef y
y' `seq` writeSTRef x y'
x' `seq` writeSTRef y (x'+y')
fibST' (n-1) x y
For reference, here's also the full code that I used for testing:
{-# LANGUAGE BangPatterns #-}
import Data.Ratio
import Data.STRef.Strict
import Control.Monad.ST.Strict
import System.Environment
data Fib =
Fib !Rational !Rational
deriving (Eq, Show)
instance Num Fib where
negate (Fib a b) = Fib (-a) (-b)
(Fib a b) + (Fib c d) = Fib (a+c) (b+d)
(Fib a b) * (Fib c d) = Fib (a*c+5*b*d) (a*d+b*c)
fromInteger i = Fib (fromInteger i) 0
abs = undefined
signum = undefined
fibRational :: Integer -> Integer
fibRational n = divSq5 $ phi^n - (1-phi)^n
where
phi = Fib (1/2) (1/2)
divSq5 (Fib 0 b) = numerator b
fibST :: Integer -> Integer
fibST n | n < 2 = n
fibST n = runST $ do
x <- newSTRef 0
y <- newSTRef 1
fibST' n x y
where
fibST' 0 x _ = readSTRef x
fibST' !n x y = do
x' <- readSTRef x
y' <- readSTRef y
y' `seq` writeSTRef x y'
x' `seq` writeSTRef y (x'+y')
fibST' (n-1) x y
main = do
(m:n:_) <- getArgs
let n' = read n
st = fibST n'
rt = fibRational n'
case m of
"st" -> print st
"rt" -> print rt
"cm" -> print (st == rt)
Now it turns out that the ST version is significantly slower than the closed-form version, although I'm not a hundred percent sure why:
# time ./fib rt 1000000 >/dev/null
./fib rt 1000000 > /dev/null 0.23s user 0.00s system 99% cpu 0.235 total
# time ./fib st 1000000 >/dev/null
./fib st 1000000 > /dev/null 11.35s user 0.06s system 99% cpu 11.422 total
So my question is: Can someone help me understand why the first implementation is so much faster? Is it algorithmic complexity, overhead or something else entirely? (I checked that both functions yield the same result). Thanks!

You are comparing very different versions here. To make it fair, here is an implementation that is equivalent to the ST solution you give, but in pure Haskell:
fibIt :: Integer -> Integer
fibIt n | n < 2 = n
fibIt n = go 1 1 (n-2)
where go !_x !y 0 = y
go !x !y i = go y (x+y) (i-1)
This one seems to perform exactly as good or bad as the ST version (both 10s here). The runtime is most likely dominated by all the Integer additions, overhead is therefore too low to be measurable.

First, the two implementations use two very different algorithms with different asymptotic complexity (well, depending on what the complexity of the Integer operations are).
Second, the st implementation is using references. References are (comparatively) slow in ghc. (Because updating a reference needs a GC write barrier due to the generational garbage collector.)
So, you're comparing two functions that differ both in algorithm an implementation technique.
You should rewrite the second one not to use references, that way you can compare just algorithms. Or rewrite the first one to use references. But why use references when it's the wrong thing? :)

You can compare the algorithmic complexities.
The first is O(1);
the second is O(n)

Longest non decrease subseq in Haskell is slow. How to improve?

longest'inc'subseq seq = maximum dp
where dp = 1 : [val n | n <- [1..length seq - 1]]
val n = (1 +) . filter'and'get'max ((<= top) . (seq!!)) $ [0..pred n]
where top = seq!!n
-----
filter'and'get'max f [] = 0
filter'and'get'max f [x] = if f x then dp!!x else 0
filter'and'get'max f (x:xs) = if f x then ( if vx > vxs then vx else vxs ) else vxs
where vx = dp!!x
vxs = filter'and'get'max f xs
that take about 1-2s with lenght of seq = 1000
while in python is come out imtermedialy
in python
def longest(s):
dp = [0]*len(s)
dp[0] = 1
for i in range(1,len(s)):
need = 0
for j in range (0, i):
if s[j] <= s[i] and dp[j] > need:
need = dp[j]
dp[i] = need + 1
return max(dp)
and when length of seq is 10000, the haskell program run sooo long
while python return the answer after 10-15s
Can we improve haskell speed?

Your core problem is that you're using the wrong data structure in Haskell for this algorithm. You've translated an algorithm that depends on O(1) lookups on a sequence (as in your Python code), into one that does O(n) lookups on a list in Haskell.
Use like-for-like data structures, and then your complexity problems will take care of themselves. In this case, it means using something like Data.Vector.Unboxed to represent the sequence, which has O(1) indexing, as well as low constant overheads in general.

With nothing more than a really mindless wrapping of your lists into Vectors I get 2.5 seconds when the input list is [1..10000].
import qualified Data.Vector as V
import Data.Vector (Vector, (!))
main = print $ liss [0..10000]
liss :: [Int] -> Int
liss seqL = V.maximum dp
where dp = V.fromList $ 1 : [val n | n <- [1..length seqL - 1]]
seq = V.fromList seqL
val n = (1 +) . filter'and'get'max ((<= top) . (seq!)) $ [0..pred n]
where top = seq!n
-----
filter'and'get'max :: (Int -> Bool) -> [Int] -> Int
filter'and'get'max f [] = 0
filter'and'get'max f [x] = if f x then dp!x else 0
filter'and'get'max f (x:xs) = if f x then ( if vx > vxs then vx else vxs ) else vxs
where vx = dp!x
vxs = filter'and'get'max f xs
The compilation and execution:
tommd#Mavlo:Test$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.3
tommd#Mavlo:Test$ ghc -O2 so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
10001
real 0m2.536s
user 0m2.528s
A worker-wrapper transformation on filter'and'get'max seems to shave off another second.
Also, I don't understand why you need that middle case (filter'and'get'max f [x]), shouldn't it work fine without that? I guess this changes the result if dp!x < 0. Note eliminating that saves 0.3 seconds right there.
And the python code you provided takes ~ 10.7 seconds (added a call of longest(range(1,10000));).
tommd#Mavlo:Test$ time python so.py
real 0m10.745s
user 0m10.729s

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Haskell: sequential Fibonacci faster than parallel - haskell

Related

Haskell: Parallel code is slower than sequential version

Why is this Haskell expression so slow?

Partial memoization in Haskell

Fibonacci's closed-form expression, the ST monad, and Haskell

Longest non decrease subseq in Haskell is slow. How to improve?

Categories

Resources