Haskell: Parallel code is slower than sequential version

Haskell: Parallel code is slower than sequential version - multithreading

I am pretty new to Haskell threads (and parallel programming in general) and I am not sure why my parallel version of an algorithm runs slower than the corresponding sequential version.
The algorithm tries to find all k-combinations without using recursion. For this, I am using this helper function, which given a number with k bits set, returns the next number with the same number of bits set:
import Data.Bits
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
It is now easy to obtain sequentially all k-combinations in the range [(2^k - 1), (2^(n-k)+...+ 2^(n-1)):
import qualified Data.Stream as ST
combs :: Int -> Int -> [Integer]
combs n k = ST.takeWhile (<= end) $ kBitNumbers start
where start = 2^k - 1
end = sum $ fmap (2^) [n-k..n-1]
kBitNumbers :: Integer -> ST.Stream Integer
kBitNumbers = ST.iterate nextKBitNumber
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
print $ length (combs n k)
My idea is that this should be easily parallelizable splitting this range into smaller parts. For example:
start :: Int -> Integer
start k = 2 ^ k - 1
end :: Int -> Int -> Integer
end n k = sum $ fmap (2 ^) [n-k..n-1]
splits :: Int -> Int -> Int -> [(Integer, Integer, Int)]
splits n k numSplits = fixedRanges ranges []
where s = start k
e = end n k
step = (e-s) `div` (min (e-s) (toInteger numSplits))
initSplits = [s,s+step..e]
ranges = zip initSplits (tail initSplits)
fixedRanges [] acc = acc
fixedRanges [x] acc = acc ++ [(fst x, e, k)]
fixedRanges (x:xs) acc = fixedRanges xs (acc ++ [(fst x, snd x, k)])
At this point, I would like to run each split in parallel, something like:
runSplit :: (Integer, Integer, Int) -> [Integer]
runSplit (start, end, k) = ST.takeWhile (<= end) $ kBitNumbers (fixStart start)
where fixStart s
| popCount s == k = s
| otherwise = fixStart $ s + 1
For pallalelization I am using the monad-par package:
import Control.Monad.Par
import System.Environment
import qualified Data.Set as S
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
numTasks = read $ params !! 2
batches = runPar $ parMap runSplit (splits n k numTasks)
reducedNumbers = foldl S.union S.empty $ fmap S.fromList batches
print $ S.size reducedNumbers
The result is that the sequential version is way faster and it uses little memory, while the parallel version consumes a lot of memory and it is noticeable slower.
What might be the reasons causing this? Are threads a good approach for this problem? For example, every thread generates a (potentially large) list of integers and the main thread reduces the results; are threads expected to need much memory or are simply meant to produce simple results (i.e. only cpu-intensive computations)?
I compile my program with stack build --ghc-options -threaded --ghc-options -rtsopts --executable-profiling --library-profiling and run it with ./.stack-work/install/x86_64-osx/lts-6.1/7.10.3/bin/combinatorics 20 3 4 +RTS -pa -N4 -RTS for n=20, k=3 and numSplits=4. An example of the profiling report for the parallel version can be found here and for the sequential version here.

In your sequential version calling combs does not build up a list in memory since after length consumes an element it isn't needed anymore and is freed. Indeed, GHC may not even allocate storage for it.
For instance, this will take a while but won't consume a lot of memory:
main = print $ length [1..1000000000] -- 1 billion
In your parallel version you are generating sub-lists, concatenating them together, building Sets, etc. and therefore the results of each sub-task have to be kept in memory.
A fairer comparison would be to have each parallel task compute the length of the k-bit numbers in its assigned range, and then add up the results. That way the k-bit numbers found by each parallel task wouldn't have to be kept in memory and would operate more like the sequential version.
Update
Here is an example of how to use parMap. Note: under 7.10.2 I've had mixed success getting the parallelism to fire - sometimes it does and sometimes it doesn't. (Figured it out - I was using -RTS -N2 instead of +RTS -N2.)
{-
compile with: ghc -O2 -threaded -rtsopts foo.hs
compare:
time ./foo 26 +RTS -N1
time ./foo 26 +RTS -N2
-}
import Data.Bits
import Control.Parallel.Strategies
import System.Environment
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
combs :: Int -> Int -> [Integer]
combs n k = takeWhile (<= end) $ iterate nextKBitNumber start
where start = 2^k - 1
end = shift start (n-k)
main :: IO ()
main = do
( arg1 : _) <- getArgs
let n = read arg1
print $ parMap rseq (length . combs n) [1..n]

good approaches for this problem
What do you mean by this problem? If it's how to write, analyze and tune a parallel Haskell program, then this is required background reading:
Simon Marlow: Parallel and Concurrent Programming in Haskell
http://community.haskell.org/~simonmar/pcph/
in particular, Section 15 (Debugging, Tuning, ..)
Use threadscope! (a graphical viewer for thread profile information generated by the Glasgow Haskell compiler) https://hackage.haskell.org/package/threadscope

Related

Parallelize computation of mutable vector in ST

How can computations done in ST be made to run in parallel?
I have a vector which needs to be filled in by random access, hence the use of ST, and the computation runs correctly single-threaded, but have been unable to figure out how to use more than one core.
Random access is needed because of the meaning of the indices into the vector. There are n things and every possible way of choosing among n things has an entry in the vector, as in the choice function. Each of these choices corresponds to a binary number (conceptually, a packed [Bool]) and these Int values are the indices. If there are n things, then the size of the vector is 2^n. The natural way the algorithm runs is for every entry corresponding to "n choose 1" to be filled in, then every entry for "n choose 2," etc. The entries corresponding to "n choose k" depends on the entries corresponding to "n choose (k-1)." The integers for the different choices do not occur in numerical order, and that's why random access is needed.
Here's a pointless (but slow) computation that follows the same pattern. The example function shows how I tried to break the computation up so that the bulk of the work is done in a pure world (no ST monad). In the code below, bogus is where most of the work is done, with the intent of calling that in parallel, but only one core is ever used.
import qualified Data.Vector as Vb
import qualified Data.Vector.Mutable as Vm
import qualified Data.Vector.Generic.Mutable as Vg
import qualified Data.Vector.Generic as Gg
import Control.Monad.ST as ST ( ST, runST )
import Data.Foldable(forM_)
import Data.Char(digitToInt)
main :: IO ()
main = do
putStrLn $ show (example 9)
example :: Int -> Vb.Vector Int
example n = runST $ do
m <- Vg.new (2^n) :: ST s (Vm.STVector s Int)
Vg.unsafeWrite m 0 (1)
forM_ [1..n] $ \i -> do
p <- prev m n (i-1)
let newEntries = (choiceList n i) :: [Int]
forM_ newEntries $ \e -> do
let v = bogus p e
Vg.unsafeWrite m e v
Gg.unsafeFreeze m
choiceList :: Int -> Int -> [Int]
choiceList _ 0 = [0]
choiceList n 1 = [ 2^k | k <- [0..(n-1) ] ]
choiceList n k
| n == k = [2^n - 1]
| otherwise = (choiceList (n-1) k) ++ (map ((2^(n-1)) +) $ choiceList (n-1) (k-1))
prev :: Vm.STVector s Int -> Int -> Int -> ST s Integer
prev m n 0 = return 1
prev m n i = do
let chs = choiceList n i
v <- mapM (\k -> Vg.unsafeRead m k ) chs
let e = map (\k -> toInteger k ) v
return (sum e)
bogus :: Integer -> Int -> Int
bogus prior index = do
let f = fac prior
let g = (f^index) :: Integer
let d = (map digitToInt (show g)) :: [Int]
let a = fromIntegral (head d)^2
a
fac :: Integer -> Integer
fac 0 = 1
fac n = n * fac (n - 1)
If anyone tests this, using more than 9 or 10 in show (example 9) will take much longer than you want to wait for such a pointless sequence of numbers.

Just do it in IO. If you need to use the result in pure code, then unsafePerformIO is available.
The following version runs about 3-4 times faster with +RTS -N16 than +RTS -N1. My changes involved converting the ST vectors to IO, changing the forM_ to forConcurrently_, and adding a bang annotation to let !v = bogus ....
Full code:
import qualified Data.Vector as Vb
import qualified Data.Vector.Mutable as Vm
import qualified Data.Vector.Generic.Mutable as Vg
import qualified Data.Vector.Generic as Gg
import Control.Monad.ST as ST ( ST, runST )
import Data.Foldable(forM_)
import Data.Char(digitToInt)
import Control.Concurrent.Async
import System.IO.Unsafe
main :: IO ()
main = do
let m = unsafePerformIO (example 9)
putStrLn $ show m
example :: Int -> IO (Vb.Vector Int)
example n = do
m <- Vg.new (2^n)
Vg.unsafeWrite m 0 (1)
forM_ [1..n] $ \i -> do
p <- prev m n (i-1)
let newEntries = (choiceList n i) :: [Int]
forConcurrently_ newEntries $ \e -> do
let !v = bogus p e
Vg.unsafeWrite m e v
Gg.unsafeFreeze m
choiceList :: Int -> Int -> [Int]
choiceList _ 0 = [0]
choiceList n 1 = [ 2^k | k <- [0..(n-1) ] ]
choiceList n k
| n == k = [2^n - 1]
| otherwise = (choiceList (n-1) k) ++ (map ((2^(n-1)) +) $ choiceList (n-1) (k-1))
prev :: Vm.IOVector Int -> Int -> Int -> IO Integer
prev m n 0 = return 1
prev m n i = do
let chs = choiceList n i
v <- mapM (\k -> Vg.unsafeRead m k ) chs
let e = map (\k -> toInteger k ) v
return (sum e)
bogus :: Integer -> Int -> Int
bogus prior index = do
let f = fac prior
let g = (f^index) :: Integer
let d = (map digitToInt (show g)) :: [Int]
let a = fromIntegral (head d)^2
a
fac :: Integer -> Integer
fac 0 = 1
fac n = n * fac (n - 1)

I think this can not be done in a safe way. In the general case, it seems it would break Haskell's referential transparency.
If we could perform multi-threaded computations within ST s, then we could spawn two threads that race over the same STRef s Bool. Let's say one thread is writing False and the other one True.
After we use runST on the computation, we get an expression of type Bool which is sometimes False and sometimes True. That should not be possible.
If you are absolutely certain that your parallelization does not break referential transparency, you could try using unsafe primitives like unsafeIOToST to spawn new threads. Use with extreme care.
There might be safer ways to achieve something similar. Outside ST, we do have some parallelism available in Control.Parallel.Strategies.

There are a number of ways to do parallelization in Haskell. Usually they will give comparable performance improvements, however some are better then the others and it mostly depends on problem that needs parallelization. This particular use case looked very interesting to me, so I decided to investigate a few approaches.
Approaches
vector-strategies
We are using a boxed vector, therefore we can utilize laziness and built-in spark pool for parallelization. One very simple approach is provided by vector-strategies package, which can iterate over any immutable boxed vector and evaluate all of the thunks in parallel. It is also possible to split the vector in chunks, but as it turns out the chunk size of 1 is the optimal one:
exampleParVector :: Int -> Vb.Vector Int
exampleParVector n = example n `using` parVector 1
parallel
parVector uses par underneath and requires one extra iteration over the vector. In this case we are already iterating over thee vector, thus it would actually make more sense to use par from parallel directly. This would allow us to perform computation in parallel while continue using ST monad:
import Control.Parallel (par)
...
forM_ [1..n] $ \i -> do
p <- prev m n (i-1)
let newEntries = choiceList n i :: [Int]
forM_ newEntries $ \e -> do
let v = bogus p e
v `par` Vg.unsafeWrite m e v
It is important to note that the computation of each element of the vector is expensive when compared to the total number of elements in the vector. That is why using par is a very good solution here. If it was the opposite, namely the vector was very large, but elements weren't too expensive to compute, it would be better to use an unboxed vector and switch it to a different parallelization method.
async
Another way was described by #K.A.Buhr. Switch to IO from ST and use async:
import Control.Concurrent.Async (forConcurrently_)
...
forM_ [1..n] $ \i -> do
p <- prev m n (i-1)
let newEntries = choiceList n i :: [Int]
forConcurrently_ newEntries $ \e -> do
let !v = bogus p e
Vg.unsafeWrite m e v
The concern that #chi has raised is a valid one, however in this particular implementation it is safe to use unsafePerformIO instead of runST, because parallelization does not violate the invariant of deterministic computation. Namely, we can promise that regardless of the input supplied to example function, the output will always be exactly the same.
scheduler
Green threads are pretty cheap in Haskell, but they aren't free. The solution above with async package has one slight drawback: it will spin up at least as many threads as there are elements in the newEntries list each time forConcurrently_ is called. It would be better to spin up as many threads as there are capabilities (the -N RTS option) and let them do all the work. For this we can use scheduler package, which is a work stealing scheduler:
import Control.Scheduler (Comp(Par), runBatch_, withScheduler_)
...
withScheduler_ Par $ \scheduler ->
forM_ [1..n] $ \i -> runBatch_ scheduler $ \_ -> do
p <- prev m n (i-1)
let newEntries = choiceList n i :: [Int]
forM_ newEntries $ \e -> scheduleWork_ scheduler $ do
let !v = bogus p e
Vg.unsafeWrite m e v
Spark pool in GHC also uses a work stealing scheduler, which is built into RTS and is unrelated to the package above in any shape or form, but the idea is very similar: few threads with many units of computation.
Benchmarks
Here are some benchmarks on a 16-core machine for all of the approaches with example 7 (value 9 takes on the order of seconds, which introduces too much noise for criterion). We only get about x5 speedup, because a significant part of the algorithm is sequential in nature and can't be parallelized.

Why is recursion slower than filter with this Random number counter

I am writing a function that generates a million random numbers of 1 or 0 and then counts how many 0s were generated.
import System.Random
import Control.Monad
countZeros :: Int -> IO Int
countZeros n = (length . filter (==0)) <$> (replicateM n $ randomRIO (0,1 :: Int))
countZeros' :: Int -> IO Int
countZeros' n = go n 0
where
go :: Int -> Int -> IO Int
go x acc = do
r <- randomRIO (0,1 :: Int)
case x of
0 -> pure acc
_ -> let acc' = if r == 0 then succ acc else acc
in go (pred x) acc'
when I run the functions with an input of 1000000
>λ= countZeros 1000000
499716
(0.93 secs, 789,015,080 bytes)
>λ= countZeros' 1000000
500442
(2.02 secs, 1,109,569,560 bytes)
I don't understand why the prime function is twice as slow as the other. I assumed that they are essentially doing the same thing behind the scenes.
I am using GHCi.
What am I missing?

With bang patterns, and proper compilation with -O2, the "prime" function is faster:
{-# LANGUAGE BangPatterns #-}
module Main where
import System.Random
import Control.Monad
import System.Environment
countZeros :: Int -> IO Int
countZeros n = (length . filter (==0)) <$> (replicateM n $ randomRIO (0,1 :: Int))
countZeros' :: Int -> IO Int
countZeros' n = go n 0
where
go :: Int -> Int -> IO Int
go !x !acc = do
r <- randomRIO (0,1 :: Int)
case x of
0 -> pure acc
_ -> let acc' = if r == 0 then succ acc else acc
in go (pred x) acc'
main :: IO ()
main = do
[what] <- getArgs
let n = 1000 * 1000 * 10
fun = case what of
"1" -> countZeros
"2" -> countZeros'
_ -> error "arg not a number"
putStrLn "----"
print =<< fun n
putStrLn "----"
Compiled with
$ stack ghc -- RandomPerf.hs -O2 -Wall
$ stack ghc -- --version
The Glorious Glasgow Haskell Compilation System, version 8.6.3
Tests:
$ time ./RandomPerf.exe 1
----
4999482
----
real 0m3.329s
user 0m0.000s
sys 0m0.031s
$ time ./RandomPerf.exe 2
----
5001089
----
real 0m2.338s
user 0m0.000s
sys 0m0.046s
Repeating the tests gives comparable results, so this is not a fluke.
Result: the countZeros' function is significantly faster.
Using Criterion and running a proper benchmark is left as an exercise.
You probably used GHCi to assess performance, which prevents the optimizer to do its job. GHCi sacrifices proper optimization to load files faster, and be more usable in an interactive way.

These actually work in different ways from each other, at a level that matters. And both are slow.
The version using replicateM is bad because replicateM in IO can't stream its results. The entire list will be constructed at once, before filter and length get to start operating on it. The reason it's faster is that length is strict in its accumulator, so it doesn't generate a massive nested chain of thinks the way your other version does. And that's even worse for performance.
The recursive version doesn't use a strict accumulator. This means that the value it returns is a giant chain of nested thunks, holding on to all the generated entries and a bunch of indirect calls via list indexing. This is even more memory used than the filter version, because it's holding on to a bunch of closures as well as all the values. But even with that fixed, it would still be slow. Using !! just wrecks performance. It's recursive when a simple if would do the same job much more efficiently.

Short-circuiting a function over a lower triangular(ish) array in Haskell: speed leads to ugly code

I've got a function, in my minimum example called maybeProduceValue i j, which is only valid when i > j. Note that in my actual code, the js are not uniform and so the data only resembles a triangular matrix, I don't know what the mathematical name for this is.
I'd like my code, which loops over i and j and returns essentially (where js is sorted)
[maximum [f i j | j <- js, j < i] | i <- [0..iMax]]
to not check any more j's once one has failed. In C-like languages, this is simple as
if (j >= i) {break;}
and I'm trying to recreate this behaviour in Haskell. I've got two implementations below:
one which tries to take advantage of laziness by using takeWhile to only inspect at most one value (per i) which fails the test and returns Nothing;
one which remembers the number of js which worked for the previous i and so, for i+1, it doesn't bother doing any safety checks until it exceeds this number.
This latter function is more than twice as fast by my benchmarks but it really is a mess - I'm trying to convince people that Haskell is more concise and safe while still reasonably performant and here is some fast code which is dense, cluttered and does a bunch of unsafe operations.
Is there a solution, perhaps using Cont, Error or Exception, that can achieve my desired behaviour?
n.b. I've tried using Traversable.mapAccumL and Vector.unfoldrN instead of State and they end up being about the same speed and clarity. It's still a very overcomplicated way of solving this problem.
import Criterion.Config
import Criterion.Main
import Control.DeepSeq
import Control.Monad.State
import Data.Maybe
import qualified Data.Traversable as T
import qualified Data.Vector as V
main = deepseq inputs $ defaultMainWith (defaultConfig{cfgSamples = ljust 10}) (return ()) [
bcompare [
bench "whileJust" $ nf whileJust js,
bench "memoised" $ nf memoisedSection js
]]
iMax = 5000
jMax = 10000
-- any sorted vector
js :: V.Vector Int
js = V.enumFromN 0 jMax
maybeProduceValue :: Int -> Int -> Maybe Float
maybeProduceValue i j | j < i = Just (fromIntegral (i+j))
| otherwise = Nothing
unsafeProduceValue :: Int -> Int -> Float
-- unsafeProduceValue i j | j >= i = error "you fool!"
unsafeProduceValue i j = fromIntegral (i+j)
whileJust, memoisedSection
:: V.Vector Int -> V.Vector Float
-- mean: 389ms
-- short circuits properly
whileJust inputs' = V.generate iMax $ \i ->
safeMax . V.map fromJust . V.takeWhile isJust $ V.map (maybeProduceValue i) inputs'
where safeMax v = if V.null v then 0 else V.maximum v
-- mean: 116ms
-- remembers the (monotonically increasing) length of the section of
-- the vector that is safe. I have tested that this doesn't violate the condition that j < i
memoisedSection inputs' = flip evalState 0 $ V.generateM iMax $ \i -> do
validSection <- state $ \oldIx ->
let newIx = oldIx + V.length (V.takeWhile (< i) (V.unsafeDrop oldIx inputs'))
in (V.unsafeTake newIx inputs', newIx)
return $ V.foldl' max 0 $ V.map (unsafeProduceValue i) validSection

Here's a simple way of solving the problem with Applicatives, provided that you don't need to keep the rest of the list once you run into an issue:
import Control.Applicative
memoizeSections :: Ord t => [(t, t)] -> Maybe [t]
memoizeSections [] = Just []
memoizeSections ((x, y):xs) = (:) <$> maybeProduceValue x y <*> memoizeSections xs
This is equivalent to:
import Data.Traversable
memoizeSections :: Ord t => [(t, t)] -> Maybe [t]
memoizeSections = flip traverse (uncurry maybeProduceValue)
and will return Nothing on the first occurrence of failure. Note that I don't know how fast this is, but it's certainly concise, and arguably pretty clear (particularly the first example).

Some minor comments:
-- any sorted vector
js :: V.Vector Int
js = V.enumFromN 0 jMax
If you have a vector of Ints (or Floats, etc), you want to use Data.Vector.Unboxed.
maybeProduceValue :: Int -> Int -> Maybe Float
maybeProduceValue i j | j < i = Just (fromIntegral (i+j))
| otherwise = Nothing
Since Just is lazy in its only field, this will create a thunk for the computation fromIntegral (i+j). You almost always want to apply Just like so
maybeProduceValue i j | j < i = Just $! fromIntegral (i+j)
There are some more thunks in:
memoisedSection inputs' = flip evalState 0 $ V.generateM iMax $ \i -> do
validSection <- state $ \oldIx ->
let newIx = oldIx + V.length (V.takeWhile (< i) (V.unsafeDrop oldIx inputs'))
in (V.unsafeTake newIx inputs', newIx)
return $ V.foldl' max 0 $ V.map (unsafeProduceValue i) validSection
Namely you want to:
let !newIx = oldIx + V.length (V.takeWhile (< i) (V.unsafeDrop oldIx inputs'))
!v = V.unsafeTake newIx inputs'
in (v, newIx)
as the pair is lazy in its fields and
return $! V.foldl' max 0 $ V.map (unsafeProduceValue i) validSection
because return in the state monad is lazy in the value.

You can use a guard in a single list comprehension:
[f i j | j <- js, i <- is, j < i]

If you're trying to get the same results as
[foo i j | i <- is, j <- js, j < i]
when you know that js is increasing, just write
[foo i j | i <- is, j <- takeWhile (< i) js]
There's no need to mess around with Maybe for this. Note that making the input list global has a likely-unfortunate effect: instead of fusing the production of the input list with its transformation(s) and ultimate consumption, it's forced to actually construct the list and then keep it in memory. It's quite possible that it will take longer to pull the list into cache from memory than to generate it piece by piece on the fly!

Caching in Haskell and explicit parallelism

I'm currently trying to optimize my solution to problem 14 at Projet Euler.
I really enjoy Haskell and I think it's a very good fit for these kind of problems, here's three different solutions I've tried:
import Data.List (unfoldr, maximumBy)
import Data.Maybe (fromJust, isNothing)
import Data.Ord (comparing)
import Control.Parallel
next :: Integer -> Maybe (Integer)
next 1 = Nothing
next n
| even n = Just (div n 2)
| odd n = Just (3 * n + 1)
get_sequence :: Integer -> [Integer]
get_sequence n = n : unfoldr (pack . next) n
where pack n = if isNothing n then Nothing else Just (fromJust n, fromJust n)
get_sequence_length :: Integer -> Integer
get_sequence_length n
| isNothing (next n) = 1
| otherwise = 1 + (get_sequence_length $ fromJust (next n))
-- 8 seconds
main1 = print $ maximumBy (comparing length) $ map get_sequence [1..1000000]
-- 5 seconds
main2 = print $ maximum $ map (\n -> (get_sequence_length n, n)) [1..1000000]
-- Never finishes
main3 = print solution
where
s1 = maximumBy (comparing length) $ map get_sequence [1..500000]
s2 = maximumBy (comparing length) $ map get_sequence [500001..10000000]
solution = (s1 `par` s2) `pseq` max s1 s2
Now if you look at the actual problem there's a lot of potential for caching, as most new sequences will contain subsequences that have already been calculated before.
For comparison, I wrote a version in C too:
Running time with caching: 0.03 seconds
Running time without caching: 0.3 seconds
That's just insane! Sure, caching reduced the time by a factor of 10, but even without caching it's still at least 17 times faster than my Haskell code.
What's wrong with my code?
Why doesn't Haskell cache the function calls for me? As the functions are pure caching shouldn't caching be trivial, only a matter of available memory?
What's the problem with my third parallel version? Why doesn't it finish?
Regarding Haskell as a language, does the compiler automatically parallellize some code (folds, maps etc), or does it always have to be done explicitly using Control.Parallel?
Edit: I stumbled upon this similar question. They mentioned that his function wasn't tail-recursive. Is my get_sequence_length tail recursive? If not how can I make it so?
Edit2: To Daniel:
Thanks a lot for the reply, really awesome.
I've been playing around with your improvements and I've found some really bad gotchas.
I'm running the tests on Windws 7 (64-bit), 3.3 GHZ Quad core with 8GB RAM.
The first thing I did was as you say replace all Integer with Int, but whenever I ran any of the mains I ran out of memory,
even with +RTS kSize -RTS set ridiciously high.
Eventually I found this (stackoverflow is awesome...), which means that since all Haskell programs on Windows are run as 32-bit, the Ints were overflowing causing infinite recursion, just wow...
I ran the tests in a Linux virtual machine (with the 64-bit ghc) instead and got similar results.

Alright, let's start from the top. First important thing is to give the exact command line you're using to compile and run; for my answer, I'll use this line for the timings of all programs:
ghc -O2 -threaded -rtsopts test && time ./test +RTS -N
Next up: since timings vary greatly from machine to machine, we'll give some baseline timings for my machine and your programs. Here's the output of uname -a for my computer:
Linux sorghum 3.4.4-2-ARCH #1 SMP PREEMPT Sun Jun 24 18:59:47 CEST 2012 x86_64 Intel(R) Core(TM)2 Quad CPU Q6600 # 2.40GHz GenuineIntel GNU/Linux
The highlights are: quad-core, 2.4GHz, 64-bit.
Using main1: 30.42s user 2.61s system 149% cpu 22.025 total
Using main2: 21.42s user 1.18s system 129% cpu 17.416 total
Using main3: 22.71s user 2.02s system 220% cpu 11.237 total
Actually, I modified main3 in two ways: first, by removing one of the zeros from the end of the range in s2, and second, by changing max s1 s2 to maximumBy (comparing length) [s1, s2], since the former only accidentally computes the right answer. =)
I'll now focus on serial speed. (To answer one of your direct questions: no, GHC does not automatically parallelize or memoize your programs. Both of those things have overheads that are very difficult to estimate, and consequently it's very difficult to decide when doing them will be beneficial. I have no idea why even the serial solutions in this answer are getting >100% CPU utilization; perhaps some garbage collection is happening in another thread or some such thing.) We'll start from main2, since it was the faster of the two serial implementations. The cheapest way to get a little boost is to change all the type signatures from Integer to Int:
Using Int: 11.17s user 0.50s system 129% cpu 8.986 total (about twice as fast)
The next boost comes from reducing allocation in the inner loop (eliminating the intermediate Maybe values).
import Data.List
import Data.Ord
get_sequence_length :: Int -> Int
get_sequence_length 1 = 1
get_sequence_length n
| even n = 1 + get_sequence_length (n `div` 2)
| odd n = 1 + get_sequence_length (3 * n + 1)
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
Using this: 4.84s user 0.03s system 101% cpu 4.777 total
The next boost comes from using faster operations than even and div:
import Data.Bits
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
get_sequence_length :: Int -> Int
get_sequence_length 1 = 1
get_sequence_length n = 1 + get_sequence_length next where
next = if even' n then n `quot` 2 else 3 * n + 1
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
Using this: 1.27s user 0.03s system 105% cpu 1.232 total
For those following along at home, this is about 17 times faster than the main2 that we started with -- a competitive improvement with switching to C.
For memoization, there's a few choices. The simplest is to use a pre-existing package like data-memocombinators to create an immutable array and read from it. The timings are fairly sensitive to choosing a good size for this array; for this problem, I found 50000 to be a pretty good upper bound.
import Data.Bits
import Data.MemoCombinators
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
pre_length :: (Int -> Int) -> (Int -> Int)
pre_length f 1 = 1
pre_length f n = 1 + f next where
next = if even' n then n `quot` 2 else 3 * n + 1
get_sequence_length :: Int -> Int
get_sequence_length = arrayRange (1,50000) (pre_length get_sequence_length)
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
With this: 0.53s user 0.10s system 149% cpu 0.421 total
The fastest of all is to use a mutable, unboxed array for the memoization bit. It's much less idiomatic, but it's bare-metal speed. The speed is much less sensitive on the size of this array, so long as the array is about as large as the biggest thing you want the answer for.
import Control.Monad
import Control.Monad.ST
import Data.Array.Base
import Data.Array.ST
import Data.Bits
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
next n = if even' n then n `quot` 2 else 3 * n + 1
get_sequence_length :: STUArray s Int Int -> Int -> ST s Int
get_sequence_length arr n = do
bounds#(lo,hi) <- getBounds arr
if not (inRange bounds n) then (+1) `fmap` get_sequence_length arr (next n) else do
let ix = n-lo
v <- unsafeRead arr ix
if v > 0 then return v else do
v' <- get_sequence_length arr (next n)
unsafeWrite arr ix (v'+1)
return (v'+1)
maxLength :: (Int,Int)
maxLength = runST $ do
arr <- newArray (1,1000000) 0
writeArray arr 1 1
loop arr 1 1 1000000
where
loop arr n len 1 = return (n,len)
loop arr n len n' = do
len' <- get_sequence_length arr n'
if len' > len then loop arr n' len' (n'-1) else loop arr n len (n'-1)
main = print maxLength
With this: 0.16s user 0.02s system 138% cpu 0.130 total (which is competitive with the memoized C version)

GHC won't parallel-ize anything automatically for you. And as you guess get_sequence_length is not tail-recursive. See here. And consider how the compiler (unless it's doing some nice optimizations for you) can't evaluate all those recursive additions until you hit the end; you're "building up thunks" which isn't usually a good thing.
Try instead calling a recursive helper function and passing an accumulator, or try defining it in terms of foldr.

Longest non decrease subseq in Haskell is slow. How to improve?

longest'inc'subseq seq = maximum dp
where dp = 1 : [val n | n <- [1..length seq - 1]]
val n = (1 +) . filter'and'get'max ((<= top) . (seq!!)) $ [0..pred n]
where top = seq!!n
-----
filter'and'get'max f [] = 0
filter'and'get'max f [x] = if f x then dp!!x else 0
filter'and'get'max f (x:xs) = if f x then ( if vx > vxs then vx else vxs ) else vxs
where vx = dp!!x
vxs = filter'and'get'max f xs
that take about 1-2s with lenght of seq = 1000
while in python is come out imtermedialy
in python
def longest(s):
dp = [0]*len(s)
dp[0] = 1
for i in range(1,len(s)):
need = 0
for j in range (0, i):
if s[j] <= s[i] and dp[j] > need:
need = dp[j]
dp[i] = need + 1
return max(dp)
and when length of seq is 10000, the haskell program run sooo long
while python return the answer after 10-15s
Can we improve haskell speed?

Your core problem is that you're using the wrong data structure in Haskell for this algorithm. You've translated an algorithm that depends on O(1) lookups on a sequence (as in your Python code), into one that does O(n) lookups on a list in Haskell.
Use like-for-like data structures, and then your complexity problems will take care of themselves. In this case, it means using something like Data.Vector.Unboxed to represent the sequence, which has O(1) indexing, as well as low constant overheads in general.

With nothing more than a really mindless wrapping of your lists into Vectors I get 2.5 seconds when the input list is [1..10000].
import qualified Data.Vector as V
import Data.Vector (Vector, (!))
main = print $ liss [0..10000]
liss :: [Int] -> Int
liss seqL = V.maximum dp
where dp = V.fromList $ 1 : [val n | n <- [1..length seqL - 1]]
seq = V.fromList seqL
val n = (1 +) . filter'and'get'max ((<= top) . (seq!)) $ [0..pred n]
where top = seq!n
-----
filter'and'get'max :: (Int -> Bool) -> [Int] -> Int
filter'and'get'max f [] = 0
filter'and'get'max f [x] = if f x then dp!x else 0
filter'and'get'max f (x:xs) = if f x then ( if vx > vxs then vx else vxs ) else vxs
where vx = dp!x
vxs = filter'and'get'max f xs
The compilation and execution:
tommd#Mavlo:Test$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.3
tommd#Mavlo:Test$ ghc -O2 so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
10001
real 0m2.536s
user 0m2.528s
A worker-wrapper transformation on filter'and'get'max seems to shave off another second.
Also, I don't understand why you need that middle case (filter'and'get'max f [x]), shouldn't it work fine without that? I guess this changes the result if dp!x < 0. Note eliminating that saves 0.3 seconds right there.
And the python code you provided takes ~ 10.7 seconds (added a call of longest(range(1,10000));).
tommd#Mavlo:Test$ time python so.py
real 0m10.745s
user 0m10.729s

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Haskell: Parallel code is slower than sequential version - multithreading

Related

Parallelize computation of mutable vector in ST

Why is recursion slower than filter with this Random number counter

Short-circuiting a function over a lower triangular(ish) array in Haskell: speed leads to ugly code

Caching in Haskell and explicit parallelism

Longest non decrease subseq in Haskell is slow. How to improve?

Categories

Resources