Sorting in parallel performance - haskell

I tried to run some programs with multicore and kinda confused by the results.
By default sorting in program below takes 20 seconds, when I run it with +RTS -N2 it takes around 16 secs, but with +RTS -N4 it takes 21 second!
Why it is like that? And is there example of program that gets faster with each extra core? (had similar results with other programs in tutorials)
Here's example of program:
import Data.List
import Control.Parallel
import Data.Time.Clock.POSIX
qsort :: Ord a => [a] -> [a]
qsort (x:xs)
= let a = qsort $ filter (<=x) xs
b = qsort $ filter (>x) xs
in b `par` a ++ x:b
qsort [] = []
randomList :: Int -> [Int]
randomList n = take n $ tail (iterate lcg 1)
where lcg x = (a * x + c) `rem` m
a = 1664525
c = 1013904223
m = 2^32
main :: IO ()
main = do
let randints = randomList 5000000
t1 <- getPOSIXTime
print . sum $ qsort randints
t2 <- getPOSIXTime
putStrLn $ "SORT TIME: " ++ show (t2 - t1) ++ "\n"

I can't duplicate your results. (Which is a good thing, since I think I was the one claiming to see a performance improvement with -N2 and -N4 with the code you posted.)
On Linux with GHC 8.8.3, and compiling to a standalone executable with -O2 -threaded, I get the following timings on a 4-core desktop:
$ stack ghc -- --version
Stack has not been tested with GHC versions above 8.6, and using 8.8.3, this may fail
Stack has not been tested with Cabal versions above 2.4, but version 3.0.1.0 was found, this may fail
The Glorious Glasgow Haskell Compilation System, version 8.8.3
$ stack ghc -- -O2 -threaded QuickSort3.hs
Stack has not been tested with GHC versions above 8.6, and using 8.8.3, this may fail
Stack has not been tested with Cabal versions above 2.4, but version 3.0.1.0 was found, this may fail
[1 of 1] Compiling Main ( QuickSort3.hs, QuickSort3.o )
Linking QuickSort3 ...
$ ./QuickSort3 +RTS -N1
10741167410134688
SORT TIME: 7.671760902s
$ ./QuickSort3 +RTS -N2
10741167410134688
SORT TIME: 5.700858877s
$ ./QuickSort3 +RTS -N3
10741167410134688
SORT TIME: 4.88330669s
$ ./QuickSort3 +RTS -N4
10741167410134688
SORT TIME: 4.93364958s
I get similar results with a 16-core Linux laptop and also similar results with a 4-core Windows virtual machine (also using GHC 8.8.3) running on that laptop.
I can think of a few possible explanations for your results.
First, I don't have a tremendously fast desktop machine, so your timings of 20secs seem suspicious. Is it possible you're doing something like:
$ stack runghc QuickSort3.hs +RTS -N4
If so, this passes the +RTS flags to stack, and then runs the Haskell program in single-threaded mode using the slow byte-code interpreter. In my tests, the sort then takes about 30secs no matter what -Nx flag value I pass.
Second, is it possible you're running this on a virtual machine with a limited number of cores (or an extremely old piece of two-core hardware)? As noted, I tried testing under a Windows virtual machine and got similar results to the Linux version with a 4-core virtual machine but quite erratic results with a 2-core virtual machine (e.g., 11.4, 13.0, and 51.3secs for -N1, -N2, and -N4 respectively, so worse performance for more cores in general, and off-the-charts bad performance for 4 cores).
You could try the following simple parallel sums benchmark, which might scale better:
import Data.List
import Control.Parallel
import Data.Time.Clock.POSIX
randomList :: Int -> Int -> [Int]
randomList seed n = take n $ tail (iterate lcg seed)
where lcg x = (a * x + c) `rem` m
a = 1664525
c = 1013904223
m = 2^32
main :: IO ()
main = do
t1 <- getPOSIXTime
let n = 50000000
a = sum $ randomList 1 n
b = sum $ randomList 2 n
c = sum $ randomList 3 n
d = sum $ randomList 4 n
e = sum $ randomList 5 n
f = sum $ randomList 6 n
g = sum $ randomList 7 n
h = sum $ randomList 8 n
print $ a `par` b `par` c `par` d `par` e `par` f `par` g `par` h `par` (a+b+c+d+e+f+g+h)
t2 <- getPOSIXTime
putStrLn $ "SORT TIME: " ++ show (t2 - t1) ++ "\n"

Related

Why is recursion slower than filter with this Random number counter

I am writing a function that generates a million random numbers of 1 or 0 and then counts how many 0s were generated.
import System.Random
import Control.Monad
countZeros :: Int -> IO Int
countZeros n = (length . filter (==0)) <$> (replicateM n $ randomRIO (0,1 :: Int))
countZeros' :: Int -> IO Int
countZeros' n = go n 0
where
go :: Int -> Int -> IO Int
go x acc = do
r <- randomRIO (0,1 :: Int)
case x of
0 -> pure acc
_ -> let acc' = if r == 0 then succ acc else acc
in go (pred x) acc'
when I run the functions with an input of 1000000
>λ= countZeros 1000000
499716
(0.93 secs, 789,015,080 bytes)
>λ= countZeros' 1000000
500442
(2.02 secs, 1,109,569,560 bytes)
I don't understand why the prime function is twice as slow as the other. I assumed that they are essentially doing the same thing behind the scenes.
I am using GHCi.
What am I missing?
With bang patterns, and proper compilation with -O2, the "prime" function is faster:
{-# LANGUAGE BangPatterns #-}
module Main where
import System.Random
import Control.Monad
import System.Environment
countZeros :: Int -> IO Int
countZeros n = (length . filter (==0)) <$> (replicateM n $ randomRIO (0,1 :: Int))
countZeros' :: Int -> IO Int
countZeros' n = go n 0
where
go :: Int -> Int -> IO Int
go !x !acc = do
r <- randomRIO (0,1 :: Int)
case x of
0 -> pure acc
_ -> let acc' = if r == 0 then succ acc else acc
in go (pred x) acc'
main :: IO ()
main = do
[what] <- getArgs
let n = 1000 * 1000 * 10
fun = case what of
"1" -> countZeros
"2" -> countZeros'
_ -> error "arg not a number"
putStrLn "----"
print =<< fun n
putStrLn "----"
Compiled with
$ stack ghc -- RandomPerf.hs -O2 -Wall
$ stack ghc -- --version
The Glorious Glasgow Haskell Compilation System, version 8.6.3
Tests:
$ time ./RandomPerf.exe 1
----
4999482
----
real 0m3.329s
user 0m0.000s
sys 0m0.031s
$ time ./RandomPerf.exe 2
----
5001089
----
real 0m2.338s
user 0m0.000s
sys 0m0.046s
Repeating the tests gives comparable results, so this is not a fluke.
Result: the countZeros' function is significantly faster.
Using Criterion and running a proper benchmark is left as an exercise.
You probably used GHCi to assess performance, which prevents the optimizer to do its job. GHCi sacrifices proper optimization to load files faster, and be more usable in an interactive way.
These actually work in different ways from each other, at a level that matters. And both are slow.
The version using replicateM is bad because replicateM in IO can't stream its results. The entire list will be constructed at once, before filter and length get to start operating on it. The reason it's faster is that length is strict in its accumulator, so it doesn't generate a massive nested chain of thinks the way your other version does. And that's even worse for performance.
The recursive version doesn't use a strict accumulator. This means that the value it returns is a giant chain of nested thunks, holding on to all the generated entries and a bunch of indirect calls via list indexing. This is even more memory used than the filter version, because it's holding on to a bunch of closures as well as all the values. But even with that fixed, it would still be slow. Using !! just wrecks performance. It's recursive when a simple if would do the same job much more efficiently.

Haskell: parallel program not utilizing all cores

The following code has the same performance whether compiled with -threaded or without, or when I write the code in a single threaded manner. Both blocks (using par and the commented forkIO/forkOS/forkOn) result in the same performance. In fact, performance is slightly degraded in the parallel version (presumably due to the overhead of parallel GC). Viewing the CPU utilization from a program like htop shows only one CPU getting pegged, which is pretty confusing since my reading of the code is that it should use most of the cores.
The fact that forkOS doesn't use more cores is particularly confusing since the relevant section from ghc/rts/posix/OSThreads.c:forkOS_createThread seems to imply that it forces a call to pthread_create.
-- (Apologies if I have missed an import or two)
import Data.List
import GHC.Conc
import Control.Concurrent
import Control.DeepSeq
import qualified Data.HashMap.Lazy as HM
main :: IO ()
main = do
let [one::Int, two] = [15, 1000000]
{-
s <- numSparks
putStrLn $ "Num sparks " <> show s
n <- getNumCapabilities
putStrLn $ "Num capabilities " <> show n
m <- newEmptyMVar
forkIO $ void $ forM [(1::Int)..one] $ \cpu -> do
-- forkOn cpu $ void $ do
forkOS $ void $ do
-- forkIO $ void $ do
-- void $ do
putStrLn $ "core " <> show cpu
s <- return $ sort $ HM.keys $ HM.fromList $ zip [cpu..two + cpu] (repeat (0::Int))
putStrLn $ "core " <> show cpu <> " done " <> show (sum s)
putMVar m ()
forM [1..one] $ \i -> takeMVar m
let s :: String = "hey!"
putStrLn s
-}
print one
print two
let __pmap__ f xs = case xs of
[] -> []
x:xs -> let y = f x
ys = __pmap__ f xs
in (y `par` ys) `pseq` (y: ys)
n <- pure $ sum . concat $ flip __pmap__ [1..one] $ \i ->
force $ sort $ HM.keys $ HM.fromList $ zip [i..(two + i)] (repeat (0::Int))
putStrLn $ "sum " <> show n
s <- numSparks
putStrLn $ "Num sparks " <> show s
Relevant section from my .cabal file
ghc-options:
-threaded
-rtsopts
"-with-rtsopts=-N15 -qg1"
Platform information
$ stack --version
Version 1.2.0, Git revision 241cd07d576d9c0c0e712e83d947e3dd64541c42 (4054 commits) x86_64 hpack-0.14.0
$ stack exec ghc -- --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial
$ uname -r
4.4.0-36-generic
Why isn't my code getting parallelized?
EDIT: if it's helpful at all, adding the -s runtime flag produces the following report
21,829,377,776 bytes allocated in the heap
126,512,021,712 bytes copied during GC
86,659,312 bytes maximum residency (322 sample(s))
6,958,976 bytes maximum slop
218 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 41944 colls, 0 par 16.268s 17.272s 0.0004s 0.0011s
Gen 1 322 colls, 321 par 237.056s 23.822s 0.0740s 0.2514s
Parallel GC work balance: 13.01% (serial 0%, perfect 100%)
TASKS: 32 (1 bound, 31 peak workers (31 total), using -N15)
SPARKS: 15 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 15 fizzled)
INIT time 0.004s ( 0.003s elapsed)
MUT time 12.504s ( 13.301s elapsed)
GC time 253.324s ( 41.094s elapsed)
EXIT time 0.000s ( 0.017s elapsed)
Total time 265.920s ( 54.413s elapsed)
Alloc rate 1,745,791,568 bytes per MUT second
Productivity 4.7% of total user, 23.1% of total elapsed
gc_alloc_block_sync: 10725286
whitehole_spin: 0
gen[0].sync: 2171
gen[1].sync: 1057315
EDIT2: Messing with the arena size seems to have helped considerably. I added -H2G -A1G to the RTS options and the time came down from 43s to 5.2s. Is there anything else that can be improved about the situation to get a full 15x speedup?
EDIT3: Edited the code to reflect the par, pseq pattern suggested by two people giving feedback
The issue is caused by the definition of __pmap__. Specifically there is a problem in the following expression:
let y = f x
in y `par` (y: __pmap__ f xs)
You would expect that this would cause y and y: __pmap__ f xs to be evaluated in parallel, but this is not the case. What happens is that GHC tries to evaluate them in parallel, but the second subexpression contains y, which is the first subexpression. Because of that, the second subexpression depends on the first one and thus they cannot be evaluated in parallel. The correct way to write the above expression is
let y = f x
ys = __pmap__ f xs
in y `par` (ys `pseq` (y : ys))
because the pseq will force ys to be evaluated before y : ys and thus the evaluation of the second subexpression can be started while the evaluation of y is running. See also this thread for some discussion on this.
So putting it all together, we get the following:
main :: IO ()
main = do
let [one::Int, two] = [15, 1000000]
print one
print two
let __pmap__ f xs = case xs of
[] -> []
x:xs -> let y = f x
ys = __pmap__ f xs
in y `par` ys `pseq` (y : ys)
n <- pure $ sum . concat $ flip __pmap__ [1..one] $ \i ->
traceShow i $ force $ sort $ HM.keys $ HM.fromList $ zip [i..(two + i)] (repeat (0::Int))
putStrLn $ "sum " <> show n
s <- numSparks
putStrLn $ "Num sparks " <> show s
Notice that I've added a traceShow (from Debug.Trace). If you run this with -N1 in rtsopts you will see that the list will be evaluated one element at a time, whereas if you use -N3, it will be evaluated 3 elements at a time.
The moral of the story is that par and pseq are easy to misuse and you should therefore prefer higher level solutions such as parMap rdeepseq (which is equivalent to your __pmap__) from parallel.

Haskell: Parallel code is slower than sequential version

I am pretty new to Haskell threads (and parallel programming in general) and I am not sure why my parallel version of an algorithm runs slower than the corresponding sequential version.
The algorithm tries to find all k-combinations without using recursion. For this, I am using this helper function, which given a number with k bits set, returns the next number with the same number of bits set:
import Data.Bits
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
It is now easy to obtain sequentially all k-combinations in the range [(2^k - 1), (2^(n-k)+...+ 2^(n-1)):
import qualified Data.Stream as ST
combs :: Int -> Int -> [Integer]
combs n k = ST.takeWhile (<= end) $ kBitNumbers start
where start = 2^k - 1
end = sum $ fmap (2^) [n-k..n-1]
kBitNumbers :: Integer -> ST.Stream Integer
kBitNumbers = ST.iterate nextKBitNumber
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
print $ length (combs n k)
My idea is that this should be easily parallelizable splitting this range into smaller parts. For example:
start :: Int -> Integer
start k = 2 ^ k - 1
end :: Int -> Int -> Integer
end n k = sum $ fmap (2 ^) [n-k..n-1]
splits :: Int -> Int -> Int -> [(Integer, Integer, Int)]
splits n k numSplits = fixedRanges ranges []
where s = start k
e = end n k
step = (e-s) `div` (min (e-s) (toInteger numSplits))
initSplits = [s,s+step..e]
ranges = zip initSplits (tail initSplits)
fixedRanges [] acc = acc
fixedRanges [x] acc = acc ++ [(fst x, e, k)]
fixedRanges (x:xs) acc = fixedRanges xs (acc ++ [(fst x, snd x, k)])
At this point, I would like to run each split in parallel, something like:
runSplit :: (Integer, Integer, Int) -> [Integer]
runSplit (start, end, k) = ST.takeWhile (<= end) $ kBitNumbers (fixStart start)
where fixStart s
| popCount s == k = s
| otherwise = fixStart $ s + 1
For pallalelization I am using the monad-par package:
import Control.Monad.Par
import System.Environment
import qualified Data.Set as S
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
numTasks = read $ params !! 2
batches = runPar $ parMap runSplit (splits n k numTasks)
reducedNumbers = foldl S.union S.empty $ fmap S.fromList batches
print $ S.size reducedNumbers
The result is that the sequential version is way faster and it uses little memory, while the parallel version consumes a lot of memory and it is noticeable slower.
What might be the reasons causing this? Are threads a good approach for this problem? For example, every thread generates a (potentially large) list of integers and the main thread reduces the results; are threads expected to need much memory or are simply meant to produce simple results (i.e. only cpu-intensive computations)?
I compile my program with stack build --ghc-options -threaded --ghc-options -rtsopts --executable-profiling --library-profiling and run it with ./.stack-work/install/x86_64-osx/lts-6.1/7.10.3/bin/combinatorics 20 3 4 +RTS -pa -N4 -RTS for n=20, k=3 and numSplits=4. An example of the profiling report for the parallel version can be found here and for the sequential version here.
In your sequential version calling combs does not build up a list in memory since after length consumes an element it isn't needed anymore and is freed. Indeed, GHC may not even allocate storage for it.
For instance, this will take a while but won't consume a lot of memory:
main = print $ length [1..1000000000] -- 1 billion
In your parallel version you are generating sub-lists, concatenating them together, building Sets, etc. and therefore the results of each sub-task have to be kept in memory.
A fairer comparison would be to have each parallel task compute the length of the k-bit numbers in its assigned range, and then add up the results. That way the k-bit numbers found by each parallel task wouldn't have to be kept in memory and would operate more like the sequential version.
Update
Here is an example of how to use parMap. Note: under 7.10.2 I've had mixed success getting the parallelism to fire - sometimes it does and sometimes it doesn't. (Figured it out - I was using -RTS -N2 instead of +RTS -N2.)
{-
compile with: ghc -O2 -threaded -rtsopts foo.hs
compare:
time ./foo 26 +RTS -N1
time ./foo 26 +RTS -N2
-}
import Data.Bits
import Control.Parallel.Strategies
import System.Environment
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
combs :: Int -> Int -> [Integer]
combs n k = takeWhile (<= end) $ iterate nextKBitNumber start
where start = 2^k - 1
end = shift start (n-k)
main :: IO ()
main = do
( arg1 : _) <- getArgs
let n = read arg1
print $ parMap rseq (length . combs n) [1..n]
good approaches for this problem
What do you mean by this problem? If it's how to write, analyze and tune a parallel Haskell program, then this is required background reading:
Simon Marlow: Parallel and Concurrent Programming in Haskell
http://community.haskell.org/~simonmar/pcph/
in particular, Section 15 (Debugging, Tuning, ..)
Use threadscope! (a graphical viewer for thread profile information generated by the Glasgow Haskell compiler) https://hackage.haskell.org/package/threadscope

Caching in Haskell and explicit parallelism

I'm currently trying to optimize my solution to problem 14 at Projet Euler.
I really enjoy Haskell and I think it's a very good fit for these kind of problems, here's three different solutions I've tried:
import Data.List (unfoldr, maximumBy)
import Data.Maybe (fromJust, isNothing)
import Data.Ord (comparing)
import Control.Parallel
next :: Integer -> Maybe (Integer)
next 1 = Nothing
next n
| even n = Just (div n 2)
| odd n = Just (3 * n + 1)
get_sequence :: Integer -> [Integer]
get_sequence n = n : unfoldr (pack . next) n
where pack n = if isNothing n then Nothing else Just (fromJust n, fromJust n)
get_sequence_length :: Integer -> Integer
get_sequence_length n
| isNothing (next n) = 1
| otherwise = 1 + (get_sequence_length $ fromJust (next n))
-- 8 seconds
main1 = print $ maximumBy (comparing length) $ map get_sequence [1..1000000]
-- 5 seconds
main2 = print $ maximum $ map (\n -> (get_sequence_length n, n)) [1..1000000]
-- Never finishes
main3 = print solution
where
s1 = maximumBy (comparing length) $ map get_sequence [1..500000]
s2 = maximumBy (comparing length) $ map get_sequence [500001..10000000]
solution = (s1 `par` s2) `pseq` max s1 s2
Now if you look at the actual problem there's a lot of potential for caching, as most new sequences will contain subsequences that have already been calculated before.
For comparison, I wrote a version in C too:
Running time with caching: 0.03 seconds
Running time without caching: 0.3 seconds
That's just insane! Sure, caching reduced the time by a factor of 10, but even without caching it's still at least 17 times faster than my Haskell code.
What's wrong with my code?
Why doesn't Haskell cache the function calls for me? As the functions are pure caching shouldn't caching be trivial, only a matter of available memory?
What's the problem with my third parallel version? Why doesn't it finish?
Regarding Haskell as a language, does the compiler automatically parallellize some code (folds, maps etc), or does it always have to be done explicitly using Control.Parallel?
Edit: I stumbled upon this similar question. They mentioned that his function wasn't tail-recursive. Is my get_sequence_length tail recursive? If not how can I make it so?
Edit2: To Daniel:
Thanks a lot for the reply, really awesome.
I've been playing around with your improvements and I've found some really bad gotchas.
I'm running the tests on Windws 7 (64-bit), 3.3 GHZ Quad core with 8GB RAM.
The first thing I did was as you say replace all Integer with Int, but whenever I ran any of the mains I ran out of memory,
even with +RTS kSize -RTS set ridiciously high.
Eventually I found this (stackoverflow is awesome...), which means that since all Haskell programs on Windows are run as 32-bit, the Ints were overflowing causing infinite recursion, just wow...
I ran the tests in a Linux virtual machine (with the 64-bit ghc) instead and got similar results.
Alright, let's start from the top. First important thing is to give the exact command line you're using to compile and run; for my answer, I'll use this line for the timings of all programs:
ghc -O2 -threaded -rtsopts test && time ./test +RTS -N
Next up: since timings vary greatly from machine to machine, we'll give some baseline timings for my machine and your programs. Here's the output of uname -a for my computer:
Linux sorghum 3.4.4-2-ARCH #1 SMP PREEMPT Sun Jun 24 18:59:47 CEST 2012 x86_64 Intel(R) Core(TM)2 Quad CPU Q6600 # 2.40GHz GenuineIntel GNU/Linux
The highlights are: quad-core, 2.4GHz, 64-bit.
Using main1: 30.42s user 2.61s system 149% cpu 22.025 total
Using main2: 21.42s user 1.18s system 129% cpu 17.416 total
Using main3: 22.71s user 2.02s system 220% cpu 11.237 total
Actually, I modified main3 in two ways: first, by removing one of the zeros from the end of the range in s2, and second, by changing max s1 s2 to maximumBy (comparing length) [s1, s2], since the former only accidentally computes the right answer. =)
I'll now focus on serial speed. (To answer one of your direct questions: no, GHC does not automatically parallelize or memoize your programs. Both of those things have overheads that are very difficult to estimate, and consequently it's very difficult to decide when doing them will be beneficial. I have no idea why even the serial solutions in this answer are getting >100% CPU utilization; perhaps some garbage collection is happening in another thread or some such thing.) We'll start from main2, since it was the faster of the two serial implementations. The cheapest way to get a little boost is to change all the type signatures from Integer to Int:
Using Int: 11.17s user 0.50s system 129% cpu 8.986 total (about twice as fast)
The next boost comes from reducing allocation in the inner loop (eliminating the intermediate Maybe values).
import Data.List
import Data.Ord
get_sequence_length :: Int -> Int
get_sequence_length 1 = 1
get_sequence_length n
| even n = 1 + get_sequence_length (n `div` 2)
| odd n = 1 + get_sequence_length (3 * n + 1)
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
Using this: 4.84s user 0.03s system 101% cpu 4.777 total
The next boost comes from using faster operations than even and div:
import Data.Bits
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
get_sequence_length :: Int -> Int
get_sequence_length 1 = 1
get_sequence_length n = 1 + get_sequence_length next where
next = if even' n then n `quot` 2 else 3 * n + 1
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
Using this: 1.27s user 0.03s system 105% cpu 1.232 total
For those following along at home, this is about 17 times faster than the main2 that we started with -- a competitive improvement with switching to C.
For memoization, there's a few choices. The simplest is to use a pre-existing package like data-memocombinators to create an immutable array and read from it. The timings are fairly sensitive to choosing a good size for this array; for this problem, I found 50000 to be a pretty good upper bound.
import Data.Bits
import Data.MemoCombinators
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
pre_length :: (Int -> Int) -> (Int -> Int)
pre_length f 1 = 1
pre_length f n = 1 + f next where
next = if even' n then n `quot` 2 else 3 * n + 1
get_sequence_length :: Int -> Int
get_sequence_length = arrayRange (1,50000) (pre_length get_sequence_length)
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
With this: 0.53s user 0.10s system 149% cpu 0.421 total
The fastest of all is to use a mutable, unboxed array for the memoization bit. It's much less idiomatic, but it's bare-metal speed. The speed is much less sensitive on the size of this array, so long as the array is about as large as the biggest thing you want the answer for.
import Control.Monad
import Control.Monad.ST
import Data.Array.Base
import Data.Array.ST
import Data.Bits
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
next n = if even' n then n `quot` 2 else 3 * n + 1
get_sequence_length :: STUArray s Int Int -> Int -> ST s Int
get_sequence_length arr n = do
bounds#(lo,hi) <- getBounds arr
if not (inRange bounds n) then (+1) `fmap` get_sequence_length arr (next n) else do
let ix = n-lo
v <- unsafeRead arr ix
if v > 0 then return v else do
v' <- get_sequence_length arr (next n)
unsafeWrite arr ix (v'+1)
return (v'+1)
maxLength :: (Int,Int)
maxLength = runST $ do
arr <- newArray (1,1000000) 0
writeArray arr 1 1
loop arr 1 1 1000000
where
loop arr n len 1 = return (n,len)
loop arr n len n' = do
len' <- get_sequence_length arr n'
if len' > len then loop arr n' len' (n'-1) else loop arr n len (n'-1)
main = print maxLength
With this: 0.16s user 0.02s system 138% cpu 0.130 total (which is competitive with the memoized C version)
GHC won't parallel-ize anything automatically for you. And as you guess get_sequence_length is not tail-recursive. See here. And consider how the compiler (unless it's doing some nice optimizations for you) can't evaluate all those recursive additions until you hit the end; you're "building up thunks" which isn't usually a good thing.
Try instead calling a recursive helper function and passing an accumulator, or try defining it in terms of foldr.

Multi-Core Haskell on Windows

I've been reading a number of tutorials on Haskell. However, I have not been able to get the compiled application to run on a multicore (I have an Intel Quad Core) on windows (32 bit).
I have tried a number of things:
ghc -O2 --make A.hs -threaded
./real-par +RTS -N2
./real-par +RTS -N4
But no luck.
The compiled application runs 100% on one core only.
Any ideas?
Code:
import Control.Parallel
import Control.Monad
import Text.Printf
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = l `pseq` r `pseq` l+r
where
l = fib (n-1)
r = fib (n-2)
main = forM_ [0..350] $ \i ->
printf "n=%d => %d\n" i (fib i)
Using par instead of pseq seems to fix it.
If vili is correct (I can't test as I don't own any MS boxes), it might be related to this bug

Resources