Understanding result of parBuffer-based strategy run - haskell

I want to understand Haskell Evaluation Strategies, and found the parBuffer concept very interesting. So I created a dumb slow function that calculates factorials, and tried to create a parallel version of it.
Consider the code below
module Main where
import System.Environment
import Control.Parallel.Strategies
fac :: Int -> Int
fac n
| n <= 1 = 1
| otherwise = n * fac (n-1)
-- sequential version
facsum :: Int -> Int
facsum n = sum $ map fac [1..n]
facsum_par :: Int -> Int
facsum_par n = sum lst where
lst = (map fac [1..n]) `using` parBuffer 50 rseq
main :: IO ()
main = do args <- getArgs
let (n:_) = args
print (facsum_par $ read n)
I would have expected, that the parBuffer would create bunches of 50 sparks, but the statistics for *n = 9000, and -N2 I get the output
SPARKS: 9000 (8277 converted, 688 overflowed, 0 dud, 1 GC'd, 34 fizzled)
Why are there overflowed sparks? As far as I understand this only happens if the amount of sparks would be larger than the size of the spark pool, but should this be limited to 50?
Why are there 34 fizzled sparks, is it only for the small, fast-calculating inputs?

Related

Why is recursion slower than filter with this Random number counter

I am writing a function that generates a million random numbers of 1 or 0 and then counts how many 0s were generated.
import System.Random
import Control.Monad
countZeros :: Int -> IO Int
countZeros n = (length . filter (==0)) <$> (replicateM n $ randomRIO (0,1 :: Int))
countZeros' :: Int -> IO Int
countZeros' n = go n 0
where
go :: Int -> Int -> IO Int
go x acc = do
r <- randomRIO (0,1 :: Int)
case x of
0 -> pure acc
_ -> let acc' = if r == 0 then succ acc else acc
in go (pred x) acc'
when I run the functions with an input of 1000000
>λ= countZeros 1000000
499716
(0.93 secs, 789,015,080 bytes)
>λ= countZeros' 1000000
500442
(2.02 secs, 1,109,569,560 bytes)
I don't understand why the prime function is twice as slow as the other. I assumed that they are essentially doing the same thing behind the scenes.
I am using GHCi.
What am I missing?
With bang patterns, and proper compilation with -O2, the "prime" function is faster:
{-# LANGUAGE BangPatterns #-}
module Main where
import System.Random
import Control.Monad
import System.Environment
countZeros :: Int -> IO Int
countZeros n = (length . filter (==0)) <$> (replicateM n $ randomRIO (0,1 :: Int))
countZeros' :: Int -> IO Int
countZeros' n = go n 0
where
go :: Int -> Int -> IO Int
go !x !acc = do
r <- randomRIO (0,1 :: Int)
case x of
0 -> pure acc
_ -> let acc' = if r == 0 then succ acc else acc
in go (pred x) acc'
main :: IO ()
main = do
[what] <- getArgs
let n = 1000 * 1000 * 10
fun = case what of
"1" -> countZeros
"2" -> countZeros'
_ -> error "arg not a number"
putStrLn "----"
print =<< fun n
putStrLn "----"
Compiled with
$ stack ghc -- RandomPerf.hs -O2 -Wall
$ stack ghc -- --version
The Glorious Glasgow Haskell Compilation System, version 8.6.3
Tests:
$ time ./RandomPerf.exe 1
----
4999482
----
real 0m3.329s
user 0m0.000s
sys 0m0.031s
$ time ./RandomPerf.exe 2
----
5001089
----
real 0m2.338s
user 0m0.000s
sys 0m0.046s
Repeating the tests gives comparable results, so this is not a fluke.
Result: the countZeros' function is significantly faster.
Using Criterion and running a proper benchmark is left as an exercise.
You probably used GHCi to assess performance, which prevents the optimizer to do its job. GHCi sacrifices proper optimization to load files faster, and be more usable in an interactive way.
These actually work in different ways from each other, at a level that matters. And both are slow.
The version using replicateM is bad because replicateM in IO can't stream its results. The entire list will be constructed at once, before filter and length get to start operating on it. The reason it's faster is that length is strict in its accumulator, so it doesn't generate a massive nested chain of thinks the way your other version does. And that's even worse for performance.
The recursive version doesn't use a strict accumulator. This means that the value it returns is a giant chain of nested thunks, holding on to all the generated entries and a bunch of indirect calls via list indexing. This is even more memory used than the filter version, because it's holding on to a bunch of closures as well as all the values. But even with that fixed, it would still be slow. Using !! just wrecks performance. It's recursive when a simple if would do the same job much more efficiently.

Haskell: Parallel code is slower than sequential version

I am pretty new to Haskell threads (and parallel programming in general) and I am not sure why my parallel version of an algorithm runs slower than the corresponding sequential version.
The algorithm tries to find all k-combinations without using recursion. For this, I am using this helper function, which given a number with k bits set, returns the next number with the same number of bits set:
import Data.Bits
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
It is now easy to obtain sequentially all k-combinations in the range [(2^k - 1), (2^(n-k)+...+ 2^(n-1)):
import qualified Data.Stream as ST
combs :: Int -> Int -> [Integer]
combs n k = ST.takeWhile (<= end) $ kBitNumbers start
where start = 2^k - 1
end = sum $ fmap (2^) [n-k..n-1]
kBitNumbers :: Integer -> ST.Stream Integer
kBitNumbers = ST.iterate nextKBitNumber
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
print $ length (combs n k)
My idea is that this should be easily parallelizable splitting this range into smaller parts. For example:
start :: Int -> Integer
start k = 2 ^ k - 1
end :: Int -> Int -> Integer
end n k = sum $ fmap (2 ^) [n-k..n-1]
splits :: Int -> Int -> Int -> [(Integer, Integer, Int)]
splits n k numSplits = fixedRanges ranges []
where s = start k
e = end n k
step = (e-s) `div` (min (e-s) (toInteger numSplits))
initSplits = [s,s+step..e]
ranges = zip initSplits (tail initSplits)
fixedRanges [] acc = acc
fixedRanges [x] acc = acc ++ [(fst x, e, k)]
fixedRanges (x:xs) acc = fixedRanges xs (acc ++ [(fst x, snd x, k)])
At this point, I would like to run each split in parallel, something like:
runSplit :: (Integer, Integer, Int) -> [Integer]
runSplit (start, end, k) = ST.takeWhile (<= end) $ kBitNumbers (fixStart start)
where fixStart s
| popCount s == k = s
| otherwise = fixStart $ s + 1
For pallalelization I am using the monad-par package:
import Control.Monad.Par
import System.Environment
import qualified Data.Set as S
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
numTasks = read $ params !! 2
batches = runPar $ parMap runSplit (splits n k numTasks)
reducedNumbers = foldl S.union S.empty $ fmap S.fromList batches
print $ S.size reducedNumbers
The result is that the sequential version is way faster and it uses little memory, while the parallel version consumes a lot of memory and it is noticeable slower.
What might be the reasons causing this? Are threads a good approach for this problem? For example, every thread generates a (potentially large) list of integers and the main thread reduces the results; are threads expected to need much memory or are simply meant to produce simple results (i.e. only cpu-intensive computations)?
I compile my program with stack build --ghc-options -threaded --ghc-options -rtsopts --executable-profiling --library-profiling and run it with ./.stack-work/install/x86_64-osx/lts-6.1/7.10.3/bin/combinatorics 20 3 4 +RTS -pa -N4 -RTS for n=20, k=3 and numSplits=4. An example of the profiling report for the parallel version can be found here and for the sequential version here.
In your sequential version calling combs does not build up a list in memory since after length consumes an element it isn't needed anymore and is freed. Indeed, GHC may not even allocate storage for it.
For instance, this will take a while but won't consume a lot of memory:
main = print $ length [1..1000000000] -- 1 billion
In your parallel version you are generating sub-lists, concatenating them together, building Sets, etc. and therefore the results of each sub-task have to be kept in memory.
A fairer comparison would be to have each parallel task compute the length of the k-bit numbers in its assigned range, and then add up the results. That way the k-bit numbers found by each parallel task wouldn't have to be kept in memory and would operate more like the sequential version.
Update
Here is an example of how to use parMap. Note: under 7.10.2 I've had mixed success getting the parallelism to fire - sometimes it does and sometimes it doesn't. (Figured it out - I was using -RTS -N2 instead of +RTS -N2.)
{-
compile with: ghc -O2 -threaded -rtsopts foo.hs
compare:
time ./foo 26 +RTS -N1
time ./foo 26 +RTS -N2
-}
import Data.Bits
import Control.Parallel.Strategies
import System.Environment
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
combs :: Int -> Int -> [Integer]
combs n k = takeWhile (<= end) $ iterate nextKBitNumber start
where start = 2^k - 1
end = shift start (n-k)
main :: IO ()
main = do
( arg1 : _) <- getArgs
let n = read arg1
print $ parMap rseq (length . combs n) [1..n]
good approaches for this problem
What do you mean by this problem? If it's how to write, analyze and tune a parallel Haskell program, then this is required background reading:
Simon Marlow: Parallel and Concurrent Programming in Haskell
http://community.haskell.org/~simonmar/pcph/
in particular, Section 15 (Debugging, Tuning, ..)
Use threadscope! (a graphical viewer for thread profile information generated by the Glasgow Haskell compiler) https://hackage.haskell.org/package/threadscope

My attempt at Project Euler #92 is too slow

I'm trying to solve Project Euler problem #92 with Haskell. I started learning Haskell recently. It's the first Project Euler problem I've tried to solve with Haskell, but my piece of code doesn't terminate even in 10 minutes. I know you don't give me the answer directly, but again I should warn I find answer with c++ doesn't give answer of Euler or new logic to solve Euler. I'm just curious why that guy doesn't work fast and what should I do to make it faster?
{--EULER 92--}
import Data.List
myFirstFunction 1 = 0
myFirstFunction 89 = 1
myFirstFunction x= myFirstFunction (giveResult x)
giveResult 0 = 0
giveResult x = (square (mod x 10)) + (giveResult (div x 10))
square x = x*x
a=[1..10000000]
main = putStrLn(show (sum (map myFirstFunction a)))
The biggest speedup can of course be gained from using a better algorithm. I'm not going deep into that here, though.
Original algorithm tweakings
So let's focus on improving the used algorithm without really changing it.
You never give any type signature, therefore the type defaults to the arbitrary precision Integer. Everything here fits easily in an Int, there's no danger of overflow, so let's use that. Adding a type signature myFirstFunction :: Int -> Int helps: time drops from Total time 13.77s ( 13.79s elapsed) to Total time 6.24s ( 6.24s elapsed) and total allocation drops by a factor of about 15. Not bad for such a simple change.
You use div and mod. These always compute a non-negative remainder and the corresponding quotient, so they need some extra checks in case some negative numbers are involved. The functions quot and rem map to the machine division instructions, they don't involve such checks and therefore are somewhat faster. If you compile via the LLVM backend (-fllvm), that also takes advantage of the fact that you always divide by a single known number (10), and converts the division into multiplication and bit-shift. Time now: Total time 1.56s ( 1.56s elapsed).
Instead of using quot and rem separately, let's use the quotRem function that computes both at once, so that we don't repeat the division (even with multiplication+shift that takes a little time):
giveResult x = case x `quotRem` 10 of
(q,r) -> r*r + giveResult q
That doesn't gain much, but a little: Total time 1.49s ( 1.49s elapsed).
You're using a list a = [1 .. 10000000], and map the function over that list and then sum the resulting list. That's idiomatic, neat and short, but not super fast, since allocating all those list cells and garbage collecting them takes time too - not very much, since GHC is very good at that, but transforming it into a loop
main = print $ go 0 1
where
go acc n
| n > 10000000 = acc
| otherwise = go (acc + myFirstFunction n) (n+1)
gains us a little still: Total time 1.34s ( 1.34s elapsed) and the allocation dropped from 880,051,856 bytes allocated in the heap for the last list version to 51,840 bytes allocated in the heap.
giveResult is recursive, and therefore cannot be inlined. The same holds for myFirstFunction, hence each computation needs two function calls (at least). We can avoid that by rewriting giveResult to a non-recursive wrapper and a recursive local loop,
giveResult x = go 0 x
where
go acc 0 = acc
go acc n = case n `quotRem` 10 of
(q,r) -> go (acc + r*r) q
so that that can be inlined: Total time 1.04s ( 1.04s elapsed).
Those were the most obvious points, further improvements - apart from the memoisation mentioned by hammar in the comments - would need some thinking.
We are now at
module Main (main) where
myFirstFunction :: Int -> Int
myFirstFunction 1 = 0
myFirstFunction 89 = 1
myFirstFunction x= myFirstFunction (giveResult x)
giveResult :: Int -> Int
giveResult x = go 0 x
where
go acc 0 = acc
go acc n = case n `quotRem` 10 of
(q,r) -> go (acc + r*r) q
main :: IO ()
main = print $ go 0 1
where
go acc n
| n > 10000000 = acc
| otherwise = go (acc + myFirstFunction n) (n+1)
With -O2 -fllvm, that runs in 1.04 seconds here, but with the native code generator (only -O2), it takes 3.5 seconds. That difference is due to the fact that GHC itself doesn't convert the division into a multiplication and bit-shift. If we do it by hand, we get pretty much the same performance from the native code generator.
Because we know something that the compiler doesn't, namely that we never deal with negative numbers here, and the numbers don't become large, we can even generate a better multiply-and-shift (that would produce wrong results for negative or large dividends) than the compiler and take the time down to 0.9 seconds for the native code generator and 0.73 seconds for the LLVM backend:
import Data.Bits
qr10 :: Int -> (Int, Int)
qr10 n = (q, r)
where
q = (n * 0x66666667) `unsafeShiftR` 34
r = n - 10 * q
Note: That requires that Int is a 64-bit type, it won't work with 32-bit Ints, it will produce wrong results for negative n, and the multiplication will overflow for large n. We're entering dirty-hack territory. We can alleviate the dirtyness by using Word instead of Int, that leaves only the overflow (which doesn't occur for n <= 10737418236 with Word resp n <= 5368709118 for Int, so here we are comfortably in the safe zone). The times aren't affected.
The corresponding C programme
#include <stdio.h>
unsigned int myFirstFunction(unsigned int i);
unsigned int giveResult(unsigned int i);
int main(void) {
unsigned int sum = 0;
for(unsigned int i = 1; i <= 10000000; ++i) {
sum += myFirstFunction(i);
}
printf("%u\n",sum);
return 0;
}
unsigned int myFirstFunction(unsigned int i) {
if (i == 1) return 0;
if (i == 89) return 1;
return myFirstFunction(giveResult(i));
}
unsigned int giveResult(unsigned int i) {
unsigned int acc = 0, r, q;
while(i) {
q = (i*0x66666667UL) >> 34;
r = i - q*10;
i = q;
acc += r*r;
}
return acc;
}
performs similarly, compiled with gcc -O3, it runs in 0.78 seconds, and with clang -O3 in 0.71.
That's pretty much the end without changing the algorithm.
Memoisation
Now, a minor change of algorithm is memoisation. If we build a lookup table for the numbers <= 7*9², we need only one computation of the sum of the squares of the digits for each number rather than iterating that until we reach 1 or 89, so let's memoise,
module Main (main) where
import Data.Array.Unboxed
import Data.Array.IArray
import Data.Array.Base (unsafeAt)
import Data.Bits
qr10 :: Int -> (Int, Int)
qr10 n = (q, r)
where
q = (n * 0x66666667) `unsafeShiftR` 34
r = n - 10 * q
digitSquareSum :: Int -> Int
digitSquareSum = go 0
where
go acc 0 = acc
go acc n = case qr10 n of
(q,r) -> go (acc + r*r) q
table :: UArray Int Int
table = array (0,567) $ assocs helper
where
helper :: Array Int Int
helper = array (0,567) [(i, f i) | i <- [0 .. 567]]
f 0 = 0
f 1 = 0
f 89 = 1
f n = helper ! digitSquareSum n
endPoint :: Int -> Int
endPoint n = table `unsafeAt` digitSquareSum n
main :: IO ()
main = print $ go 0 1
where
go acc n
| n > 10000000 = acc
| otherwise = go (acc + endPoint n) (n+1)
Doing the memoisation by hand instead of using a library makes the code longer, but we can tailor it to our needs. We can use an unboxed array, and we can omit the bounds check on the array access. Both significantly speed the computation up. The time is now 0.18 seconds for the native code generator, and 0.13 seconds for the LLVM backend. The corresponding C programme runs in 0.16 seconds compiled with gcc -O3, and 0.145 seconds compiled with clang -O3 (Haskell beats C, w00t!).
Scaling and a hint for a better algorithm
The used algorithm however doesn't scale too well, a bit worse than linear, and for an upper bound of 108 (with suitably adapted memoisation limit), it runs in 1.5 seconds (ghc -O2 -fllvm), resp. 1.64 seconds (clang -O3) and 1.87 seconds (gcc -O3) [2.02 seconds for the native code generator].
Using a different algorithm that counts the numbers whose sequence ends in 1 by partitioning such numbers into a sum of squares of digits (The only numbers that directly produce 1 are powers of 10. We can write
10 = 1×3² + 1×1²
10 = 2×2² + 2×1²
10 = 1×2² + 6×1²
10 = 10×1²
From the first, we obtain 13, 31, 103, 130, 301, 310, 1003, 1030, 1300, 3001, 3010, 3100, ...
From the second, we obtain 1122, 1212, 1221, 2112, 2121, 2211, 11022, 11202, ...
From the third 1111112, 1111121, ...
Only 13, 31, 103, 130, 301, 310 are possible sums of squares of the digits of numbers <= 10^10, so only those need to be investigated further. We can write
100 = 1×9² + 1×4² + 3×1²
...
100 = 1×8² + 1×6²
...
The first of these partitions generates no children since it requires five nonzero digits, the other explicitly given generates the two children 68 and 86 (also 608 if the limit is 108, more for larger limits)), we can get better scaling and a faster algorithm.
The fairly unoptimised programme I wrote way back when to solve this problem runs (input is exponent of 10 of the limit)
$ time ./problem92 7
8581146
real 0m0.010s
user 0m0.008s
sys 0m0.002s
$ time ./problem92 8
85744333
real 0m0.022s
user 0m0.018s
sys 0m0.003s
$ time ./problem92 9
854325192
real 0m0.040s
user 0m0.033s
sys 0m0.006s
$ time ./problem92 10
8507390852
real 0m0.074s
user 0m0.069s
sys 0m0.004s
in a different league.
First off, I took the liberty of cleaning up your code a little:
endsAt89 1 = 0
endsAt89 89 = 1
endsAt89 n = endsAt89 (sumOfSquareDigits n)
sumOfSquareDigits 0 = 0
sumOfSquareDigits n = (n `mod` 10)^2 + sumOfSquareDigits (n `div` 10)
main = print . sum $ map endsAt89 [1..10^7]
On my crappy netbook is 1 min 13 sec. Let's see if we can improve that.
Since the numbers are small, we can start by using machine-sized Int instead of arbitrary-size Integer. This is just a matter of adding type signatures, e.g.
sumOfSquareDigits :: Int -> Int
That improves the run time drastically to 20 seconds.
Since the numbers are all positive, we can replace div and mod with the slightly faster quot and rem, or even both in one go with quotRem:
sumOfSquareDigits :: Int -> Int
sumOfSquareDigits 0 = 0
sumOfSquareDigits n = r^2 + sumOfSquareDigits q
where (q, r) = quotRem x 10
Run time is now 17 seconds. Making it tail recursive shaves off another second:
sumOfSquareDigits :: Int -> Int
sumOfSquareDigits n = loop n 0
where
loop 0 !s = s
loop n !s = loop q (s + r^2)
where (q, r) = quotRem n 10
For further improvements, we can notice that sumOfSquareDigits returns at most 567 = 7 * 9^2 for the given input numbers, so we can memoize for small numbers to reduce the number of iterations needed. Here's my final version (using the data-memocombinators package for the memoization):
{-# LANGUAGE BangPatterns #-}
import qualified Data.MemoCombinators as Memo
endsAt89 :: Int -> Int
endsAt89 = Memo.arrayRange (1, 7*9^2) endsAt89'
where
endsAt89' 1 = 0
endsAt89' 89 = 1
endsAt89' n = endsAt89 (sumOfSquareDigits n)
sumOfSquareDigits :: Int -> Int
sumOfSquareDigits n = loop n 0
where
loop 0 !s = s
loop n !s = loop q (s + r^2)
where (q, r) = quotRem n 10
main = print . sum $ map endsAt89 [1..10^7]
This runs in just under 9 seconds on my machine.

Caching in Haskell and explicit parallelism

I'm currently trying to optimize my solution to problem 14 at Projet Euler.
I really enjoy Haskell and I think it's a very good fit for these kind of problems, here's three different solutions I've tried:
import Data.List (unfoldr, maximumBy)
import Data.Maybe (fromJust, isNothing)
import Data.Ord (comparing)
import Control.Parallel
next :: Integer -> Maybe (Integer)
next 1 = Nothing
next n
| even n = Just (div n 2)
| odd n = Just (3 * n + 1)
get_sequence :: Integer -> [Integer]
get_sequence n = n : unfoldr (pack . next) n
where pack n = if isNothing n then Nothing else Just (fromJust n, fromJust n)
get_sequence_length :: Integer -> Integer
get_sequence_length n
| isNothing (next n) = 1
| otherwise = 1 + (get_sequence_length $ fromJust (next n))
-- 8 seconds
main1 = print $ maximumBy (comparing length) $ map get_sequence [1..1000000]
-- 5 seconds
main2 = print $ maximum $ map (\n -> (get_sequence_length n, n)) [1..1000000]
-- Never finishes
main3 = print solution
where
s1 = maximumBy (comparing length) $ map get_sequence [1..500000]
s2 = maximumBy (comparing length) $ map get_sequence [500001..10000000]
solution = (s1 `par` s2) `pseq` max s1 s2
Now if you look at the actual problem there's a lot of potential for caching, as most new sequences will contain subsequences that have already been calculated before.
For comparison, I wrote a version in C too:
Running time with caching: 0.03 seconds
Running time without caching: 0.3 seconds
That's just insane! Sure, caching reduced the time by a factor of 10, but even without caching it's still at least 17 times faster than my Haskell code.
What's wrong with my code?
Why doesn't Haskell cache the function calls for me? As the functions are pure caching shouldn't caching be trivial, only a matter of available memory?
What's the problem with my third parallel version? Why doesn't it finish?
Regarding Haskell as a language, does the compiler automatically parallellize some code (folds, maps etc), or does it always have to be done explicitly using Control.Parallel?
Edit: I stumbled upon this similar question. They mentioned that his function wasn't tail-recursive. Is my get_sequence_length tail recursive? If not how can I make it so?
Edit2: To Daniel:
Thanks a lot for the reply, really awesome.
I've been playing around with your improvements and I've found some really bad gotchas.
I'm running the tests on Windws 7 (64-bit), 3.3 GHZ Quad core with 8GB RAM.
The first thing I did was as you say replace all Integer with Int, but whenever I ran any of the mains I ran out of memory,
even with +RTS kSize -RTS set ridiciously high.
Eventually I found this (stackoverflow is awesome...), which means that since all Haskell programs on Windows are run as 32-bit, the Ints were overflowing causing infinite recursion, just wow...
I ran the tests in a Linux virtual machine (with the 64-bit ghc) instead and got similar results.
Alright, let's start from the top. First important thing is to give the exact command line you're using to compile and run; for my answer, I'll use this line for the timings of all programs:
ghc -O2 -threaded -rtsopts test && time ./test +RTS -N
Next up: since timings vary greatly from machine to machine, we'll give some baseline timings for my machine and your programs. Here's the output of uname -a for my computer:
Linux sorghum 3.4.4-2-ARCH #1 SMP PREEMPT Sun Jun 24 18:59:47 CEST 2012 x86_64 Intel(R) Core(TM)2 Quad CPU Q6600 # 2.40GHz GenuineIntel GNU/Linux
The highlights are: quad-core, 2.4GHz, 64-bit.
Using main1: 30.42s user 2.61s system 149% cpu 22.025 total
Using main2: 21.42s user 1.18s system 129% cpu 17.416 total
Using main3: 22.71s user 2.02s system 220% cpu 11.237 total
Actually, I modified main3 in two ways: first, by removing one of the zeros from the end of the range in s2, and second, by changing max s1 s2 to maximumBy (comparing length) [s1, s2], since the former only accidentally computes the right answer. =)
I'll now focus on serial speed. (To answer one of your direct questions: no, GHC does not automatically parallelize or memoize your programs. Both of those things have overheads that are very difficult to estimate, and consequently it's very difficult to decide when doing them will be beneficial. I have no idea why even the serial solutions in this answer are getting >100% CPU utilization; perhaps some garbage collection is happening in another thread or some such thing.) We'll start from main2, since it was the faster of the two serial implementations. The cheapest way to get a little boost is to change all the type signatures from Integer to Int:
Using Int: 11.17s user 0.50s system 129% cpu 8.986 total (about twice as fast)
The next boost comes from reducing allocation in the inner loop (eliminating the intermediate Maybe values).
import Data.List
import Data.Ord
get_sequence_length :: Int -> Int
get_sequence_length 1 = 1
get_sequence_length n
| even n = 1 + get_sequence_length (n `div` 2)
| odd n = 1 + get_sequence_length (3 * n + 1)
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
Using this: 4.84s user 0.03s system 101% cpu 4.777 total
The next boost comes from using faster operations than even and div:
import Data.Bits
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
get_sequence_length :: Int -> Int
get_sequence_length 1 = 1
get_sequence_length n = 1 + get_sequence_length next where
next = if even' n then n `quot` 2 else 3 * n + 1
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
Using this: 1.27s user 0.03s system 105% cpu 1.232 total
For those following along at home, this is about 17 times faster than the main2 that we started with -- a competitive improvement with switching to C.
For memoization, there's a few choices. The simplest is to use a pre-existing package like data-memocombinators to create an immutable array and read from it. The timings are fairly sensitive to choosing a good size for this array; for this problem, I found 50000 to be a pretty good upper bound.
import Data.Bits
import Data.MemoCombinators
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
pre_length :: (Int -> Int) -> (Int -> Int)
pre_length f 1 = 1
pre_length f n = 1 + f next where
next = if even' n then n `quot` 2 else 3 * n + 1
get_sequence_length :: Int -> Int
get_sequence_length = arrayRange (1,50000) (pre_length get_sequence_length)
lengths :: [(Int,Int)]
lengths = map (\n -> (get_sequence_length n, n)) [1..1000000]
main = print (maximumBy (comparing fst) lengths)
With this: 0.53s user 0.10s system 149% cpu 0.421 total
The fastest of all is to use a mutable, unboxed array for the memoization bit. It's much less idiomatic, but it's bare-metal speed. The speed is much less sensitive on the size of this array, so long as the array is about as large as the biggest thing you want the answer for.
import Control.Monad
import Control.Monad.ST
import Data.Array.Base
import Data.Array.ST
import Data.Bits
import Data.List
import Data.Ord
even' n = n .&. 1 == 0
next n = if even' n then n `quot` 2 else 3 * n + 1
get_sequence_length :: STUArray s Int Int -> Int -> ST s Int
get_sequence_length arr n = do
bounds#(lo,hi) <- getBounds arr
if not (inRange bounds n) then (+1) `fmap` get_sequence_length arr (next n) else do
let ix = n-lo
v <- unsafeRead arr ix
if v > 0 then return v else do
v' <- get_sequence_length arr (next n)
unsafeWrite arr ix (v'+1)
return (v'+1)
maxLength :: (Int,Int)
maxLength = runST $ do
arr <- newArray (1,1000000) 0
writeArray arr 1 1
loop arr 1 1 1000000
where
loop arr n len 1 = return (n,len)
loop arr n len n' = do
len' <- get_sequence_length arr n'
if len' > len then loop arr n' len' (n'-1) else loop arr n len (n'-1)
main = print maxLength
With this: 0.16s user 0.02s system 138% cpu 0.130 total (which is competitive with the memoized C version)
GHC won't parallel-ize anything automatically for you. And as you guess get_sequence_length is not tail-recursive. See here. And consider how the compiler (unless it's doing some nice optimizations for you) can't evaluate all those recursive additions until you hit the end; you're "building up thunks" which isn't usually a good thing.
Try instead calling a recursive helper function and passing an accumulator, or try defining it in terms of foldr.

Haskell Nested Vector Parallel Strategy

Similar to this related question, I would like to perform a parallel map on a Vector, but in my case I have a nested Vector, and I can't seem to get the evaluation correct.
Here is what I have so far:
import qualified Data.Vector as V
import qualified Data.Vector.Unboxed as U
import Data.Vector.Strategies
import Control.DeepSeq
main = do
let res = genVVec 200 `using` parVector 2
print res
genUVec :: Int -> U.Vector Int
genUVec n = U.map (ack 2) $ U.enumFromN n 75
genVVec :: Int -> V.Vector (U.Vector Int)
genVVec n = V.map genUVec $ V.enumFromN 0 n
ack :: Int -> Int -> Int
ack 0 n = n+1
ack m 0 = ack (m-1) 1
ack m n = ack (m-1) (ack m (n-1))
instance (NFData a, U.Unbox a) => NFData (U.Vector a) where
rnf = rnf . U.toList
gives:
$ ./vectorPar +RTS -N8 -s >/dev/null
SPARKS: 200 (17 converted, 183 pruned)
Total time 1.37s ( 1.30s elapsed)
$ ./vectorPar +RTS -s >/dev/null
SPARKS: 200 (0 converted, 200 pruned)
Total time 1.25s ( 1.26s elapsed)
I have also tried modifying the parVector function in vector-strategies directly, but my attempts are clumsy and ineffective.
I realize REPA was designed for nested workloads, but this seems a simple enough problem, and I'd rather not have to rewrite a lot of code.
Note: Guilty author of vector-strategies here (which is a very small title, seeing as this was just a hacked up function I figured others would find useful).
Your observation that parVector is wrong in that it allows the sparks to be GCed prior to use seems to be correct. The advice by SimonM means I must do precisely what I was trying to avoid, construct a new vector, at some cost, in place of the old one. Knowing this is necessary, there is little reason not to change parVector to the much simpler definition of:
parVector2 :: NFData a => Int -> Strategy (V.Vector a)
parVector2 n = liftM V.fromList . parListChunk n rdeepseq . V.toList
Notice the fix given by John L only works because it "beats" the collector by forcing the computations before collection would occur.
I'll be changing the vector-strategies library so this is unnecessary - making your original code work fine. Unfortunately, this will incur the above-mentioned cost of constructing a new Vector (usually minimal).
The problem appears to be that parVector doesn't force evaluation of the elements of the vector. Each element remains a thunk and nothing is sparked until the vector is consumed (by being printed), which is too late for the sparks to do work. You can force evaluation of each element by composing the parVector strategy with rdeepseq.
import qualified Data.Vector as V
import qualified Data.Vector.Unboxed as U
import Data.Vector.Strategies
import Control.DeepSeq
import Control.Parallel.Strategies
main = do
let res = genVVec 200 `using` (rdeepseq `dot` parVector 20)
print res
genUVec :: Int -> U.Vector Int
genUVec n = U.map (ack 2) $ U.enumFromN n 75
genVVec :: Int -> V.Vector (U.Vector Int)
genVVec n = V.map genUVec $ V.enumFromN 0 n
ack :: Int -> Int -> Int
ack 0 n = n+1
ack m 0 = ack (m-1) 1
ack m n = ack (m-1) (ack m (n-1))
instance (NFData a, U.Unbox a) => NFData (U.Vector a) where
rnf vec = seq vec ()
instance (NFData a) => NFData (V.Vector a) where
rnf = rnf . V.toList
I also changed your NFData (U.Vector a) instance. Since a U.Vector is unboxed, evaluation to WHNF is sufficient, and forcing each element via the list conversion is wasteful. In fact the default definition for rnf works fine if you like.
With these two changes, I get the following
SPARKS: 200 (200 converted, 0 pruned)
and the runtime has been reduced by nearly 50%. I also adjusted the vector chunk size to 20, but the result is very similar to a chunk size of 2.

Resources