How to evaluate tuples in parallel using rpar Strategy in Haskell? - haskell

I stumbled upon a problem with Eval monad and rpar Strategy in Haskell. Consider following code:
module Main where
import Control.Parallel.Strategies
main :: IO ()
main = print . sum . inParallel2 $ [1..10000]
inParallel :: [Double] -> [Double]
inParallel xss = runEval . go $ xss
where
go [] = return []
go (x:xs) = do
x' <- rpar $ x + 1
xs' <- go xs
return (x':xs')
inParallel2 :: [Double] -> [Double]
inParallel2 xss = runEval . go $ xss
where
go [] = return []
go [x] = return $ [x + 1]
go (x:y:xs) = do
(x',y') <- rpar $ (x + 1, y + 1)
xs' <- go xs
return (x':y':xs'
I compile and run it like this:
ghc -O2 -Wall -threaded -rtsopts -fforce-recomp -eventlog eval.hs
./eval +RTS -N3 -ls -s
When I use inParallel function parallelism works as expected. In the output runtime statistics I see:
SPARKS: 100000 (100000 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
When I switch to inParallel2 function all parallelism is gone:
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
Why doesn't evaluation of tuples work in parallel? I tried forcing the tuple before passing it to rpar:
rpar $!! (x + 1, y + 1)
but still no result. What am I doing wrong?

The rpar strategy annotates a term for possible evaluation in parallel, but only up to weak head normal form, which essentially means, up to the outermost constructor. So for an integer or double, that means full evaluation, but for a pair, only the pair constructor, not its components, will get evaluated.
Forcing the pair before passing it to rpar is not going to help. Now you're evaluating the pair locally, before annotating the already evaluated tuple for possible parallel evaluation.
You probably want to combine the rpar with the rdeepseq strategy, thereby stating that the term should be completely evaluated, if possible in parallel. You can do this by saying
(rpar `dot` rdeepseq) (x + 1, y + 1)
The dot operator is for composing strategies.
There is, however, yet another problem with your code: pattern matching forces immediate evaluation, and therefore using pattern matching for rpar-annotated expressions is usually a bad idea. In particular, the line
(x',y') <- (rpar `dot` rdeepseq) (x + 1, y + 1)
will defeat all parallelism, because before the spark can be picked up for evaluation by another thread, the local thread will already start evaluating it in order to match the pattern. You can prevent this by using a lazy / irrefutable pattern:
~(x',y') <- (rpar `dot` rdeepseq) (x + 1, y + 1)
Or alternatively use fst and snd to access the components of the pair.
Finally, don't expect actual speedup if you create sparks that are as cheap as adding one to an integer. While sparks themselves are relatively cheap, they are not cost-free, so they work better if the computation you are annotating for parallel evaluation is somewhat costly.
You might want to read some tutorials on using strategies, such as Simon Marlow's
Parallel and Concurrent Programming using Haskell or my own Deterministic Parallel Programming in Haskell.

Related

Unable to force strictness in tail recursion

I am dealing with memory leaks in my Haskell program and I was able to isolate it to very basic laziness problem in dealing with arrays. I understand what's happening there. First element of the array is computed while the rest produce the delayed computations which consumes the heap. Unfortunately, I was unable to force strictness for the entire array computation.
I tried various combinations of seq, BangPatterns, ($!) without much success.
import Control.Monad
force x = x `seq` x
loop :: [Int] -> IO ()
loop x = do
when (head x `mod` 10000000 == 0) $ print x
let x' = force $ map (+1) x
loop x'
main = loop $ replicate 200 1
The profile with standard profiling options didn't give me any more information than I already know:
ghc -prof -fprof-auto-calls -rtsopts test.hs
./test +RTS -M300M -p -hc
This runs out of memory in the matter of a few seconds.
force x = x `seq` x
That's useless. seq doesn't mean "evaluate this thing now"; it means "evaluate the left thing before returning the result of evaluating the right thing". When they're the same, it does nothing, and your force is equivalent to just id. Try this instead:
import Control.DeepSeq
import Control.Monad
loop :: [Int] -> IO ()
loop x = do
when (head x `mod` 10000000 == 0) $ print x
let x' = map (+1) x
loop $!! x'
main = loop $ replicate 200 1
That evaluates x' and everything in it before loop x', which is useful.
Alternatively, Control.DeepSeq has a force function that is useful. Its semantics in this case are "evaluate all of the elements of your list before returning the result of evaluating any of it". If you used its force function in place of your own, your original code would otherwise work, since the first line of loop does evaluate the beginning of the list.

Understanding Eval Monad's `rpar`

Looking at the following example from Parallel and Concurrent Programming in Haskell:
main = do
[n] <- getArgs
let test = [test1,test2,test3,test4] !! (read n - 1)
t0 <- getCurrentTime
r <- evaluate (runEval test)
printTimeSince t0
print r
printTimeSince t0
test1 = do
x <- rpar (fib 36)
y <- rpar (fib 35)
return (x,y)
The book shows its compilation:
$ ghc -O2 rpar.hs -threaded
And then running the above test:
$ ./rpar 1 +RTS -N2
time: 0.00s
(24157817,14930352)
time: 0.83s
If I understand correctly, the Eval Monad (using rpar) results in both fib 36 and fib 35 being computed in parallel.
Does the actual work, i.e. computing the function fib ... occur when calling (runEval test)? Or perhaps evaluate ... is required? Or, finally, perhaps it gets computed when calling print r to evaluate it entirely?
It's not clear to me when the actual work gets performed for rpar.
Here's my guess, but I can't seem to replicate this on my laptop, too many imports I'd have to get from cabal.
test1 = do
x <- rpar (fib 36)
y <- rpar (fib 35)
return (x,y)
In this, you spark the evaluation of (fib 36) and (fib 35) in parallel, but you don't wait for them - you just return (x,y) immediately, while x and y are still evaluating. Then, we you get to print r, you are forced to wait until x and y finish evaluating.
In theory, the following code should force test1 to wait until x and y have finished evaluating before returning them.
test1 = do
x <- rpar (fib 36)
y <- rpar (fib 35)
rseq x
rseq y
return (x,y)
Then, running this should give you approximately
$ ./rpar 1 +RTS -N2
time: 0.83s
(24157817,14930352)
time: 0.83s
hopefully...
EDIT
Finally got back to my machine, replicated the condition, and my suggested code gives the expected result. However, the OP raises another good question: if evaluate only evaluates to the WHNF, why does it even end up doing work before print is called?
The answer is in the monad definition of Control.Parallel.Strategies - in other words, it isn't evaluate that pushes the evaluation of x and y, but runEval. The Eval monad is strict in the first argument: in x >>= f it will evaluate x (please check out this question before continuing). Then, de-sugaring test1 gives:
test1 = (rpar (fib 36)) >>= (\x ->
(rpar (fib 35)) >>= (\y ->
(rseq x) >>= (\_ ->
(rseq y) >>= (\_ ->
return (x,y)))))
Then, since rpar only "sparks" evaluation, it uses par (which begins the evaluation of the first argument but immediately returns the second) and immediately returns Done, however, rseq (like seq, but strict only in the first argument) does not return Done until its argument is actually evaluated (to WHNF). Hence, without the rseq calls, you have know that x and y have begun to be evaluated but no assurance that they have finished, but with those calls, you know both x and y are also evaluated before return is called on them.

Project euler 10 - [haskell] Why so inefficient?

Alright, so i've picked up project euler where i left off when using java, and i'm at problem 10. I use Haskell now and i figured it'd be good to learn some haskell since i'm still very much a beginner.
http://projecteuler.net/problem=10
My friend who still codes in java came up with a very straight forward way to implement the sieve of eratosthenes:
http://puu.sh/5zQoU.png
I tried implementing a better looking (and what i thought was gonna be a slightly more efficient) Haskell function to find all primes up to 2,000,000.
I came to this very elegant, yet apparently enormously inefficient function:
primeSieveV2 :: [Integer] -> [Integer]
primeSieveV2 [] = []
primeSieveV2 (x:xs) = x:primeSieveV2( (filter (\n -> ( mod n x ) /= 0) xs) )
Now i'm not sure why my function is so much slower than his (he claim his works in 5ms), if anything mine should be faster, since i only check composites once (they are removed from the list when they are found) whereas his checks them as many times as they can be formed.
Any help?
You don't actually have a sieve here. In Haskell you could write a sieve as
import Data.Vector.Unboxed hiding (forM_)
import Data.Vector.Unboxed.Mutable
import Control.Monad.ST (runST)
import Control.Monad (forM_, when)
import Prelude hiding (read)
sieve :: Int -> Vector Bool
sieve n = runST $ do
vec <- new (n + 1) -- Create the mutable vector
set vec True -- Set all the elements to True
forM_ [2..n] $ \ i -> do -- Loop for i from 2 to n
val <- read vec i -- read the value at i
when val $ -- if the value is true, set all it's multiples to false
forM_ [2*i, 3*i .. n] $ \j -> write vec j False
freeze vec -- return the immutable vector
main = print . ifoldl' summer 0 $ sieve 2000000
where summer s i b = if b then i + s else s
This "cheats" by using a mutable unboxed vector, but it's pretty darn fast
$ ghc -O2 primes.hs
$ time ./primes
142913828923
real: 0.238 s
This is about 5x faster than my benchmarking of augustss's solution.
To actually implement the sieve efficiently in Haskell you probably need to do it the Java way (i.e., allocate a mutable array an modify it).
For just generating primes I like this:
primes = 2 : filter (isPrime primes) [3,5 ..]
where isPrime (p:ps) x = p*p > x || x `rem` p /= 0 && isPrime ps x
And then you can print the sum of all primes primes < 2,000,000
main = print $ sum $ takeWhile (< 2000000) primes
You can speed it up by adding a type signature primes :: [Int].
But it works well with Integer as well and that also gives you the correct sum (which 32 bit Int will not).
See The Genuine Sieve of Eratosthenes for more information.
The time complexity of your code is n2 (in n primes produced). It is impractical to run for producing more than first 10...20 thousand primes.
The main problem with that code is not that it uses rem but that it starts its filters prematurely, so creates too many of them. Here's how you fix it, with a small tweak:
{-# LANGUAGE PatternGuards #-}
primes = 2 : sieve primes [3..]
sieve (p:ps) xs | (h,t) <- span (< p*p) xs = h ++ sieve ps [x | x <- t, rem x p /= 0]
-- sieve ps (filter (\x->rem x p/=0) t)
main = print $ sum $ takeWhile (< 100000) primes
This improves the time complexity by about n1/2 (in n primes produced) and gives it a drastic speedup: it gets to 100,000 75x faster. Your 28 seconds should become ~ 0.4 sec. But, you probably tested it in GHCi as interpreted code, not compiled. Marking it1) as :: [Int] and compiling with -O2 flag gives it another ~ 40x speedup, so it'll be ~ 0.01 sec. To reach 2,000,000 with this code takes ~ 90x longer, for a whopping ~ 1 sec of projected run time.
1) be sure to use sum $ map (fromIntegral :: Int -> Integer) $ takeWhile ... in main.
see also: http://en.wikipedia.org/wiki/Analysis_of_algorithms#Empirical_orders_of_growth

How to avoid stack space overflows?

I've been a bit surprised by GHC throwing stack overflows if I'd need to get value of large list containing memory intensive elements.
I did expected GHC has TCO so I'll never meet such situations.
To most simplify the case look at the following straightforward implementations of functions returning Fibonacci numbers (taken from HaskellWiki). The goal is to display millionth number.
import Data.List
# elegant recursive definition
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
# a bit tricky using unfoldr from Data.List
fibs' = unfoldr (\(a,b) -> Just (a,(b,a+b))) (0,1)
# version using iterate
fibs'' = map fst $ iterate (\(a,b) -> (b,a+b)) (0,1)
# calculate number by definition
fib_at 0 = 0
fib_at 1 = 1
fib_at n = fib_at (n-1) + fib_at (n-2)
main = do
{-- All following expressions abort with
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
--}
print $ fibs !! (10^6)
print . last $ take (10^6) fibs
print $ fibs' !! (10^6)
print $ fibs'' !! (10^6)
-- following expression does not finish after several
-- minutes
print $ fib_at (10^6)
The source is compiled with ghc -O2.
What am I doing wrong ? I'd like to avoid recompiling with increased stack size or other specific compiler options.
These links here will give you a good introduction to your problem of too many thunks (space leaks).
If you know what to look out for (and have a decent model of lazy evaluation), then solving them is quite easy, for example:
{-# LANGUAGE BangPatterns #-}
import Data.List
fibs' = unfoldr (\(!a,!b) -> Just (a,(b,a+b))) (0,1)
main = do
print $ fibs' !! (10^6) -- no more stack overflow
All of the definitions (except the useless fib_at) will delay all the + operations, which means that when you have selected the millionth element it is a thunk with a million delayed additions. You should try something stricter.
As other have pointed out, Haskell being lazy you have to force evaluation of the thunks to avoid stack overflow.
It appears to me that this version of fibs' should work up to 10^6:
fibs' = unfoldr (\(a,b) -> Just (seq a (a, (b, a + b) ))) (0,1)
I recommend to study this wiki page on Folds and have a look at the seq function.

How to exploit any parallelism in my haskell parallel code?

I've just stated working in haskell semi-explicit parallelism with GHC 6.12. I've write the following haskell code to compute in parallel the map of the fibonnaci function upon 4 elements on a list, and in the same time the map of the function sumEuler upon two elements.
import Control.Parallel
import Control.Parallel.Strategies
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
mkList :: Int -> [Int]
mkList n = [1..n-1]
relprime :: Int -> Int -> Bool
relprime x y = gcd x y == 1
euler :: Int -> Int
euler n = length (filter (relprime n) (mkList n))
sumEuler :: Int -> Int
sumEuler = sum . (map euler) . mkList
-- parallel initiation of list walk
mapFib :: [Int]
mapFib = map fib [37, 38, 39, 40]
mapEuler :: [Int]
mapEuler = map sumEuler [7600, 7600]
parMapFibEuler :: Int
parMapFibEuler = (forceList mapFib) `par` (forceList mapEuler `pseq` (sum mapFib + sum mapEuler))
-- how to evaluate in whnf form by forcing
forceList :: [a] -> ()
forceList [] = ()
forceList (x:xs) = x `pseq` (forceList xs)
main = do putStrLn (" sum : " ++ show parMapFibEuler)
to improve my program in parallel I rewrote it with par and pseq and a forcing function to force whnf evaluation. My problem is that by looking in the threadscope it appear that i didn't gain any parallelism. Things are worse because i didn't gain any speedup.
That why I have theses two questions
Question 1 How could I modify my code to exploit any parallelism ?
Question 2 How could I write my program in order to use Strategies (parMap, parList, rdeepseq and so on ...) ?
First improvement with Strategies
according to his contribution
parMapFibEuler = (mapFib, mapEuler) `using` s `seq` (sum mapFib + sum mapEuler) where
s = parTuple2 (seqList rseq) (seqList rseq)
the parallelism appears in the threadscope but not enough to have a significant speedup
The reason you aren't seeing any parallelism here is because your spark has been garbage collected. Run the program with +RTS -s and note this line:
SPARKS: 1 (0 converted, 1 pruned)
the spark has been "pruned", which means removed by the garbage collector. In GHC 7 we made a change to the semantics of sparks, such that a spark is now garbage collected (GC'd) if it is not referred to by the rest of the program; the details are in the "Seq no more" paper.
Why is the spark GC'd in your case? Look at the code:
parMapFibEuler :: Int
parMapFibEuler = (forceList mapFib) `par` (forceList mapEuler `pseq` (sum mapFib + sum mapEuler))
the spark here is the expression forkList mapFib. Note that the value of this expression is not required by the rest of the program; it only appears as an argument to par. GHC knows that it isn't required, so it gets garbage collected.
The whole point of the recent changes to the parallel package were to let you easily avoid this bear trap. A good Rule of Thumb is to use Control.Parallel.Strategies rather than par and pseq directly. My preferred way to write this would be
parMapFibEuler :: Int
parMapFibEuler = runEval $ do
a <- rpar $ sum mapFib
b <- rseq $ sum mapEuler
return (a+b)
but sadly this doesn't work with GHC 7.0.2, because the spark sum mapFib is floated out as a static expression (a CAF), and the runtime doesn't think sparks that point to static expressions are worth keeping (I'll fix this). This wouldn't happen in a real program, of course! So let's make the program a bit more realistic and defeat the CAF optimisation:
parMapFibEuler :: Int -> Int
parMapFibEuler n = runEval $ do
a <- rpar $ sum (take n mapFib)
b <- rseq $ sum (take n mapEuler)
return (a+b)
main = do [n] <- fmap (fmap read) getArgs
putStrLn (" sum : " ++ show (parMapFibEuler n))
Now I get good parallelism with GHC 7.0.2. However, note that #John's comments also apply: generally you want to look for more fine-grained parallelism so as to let GHC use all your processors.
Your parallelism is far too course-grained to have much beneficial effect. The largest chunks of work that can be done in parallel efficiently are in sumEuler, so that's where you should add your par annotations. Try changing sumEuler to:
sumEuler :: Int -> Int
sumEuler = sum . (parMap rseq euler) . mkList
parMap is from Control.Parallel.Strategies; it expresses a map that can be done in parallel. The first argument, rseq having type Strategy a, is used to force the computation to a specific point, otherwise no work would be done, due to laziness. rseq is fine for most numeric types.
It's not useful to add parallelism to fib here, below about fib 40 there isn't enough work to make it worthwhile.
In addition to threadscope, it's useful to run your program with the -s flag. Look for a line like:
SPARKS: 15202 (15195 converted, 0 pruned)
in the output. Each spark is an entry in a work queue to possibly be performed in parallel. Converted sparks are actually done in parallel, while pruned sparks mean that the main thread got to them before a worker thread had the chance to do so. If the pruned number is high, it means your parallel expressions are too fine-grained. If the total number of sparks is low, you aren't trying to do enough in parallel.
Finally, I think parMapFibEuler is better written as:
parMapFibEuler :: Int
parMapFibEuler = sum (mapFib `using` parList rseq) + sum mapEuler
mapEuler is simply too short to have any parallelism usefully expressed here, especially as euler is already performed in parallel. I'm doubtful that it makes a substantial difference for mapFib either. If the lists mapFib and mapEuler were longer, parallelism here would be more useful. Instead of parList you may be able to use parBuffer, which tends to work well for list consumers.
Making these two changes cuts the runtime from 12s to 8s for me, with GHC 7.0.2.
Hmmm... Maybe?
((forceList mapFib) `par` (forceList mapEuler)) `pseq` (sum mapFib + sum mapEuler)
I.e. spawn mapFib in background and calculate mapEuler and only after it (mapEuler) do (+) of their sums.
Actually I guess you can do something like:
parMapFibEuler = a `par` b `pseq` (a+b) where
a = sum mapFib
b = sum mapEuler
About Q2:
As I know strategies - is the "strategies" to combine data-structures with those par and seq.
You can write your forceList = withStrategy (seqList rseq)
As well you can write your code like:
parMapFibEuler = (mapFib, mapEuler) `using` s `seq` (sum mapFib + sum mapEuler) where
s = parTuple2 (seqList rseq) (seqList rseq)
I.e. strategy applied to tuple of two lists will force their evaulation in parallel, but each list will be forced to be evaluated sequentially.
First off, I assume you know your fib definition is awful and you're just doing this to play with the parallel package.
You seem to be going for parallelism at the wrong level. Parallelizing mapFib and mapEuler won't give a good speed-up because there is more work to compute mapFib. What you should do is compute each of these very expensive elements in parallel, which is slightly finer grain but not overly so:
mapFib :: [Int]
mapFib = parMap rdeepseq fib [37, 38, 39, 40]
mapEuler :: [Int]
mapEuler = parMap rdeepseq sumEuler [7600, 7600, 7600,7600]
parMapFibEuler :: Int
parMapFibEuler = sum a + sum b
where
a = mapFib
b = mapEuler
Also, I originally fought using Control.Parallel.Strategies over Control.Parallel but have come to like it as it is more readable and avoids issues like yours where one would expect parallelism and have to squint at it to figure out why you aren't getting any.
Finally, you should always post how you compile and how you run code you're expecting to be parallelized. For example:
$ ghc --make -rtsopts -O2 -threaded so.hs -eventlog -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
$ ./so +RTS -ls -N2
sum : 299045675
Yields:

Resources