Haskell : parallel list computation

Haskell : parallel list computation - haskell

I am trying to follow the implementation of the countdown problem shown in this video (https://www.youtube.com/watch?v=rlwSBNI9bXE&) and I thought it would be a good problem to try and run in parallel?
data Op =
Add
| Sub
| Mul
| Div deriving (Eq)
data Expr =
Val Int
| App Op Expr Expr
--{ helper functions }
solutions' :: [Int] -> Int -> [Expr]
solutions' ns n =
[e | ns' <- choices ns, (e, m) <- results ns', m == n]
I tried following some other posts on how to do this and came up with something like this
instance NFData Op where
rnf Add = Add `deepseq` ()
rnf Sub = Sub `deepseq` ()
rnf Mul = Mul `deepseq` ()
rnf Div = Div `deepseq` ()
instance NFData Expr where
rnf (Val a) = a `deepseq` ()
rnf (App o l r) = (rnf o) `deepseq` (rnf l) `deepseq` (rnf r)
solutions' :: [Int] -> Int -> [Expr]
solutions' ns n =
([e | ns' <- choices ns, (e, m) <- results ns', m == n]
`using` parList rdeepseq)
It compiles but the program crashes when i try and run it. To be honest I was really just guessing on what I wrote.
How do I get this to run in parallel?
when I run in GHCI
>λ= r = (solutions' [1,3,7,10,25,50] 765)
(0.00 secs, 0 bytes)
>λ= mapM_ print r
*** Exception: stack overflow
>λ=
if i compile with
ghc ./Countdown.hs +RTS -N8 -s
and then run the executable, it does not terminate.

Ok, so I just clicked at a random timestamp in the video, and by sheer luck I got a slide that describes what's wrong.
For our example, only about 5 million of the 33 million possible expressions are valid.
So, this means that you are evaluating
_fiveMillionList `using` parList rdeepseq
Now, the way (`using` parList _strat) works is that it immediately forces the entire spine of the list. When you begin evaluating your expression, parList forces all the cells of the list to exist. Further, as #DavidFletcher notes, your parallelism is actually useless. Because the filtration is underneath the using, forcing the entire spine of the list also forces all 33 million Exprs to exist, because you need to know how many elements passed the (==) test, so you need to create the Exprs to test them. They don't need to all exist simultaneously, but, in the end, 5 million of them (not counting the Exprs recursively contained in them), plus 5 million (:) constructors, will be held in memory. To add insult to injury, you proceed to create 5 million more objects in the form of dud sparks. And, all of this is being orchestrated by 5 million calls to the Eval monad's (>>=) function. I'm not sure which one of these exactly is sitting resident in memory for long enough to cause a stack overflow, but I'm fairly sure that parList is the culprit.
Perhaps try a more reasonable Strategy. I think you are pretty much forced into using parBuffer, because you need laziness. Using parBuffer n strat, if you evaluate a (:)-cell, then the strategy ensures that the next n - 1 elements have been sparked. So, essentially, it "runs ahead" of any consumer that starts at the head of the list, maintaining a buffer of parallely-evaluated elements. Something like parBuffer 1000 rdeepseq should be fine.
Your NFData instances could use some work. They aren't the problem, but they don't really demonstrate a sound understanding of how evaluation works. I'll just leave them here:
instance NFData Op where
-- (seq/deepseq) x y is defined by
-- "if you want to evaluate (seq/deepseq) x y to WHNF, then you must
-- evaluate x to WHNF/NF, then evaluate y to WHNF."
-- but e.g. Add is already in WHNF and NF, so seq Add and deeqseq Add are no-ops
-- the actual evaluation is already finished by the case in rnf's equations
-- you could even write rnf x = x `seq` (), but I think it's best to be explicit
rnf Add = ()
rnf Sub = ()
rnf Mul = ()
rnf Div = ()
instance NFData Expr where
rnf (Val a) = a `deepseq` ()
-- rnf o, rnf l :: ()
-- WHNF and NF are the same thing for the type (); all constructors are nullary
-- therefore (deepseq (rnf x) y) = seq (rnf x) y
-- but then seq (rnf x) y = deepseq x y {by definition}
rnf (App o l r) = o `deepseq` l `deepseq` rnf r

Related

Explanation of the findIndices function

I said in this question that I didn't understand the source code of findIndices.
In fact I didn't pay enough attention and I didn't see that there are two definitions of this function:
findIndices :: (a -> Bool) -> [a] -> [Int]
#if defined(USE_REPORT_PRELUDE)
findIndices p xs = [ i | (x,i) <- zip xs [0..], p x]
#else
-- Efficient definition, adapted from Data.Sequence
{-# INLINE findIndices #-}
findIndices p ls = build $ \c n ->
let go x r k | p x = I# k `c` r (k +# 1#)
| otherwise = r (k +# 1#)
in foldr go (\_ -> n) ls 0#
#endif /* USE_REPORT_PRELUDE */
I understand the first definition, the one I didn't see. I don't understand the second one. I have a couple of questions:
what is if defined(USE_REPORT_PRELUDE) ?
can one explain the second definition ? What are build, I#, +#, 1# ?
why the second definition is inlined, not the first one ?

The CPP extensions enables the C preprocessor, as for the C programming language. Here, it is used to test if the flag USE_REPORT_PRELUDE was set during compilation. According to that flag, the compiler uses the #if or the #else variant of code.
build is a function which could be defined as
build f = f (:) []
So, using build (\c n -> ... essentially lets c to the "cons" (:), and n to the "nil" [].
This is not used for convenience: it is not convenient at all! However, the compiler optimizer works great with build and foldr combined, so the code is written here in a weird way to take advantage of that.
Further, I# ... is the low-level constructor for integers. When we normally write
x :: Int
x = 4+2
GHC implements x (very roughly) with a pointer to some memory that reads as unevaluated: 4+2. After x is forced the first time, this memory gets overwritten with evaluated: I# 6#. This is needed to implement laziness.
The "boxing" here refers to the indirection through a pointer.
Instead, the type Int# is a plain machine integer, with no pointers, no indirection, no unevaluated expressions. It is strict (instead of lazy), but being more low-level it is more efficient. One creates a value as in
x' :: Int#
x' = 6#
x :: Int
x = I# x'
Indeed, Int is defined as newtype Int = I# Int#.
Keep in mind that this is not standard Haskell, but GHC-specific low-level details. In normal code, you should not need to use such unboxed types. In libraries, the authors do that to achieve a little more performance, but that's it.
Sometimes, even if in our code we only use Ints, GHC is smart enough to automatically convert our code to using Int# and achieve more efficiency, avoiding the boxing. This can be observed if we ask GHC to "dump Core" so that we can see the result of the optimization.
For instance, compiling
f :: Int -> Int
f 0 = 0
f n = n + f (n-1)
GHC produces a lower level version (this is GHC Core, not Haskell, but it is similar enough to be understood):
Main.$wf :: GHC.Prim.Int# -> GHC.Prim.Int#
Main.$wf = \ (ww_s4un :: GHC.Prim.Int#) ->
case ww_s4un of ds_X210 {
__DEFAULT ->
case Main.$wf (GHC.Prim.-# ds_X210 1#) of ww1_s4ur { __DEFAULT ->
GHC.Prim.+# ds_X210 ww1_s4ur
};
0# -> 0#
}

Notice the number of arguments to go. go x r k = ... === go x r = \k -> .... This is the standard trick to arrange for left-to-right information flow while folding the list (go is used as the reducer function, in foldr go (\_ -> n) ls 0#). Here, it's the counting of [0..], explicated as the initial k=0 and the (k + 1) on each step (k is an unfortunate naming choice, i seems better; k is overloaded with the irrelevant "constant" and "continuation", not just "counter" which was probably the intended meaning here).
The foldr/build (sic) fusion (linked to by luqui in the comments) turns foldr c n $ findIndices p [a1,a2,...,an] into a loop, exposing the inner foldr of the findIndices definition, avoiding building the actual list structure of the result of the findIndices call:
build g = g (:) []
foldr c n $ build g = g c n
foldr c n $ findIndices p [a1,a2,...,an]
==
foldr c n $ build g where {g c n = ...}
=
g c n where {g c n = ...}
=
foldr go (const n) [a1,a2,...,an] 0 where {go x r k = ...}
=
go a1 (foldr go (const n) [a2,...,an]) 0
=
let { x=a1, r=foldr go (const n) [a2,...,an], k=0 }
in
if | p x -> c (I# k) (r (k +# 1#)) -- no 'cons' (`:`), only 'c'
| otherwise -> r (k +# 1#)
=
....
So you see, it's a standard trick to have foldr define a function which expects one more input argument, to arrange the left-to-right information flow while processing the input list.
All the stuff with the hash sign are "primitive" or "closer-to-machine-level" entities. I# is a primitive Int constructor; 0# is a machine-level 0; etc.. This may or may not be exactly correct, but it should be close enough.
foldr/build fusion seems a particular case of transducers-based code transformation, which is based on the fact that nested folds are fused by composing their reducers' transformers (aka transducers), as in
foldr c n $
foldr (tr2 c2) n2 $
foldr (tr3 c3) n3 xs
=
foldr (tr2 c) n $ -- fold "replaces" the constructor nodes with its reducer
foldr (tr3 c3) n3 xs -- so just use the outer reducer in the first place!
=
foldr (tr3 (tr2 c)) n xs
=
foldr ((tr3 . tr2) c) n xs
and build g === foldr . tr for some appropriate choice of tr for a given g, so that
build g = g c n = (foldr . tr) c n = foldr (tr c) n
As for USE_REPORT_PRELUDE, again, I can't say this with any authority, but I always assumed that it is the compilation flag which is enabled when the mock definitions from the Haskell Report are used as actual code, even though they were intended as an executable specification.

Haskell: to fix or not to fix

I recently learned about Data.Function.fix, and now I want to apply it everywhere. For example, whenever I see a recursive function I want to "fix" it. So basically my question is where and when should I use it.
To make it more specific:
1) Suppose I have the following code for factorization of n:
f n = f' n primes
where
f' n (p:ps) = ...
-- if p^2<=n: returns (p,k):f' (n `div` p^k) ps for k = maximum power of p in n
-- if n<=1: returns []
-- otherwise: returns [(n,1)]
If I rewrite it in terms of fix, will I gain something? Lose something? Is it possible, that by rewriting an explicit recursion into fix-version I will resolve or vice versa create a stack overflow?
2) When dealing with lists, there are several solutions: recursion/fix, foldr/foldl/foldl', and probably something else. Is there any general guide/advice on when to use each? For example, would you rewrite the above code using foldr over the infinite list of primes?
There are, probably, other important questions not covered here. Any additional comments related to the usage of fix are welcome as well.

One thing that can be gained by writing in an explicitly fixed form is that the recursion is left "open".
factOpen :: (Integer -> Integer) -> Integer -> Integer
factOpen recur 0 = 1
factOpen recur n = n * recur (pred n)
We can use fix to get regular fact back
fact :: Integer -> Integer
fact = fix factOpen
This works because fix effectively passes a function itself as its first argument. By leaving the recursion open, however, we can modify which function gets "passed back". The best example of using this property is to use something like memoFix from the memoize package.
factM :: Integer -> Integer
factM = memoFix factOpen
And now factM has built-in memoization.
Effectively, we have that open-style recursion requires us impute the recursive bit as a first-order thing. Recursive bindings are one way that Haskell allows for recursion at the language level, but we can build other, more specialized forms.

I'd like to mention another usage of fix; suppose you have a simple language consisting of addition, negative, and integer literals. Perhaps you have written a parser which takes a String and outputs a Tree:
data Tree = Leaf String | Node String [Tree]
parse :: String -> Tree
-- parse "-(1+2)" == Node "Neg" [Node "Add" [Node "Lit" [Leaf "1"], Node "Lit" [Leaf "2"]]]
Now you would like to evaluate your tree to a single integer:
fromTree (Node "Lit" [Leaf n]) = case reads n of {[(x,"")] -> Just x; _ -> Nothing}
fromTree (Node "Neg" [e]) = liftM negate (fromTree e)
fromTree (Node "Add" [e1,e2]) = liftM2 (+) (fromTree e1) (fromTree e2)
Suppose someone else decides to extend the language; they want to add multiplication. They will have to have access to the original source code. They could try the following:
fromTree' (Node "Mul" [e1, e2]) = ...
fromTree' e = fromTree e
But then Mul can only appear once, at the top level of the expression, since the call to fromTree will not be aware of the Node "Mul" case. Tree "Neg" [Tree "Mul" a b] will not work, since the original fromTree has no pattern for "Mul". However, if the same function is written using fix:
fromTreeExt :: (Tree -> Maybe Int) -> (Tree -> Maybe Int)
fromTreeExt self (Node "Neg" [e]) = liftM negate (self e)
fromTreeExt .... -- other cases
fromTree = fix fromTreeExt
Then extending the language is possible:
fromTreeExt' self (Node "Mul" [e1, e2]) = ...
fromTreeExt' self e = fromTreeExt self e
fromTree' = fix fromTreeExt'
Now, the extended fromTree' will evaluate the tree properly, since self in fromTreeExt' refers to the entire function, including the "Mul" case.
This approach is used here (the above example is a closely adapted version of the usage in the paper).

Beware the difference between _Y f = f (_Y f) (recursion, value--copying) and fix f = x where x = f x (corecursion, reference--sharing).
Haskell's let and where bindings are recursive: same name on the LHS and RHS refer to the same entity. The reference is shared.
In the definition of _Y there's no sharing (unless a compiler performs an aggressive optimization of common subexpressions elimination). This means it describes recursion, where repetition is achieved by application of a copy of an original, like in a classic metaphor of a recursive function creating its own copies. Corecursion, on the other hand, relies on sharing, on referring to same entity.
An example, primes calculated by
2 : _Y ((3:) . gaps 5 . _U . map (\p-> [p*p, p*p+2*p..]))
-- gaps 5 == ([5,7..] \\)
-- _U == sort . concat
either reusing its own output (with fix, let g = ((3:)...) ; ps = g ps in 2 : ps) or creating separate primes supply for itself (with _Y, let g () = ((3:)...) (g ()) in 2 : g ()).
See also:
double stream feed to prevent unneeded memoization?
How to implement an efficient infinite generator of prime numbers in Python?
Or, with the usual example of factorial function,
gen rec n = n<2 -> 1 ; n * rec (n-1) -- "if" notation
facrec = _Y gen
facrec 4 = gen (_Y gen) 4
= let {rec=_Y gen} in (\n-> ...) 4
= let {rec=_Y gen} in (4<2 -> 1 ; 4*rec 3)
= 4*_Y gen 3
= 4*gen (_Y gen) 3
= 4*let {rec2=_Y gen} in (3<2 -> 1 ; 3*rec2 2)
= 4*3*_Y gen 2 -- (_Y gen) recalculated
.....
fac = fix gen
fac 4 = (let f = gen f in f) 4
= (let f = (let {rec=f} in (\n-> ...)) in f) 4
= let {rec=f} in (4<2 -> 1 ; 4*rec 3) -- f binding is created
= 4*f 3
= 4*let {rec=f} in (3<2 -> 1 ; 3*rec 2)
= 4*3*f 2 -- f binding is reused
.....

1) fix is just a function, it improves your code when you use some recursion. It makes your code prettier.For example usage visit: Haskell Wikibook - Fix and recursion.
2) You know what does foldr? Seems like foldr isn't useful in factorization (or i didn't understand what are you mean in that).
Here is a prime factorization without fix:
fact xs = map (\x->takeWhile (\y->y/=[]) x) . map (\x->factIt x) $ xs
where factIt n = map (\x->getFact x n []) [2..n]
getFact i n xs
| n `mod` i == 0 = getFact i (div n i) xs++[i]
| otherwise = xs
and with fix(this exactly works like the previous):
fact xs = map (\x->takeWhile (\y->y/=[]) x) . map (\x->getfact x) $ xs
where getfact n = map (\x->defact x n) [2..n]
defact i n =
fix (\rec j k xs->if(mod k j == 0)then (rec j (div k j) xs++[j]) else xs ) i n []
This isn't pretty because in this case fix isn't a good choice(but there is always somebody who can write it better).

Recursive state monad for accumulating a value while building a list?

I'm totally new to Haskell so apologies if the question is silly.
What I want to do is recursively build a list while at the same time building up an accumulated value based on the recursive calls. This is for a problem I'm doing for a Coursera course, so I won't post the exact problem but something analogous.
Say for example I wanted to take a list of ints and double each one (ignoring for the purpose of the example that I could just use map), but I also wanted to count up how many times the number '5' appears in the list.
So to do the doubling I could do this:
foo [] = []
foo (x:xs) = x * 2 : foo xs
So far so easy. But how can I also maintain a count of how many times x is a five? The best solution I've got is to use an explicit accumulator like this, which I don't like as it reverses the list, so you need to do a reverse at the end:
foo total acc [] = (total, reverse acc)
foo total acc (x:xs) = foo (if x == 5 then total + 1 else total) (x*2 : acc) xs
But I feel like this should be able to be handled nicer by the State monad, which I haven't used before, but when I try to construct a function that will fit the pattern I've seen I get stuck because of the recursive call to foo. Is there a nicer way to do this?
EDIT: I need this to work for very long lists, so any recursive calls need to be tail-recursive too. (The example I have here manages to be tail-recursive thanks to Haskell's 'tail recursion modulo cons').

Using State monad it can be something like:
foo :: [Int] -> State Int [Int]
foo [] = return []
foo (x:xs) = do
i <- get
put $ if x==5 then (i+1) else i
r <- foo xs
return $ (x*2):r
main = do
let (lst,count) = runState (foo [1,2,5,6,5,5]) 0 in
putStr $ show count

This is a simple fold
foo :: [Integer] -> ([Integer], Int)
foo [] = ([], 0)
foo (x : xs) = let (rs, n) = foo xs
in (2 * x : rs, if x == 5 then n + 1 else n)
or expressed using foldr
foo' :: [Integer] -> ([Integer], Int)
foo' = foldr f ([], 0)
where
f x (rs, n) = (2 * x : rs, if x == 5 then n + 1 else n)
The accumulated value is a pair of both the operations.
Notes:
Have a look at Beautiful folding. It shows a nice way how to make such computations composable.
You can use State for the same thing as well, by viewing each element as a stateful computation. This is a bit overkill, but certainly possible. In fact, any fold can be expressed as a sequence of State computations:
import Control.Monad
import Control.Monad.State
-- I used a slightly non-standard signature for a left fold
-- for simplicity.
foldl' :: (b -> a -> a) -> a -> [b] -> a
foldl' f z xs = execState (mapM_ (modify . f) xs) z
Function mapM_ first maps each element of xs to a stateful computation by modify . f :: b -> State a (). Then it combines a list of such computations into one of type State a () (it discards the results of the monadic computations, just keeps the effects). Finally we run this stateful computation on z.

STArray and stack overflow

I am struggling to understand why the following attempts to find a minimum element in STArray lead to stack space overflow when compiled with ghc (7.4.1, regardless of -O level), but work fine in ghci:
import Control.Monad
import Control.Monad.ST
import Control.Applicative
import Data.Array.ST
n = 1000 :: Int
minElem = runST $ do
arr <- newArray ((1,1),(n,n)) 0 :: ST s (STArray s (Int,Int) Int)
let ixs = [(i,j) | i <- [1..n], j <- [1..n]]
forM_ ixs $ \(i,j) -> writeArray arr (i,j) (i*j `mod` 7927)
-- readArray arr (34,56) -- this works OK
-- findMin1 arr -- stackoverflows when compiled
findMin2 arr -- stackoverflows when compiled
findMin1 arr = do
es <- getElems arr
return $ minimum es
findMin2 arr = do
e11 <- readArray arr (1,1)
foldM (\m ij -> min m <$> readArray arr ij) e11 ixs
where ixs = [(i,j) | i <- [1..n], j <- [1..n]]
main = print minElem
Note: switching to STUArray or ST.Lazy doesn't seem to have any effect.
The main question though: What would be the proper way to implement such "fold-like" operation over big STArray while inside ST?

That's probably a result of getElems being a bad idea. In this case an array is a bad idea altogether:
minimum (zipWith (\x y -> (x, y, mod (x*y) 7927)) [1..1000] [1..1000])
This one gives you the answer right away: (1, 1, 1).
If you want to use an array anyway I recommend converting the array to an Array or UArray first and then using elems or assocs on that one. This has no additional cost, if you do it using runSTArray or runSTUArray.

The big problem in findMin1 is getElems:
getElems :: (MArray a e m, Ix i) => a i e -> m [e]
getElems marr = do
(_l, _u) <- getBounds marr -- Hmm, why is that there?
n <- getNumElements marr
sequence [unsafeRead marr i | i <- [0 .. n - 1]]
Using sequence on a long list is a common cause for stack overflows in monads whose (>>=) isn't lazy enough, since
sequence ms = foldr k (return []) ms
where
k m m' = do { x <- m; xs <- m'; return (x:xs) }
then it has to build a thunk of size proportional to the length of the list. getElems would work with the Control.Monad.ST.Lazy, but then the filling of the array with
forM_ ixs $ \(i,j) -> writeArray arr (i,j) (i*j `mod` 7927)
creates a huge thunk that overflows the stack. For the strict ST variant, you need to replace getElems with something that builds the list without sequence or - much better - compute the minimum without creating a list of elements at all. For the lazy ST variant, you need to ensure that the filling of the array doesn't build a huge thunk e.g. by forcing the result of the writeArray calls
let fill i j
| i > n = return ()
| j > n = fill (i+1) 1
| otherwise = do
() <- writeArray arr (i,j) $ (i*j `mod` 7927)
fill i (j+1)
() <- fill 1 1
The problem in findMin2 is that
foldM (\m ij -> min m <$> readArray arr ij) e11 ixs
is lazy in m, so it builds a huge thunk to compute the minimum. You can easily fix that by using seq (or a bang-pattern) to make it strict in m.
The main question though: What would be the proper way to implement such "fold-like" operation over big STArray while inside ST?
Usually, you'll use the strict ST variant (and for types like Int, you should almost certainly use STUArrays instead of STArrays). Then the most important rule is that your functions be strict enough. The structure of findMin2 is okay, the implementation is just too lazy.
If performance matters, you will often have to avoid the generic higher order functions like foldM and write your own loops to avoid allocating lists and control strictness exactly as the problem at hand requires.

The problem is that minimum is a non-strict fold, so it is causing a thunk buildup. Use (foldl' min).
Now we add a bunch of verbiage to ignore because stackoverflow has turned worthless and no longer allows posting a straightforward answer.

No speedup with naive merge sort parallelization in Haskell

Note: This post was completely rewritten 2011-06-10; thanks to Peter for helping me out. Also, please don't be offended if I don't accept one answer, since this question seems to be rather open-ended. (But, if you solve it, you get the check mark, of course).
Another user had posted a question about parallelizing a merge sort. I thought I'd write a simple solution, but alas, it is not much faster than the sequential version.
Problem statement
Merge sort is a divide-and-conquer algorithm, where the leaves of computation can be parallelized.
The code works as follows: the list is converted into a tree, representing computation nodes. Then, the merging step returns a list for each node. Theoretically, we should see some significant performanc gains, since we're going from an O(n log n) algorithm to an O(n) algorithm with infinite processors.
The first steps of the computation are parallelized, when parameter l (level) is greater than zero below. This is done by [via variable strat] selecting the rpar strategy, which will make sub-computation mergeSort' x occur in parallel with mergeSort' y. Then, we merge the results, and force its evaluation with rdeepseq.
data Tree a = Leaf a | Node (Tree a) (Tree a) deriving (Show)
instance NFData a => NFData (Tree a) where
rnf (Leaf v) = deepseq v ()
rnf (Node x y) = deepseq (x, y) ()
listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = uncurry Node $ listToTree *** listToTree $
splitAt (length xs `div` 2) xs
-- mergeSort' :: Ord a => Tree a -> Eval [a]
mergeSort' l (Leaf v) = return [v]
mergeSort' l (Node x y) = do
xr <- strat $ runEval $ mergeSort' (l - 1) x
yr <- rseq $ runEval $ mergeSort' (l - 1) y
rdeepseq (merge xr yr)
where
merge [] y = y
merge x [] = x
merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
| otherwise = y : merge (x:xs) ys
strat | l > 0 = rpar
| otherwise = rseq
mergeSort = runEval . mergeSort' 10
By only evaluating a few levels of the computation, we should have decent parallel communication complexity as well -- some constant factor order of n.
Results
Obtain the 4th version source code here [ http://pastebin.com/DxYneAaC ], and run it with the following to inspect thread usage, or subsequent command lines for benchmarking,
rm -f ParallelMergeSort; ghc -O2 -O3 -optc-O3 -optc-ffast-math -eventlog --make -rtsopts -threaded ParallelMergeSort.hs
./ParallelMergeSort +RTS -H512m -K512m -ls -N
threadscope ParallelMergeSort.eventlog
Results on a 24-core X5680 # 3.33GHz show little improvement
> ./ParallelMergeSort
initialization: 10.461204s sec.
sorting: 6.383197s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -N
initialization: 27.94877s sec.
sorting: 5.228463s sec.
and on my own machine, a quad-core Phenom II,
> ./ParallelMergeSort
initialization: 18.943919s sec.
sorting: 10.465077s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -ls -N
initialization: 22.92075s sec.
sorting: 7.431716s sec.
Inspecting the result in threadscope shows good utilization for small amounts of data. (though, sadly, no perceptible speedup). However, when I try to run it on larger lists, like the above, it uses about 2 cpus half the time. It seems like a lot of sparks are getting pruned. It's also sensitive to the memory parameters, where 256mb is the sweet spot, 128mb gives 9 seconds, 512 gives 8.4, and 1024 gives 12.3!
Solutions I'm looking for
Finally, if anyone knows some high-power tools to throw at this, I'd appreciate it. (Eden?). My primary interest in Haskell parallelism is to be able to write small supportive tools for research projects, which I can throw on a 24 or 80 core server in our lab's cluster. Since they're not the main point of our group's research, I don't want to spend much time on the parallelization efficiency. So, for me, simpler is better, even if I only end up getting 20% usage.
Further discussion
I notice that the second bar in threadscope is sometimes green (c.f. its homepage, where the second bar seems to always be garbage collection). What does this mean?
Is there any way to sidestep garbage collection? It seems to be taking a lot of time. For example, why can't a subcomputation be forked, return the result in shared memory, and then die?
Is there a better way (arrows, applicative) to express parallelism?

The answer is pretty easy: Because you have at no point introduced parallelism. Eval is just a monad to order computations, you have to ask for things to be executed in parallel manually. What you probably want is:
do xr <- rpar $ runEval $ mergeSort' x
yr <- rseq $ runEval $ mergeSort' y
rseq (merge xr yr)
This will make Haskell actually create a spark for the first computation, instead of trying to evaluate it on the spot.
Standard tips also kind-of apply:
The result should be evaluated deeply (e.g. using evalTraversable rseq). Otherwise you will only force the head of the tree, and the bulk of the data will just be returned unevaluated.
Just sparking everything will most likely eat up any gains. It would be a good idea to introduce a parameter that stops sparking at lower recursion levels.
Edit: The following actually doesn't apply anymore after the question edit
But the worst part last: Your algorithm as you state it is very flawed. Your top-level seq only forces the first cons-cell of the list, which allows GHC to use lazyness to great effect. It will never actually construct the result list, just plow through all of them in a search for the minimum element (that's not even strictly needed, but GHC only produces the cell after the minimum is known).
So don't be surprised when performance actually drops sharply when you start introducing parallelism under the assumptions that you need the whole list at some point in the program...
Edit 2: Some more answers to the edits
The biggest problem with your program is probably that it is using lists. If you want to make more than a toy example, consider at least using (unpacked) Arrays. If you want to go into serious number-crunching, maybe consider a specialised library like repa.
On "Further Discussion":
The colors stand for different GC states, I can't remember which. Try to look at the event log for the associated event.
The way to "sidestep" garbage collection is to not produce so much garbage in the first place, e.g. by using better data structures.
Well, if you are looking for an inspiration on robust parallelization it might be worthwhile to have a look at monad-par, which is relatively new but (I feel) less "surprising" in its parallel behaviour.
With monad-par, your example might become something like:
do xr <- spawn $ mergeSort' x
yr <- spawn $ mergeSort' y
merge <$> get xr <*> get yr
So here the get actually forces you to specify the join points - and the library does the required deepseq automatically behind the scenes.

I had similar luck to what you report in EDIT 3 on a dual core system with these variants. I used a smaller list length because I'm on a smaller computer, compiled with ghc -O2 -rtsopts -threaded MergePar.hs, and ran with ./MergePar +RTS -H256M -N. This might offer a more structured way to compare performance. Note that the RTS option -qa sometimes helps the simple par variants.
import Control.Applicative
import Control.Parallel
import Control.Parallel.Strategies
import Criterion.Main
import GHC.Conc (numCapabilities)
data Tree a = Leaf a | Node (Tree a) (Tree a) deriving Show
listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = Node (listToTree (take half xs)) (listToTree (drop half xs))
where half = length xs `div` 2
-- Merge two ordered lists
merge :: Ord a => [a] -> [a] -> [a]
merge [] y = y
merge x [] = x
merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
| otherwise = y : merge (x:xs) ys
-- Simple merge sort
mergeSort' :: Ord a => Tree a -> [a]
mergeSort' (Leaf v) = [v]
mergeSort' (Node x y) = merge (mergeSort' x) (mergeSort' y)
mergeSort :: Ord a => [a] -> [a]
mergeSort = mergeSort' . listToTree
-- Merge sort with 'par' annotations on every recursive call
mergeSortP' :: Ord a => Tree a -> [a]
mergeSortP' (Leaf v) = [v]
mergeSortP' (Node x y) = let xr = mergeSortP' x
yr = mergeSortP' y
in xr `par` yr `pseq` merge xr yr
mergeSortP :: Ord a => [a] -> [a]
mergeSortP = mergeSortP' . listToTree
-- Merge sort with 'rpar' annotations on every recursive call
mergeSortR' :: Ord a => Tree a -> [a]
mergeSortR' (Leaf v) = [v]
mergeSortR' (Node x y) =
runEval $ merge <$> rpar (mergeSortR' x) <*> rpar (mergeSortR' y)
mergeSortR :: Ord a => [a] -> [a]
mergeSortR = mergeSortR' . listToTree
-- Parallel merge sort that stops looking for parallelism at a certain
-- depth
smartMerge' :: Ord a => Int -> Tree a -> [a]
smartMerge' _ (Leaf v) = [v]
smartMerge' n t#(Node x y)
| n <= 1 = mergeSort' t
| otherwise = let xr = smartMerge' (n-1) x
yr = smartMerge' (n-2) y
in xr `par` yr `pseq` merge xr yr
smartMerge :: Ord a => [a] -> [a]
smartMerge = smartMerge' numCapabilities . listToTree
main = defaultMain $ [ bench "original" $ nf mergeSort lst
, bench "par" $ nf mergeSortP lst
, bench "rpar" $ nf mergeSortR lst
, bench "smart" $ nf smartMerge lst ]
where lst = [100000,99999..0] :: [Int]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string