Slowdown when using parallel strategies in Haskell

Slowdown when using parallel strategies in Haskell - haskell

I was working through the exercises of Andre Loh's deterministic parallel programming in haskell exercises. I was trying to convert the N-Queens sequential code into parallel by using strategies, but I noticed that the parallel code runs much slower than the sequential code and also errors out with insufficient stack space.
This is the code for the parallel N-Queens,
import Control.Monad
import System.Environment
import GHC.Conc
import Control.Parallel.Strategies
import Data.List
import Data.Function
type PartialSolution = [Int] -- per column, list the row the queen is in
type Solution = PartialSolution
type BoardSize = Int
chunk :: Int -> [a] -> [[a]]
chunk n [] = []
chunk n xs = case splitAt n xs of
(ys, zs) -> ys : chunk n zs
-- Generate all solutions for a given board size.
queens :: BoardSize -> [Solution]
--queens n = iterate (concatMap (addQueen n)) [[]] !! n
queens n = iterate (\l -> concat (map (addQueen n) l `using` parListChunk (n `div` numCapabilities) rdeepseq)) [[]] !! n
-- Given the size of the problem and a partial solution for the
-- first few columns, find all possible assignments for the next
-- column and extend the partial solution.
addQueen :: BoardSize -> PartialSolution -> [PartialSolution]
addQueen n s = [ x : s | x <- [1..n], safe x s 1 ]
-- Given a row number, a partial solution and an offset, check
-- that a queen placed at that row threatens no queen in the
-- partial solution.
safe :: Int -> PartialSolution -> Int -> Bool
safe x [] n = True
safe x (c:y) n = x /= c && x /= c + n && x /= c - n && safe x y (n + 1)
main = do
[n] <- getArgs
print $ length $ queens (read n)
The line (\l -> concat (map (addQueen n) l using parListChunk (n div numCapabilities) rdeepseq)) is what I changed from the original code. I have seen Simon Marlow's solution but I wanted to know the reason for the slowdown and error in my code.
Thanks in advance.

You are sparking way too much work. The parListChunk parameter of div n numCapabilities is probably, what, 7 on your system (2 cores and you're running with n ~ 14). The list is going to grow large very quickly so there is no point in sparking such small units of work (and I don't see why it makes sense tying it to the value of n).
If I add a factor of ten (making the sparking unit 70 in this case) then I get a clear performance win over single threading. Also, I don't have the stack issue you refer to - if it goes away with a change to your parListChunk value then I'd report that as a bug.
If I make the chunking every 800 then the times top off at 5.375s vs 7.9s. Over 800 and the performance starts to get worse again, ymmv.
EDIT:
[tommd#mavlo Test]$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.4
[tommd#mavlo Test]$ ghc -O2 so.hs -rtsopts -threaded -fforce-recomp ; time ./so 13 +RTS -N2
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
73712
real 0m5.404s
[tommd#mavlo Test]$ ghc -O2 so.hs -rtsopts -fforce-recomp ; time ./so 13
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
73712
real 0m8.134s

Related

Improving Haskell code performance (BangPatterns, LazyByteString)

I've used BangPatterns, Lazy ByteString. Don't know what else to do to improve performance of this code. Any ideas and suggestions? It's clearly not the fastest version as it exceeds time limit.
-- Find the sum of all the multiples of 3 or 5 below N
-- Input Format
-- First line contains T that denotes the number of test cases. This is followed by T lines, each containing an integer, N.
{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O2 -optc-O2 #-}
import qualified Data.ByteString.Lazy as L
import Control.Monad (mapM_)
readInt :: L.ByteString -> Int
readInt !s = L.foldl' (\x c -> 10 * x + fromIntegral c - 48) 0 s
main :: IO ()
main = do
-- don't need the number of inputs, since it is read lazily.
-- split input by lines
(_:ls) <- L.split 10 `fmap` L.getContents
-- length ls <= 10^5
mapM_ (print . f . readInt) ls
-- n <= 10^9
f :: Int -> Int
f n = go 0 0
where
go !i !a | i == n = a
go !i !a | i `mod` 3 == 0
|| i `mod` 5 == 0 = go (i+1) (a+i)
go !i !a = go (i+1) a

danidiaz has already discussed the input and output issue somewhat.
One fast way to produce multiples of 3 or 5 is to use a "wheel" of the sort commonly used for prime sieves.
multiples3or5 = go 0 $ cycle [3,2,1,3,1,2,3]
where
go n (x : xs) = n : go (n+x) xs
go n [] = error "impossible"
In fact, since the circular list never ends, it's cleaner to use a different type. And since you're using Int, it might as well be specialized and unpacked for performance. Note that the UNPACK pragma in this context is not needed for GHC version 7.8 or above.
data IntStream = {-# UNPACK #-} !Int :> IntStream
infixr 5 :>
wheel :: IntStream
wheel = 3 :> 2 :> 1 :> 3 :> 1 :> 2 :> 3 :> wheel
multiples3or5 = go 0 wheel
where
go !n (x :> xs) = n : go (n+x) xs
As fgv commented, this is in the nature of an anamorphism. You can see this by writing
multiples3or5 = unfoldr go (0, wheel) where
go (!n, (x :> xs)) = Just (n, (n+x, xs))
but note that unfoldr did not become efficient enough to be much use for anything until base 4.8, which has not officially been released.
When printing out the results, the system has to divide a lot of things by 10. I don't know if those routines are specially optimized, but I do know that GHC's native code generator does not currently optimize division by a known divisor unless that divisor is a power of 2. So you might find that you can improve performance by using -fllvm, and being careful to use a compatible version of LLVM.
Edit
See Chad Groft's answer for a better way.

Your use of print in the line
mapM_ (print . f . readInt) ls
may be introducing some overhead, because print depends on the Show instance for Int, meaning a conversion to inefficient Strings will take place.
Add the following imports
import qualified Data.ByteString.Builder as BB
import qualified Data.Foldable as F
import Data.List.Split (chunksOf) -- from the "split" package
import System.IO -- for stdout
and try to change that line with something like
let resultList = map (f . readInt) ls
F.mapM_ (BB.hPutBuilder stdout . F.foldMap BB.intDec) (chunksOf 1000 resultList)
that takes chunks of size 1000 from the list of Ints and uses the efficient Builder type and the specialized hPutBuilder function to write them to stdout.
(I added the chunking because otherwise I feared constructing the Builder would force the whole input list into memory. And we don't want that, because the list is being read lazily.)
I'm not sure if that's the main bottleneck, though.

If you're really concerned with efficiency, rethink the algorithm. Your main bottleneck is that you're manually summing a bunch of numbers between 1 and N, which will perform poorly on large N no matter what you do.
Instead, think mathematically. The sum of all multiples of 3 or 5 up to N is almost the sum of all multiples of 3 up to N (call this S_3), plus the sum of all multiples of 5 up to N (call this S_5). I say "almost" because some numbers get double-counted; call their sum T. Now the sum you want is exactly S_3 + S_5 – T, and each term has a nice closed formula (what is it?). Calculating these three numbers is much faster than what you're doing.

Here you the formula without those "think about" mentor answers
sumMultiplesOf::Integral n=>n->n->n
sumMultiplesOf k n = d * (1 + d) `div` 2 * k where d = (n - 1) `div` k
sumMultiplesOf3or5::Integral n=>n->n
sumMultiplesOf3or5 n = sumMultiplesOf 3 n + sumMultiplesOf 5 n - sumMultiplesOf 15 n

Haskell: how to operate the string type in a do block

I want to make a function that firstly divides a list l to two list m and n. Then create two thread to find out the longest palindrome in the two list. My code is :
import Control.Concurrent (forkIO)
import System.Environment (getArgs)
import Data.List
import Data.Ord
main = do
l <- getArgs
forkIO $ putStrLn $ show $ longestPalindr $ mList l
forkIO $ putStrLn $ show $ longestPalindr $ nList l
longestPalindr x =
snd $ last $ sort $
map (\l -> (length l, l)) $
map head $ group $ sort $
filter (\y -> y == reverse y) $
concatMap inits $ tails x
mList l = take (length l `div` 2) l
nList l = drop (length l `div` 2) l
Now I can compile it, but the result is a [ ]. When I just run the longestPalindr and mList , I get the right result. I thought the logic here is right. So what is the problem?

The question title may need to be changed, as this is no longer about type errors.
The functionality of the program can be fixed by simply mapping longestPalindr across the two halves of the list. In your code, you are finding the longest palindrome across [[Char]], so the result length is usually just 1.
I've given a simple example of par and pseq. This just suggests to the compiler that it may be smart to evaluate left and right independently. It doesn't guarantee parallel evaluation, but rather leaves it up to the compiler to decide.
Consult Parallel Haskell on the wiki to understand sparks, compile with the -threaded flag, then run it with +RTS -N2. Add -stderr for profiling, and see if there is any benefit to sparking here. I would expect negative returns until you start to feed it longer lists.
For further reading on functional parallelism, take a look at Control.Parallel.Strategies. Manually wrangling threads in Haskell is only really needed in nondeterministic scenarios.
import Control.Parallel (par, pseq)
import System.Environment (getArgs)
import Data.List
import Data.Ord
import Control.Function (on)
main = do
l <- getArgs
let left = map longestPalindr (mList l)
right = map longestPalindr (nList l)
left `par` right `pseq` print $ longest (left ++ right)
longestPalindr x = longest pals
where pals = nub $ filter (\y -> y == reverse y) substrings
substrings = concatMap inits $ tails x
longest = maximumBy (compare `on` length)
mList l = take (length l `div` 2) l
nList l = drop (length l `div` 2) l

For reference, please read the Parallelchapter from Simon Marlow's book.
http://chimera.labs.oreilly.com/books/1230000000929/ch02.html#sec_par-eval-whnf
As others have stated, using par from the Eval monad seems to be the correct approach here.
Here is a simplified view of your problem. You can test it out by compiling with +RTS -threaded -RTSand then you can use Thread Scope to profile your performance.
import Control.Parallel.Strategies
import Data.List (maximumBy, subsequences)
import Data.Ord
isPalindrome :: Eq a => [a] -> Bool
isPalindrome xs = xs == reverse xs
-- * note while subsequences is correct, it is asymptotically
-- inefficient due to nested foldr calls
getLongestPalindrome :: Ord a => [a] -> Int
getLongestPalindrome = length . maximum' . filter isPalindrome . subsequences
where maximum' :: Ord a => [[a]] -> [a]
maximum' = maximumBy $ comparing length
--- Do it in parallel, in a monad
-- rpar rpar seems to fit your case, according to Simon Marlow's book
-- http://chimera.labs.oreilly.com/books/1230000000929/ch02.html#sec_par-eval-whnf
main :: IO ()
main = do
let shorter = [2,3,4,5,4,3,2]
longer = [1,2,3,4,5,4,3,2,1]
result = runEval $ do
a <- rpar $ getLongestPalindrome shorter
b <- rpar $ getLongestPalindrome longer
if a > b -- 'a > b' will always be false in this case
then return (a,"shorter")
else return (b,"longer")
print result
-- This will print the length of the longest palindrome along w/ the list name
-- Don't forget to compile w/ -threaded and use ThreadScope to check
-- performance and evaluation

Project euler 10 - [haskell] Why so inefficient?

Alright, so i've picked up project euler where i left off when using java, and i'm at problem 10. I use Haskell now and i figured it'd be good to learn some haskell since i'm still very much a beginner.
http://projecteuler.net/problem=10
My friend who still codes in java came up with a very straight forward way to implement the sieve of eratosthenes:
http://puu.sh/5zQoU.png
I tried implementing a better looking (and what i thought was gonna be a slightly more efficient) Haskell function to find all primes up to 2,000,000.
I came to this very elegant, yet apparently enormously inefficient function:
primeSieveV2 :: [Integer] -> [Integer]
primeSieveV2 [] = []
primeSieveV2 (x:xs) = x:primeSieveV2( (filter (\n -> ( mod n x ) /= 0) xs) )
Now i'm not sure why my function is so much slower than his (he claim his works in 5ms), if anything mine should be faster, since i only check composites once (they are removed from the list when they are found) whereas his checks them as many times as they can be formed.
Any help?

You don't actually have a sieve here. In Haskell you could write a sieve as
import Data.Vector.Unboxed hiding (forM_)
import Data.Vector.Unboxed.Mutable
import Control.Monad.ST (runST)
import Control.Monad (forM_, when)
import Prelude hiding (read)
sieve :: Int -> Vector Bool
sieve n = runST $ do
vec <- new (n + 1) -- Create the mutable vector
set vec True -- Set all the elements to True
forM_ [2..n] $ \ i -> do -- Loop for i from 2 to n
val <- read vec i -- read the value at i
when val $ -- if the value is true, set all it's multiples to false
forM_ [2*i, 3*i .. n] $ \j -> write vec j False
freeze vec -- return the immutable vector
main = print . ifoldl' summer 0 $ sieve 2000000
where summer s i b = if b then i + s else s
This "cheats" by using a mutable unboxed vector, but it's pretty darn fast
$ ghc -O2 primes.hs
$ time ./primes
142913828923
real: 0.238 s
This is about 5x faster than my benchmarking of augustss's solution.

To actually implement the sieve efficiently in Haskell you probably need to do it the Java way (i.e., allocate a mutable array an modify it).
For just generating primes I like this:
primes = 2 : filter (isPrime primes) [3,5 ..]
where isPrime (p:ps) x = p*p > x || x `rem` p /= 0 && isPrime ps x
And then you can print the sum of all primes primes < 2,000,000
main = print $ sum $ takeWhile (< 2000000) primes
You can speed it up by adding a type signature primes :: [Int].
But it works well with Integer as well and that also gives you the correct sum (which 32 bit Int will not).
See The Genuine Sieve of Eratosthenes for more information.

The time complexity of your code is n2 (in n primes produced). It is impractical to run for producing more than first 10...20 thousand primes.
The main problem with that code is not that it uses rem but that it starts its filters prematurely, so creates too many of them. Here's how you fix it, with a small tweak:
{-# LANGUAGE PatternGuards #-}
primes = 2 : sieve primes [3..]
sieve (p:ps) xs | (h,t) <- span (< p*p) xs = h ++ sieve ps [x | x <- t, rem x p /= 0]
-- sieve ps (filter (\x->rem x p/=0) t)
main = print $ sum $ takeWhile (< 100000) primes
This improves the time complexity by about n1/2 (in n primes produced) and gives it a drastic speedup: it gets to 100,000 75x faster. Your 28 seconds should become ~ 0.4 sec. But, you probably tested it in GHCi as interpreted code, not compiled. Marking it1) as :: [Int] and compiling with -O2 flag gives it another ~ 40x speedup, so it'll be ~ 0.01 sec. To reach 2,000,000 with this code takes ~ 90x longer, for a whopping ~ 1 sec of projected run time.
1) be sure to use sum $ map (fromIntegral :: Int -> Integer) $ takeWhile ... in main.
see also: http://en.wikipedia.org/wiki/Analysis_of_algorithms#Empirical_orders_of_growth

Stack space overflow error while running recursive compiled code. Fine tune algorithm, or give more resources?

I am working on project Euler #14, and have a solution to get the answer, but am getting a stack space overflow error when I try to run the code. The algorithm works OK in the interactive GHCI (on low numbers), but wont work when I throw a really big number at it and try to compile it.
Here is a rough idea of what it does in the interactive GHCI. It takes about 10 seconds to calculate "answer 50000" on my computer.
After letting GHCI run the problem for a few minutes, it spits out the correct answer.
*Euler System.IO> answer 1000000
(525,837799)
But that doesn't solve the stack overflow error when compiling the program to run natively.
*Euler System.IO> answer 10
(20,9)
*Euler System.IO> answer 100
(119,97)
*Euler System.IO> answer 1000
(179,871)
*Euler System.IO> answer 10000
(262,6171)
*Euler System.IO> answer 50000
(324,35655)
What should I do to get the answer to for "answer 1000000"? I imagine my algorithm needs to be fine tuned a bit, but I have no idea how to go about doing that.
Code:
module Main
where
import System.IO
import Control.Monad
main = print (answer 1000000)
-- Count the length of the sequences
-- count' creates a tuple with the second value
-- being the starting number of the game
-- and the first value being the total
-- length of the chain
count' n = (cSeq n, n)
cSeq n = length $ game n
-- Find the maximum chain value of the game
answer n = maximum $ map count' [1..n]
-- Working game.
-- game 13 = [13,40,20,10,5,16,8,4,2,1]
game n = n : play n
play x
| x <= 0 = [] -- is negative or 0
| x == 1 = [] -- is 1
| even x = doEven x : play ((doEven x)) -- even
| otherwise = doOdd x : play ((doOdd x)) -- odd
where doOdd x = (3 * x) + 1
doEven x = (x `div` 2)

The problem here is that maximum is too lazy. Instead of keeping track of the largest element as it goes along, it builds up a huge tree of max thunks. This is because maximum is defined in terms of foldl, so the evaluation goes as follows:
maximum [1, 2, 3, 4, 5]
foldl max 1 [2, 3, 4, 5]
foldl max (max 1 2) [3, 4, 5]
foldl max (max (max 1 2) 3) [4, 5]
foldl max (max (max (max 1 2) 3) 4) [5]
foldl max (max (max (max (max 1 2) 3) 4) 5) []
max (max (max (max 1 2) 3) 4) 5 -- this expression will be huge for large lists
Trying to evaluate too many of these nested max calls causes a stack overflow.
The solution is to force it to evaluate these as it goes along by using the strict version foldl', (or, in this case, its cousin foldl1'). This prevents the max's from building up by reducing them at each step:
foldl1' max [1, 2, 3, 4, 5]
foldl' max 1 [2, 3, 4, 5]
foldl' max 2 [3, 4, 5]
foldl' max 3 [4, 5]
foldl' max 4 [5]
foldl' max 5 []
5
GHC can often solve these kinds of problems on its own if you compile with -O2 which (among other things) runs a strictness analysis of your program. However, I think it's good practice to write programs that don't need to rely on optimizations to work.
Note: After fixing this, the resulting program is still very slow. You might want to look into using memoization for this problem.

#hammar already pointed out the problem that maximum is too lazy, and how to resolve that (using foldl1', the strict version of foldl1).
But there are further inefficiencies in the code.
cSeq n = length $ game n
cSeq lets game construct a list, only to calculate its length. Unfortunately, length is not a "good consumer", so the construction of the intermediate list is not fused away. That's quite a bit of unnecessary allocation and costs time. Eliminating these lists
cSeq n = coll (1 :: Int) n
where
coll acc 1 = acc
coll acc m
| even m = coll (acc + 1) (m `div` 2)
| otherwise = coll (acc + 1) (3*m+1)
cuts down the allocation by something like 65% and the running time by about 20% (still slow). Next point, you're using div, which performs a sign check in addition to the normal division. Since all numbers involved are positive, using quot instead does speed it up a bit more (not much here, but it will become important later).
The next big point is that, since you haven't given type signatures, the type of the numbers (except where it was determined by the use of length or by the expression type signature (1 :: Int) in my rewrite) is Integer. The operations on Integer are considerably slower than the corresponding operations on Int, so if possible, you should use Int (or Word) rather than Integer when speed matters. If you have a 64-bit GHC, Int is sufficient for these computations, that reduces the running time by about half when using div, by about 70% when using quot, when using the native code generator, and when using the LLVM backend, the running time is reduced by about 70% when using div and by about 95% when using quot.
The difference between the native code generator and the LLVM backend is mostly due to some elementary low-level optimisations.
even and odd are defined
even, odd :: (Integral a) => a -> Bool
even n = n `rem` 2 == 0
odd = not . even
in GHC.Real. When the type is Int, LLVM knows to replace the division by 2 used to determine the modulus with a bitwise and (n .&. 1 == 0). The native code generator does not (yet) do many of these low-level optimisations. If you do that by hand, the code produced by the NCG and the LLVM backend performs nearly identically.
When using div, both, the NCG and LLVM, are not able to replace the division with a short shift-and-add sequence, so you get the relatively slow machine division instruction with the sign-test. With quot, both are able to do that for Int, so you get much faster code.
The knowledge that all occurring numbers are positive allows us to replace the division by 2 with a simple right shift, without any code to correct for negative arguments, that speeds up the code produced by the LLVM backend by another ~33%, oddly it doesn't make a difference for the NCG.
So from the original that took eight second plus/minus a bit (a little less with the NCG, a little more with the LLVM backend), we've gone to
module Main (main)
where
import Data.List
import Data.Bits
main = print (answer (1000000 :: Int))
-- Count the length of the sequences
-- count' creates a tuple with the second value
-- being the starting number of the game
-- and the first value being the total
-- length of the chain
count' n = (cSeq n, n)
cSeq n = go (1 :: Int) n
where
go !acc 1 = acc
go acc m
| even' m = go (acc+1) (m `shiftR` 1)
| otherwise = go (acc+1) (3*m+1)
even' :: Int -> Bool
even' m = m .&. 1 == 0
-- Find the maximum chain value of the game
answer n = foldl1' max $ map count' [1..n]
which takes 0.37 seconds with the NCG, and 0.27 seconds with the LLVM backend on my setup.
A minute improvement in running time, but a huge reduction of allocation can be obtained by replacing the foldl1' max with a manual recursion,
answer n = go 1 1 2
where
go ml mi i
| n < i = (ml,mi)
| l > ml = go l i (i+1)
| otherwise = go ml mi (i+1)
where
l = cSeq i
that makes it 0.35 resp. 0.25 seconds (and produces a tiny 52,936 bytes allocated in the heap).
Now if that is still too slow, you can worry about a good memoisation strategy. The best I know(1) is to use an unboxed array to store the chain lengths for the numbers not exceeding the limit,
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import System.Environment (getArgs)
import Data.Array.ST
import Data.Array.Base
import Control.Monad.ST
import Data.Bits
main :: IO ()
main = do
args <- getArgs
let bd = case args of
a:_ -> read a
_ -> 100000
print $ mxColl bd
mxColl :: Int -> (Int,Int)
mxColl bd = runST $ do
arr <- newArray (0,bd) 0
unsafeWrite arr 1 1
goColl arr bd 1 1 2
goColl :: STUArray s Int Int -> Int -> Int -> Int -> Int -> ST s (Int,Int)
goColl arr bd ms ml i
| bd < i = return (ms,ml)
| otherwise = do
nln <- collatzLength arr bd i
if ml < nln
then goColl arr bd i nln (i+1)
else goColl arr bd ms ml (i+1)
collatzLength :: STUArray s Int Int -> Int -> Int -> ST s Int
collatzLength arr bd n = go 1 n
where
go !l 1 = return l
go l m
| bd < m = go (l+1) $ case m .&. 1 of
0 -> m `shiftR` 1
_ -> 3*m+1
| otherwise = do
l' <- unsafeRead arr m
case l' of
0 -> do
l'' <- go 1 $ case m .&. 1 of
0 -> m `shiftR` 1
_ -> 3*m+1
unsafeWrite arr m (l''+1)
return (l + l'')
_ -> return (l+l'-1)
which does the job for a limit of 1000000 in 0.04 seconds when compiled with the NCG, 0.05 with the LLVM backend (apparently, that is not as good at optimising STUArray code as the NCG is).
If you don't have a 64-bit GHC, you can't simply use Int, since that would overflow then for some inputs.
But the overwhelming part of the computation is still performed in Int range, so you should use that where possible and only move to Integer where required.
switch :: Int
switch = (maxBound - 1) `quot` 3
back :: Integer
back = 2 * fromIntegral (maxBound :: Int)
cSeq :: Int -> Int
cSeq n = goInt 1 n
where
goInt acc 1 = acc
goInt acc m
| m .&. 1 == 0 = goInt (acc+1) (m `shiftR` 1)
| m > switch = goInteger (acc+1) (3*toInteger m + 1)
| otherwise = goInt (acc+1) (3*m+1)
goInteger acc m
| fromInteger m .&. (1 :: Int) == 1 = goInteger (acc+1) (3*m+1)
| m > back = goInteger (acc+1) (m `quot` 2) -- yup, quot is faster than shift for Integer here
| otherwise = goInt (acc + 1) (fromInteger $ m `quot` 2)
makes it harder to optimise the loop(s), so it is slower than the single loop using Int, but still decent. Here (where the Integer loop is never run), it takes 0.42 seconds with the NCG and 0.37 with the LLVM backend (which is pretty much the same as using quot in the pure Int version).
Using a similar trick for the memoised version has similar consequences, it's considerably slower than the pure Int version, but still blazingly fast compared to unmemoised versions.
(1) For this special (type of) problem, where you need to memoise the results for a contiguous range of arguments. For other problems, a Map or some other data structure will be the better choice.

It seems that the maximum function is the culprit as already pointed out, but you shouldn't have to worry about it if you compile your program with the -O2 flag.
The program is still quite slow, this is because the problem is supposed to teach you about memoization. One good way of doing this is haskell is by using Data.Memocombinators:
import Data.MemoCombinators
import Control.Arrow
import Data.List
import Data.Ord
import System.Environment
play m = maximumBy (comparing snd) . map (second threeNPuzzle) $ zip [1..] [1..m]
where
threeNPuzzle = arrayRange (1,m) memoized
memoized n
| n == 1 = 1
| odd n = 1 + threeNPuzzle (3*n + 1)
| even n = 1 + threeNPuzzle (n `div` 2)
main = getArgs >>= print . play . read . head
The above program runs in under a second when compiled with -O2 on my machine.
Note that in this case it is not a good idea to memoize all values found by threeNPuzzle, the program above memoizes the ones up until the limit (1000000 in the problem).

No speedup with naive merge sort parallelization in Haskell

Note: This post was completely rewritten 2011-06-10; thanks to Peter for helping me out. Also, please don't be offended if I don't accept one answer, since this question seems to be rather open-ended. (But, if you solve it, you get the check mark, of course).
Another user had posted a question about parallelizing a merge sort. I thought I'd write a simple solution, but alas, it is not much faster than the sequential version.
Problem statement
Merge sort is a divide-and-conquer algorithm, where the leaves of computation can be parallelized.
The code works as follows: the list is converted into a tree, representing computation nodes. Then, the merging step returns a list for each node. Theoretically, we should see some significant performanc gains, since we're going from an O(n log n) algorithm to an O(n) algorithm with infinite processors.
The first steps of the computation are parallelized, when parameter l (level) is greater than zero below. This is done by [via variable strat] selecting the rpar strategy, which will make sub-computation mergeSort' x occur in parallel with mergeSort' y. Then, we merge the results, and force its evaluation with rdeepseq.
data Tree a = Leaf a | Node (Tree a) (Tree a) deriving (Show)
instance NFData a => NFData (Tree a) where
rnf (Leaf v) = deepseq v ()
rnf (Node x y) = deepseq (x, y) ()
listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = uncurry Node $ listToTree *** listToTree $
splitAt (length xs `div` 2) xs
-- mergeSort' :: Ord a => Tree a -> Eval [a]
mergeSort' l (Leaf v) = return [v]
mergeSort' l (Node x y) = do
xr <- strat $ runEval $ mergeSort' (l - 1) x
yr <- rseq $ runEval $ mergeSort' (l - 1) y
rdeepseq (merge xr yr)
where
merge [] y = y
merge x [] = x
merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
| otherwise = y : merge (x:xs) ys
strat | l > 0 = rpar
| otherwise = rseq
mergeSort = runEval . mergeSort' 10
By only evaluating a few levels of the computation, we should have decent parallel communication complexity as well -- some constant factor order of n.
Results
Obtain the 4th version source code here [ http://pastebin.com/DxYneAaC ], and run it with the following to inspect thread usage, or subsequent command lines for benchmarking,
rm -f ParallelMergeSort; ghc -O2 -O3 -optc-O3 -optc-ffast-math -eventlog --make -rtsopts -threaded ParallelMergeSort.hs
./ParallelMergeSort +RTS -H512m -K512m -ls -N
threadscope ParallelMergeSort.eventlog
Results on a 24-core X5680 # 3.33GHz show little improvement
> ./ParallelMergeSort
initialization: 10.461204s sec.
sorting: 6.383197s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -N
initialization: 27.94877s sec.
sorting: 5.228463s sec.
and on my own machine, a quad-core Phenom II,
> ./ParallelMergeSort
initialization: 18.943919s sec.
sorting: 10.465077s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -ls -N
initialization: 22.92075s sec.
sorting: 7.431716s sec.
Inspecting the result in threadscope shows good utilization for small amounts of data. (though, sadly, no perceptible speedup). However, when I try to run it on larger lists, like the above, it uses about 2 cpus half the time. It seems like a lot of sparks are getting pruned. It's also sensitive to the memory parameters, where 256mb is the sweet spot, 128mb gives 9 seconds, 512 gives 8.4, and 1024 gives 12.3!
Solutions I'm looking for
Finally, if anyone knows some high-power tools to throw at this, I'd appreciate it. (Eden?). My primary interest in Haskell parallelism is to be able to write small supportive tools for research projects, which I can throw on a 24 or 80 core server in our lab's cluster. Since they're not the main point of our group's research, I don't want to spend much time on the parallelization efficiency. So, for me, simpler is better, even if I only end up getting 20% usage.
Further discussion
I notice that the second bar in threadscope is sometimes green (c.f. its homepage, where the second bar seems to always be garbage collection). What does this mean?
Is there any way to sidestep garbage collection? It seems to be taking a lot of time. For example, why can't a subcomputation be forked, return the result in shared memory, and then die?
Is there a better way (arrows, applicative) to express parallelism?

The answer is pretty easy: Because you have at no point introduced parallelism. Eval is just a monad to order computations, you have to ask for things to be executed in parallel manually. What you probably want is:
do xr <- rpar $ runEval $ mergeSort' x
yr <- rseq $ runEval $ mergeSort' y
rseq (merge xr yr)
This will make Haskell actually create a spark for the first computation, instead of trying to evaluate it on the spot.
Standard tips also kind-of apply:
The result should be evaluated deeply (e.g. using evalTraversable rseq). Otherwise you will only force the head of the tree, and the bulk of the data will just be returned unevaluated.
Just sparking everything will most likely eat up any gains. It would be a good idea to introduce a parameter that stops sparking at lower recursion levels.
Edit: The following actually doesn't apply anymore after the question edit
But the worst part last: Your algorithm as you state it is very flawed. Your top-level seq only forces the first cons-cell of the list, which allows GHC to use lazyness to great effect. It will never actually construct the result list, just plow through all of them in a search for the minimum element (that's not even strictly needed, but GHC only produces the cell after the minimum is known).
So don't be surprised when performance actually drops sharply when you start introducing parallelism under the assumptions that you need the whole list at some point in the program...
Edit 2: Some more answers to the edits
The biggest problem with your program is probably that it is using lists. If you want to make more than a toy example, consider at least using (unpacked) Arrays. If you want to go into serious number-crunching, maybe consider a specialised library like repa.
On "Further Discussion":
The colors stand for different GC states, I can't remember which. Try to look at the event log for the associated event.
The way to "sidestep" garbage collection is to not produce so much garbage in the first place, e.g. by using better data structures.
Well, if you are looking for an inspiration on robust parallelization it might be worthwhile to have a look at monad-par, which is relatively new but (I feel) less "surprising" in its parallel behaviour.
With monad-par, your example might become something like:
do xr <- spawn $ mergeSort' x
yr <- spawn $ mergeSort' y
merge <$> get xr <*> get yr
So here the get actually forces you to specify the join points - and the library does the required deepseq automatically behind the scenes.

I had similar luck to what you report in EDIT 3 on a dual core system with these variants. I used a smaller list length because I'm on a smaller computer, compiled with ghc -O2 -rtsopts -threaded MergePar.hs, and ran with ./MergePar +RTS -H256M -N. This might offer a more structured way to compare performance. Note that the RTS option -qa sometimes helps the simple par variants.
import Control.Applicative
import Control.Parallel
import Control.Parallel.Strategies
import Criterion.Main
import GHC.Conc (numCapabilities)
data Tree a = Leaf a | Node (Tree a) (Tree a) deriving Show
listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = Node (listToTree (take half xs)) (listToTree (drop half xs))
where half = length xs `div` 2
-- Merge two ordered lists
merge :: Ord a => [a] -> [a] -> [a]
merge [] y = y
merge x [] = x
merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
| otherwise = y : merge (x:xs) ys
-- Simple merge sort
mergeSort' :: Ord a => Tree a -> [a]
mergeSort' (Leaf v) = [v]
mergeSort' (Node x y) = merge (mergeSort' x) (mergeSort' y)
mergeSort :: Ord a => [a] -> [a]
mergeSort = mergeSort' . listToTree
-- Merge sort with 'par' annotations on every recursive call
mergeSortP' :: Ord a => Tree a -> [a]
mergeSortP' (Leaf v) = [v]
mergeSortP' (Node x y) = let xr = mergeSortP' x
yr = mergeSortP' y
in xr `par` yr `pseq` merge xr yr
mergeSortP :: Ord a => [a] -> [a]
mergeSortP = mergeSortP' . listToTree
-- Merge sort with 'rpar' annotations on every recursive call
mergeSortR' :: Ord a => Tree a -> [a]
mergeSortR' (Leaf v) = [v]
mergeSortR' (Node x y) =
runEval $ merge <$> rpar (mergeSortR' x) <*> rpar (mergeSortR' y)
mergeSortR :: Ord a => [a] -> [a]
mergeSortR = mergeSortR' . listToTree
-- Parallel merge sort that stops looking for parallelism at a certain
-- depth
smartMerge' :: Ord a => Int -> Tree a -> [a]
smartMerge' _ (Leaf v) = [v]
smartMerge' n t#(Node x y)
| n <= 1 = mergeSort' t
| otherwise = let xr = smartMerge' (n-1) x
yr = smartMerge' (n-2) y
in xr `par` yr `pseq` merge xr yr
smartMerge :: Ord a => [a] -> [a]
smartMerge = smartMerge' numCapabilities . listToTree
main = defaultMain $ [ bench "original" $ nf mergeSort lst
, bench "par" $ nf mergeSortP lst
, bench "rpar" $ nf mergeSortR lst
, bench "smart" $ nf smartMerge lst ]
where lst = [100000,99999..0] :: [Int]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string