I am wondering why the GC never frees memory for main = memo_main. If main = slow_main, the computation takes the same amount of time each time I enter 37. Due to this, the behavior I expected was to see decrease in memory allocation each time the top level memoFix returns. Yet, it is stuck at 370 MBs.
module Main where
import Data.Function.Memoize
memo_fib :: (Int -> Int) -> Int -> Int
memo_fib _ 0 = 1
memo_fib _ 1 = 1
memo_fib f n = f (n-1) + f (n-2)
memo_main = interact $ unlines . map (\x -> show $ memoFix memo_fib $ read x) . lines
slow_fib 0 = 1
slow_fib 1 = 1
slow_fib n = slow_fib (n-1) + slow_fib (n-2)
slow_main = interact $ unlines . map (\x -> show $ slow_fib $ read x) . lines
main :: IO ()
main = memo_main
UPD: Resolved. Memory was actually freed, but not returned to the OS. This is per design of GHC RTS.
Just a couple notes:
Are you compiling with optimization? Optimization might cause memoFix memo_fib to be lifted out of the lambda, since it doesn't depend on x, thus sharing the memo table across all calls.
Another thing to keep in mind is that the GHC runtime never shrinks the heap. As soon as it has allocated a certain amount of memory, the GC collects it but keeps it around for other parts of the program to use, rather than freeing it to the OS. So the GC may be collecting this after all, but you won't see it with whatever tool you're using to watch it.
Related
I'd like to search through a list, testing each element for property X and then return when an element with property X is found.
This list is very large and would benefit from parallelism, but the cost of the spark is rather high relative to the compute time. parListChunk would be great, but then it must search through the entire list.
Is there some way I can write something like parListChunk but with early abort?
This is the naive search code:
hasPropertyX :: Object -> Bool
anyObjectHasPropertyX :: [Object] -> Bool
anyObjectHasPropertyX [] = False
anyObjectHasPropertyX l
| hasPropertyX (head l) == True = True
| otherwise = anyObjectHasPropertyX (tail l)
and this is my first attempt at parallelism:
anyObjectHasPropertyXPar [] = False
anyObjectHasPropertyXPar [a] = hasPropertyX a
anyObjectHasPropertyXPar (a:b:rest) = runEval $ do c1 <- rpar (force (hasPropertyX a))
c2 <- rpar (force (hasPropertyX b))
rseq c1
rseq c2
if (c1 == True) || (c2 == True) then return True else return (anyObjectHasPropertyXPar rest)
This does run slightly faster than the naive code (even with -N1, oddly enough), but not by much (it helps a little by extending the number of parallel computations). I believe it's not benefitting much because it has to spark one thread for each element in the list.
Is there an approach similar to parListChunk that will only spark n threads and that allows for an early abort?
Edit: I'm having problems thinking about this because it seems that I would need to monitor the return value of all the threads. If I omit the rseq's and have something like
if (c1 == True) || (c2 == True) then ...
Is the runtime environment intelligent enough to monitor both threads and continue when either one of them returns?
I don't think you're going to have much luck using Control.Parallel.Strategies. A key feature of this module is that it expresses "deterministic parallelism" such that the result of the program is unaffected by the parallel evaluation. The problem you've described is fundamentally non-deterministic because threads are racing to find the first match.
Update: I see now that you're only returning True if the element is found, so the desired behavior is technically deterministic. So, perhaps there is a way to trick the Strategies module into working. Still, the implementation below seems to meet the requirements.
Here's an implementation of a parallel find parFind that runs in the IO monad using Control.Concurrent primitives and seems to do what you want. Two MVars are used: runningV keeps count of how many threads are still running to allow the last thread standing to detect search failure; and resultV is used to return Just the result or Nothing when search failure is detected by that last thread. Note that it is unlikely to perform better than a single-threaded implementation unless the test (your hasPropertyX above) is substantially more work than the list traversal, unlike this toy example.
import Control.Monad
import Control.Concurrent
import Data.List
import System.Environment
-- Thin a list to every `n`th element starting with index `i`
thin :: Int -> Int -> [a] -> [a]
thin i n = unfoldr step . drop i
where step [] = Nothing
step (y:ys) = Just (y, drop (n-1) ys)
-- Use `n` parallel threads to find first element of `xs` satisfying `f`
parFind :: Int -> (a -> Bool) -> [a] -> IO (Maybe a)
parFind n f xs = do
resultV <- newEmptyMVar
runningV <- newMVar n
comparisonsV <- newMVar 0
threads <- forM [0..n-1] $ \i -> forkIO $ do
case find f (thin i n xs) of
Just x -> void (tryPutMVar resultV (Just x))
Nothing -> do m <- takeMVar runningV
if m == 1
then void (tryPutMVar resultV Nothing)
else putMVar runningV (m-1)
result <- readMVar resultV
mapM_ killThread threads
return result
myList :: [Int]
myList = [1..1000000000]
-- Use `n` threads to find first element equal to `y` in `myList`
run :: Int -> Int -> IO ()
run n y = do x <- parFind n (== y) myList
print x
-- e.g., stack ghc -- -O2 -threaded SearchList.hs
-- time ./SearchList +RTS -N4 -RTS 4 12345 # find 12345 using 4 threads -> 0.018s
-- time ./SearchList +RTS -N4 -RTS 4 -1 # full search w/o match -> 6.7s
main :: IO ()
main = do [n,y] <- getArgs
run (read n) (read y)
Also, note that this version runs the threads on interleaved sublists rather than dividing the main list up into consecutive chunks. I did it this way because (1) it was easier to demonstrate that "early" elements were found quickly; and (2) my huge list means that memory usage can explode if the whole list needs to be kept in memory.
In fact, this example is a bit of a performance time bomb -- its memory usage is nondeterministic and can probably explode if one thread falls way behind so that a substantial portion of the whole list needs to be kept in memory.
In a real world example where the whole list is probably being kept in memory and the property test is expensive, you may find that breaking the list into chunks is faster.
I have a map operation (that is actually run in parallel using parMap from Control.Parallel.Strategies) that takes quite a while. Given that I know how many times the function is applied (n in this context), how can I easily display, every once in a while, how many of the n applications have been evaluated?
The obvious solution would be to make the map a mapM with some putStr inside the mapping function, but that would:
take an unnecessary amount of efficiency
not sample the status every once in a while but in every applications
basically remove all the good things about a deterministic algorithm in the context of parallelism
So, is there a way to keep track of this information, that I'm missing, that avoids these problems?
In production you probably shouldn't use trace and are forced to deal with the complications of needing IO, but for tests you could modify the definition of parMap to take another parameter telling when to emit a count:
import Control.Monad (sequence)
import Control.Parallel.Strategies (Strategy, using, rseq, rparWith, parMap)
import Debug.Trace (traceShow)
import System.IO (hFlush, hSetBuffering, BufferMode(NoBuffering), stdout)
evalList' :: Integer -> Strategy a -> Strategy [a]
evalList' t s as = sequence $ foldr f [] $ zip as [1..]
where f (a, n) ss | n `mod` t == 0 = s (traceShow n a):ss
| otherwise = s a:ss
parList' :: Integer -> Strategy a -> Strategy [a]
parList' t s = evalList' t (rparWith s)
parMap' :: Integer -> Strategy b -> (a -> b) -> [a] -> [b]
parMap' t s f xs = map f xs `using` parList' t s
-- some work to do
fib :: Integer -> Integer
fib 0 = 1
fib 1 = 1
fib n = fib (n-1) + fib(n-2)
main = do hSetBuffering stdout NoBuffering
print $ sum (parMap' 1000 rseq (fib.(+20).(`mod` 5)) ([1..10000]::[Integer]))
If the work packages given by each list element become to small, you could adapt parListChunk instead accordingly.
One could try to craft this behaviour using timeout.
seconds :: Int
seconds = 1000000
progress :: [a] -> IO ()
progress [] = return ()
progress l#(x:xs) =
do r <- timeout (5 * seconds) x -- 5s
threadDelay (2 * seconds) -- 2s more delay
case r of
Nothing -> progress l -- retry
Just y -> do putStrLn "one done!"
progress xs
Be careful since I fear that timeout is aborting the computation. If there's another thread that evaluates x that should be fine, but if this is the only thread doing that it could cause a livelock if 5 seconds are not enough.
I am working on project Euler #14, and have a solution to get the answer, but am getting a stack space overflow error when I try to run the code. The algorithm works OK in the interactive GHCI (on low numbers), but wont work when I throw a really big number at it and try to compile it.
Here is a rough idea of what it does in the interactive GHCI. It takes about 10 seconds to calculate "answer 50000" on my computer.
After letting GHCI run the problem for a few minutes, it spits out the correct answer.
*Euler System.IO> answer 1000000
(525,837799)
But that doesn't solve the stack overflow error when compiling the program to run natively.
*Euler System.IO> answer 10
(20,9)
*Euler System.IO> answer 100
(119,97)
*Euler System.IO> answer 1000
(179,871)
*Euler System.IO> answer 10000
(262,6171)
*Euler System.IO> answer 50000
(324,35655)
What should I do to get the answer to for "answer 1000000"? I imagine my algorithm needs to be fine tuned a bit, but I have no idea how to go about doing that.
Code:
module Main
where
import System.IO
import Control.Monad
main = print (answer 1000000)
-- Count the length of the sequences
-- count' creates a tuple with the second value
-- being the starting number of the game
-- and the first value being the total
-- length of the chain
count' n = (cSeq n, n)
cSeq n = length $ game n
-- Find the maximum chain value of the game
answer n = maximum $ map count' [1..n]
-- Working game.
-- game 13 = [13,40,20,10,5,16,8,4,2,1]
game n = n : play n
play x
| x <= 0 = [] -- is negative or 0
| x == 1 = [] -- is 1
| even x = doEven x : play ((doEven x)) -- even
| otherwise = doOdd x : play ((doOdd x)) -- odd
where doOdd x = (3 * x) + 1
doEven x = (x `div` 2)
The problem here is that maximum is too lazy. Instead of keeping track of the largest element as it goes along, it builds up a huge tree of max thunks. This is because maximum is defined in terms of foldl, so the evaluation goes as follows:
maximum [1, 2, 3, 4, 5]
foldl max 1 [2, 3, 4, 5]
foldl max (max 1 2) [3, 4, 5]
foldl max (max (max 1 2) 3) [4, 5]
foldl max (max (max (max 1 2) 3) 4) [5]
foldl max (max (max (max (max 1 2) 3) 4) 5) []
max (max (max (max 1 2) 3) 4) 5 -- this expression will be huge for large lists
Trying to evaluate too many of these nested max calls causes a stack overflow.
The solution is to force it to evaluate these as it goes along by using the strict version foldl', (or, in this case, its cousin foldl1'). This prevents the max's from building up by reducing them at each step:
foldl1' max [1, 2, 3, 4, 5]
foldl' max 1 [2, 3, 4, 5]
foldl' max 2 [3, 4, 5]
foldl' max 3 [4, 5]
foldl' max 4 [5]
foldl' max 5 []
5
GHC can often solve these kinds of problems on its own if you compile with -O2 which (among other things) runs a strictness analysis of your program. However, I think it's good practice to write programs that don't need to rely on optimizations to work.
Note: After fixing this, the resulting program is still very slow. You might want to look into using memoization for this problem.
#hammar already pointed out the problem that maximum is too lazy, and how to resolve that (using foldl1', the strict version of foldl1).
But there are further inefficiencies in the code.
cSeq n = length $ game n
cSeq lets game construct a list, only to calculate its length. Unfortunately, length is not a "good consumer", so the construction of the intermediate list is not fused away. That's quite a bit of unnecessary allocation and costs time. Eliminating these lists
cSeq n = coll (1 :: Int) n
where
coll acc 1 = acc
coll acc m
| even m = coll (acc + 1) (m `div` 2)
| otherwise = coll (acc + 1) (3*m+1)
cuts down the allocation by something like 65% and the running time by about 20% (still slow). Next point, you're using div, which performs a sign check in addition to the normal division. Since all numbers involved are positive, using quot instead does speed it up a bit more (not much here, but it will become important later).
The next big point is that, since you haven't given type signatures, the type of the numbers (except where it was determined by the use of length or by the expression type signature (1 :: Int) in my rewrite) is Integer. The operations on Integer are considerably slower than the corresponding operations on Int, so if possible, you should use Int (or Word) rather than Integer when speed matters. If you have a 64-bit GHC, Int is sufficient for these computations, that reduces the running time by about half when using div, by about 70% when using quot, when using the native code generator, and when using the LLVM backend, the running time is reduced by about 70% when using div and by about 95% when using quot.
The difference between the native code generator and the LLVM backend is mostly due to some elementary low-level optimisations.
even and odd are defined
even, odd :: (Integral a) => a -> Bool
even n = n `rem` 2 == 0
odd = not . even
in GHC.Real. When the type is Int, LLVM knows to replace the division by 2 used to determine the modulus with a bitwise and (n .&. 1 == 0). The native code generator does not (yet) do many of these low-level optimisations. If you do that by hand, the code produced by the NCG and the LLVM backend performs nearly identically.
When using div, both, the NCG and LLVM, are not able to replace the division with a short shift-and-add sequence, so you get the relatively slow machine division instruction with the sign-test. With quot, both are able to do that for Int, so you get much faster code.
The knowledge that all occurring numbers are positive allows us to replace the division by 2 with a simple right shift, without any code to correct for negative arguments, that speeds up the code produced by the LLVM backend by another ~33%, oddly it doesn't make a difference for the NCG.
So from the original that took eight second plus/minus a bit (a little less with the NCG, a little more with the LLVM backend), we've gone to
module Main (main)
where
import Data.List
import Data.Bits
main = print (answer (1000000 :: Int))
-- Count the length of the sequences
-- count' creates a tuple with the second value
-- being the starting number of the game
-- and the first value being the total
-- length of the chain
count' n = (cSeq n, n)
cSeq n = go (1 :: Int) n
where
go !acc 1 = acc
go acc m
| even' m = go (acc+1) (m `shiftR` 1)
| otherwise = go (acc+1) (3*m+1)
even' :: Int -> Bool
even' m = m .&. 1 == 0
-- Find the maximum chain value of the game
answer n = foldl1' max $ map count' [1..n]
which takes 0.37 seconds with the NCG, and 0.27 seconds with the LLVM backend on my setup.
A minute improvement in running time, but a huge reduction of allocation can be obtained by replacing the foldl1' max with a manual recursion,
answer n = go 1 1 2
where
go ml mi i
| n < i = (ml,mi)
| l > ml = go l i (i+1)
| otherwise = go ml mi (i+1)
where
l = cSeq i
that makes it 0.35 resp. 0.25 seconds (and produces a tiny 52,936 bytes allocated in the heap).
Now if that is still too slow, you can worry about a good memoisation strategy. The best I know(1) is to use an unboxed array to store the chain lengths for the numbers not exceeding the limit,
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import System.Environment (getArgs)
import Data.Array.ST
import Data.Array.Base
import Control.Monad.ST
import Data.Bits
main :: IO ()
main = do
args <- getArgs
let bd = case args of
a:_ -> read a
_ -> 100000
print $ mxColl bd
mxColl :: Int -> (Int,Int)
mxColl bd = runST $ do
arr <- newArray (0,bd) 0
unsafeWrite arr 1 1
goColl arr bd 1 1 2
goColl :: STUArray s Int Int -> Int -> Int -> Int -> Int -> ST s (Int,Int)
goColl arr bd ms ml i
| bd < i = return (ms,ml)
| otherwise = do
nln <- collatzLength arr bd i
if ml < nln
then goColl arr bd i nln (i+1)
else goColl arr bd ms ml (i+1)
collatzLength :: STUArray s Int Int -> Int -> Int -> ST s Int
collatzLength arr bd n = go 1 n
where
go !l 1 = return l
go l m
| bd < m = go (l+1) $ case m .&. 1 of
0 -> m `shiftR` 1
_ -> 3*m+1
| otherwise = do
l' <- unsafeRead arr m
case l' of
0 -> do
l'' <- go 1 $ case m .&. 1 of
0 -> m `shiftR` 1
_ -> 3*m+1
unsafeWrite arr m (l''+1)
return (l + l'')
_ -> return (l+l'-1)
which does the job for a limit of 1000000 in 0.04 seconds when compiled with the NCG, 0.05 with the LLVM backend (apparently, that is not as good at optimising STUArray code as the NCG is).
If you don't have a 64-bit GHC, you can't simply use Int, since that would overflow then for some inputs.
But the overwhelming part of the computation is still performed in Int range, so you should use that where possible and only move to Integer where required.
switch :: Int
switch = (maxBound - 1) `quot` 3
back :: Integer
back = 2 * fromIntegral (maxBound :: Int)
cSeq :: Int -> Int
cSeq n = goInt 1 n
where
goInt acc 1 = acc
goInt acc m
| m .&. 1 == 0 = goInt (acc+1) (m `shiftR` 1)
| m > switch = goInteger (acc+1) (3*toInteger m + 1)
| otherwise = goInt (acc+1) (3*m+1)
goInteger acc m
| fromInteger m .&. (1 :: Int) == 1 = goInteger (acc+1) (3*m+1)
| m > back = goInteger (acc+1) (m `quot` 2) -- yup, quot is faster than shift for Integer here
| otherwise = goInt (acc + 1) (fromInteger $ m `quot` 2)
makes it harder to optimise the loop(s), so it is slower than the single loop using Int, but still decent. Here (where the Integer loop is never run), it takes 0.42 seconds with the NCG and 0.37 with the LLVM backend (which is pretty much the same as using quot in the pure Int version).
Using a similar trick for the memoised version has similar consequences, it's considerably slower than the pure Int version, but still blazingly fast compared to unmemoised versions.
(1) For this special (type of) problem, where you need to memoise the results for a contiguous range of arguments. For other problems, a Map or some other data structure will be the better choice.
It seems that the maximum function is the culprit as already pointed out, but you shouldn't have to worry about it if you compile your program with the -O2 flag.
The program is still quite slow, this is because the problem is supposed to teach you about memoization. One good way of doing this is haskell is by using Data.Memocombinators:
import Data.MemoCombinators
import Control.Arrow
import Data.List
import Data.Ord
import System.Environment
play m = maximumBy (comparing snd) . map (second threeNPuzzle) $ zip [1..] [1..m]
where
threeNPuzzle = arrayRange (1,m) memoized
memoized n
| n == 1 = 1
| odd n = 1 + threeNPuzzle (3*n + 1)
| even n = 1 + threeNPuzzle (n `div` 2)
main = getArgs >>= print . play . read . head
The above program runs in under a second when compiled with -O2 on my machine.
Note that in this case it is not a good idea to memoize all values found by threeNPuzzle, the program above memoizes the ones up until the limit (1000000 in the problem).
I'm trying to measure the performance of a simple Haar DWT program using the Criterion framework. (It is erroneously slow, but I'll leave that for another question). I can't find any good documentation on the web, unfortunately. My two primary problems are
How can one pass data from one benchmark to another? I want to time each stage of the program.
How does the sampling work, and avoid lazy evaluation reusing its previous computations?
This source is relatively pared down; the first function getRandList generates a list of random numbers; haarStep transforms an input signal into differences and sums, and haarDWT calls the former and recurses on the sums. I'm trying to pass the getRandList to the haarDWT via lazy evaluation, but perhaps my usage is incorrect / unsupported. The timings don't seem to make sense.
{-# LANGUAGE ViewPatterns #-}
import Control.Arrow
import qualified Data.Vector.Unboxed as V
import System.Random
import Criterion.Main
invSqrt2 = 0.70710678118654752440
getRandList :: RandomGen g => g -> Int -> [Float]
getRandList gen 0 = []
getRandList gen n = v:rest where
(v, gen') = random gen
rest = getRandList gen' (n - 1)
haarStep :: V.Vector Float -> (V.Vector Float, V.Vector Float)
haarStep = (alternatingOp (-) &&& alternatingOp (+)) where
alternatingOp op x = V.generate (V.length x `div` 2) (\i ->
((x V.! (2 * i)) `op` (x V.! (2 * i + 1))) * invSqrt2)
haarDWT :: V.Vector Float -> V.Vector Float
haarDWT xl#(V.length -> 1) = xl
haarDWT (haarStep -> (d, s)) = haarDWT s V.++ d
main = do
gen <- getStdGen
inData <- return $ getRandList gen 2097152
outData <- return $ haarDWT (V.fromList inData)
defaultMain [
bench "get input" $ nf id inData,
bench "transform" $ nf V.toList outData
]
writeFile "input.dat" (unlines $ map show inData)
writeFile "output.dat" (unlines $ map show $ V.toList outData)
Finally, I'm getting an error when I try to call it with -s 1; maybe this is just a Criterion bug.
Main: ./Data/Vector/Generic.hs:237 ((!)): index out of bounds (1,1)
Thanks in advance!
The posted benchmark is erroniously slow... or is it
Are you sure it's erroneous? You're touching (well, the "nf" call is touching) 2 million boxed elements - thats 4 million pointers. You can call this erroneous if you want, but the issue is just what you think you're measure compared to what you really are measuring.
Sharing Data Between Benchmarks
Data sharing can be accomplished through partial application. In my benchmarks I commonly have
let var = somethingCommon in
defaultMain [ bench "one" (nf (func1 somethingCommon) input1)
, bench "two" (nf (func2 somethingCommon) input2)]
Avoiding Reuse in the presences of lazy evaluation
Criterion avoids sharing by separating out your function and your input. You have signatures such as:
funcToBenchmark :: (NFData b) => a -> b
inputForFunc :: a
In Haskell every time you apply funcToBenchmark inputForFunc it will create a thunk that needs evaluated. There is no sharing unless you use the same variable name as a previous computation. There is no automatic memoization - this seems to be a common misunderstanding.
Notice the nuance in what isn't shared. We aren't sharing the final result, but the input is shared. If the generation of the input is what you want to benchmark (i.e. getRandList, in this case) then benchmark that and not just the identity + nf function:
main = do
gen <- getStdGen
let inData = getRandList gen size
inVec = V.fromList inData
size = 2097152
defaultMain
[ bench "get input for real" $ nf (getRandList gen) size
, bench "get input for real and run harrDWT and listify a vector" $ nf (V.toList . haarDWT . V.fromList . getRandList gen) size
, bench "screw generation, how fast is haarDWT" $ whnf haarDWT inVec] -- for unboxed vectors whnf is sufficient
Interpreting Data
The third benchmark is rather instructive. Lets look at what criterion prints out:
benchmarking screw generation, how fast is haarDWT
collecting 100 samples, 1 iterations each, in estimated 137.3525 s
bootstrapping with 100000 resamples
mean: 134.7204 ms, lb 134.5117 ms, ub 135.0135 ms, ci 0.950
Based on a single run, Criterion thinks it will take 137 seconds to perform it's 100 samples. About ten seconds later it was done - what happened? Well, the first run forced all the inputs (inVec), which was expensive. The subsequent runs found a value instead of a thunk, and thus we truely benchmarked haarDWT and not the StdGen RNG (which is known to be painfully slow).
I've just stated working in haskell semi-explicit parallelism with GHC 6.12. I've write the following haskell code to compute in parallel the map of the fibonnaci function upon 4 elements on a list, and in the same time the map of the function sumEuler upon two elements.
import Control.Parallel
import Control.Parallel.Strategies
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
mkList :: Int -> [Int]
mkList n = [1..n-1]
relprime :: Int -> Int -> Bool
relprime x y = gcd x y == 1
euler :: Int -> Int
euler n = length (filter (relprime n) (mkList n))
sumEuler :: Int -> Int
sumEuler = sum . (map euler) . mkList
-- parallel initiation of list walk
mapFib :: [Int]
mapFib = map fib [37, 38, 39, 40]
mapEuler :: [Int]
mapEuler = map sumEuler [7600, 7600]
parMapFibEuler :: Int
parMapFibEuler = (forceList mapFib) `par` (forceList mapEuler `pseq` (sum mapFib + sum mapEuler))
-- how to evaluate in whnf form by forcing
forceList :: [a] -> ()
forceList [] = ()
forceList (x:xs) = x `pseq` (forceList xs)
main = do putStrLn (" sum : " ++ show parMapFibEuler)
to improve my program in parallel I rewrote it with par and pseq and a forcing function to force whnf evaluation. My problem is that by looking in the threadscope it appear that i didn't gain any parallelism. Things are worse because i didn't gain any speedup.
That why I have theses two questions
Question 1 How could I modify my code to exploit any parallelism ?
Question 2 How could I write my program in order to use Strategies (parMap, parList, rdeepseq and so on ...) ?
First improvement with Strategies
according to his contribution
parMapFibEuler = (mapFib, mapEuler) `using` s `seq` (sum mapFib + sum mapEuler) where
s = parTuple2 (seqList rseq) (seqList rseq)
the parallelism appears in the threadscope but not enough to have a significant speedup
The reason you aren't seeing any parallelism here is because your spark has been garbage collected. Run the program with +RTS -s and note this line:
SPARKS: 1 (0 converted, 1 pruned)
the spark has been "pruned", which means removed by the garbage collector. In GHC 7 we made a change to the semantics of sparks, such that a spark is now garbage collected (GC'd) if it is not referred to by the rest of the program; the details are in the "Seq no more" paper.
Why is the spark GC'd in your case? Look at the code:
parMapFibEuler :: Int
parMapFibEuler = (forceList mapFib) `par` (forceList mapEuler `pseq` (sum mapFib + sum mapEuler))
the spark here is the expression forkList mapFib. Note that the value of this expression is not required by the rest of the program; it only appears as an argument to par. GHC knows that it isn't required, so it gets garbage collected.
The whole point of the recent changes to the parallel package were to let you easily avoid this bear trap. A good Rule of Thumb is to use Control.Parallel.Strategies rather than par and pseq directly. My preferred way to write this would be
parMapFibEuler :: Int
parMapFibEuler = runEval $ do
a <- rpar $ sum mapFib
b <- rseq $ sum mapEuler
return (a+b)
but sadly this doesn't work with GHC 7.0.2, because the spark sum mapFib is floated out as a static expression (a CAF), and the runtime doesn't think sparks that point to static expressions are worth keeping (I'll fix this). This wouldn't happen in a real program, of course! So let's make the program a bit more realistic and defeat the CAF optimisation:
parMapFibEuler :: Int -> Int
parMapFibEuler n = runEval $ do
a <- rpar $ sum (take n mapFib)
b <- rseq $ sum (take n mapEuler)
return (a+b)
main = do [n] <- fmap (fmap read) getArgs
putStrLn (" sum : " ++ show (parMapFibEuler n))
Now I get good parallelism with GHC 7.0.2. However, note that #John's comments also apply: generally you want to look for more fine-grained parallelism so as to let GHC use all your processors.
Your parallelism is far too course-grained to have much beneficial effect. The largest chunks of work that can be done in parallel efficiently are in sumEuler, so that's where you should add your par annotations. Try changing sumEuler to:
sumEuler :: Int -> Int
sumEuler = sum . (parMap rseq euler) . mkList
parMap is from Control.Parallel.Strategies; it expresses a map that can be done in parallel. The first argument, rseq having type Strategy a, is used to force the computation to a specific point, otherwise no work would be done, due to laziness. rseq is fine for most numeric types.
It's not useful to add parallelism to fib here, below about fib 40 there isn't enough work to make it worthwhile.
In addition to threadscope, it's useful to run your program with the -s flag. Look for a line like:
SPARKS: 15202 (15195 converted, 0 pruned)
in the output. Each spark is an entry in a work queue to possibly be performed in parallel. Converted sparks are actually done in parallel, while pruned sparks mean that the main thread got to them before a worker thread had the chance to do so. If the pruned number is high, it means your parallel expressions are too fine-grained. If the total number of sparks is low, you aren't trying to do enough in parallel.
Finally, I think parMapFibEuler is better written as:
parMapFibEuler :: Int
parMapFibEuler = sum (mapFib `using` parList rseq) + sum mapEuler
mapEuler is simply too short to have any parallelism usefully expressed here, especially as euler is already performed in parallel. I'm doubtful that it makes a substantial difference for mapFib either. If the lists mapFib and mapEuler were longer, parallelism here would be more useful. Instead of parList you may be able to use parBuffer, which tends to work well for list consumers.
Making these two changes cuts the runtime from 12s to 8s for me, with GHC 7.0.2.
Hmmm... Maybe?
((forceList mapFib) `par` (forceList mapEuler)) `pseq` (sum mapFib + sum mapEuler)
I.e. spawn mapFib in background and calculate mapEuler and only after it (mapEuler) do (+) of their sums.
Actually I guess you can do something like:
parMapFibEuler = a `par` b `pseq` (a+b) where
a = sum mapFib
b = sum mapEuler
About Q2:
As I know strategies - is the "strategies" to combine data-structures with those par and seq.
You can write your forceList = withStrategy (seqList rseq)
As well you can write your code like:
parMapFibEuler = (mapFib, mapEuler) `using` s `seq` (sum mapFib + sum mapEuler) where
s = parTuple2 (seqList rseq) (seqList rseq)
I.e. strategy applied to tuple of two lists will force their evaulation in parallel, but each list will be forced to be evaluated sequentially.
First off, I assume you know your fib definition is awful and you're just doing this to play with the parallel package.
You seem to be going for parallelism at the wrong level. Parallelizing mapFib and mapEuler won't give a good speed-up because there is more work to compute mapFib. What you should do is compute each of these very expensive elements in parallel, which is slightly finer grain but not overly so:
mapFib :: [Int]
mapFib = parMap rdeepseq fib [37, 38, 39, 40]
mapEuler :: [Int]
mapEuler = parMap rdeepseq sumEuler [7600, 7600, 7600,7600]
parMapFibEuler :: Int
parMapFibEuler = sum a + sum b
where
a = mapFib
b = mapEuler
Also, I originally fought using Control.Parallel.Strategies over Control.Parallel but have come to like it as it is more readable and avoids issues like yours where one would expect parallelism and have to squint at it to figure out why you aren't getting any.
Finally, you should always post how you compile and how you run code you're expecting to be parallelized. For example:
$ ghc --make -rtsopts -O2 -threaded so.hs -eventlog -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
$ ./so +RTS -ls -N2
sum : 299045675
Yields: