Frequent GC preventing sparks from running in parallel - haskell

I tried running the first example here: http://chimera.labs.oreilly.com/books/1230000000929/ch03.html
Code: https://github.com/simonmar/parconc-examples/blob/master/strat.hs
import Control.Parallel
import Control.Parallel.Strategies (rpar, Strategy, using)
import Text.Printf
import System.Environment
-- <<fib
fib :: Integer -> Integer
fib 0 = 1
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
-- >>
main = print pair
where
pair =
-- <<pair
(fib 35, fib 36) `using` parPair
-- >>
-- <<parPair
parPair :: Strategy (a,b)
parPair (a,b) = do
a' <- rpar a
b' <- rpar b
return (a',b')
-- >>
I've built using ghc 7.10.2 (on OSX, with a multicore machine) using the following command:
ghc -O2 strat.hs -threaded -rtsopts -eventlog
And run using:
./strat +RTS -N2 -l -s
I expected the 2 fibs calculations to be run in parallel (previous chapter examples worked as expected, so no setup issues), and I wasn't getting any speedup at all, as seen here:
% ./strat +RTS -N2 -l -s
(14930352,24157817)
3,127,178,800 bytes allocated in the heap
6,323,360 bytes copied during GC
70,000 bytes maximum residency (2 sample(s))
31,576 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 5963 colls, 5963 par 0.179s 0.074s 0.0000s 0.0001s
Gen 1 2 colls, 1 par 0.000s 0.000s 0.0001s 0.0001s
Parallel GC work balance: 2.34% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 2 (0 converted, 0 overflowed, 0 dud, 1 GC'd, 1 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 1.809s ( 1.870s elapsed)
GC time 0.180s ( 0.074s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.991s ( 1.945s elapsed)
Alloc rate 1,728,514,772 bytes per MUT second
Productivity 91.0% of total user, 93.1% of total elapsed
gc_alloc_block_sync: 238
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
-N1 gets similar results (omitted).
The # of GC collections seemed suspicious, as pointed out by others in #haskell-beginners, so I tried adding -A16M when running. The results looked much more in line with expectations:
% ./strat +RTS -N2 -l -s -A16M
(14930352,24157817)
3,127,179,920 bytes allocated in the heap
260,960 bytes copied during GC
69,984 bytes maximum residency (2 sample(s))
28,320 bytes maximum slop
33 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 115 colls, 115 par 0.105s 0.002s 0.0000s 0.0003s
Gen 1 2 colls, 1 par 0.000s 0.000s 0.0002s 0.0002s
Parallel GC work balance: 71.25% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 2 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 1.579s ( 1.087s elapsed)
GC time 0.106s ( 0.002s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.686s ( 1.091s elapsed)
Alloc rate 1,980,993,138 bytes per MUT second
Productivity 93.7% of total user, 144.8% of total elapsed
gc_alloc_block_sync: 27
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
The question is: Why is this the behavior? Even with frequent GC, I still intuitively expect the 2 sparks to run in parallel in the other 90% of the running time.

Yes, this is actually a bug in GHC 8.0.1 and earlier (I'm working on fixing it for 8.0.2). The problem is that the fib 35 and fib 36 expressions are constant and so GHC lifts them to the top level as CAFs, and the RTS was wrongly assuming that the CAFs were unreachable and so garbage collecting the sparks.
You can work around it by making the expressions non-constant by passing in parameters on the command line:
main = do
[a,b] <- map read <$> getArgs
let pair = (fib a, fib b) `using` parPair
print pair
and then run the program with ./strat 35 36.

Related

Why does `-threaded` make it slower?

A simple plan:
import qualified Data.ByteString.Lazy.Char8 as BS
main = do
wc <- length . BS.words <$> BS.getContents
print wc
Build for speed:
ghc -fllvm -O2 -threaded -rtsopts Words.hs
More CPUs means more slowly?
$ time ./Words +RTS -qa -N1 < big.txt
331041862
real 0m25.963s
user 0m21.747s
sys 0m1.528s
$ time ./Words +RTS -qa -N2 < big.txt
331041862
real 0m36.410s
user 0m34.910s
sys 0m6.892s
$ time ./Words +RTS -qa -N4 < big.txt
331041862
real 0m42.150s
user 0m55.393s
sys 0m16.227s
For good measure:
$time wc -w big.txt
331041862 big.txt
real 0m8.277s
user 0m7.553s
sys 0m0.529s
Clearly, this is a single-threaded activity. Still, I wonder why it slows down so much.
Also, do you have any tips, how I can make it competitive with wc?
It's GC. Executed your program with +RTS -s and the results told everything.
-N1
D:\>a +RTS -qa -N1 -s < lorem.txt
15470835
4,558,095,152 bytes allocated in the heap
1,746,720 bytes copied during GC
77,936 bytes maximum residency (118 sample(s))
131,856 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 8519 colls, 0 par 0.016s 0.021s 0.0000s 0.0001s
Gen 1 118 colls, 0 par 0.000s 0.004s 0.0000s 0.0001s
TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 0.842s ( 0.855s elapsed)
GC time 0.016s ( 0.025s elapsed)
EXIT time 0.016s ( 0.000s elapsed)
Total time 0.874s ( 0.881s elapsed)
Alloc rate 5,410,809,512 bytes per MUT second
Productivity 98.2% of total user, 97.4% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
-N4
D:\>a +RTS -qa -N4 -s < lorem.txt
15470835
4,558,093,352 bytes allocated in the heap
1,720,232 bytes copied during GC
77,936 bytes maximum residency (113 sample(s))
160,432 bytes maximum slop
4 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 8524 colls, 8524 par 4.742s 1.678s 0.0002s 0.0499s
Gen 1 113 colls, 112 par 0.031s 0.027s 0.0002s 0.0099s
Parallel GC work balance: 1.40% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 1.950s ( 1.415s elapsed)
GC time 4.774s ( 1.705s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 6.724s ( 3.121s elapsed)
Alloc rate 2,337,468,786 bytes per MUT second
Productivity 29.0% of total user, 62.5% of total elapsed
gc_alloc_block_sync: 21082
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
The most significant parts are
Tot time (elapsed) Avg pause Max pause
Gen 0 8524 colls, 8524 par 4.742s 1.678s 0.0002s 0.0499s
and
Parallel GC work balance: 1.40% (serial 0%, perfect 100%)
When -threaded switch is on, at runtime ghc will try its best to balance any work among threads as far as possible. Your whole program is a sequential process so the only work can be moved to other threads are GC, while your program in fact cannot be GCed in parallel so these threads wait for one another to complete their job, resulting a lot of time wasted on synchronization.
If you tell the runtime not to balance among threads by +RTS -qm then sometimes -N4 is as fast as -N1.

Parallel Fibonacci example from "Parallel and Concurrent Programming" [duplicate]

I tried running the first example here: http://chimera.labs.oreilly.com/books/1230000000929/ch03.html
Code: https://github.com/simonmar/parconc-examples/blob/master/strat.hs
import Control.Parallel
import Control.Parallel.Strategies (rpar, Strategy, using)
import Text.Printf
import System.Environment
-- <<fib
fib :: Integer -> Integer
fib 0 = 1
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
-- >>
main = print pair
where
pair =
-- <<pair
(fib 35, fib 36) `using` parPair
-- >>
-- <<parPair
parPair :: Strategy (a,b)
parPair (a,b) = do
a' <- rpar a
b' <- rpar b
return (a',b')
-- >>
I've built using ghc 7.10.2 (on OSX, with a multicore machine) using the following command:
ghc -O2 strat.hs -threaded -rtsopts -eventlog
And run using:
./strat +RTS -N2 -l -s
I expected the 2 fibs calculations to be run in parallel (previous chapter examples worked as expected, so no setup issues), and I wasn't getting any speedup at all, as seen here:
% ./strat +RTS -N2 -l -s
(14930352,24157817)
3,127,178,800 bytes allocated in the heap
6,323,360 bytes copied during GC
70,000 bytes maximum residency (2 sample(s))
31,576 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 5963 colls, 5963 par 0.179s 0.074s 0.0000s 0.0001s
Gen 1 2 colls, 1 par 0.000s 0.000s 0.0001s 0.0001s
Parallel GC work balance: 2.34% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 2 (0 converted, 0 overflowed, 0 dud, 1 GC'd, 1 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 1.809s ( 1.870s elapsed)
GC time 0.180s ( 0.074s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.991s ( 1.945s elapsed)
Alloc rate 1,728,514,772 bytes per MUT second
Productivity 91.0% of total user, 93.1% of total elapsed
gc_alloc_block_sync: 238
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
-N1 gets similar results (omitted).
The # of GC collections seemed suspicious, as pointed out by others in #haskell-beginners, so I tried adding -A16M when running. The results looked much more in line with expectations:
% ./strat +RTS -N2 -l -s -A16M
(14930352,24157817)
3,127,179,920 bytes allocated in the heap
260,960 bytes copied during GC
69,984 bytes maximum residency (2 sample(s))
28,320 bytes maximum slop
33 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 115 colls, 115 par 0.105s 0.002s 0.0000s 0.0003s
Gen 1 2 colls, 1 par 0.000s 0.000s 0.0002s 0.0002s
Parallel GC work balance: 71.25% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 2 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 1.579s ( 1.087s elapsed)
GC time 0.106s ( 0.002s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.686s ( 1.091s elapsed)
Alloc rate 1,980,993,138 bytes per MUT second
Productivity 93.7% of total user, 144.8% of total elapsed
gc_alloc_block_sync: 27
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
The question is: Why is this the behavior? Even with frequent GC, I still intuitively expect the 2 sparks to run in parallel in the other 90% of the running time.
Yes, this is actually a bug in GHC 8.0.1 and earlier (I'm working on fixing it for 8.0.2). The problem is that the fib 35 and fib 36 expressions are constant and so GHC lifts them to the top level as CAFs, and the RTS was wrongly assuming that the CAFs were unreachable and so garbage collecting the sparks.
You can work around it by making the expressions non-constant by passing in parameters on the command line:
main = do
[a,b] <- map read <$> getArgs
let pair = (fib a, fib b) `using` parPair
print pair
and then run the program with ./strat 35 36.

Profiling Two Functions That Sum Large List

I just started reading Parallel and Concurrent Programming in Haskell.
I wrote two programs that, I believe, sums up a list in 2 ways:
running rpar (force (sum list))
splitting up the list, running the above command on each list, and adding each
Here's the code:
import Control.Parallel.Strategies
import Control.DeepSeq
import System.Environment
main :: IO ()
main = do
[n] <- getArgs
[single, faster] !! (read n - 1)
single :: IO ()
single = print . runEval $ rpar (sum list)
faster :: IO ()
faster = print . runEval $ do
let (as, bs) = splitAt ((length list) `div` 2) list
res1 <- rpar (sum as)
res2 <- rpar (sum bs)
return (res1 + res2)
list :: [Integer]
list = [1..10000000]
Compile with parallelization enabled (-threaded)
C:\Users\k\Workspace\parallel_concurrent_haskell>ghc Sum.hs -O2 -threaded -rtsopts
[1 of 1] Compiling Main ( Sum.hs, Sum.o )
Linking Sum.exe ...
Results of single Program
C:\Users\k\Workspace\parallel_concurrent_haskell>Sum 1 +RTS -s -N2
50000005000000
960,065,896 bytes allocated in the heap
363,696 bytes copied during GC
43,832 bytes maximum residency (2 sample(s))
57,016 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1837 colls, 1837 par 0.00s 0.01s 0.0000s 0.0007s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0002s 0.0003s
Parallel GC work balance: 0.18% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 1 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.27s ( 0.27s elapsed)
GC time 0.00s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.27s ( 0.28s elapsed)
Alloc rate 3,614,365,726 bytes per MUT second
Productivity 100.0% of total user, 95.1% of total elapsed
gc_alloc_block_sync: 573
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
Run with faster
C:\Users\k\Workspace\parallel_concurrent_haskell>Sum 2 +RTS -s -N2
50000005000000
1,600,100,336 bytes allocated in the heap
1,477,564,464 bytes copied during GC
400,027,984 bytes maximum residency (14 sample(s))
70,377,336 bytes maximum slop
911 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 3067 colls, 3067 par 1.05s 0.68s 0.0002s 0.0021s
Gen 1 14 colls, 13 par 1.98s 1.53s 0.1093s 0.5271s
Parallel GC work balance: 0.00% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 2 (0 converted, 0 overflowed, 0 dud, 1 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.38s ( 1.74s elapsed)
GC time 3.03s ( 2.21s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.42s ( 3.95s elapsed)
Alloc rate 4,266,934,229 bytes per MUT second
Productivity 11.4% of total user, 9.9% of total elapsed
gc_alloc_block_sync: 335
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
Why did single complete in 0.28 seconds, but faster (poorly named, evidently) took 3.95 seconds?
I am no expert in haskell-specific profiling, but I can see several possible problems in faster. You are walking the input list at least three times: once to get its length, once for splitAt (maybe it is twice, I'm not totally sure how this is implemented), and then again to read and sum its elements. In single, the list is walked only once.
You also hold the entire list in memory at once with faster, but with single haskell can process it lazily, and GC as you go. If you look at the profiling output, you can see that faster is copying many more bytes during GC: over 3,000 times more! faster also needed 400MB of memory all at once, where single needed only 40KB at a time. So the garbage collector had a larger space to keep scanning over.
Another big issue: you allocate a ton of new cons cells in faster, to hold the two intermediate sub-lists. Even if it could all be GCed right away, this is a lot of time spent allocating. It's more expensive than just doing the addition to begin with! So even before you start adding, you are already "over budget" compared to simple.
Following amalloy's answer... My machine is slower than yours, and running your single took
Total time 0.41s ( 0.35s elapsed)
I tried:
list = [ 1..10000000]
list1 = [ 1..5000000]
list2 = [ 5000001 .. 10000000 ]
fastest :: IO ()
fastest = print . runEval $ do
res1 <- rpar (sum list1)
res2 <- rpar (sum list2)
return (res1 + res2)
With that I got
c:\Users\peter\Documents\Haskell\practice>parlist 4 +RTS -s -N2
parlist 4 +RTS -s -N2
50000005000000
960,068,544 bytes allocated in the heap
1,398,472 bytes copied during GC
43,832 bytes maximum residency (3 sample(s))
203,544 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1836 colls, 1836 par 0.00s 0.01s 0.0000s 0.0009s
Gen 1 3 colls, 2 par 0.00s 0.00s 0.0002s 0.0004s
Parallel GC work balance: 0.04% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 2 (0 converted, 0 overflowed, 0 dud, 1 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.31s ( 0.33s elapsed)
GC time 0.00s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.31s ( 0.35s elapsed)
Alloc rate 3,072,219,340 bytes per MUT second
Productivity 100.0% of total user, 90.1% of total elapsed
which is faster...

Excessive mysterious system time use in a GHC-compiled binary

I'm working on an exploration of automatic bounding of constraint-base searches. As such, my starting point is the SEND MORE MONEY problem, with a solution based on nondeterministic selection without replacement. I've modified the approach to count the number of samples performed, in order to better measure the impact of adding constraints to the search.
import Control.Monad.State
import Control.Monad.Trans.List
import Control.Monad.Morph
import Data.List (foldl')
type CS a b = StateT [a] (ListT (State Int)) b
select' :: [a] -> [(a, [a])]
select' [] = []
select' (x:xs) = (x, xs) : [(y, x:ys) | ~(y, ys) <- select' xs]
select :: CS a a
select = do
i <- lift . lift $ get
xs <- get
lift . lift . put $! i + length xs
hoist (ListT . return) (StateT select')
runCS :: CS a b -> [a] -> ([b], Int)
runCS a xs = flip runState 0 . runListT $ evalStateT a xs
fromDigits :: [Int] -> Int
fromDigits = foldl' (\x y -> 10 * x + y) 0
sendMoreMoney :: ([(Int, Int, Int)], Int)
sendMoreMoney = flip runCS [0..9] $ do
[s,e,n,d,m,o,r,y] <- replicateM 8 select
let send = fromDigits [s,e,n,d]
more = fromDigits [m,o,r,e]
money = fromDigits [m,o,n,e,y]
guard $ s /= 0 && m /= 0 && send + more == money
return (send, more, money)
main :: IO ()
main = print sendMoreMoney
It works, it gets correct results, and it maintains a flat heap profile during the search. But even so, it's slow. It's something like 20x slower than without counting the selections. Even that isn't terrible. I can live with paying a huge penalty in order to collect these performance numbers.
But I still don't want the performance to be needlessly terrible, so I decided to look for low-hanging fruit in terms of performance. And I came across some baffling results when I did.
$ ghc -O2 -Wall -fforce-recomp -rtsopts statefulbacktrack.hs
[1 of 1] Compiling Main ( statefulbacktrack.hs, statefulbacktrack.o )
Linking statefulbacktrack ...
$ time ./statefulbacktrack
([(9567,1085,10652)],2606500)
real 0m6.960s
user 0m3.880s
sys 0m2.968s
That system time is utterly ridiculous. The program performs output once. Where's it all going? My next step was checking strace.
$ strace -cf ./statefulbacktrack
([(9567,1085,10652)],2606500)
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.38 0.033798 1469 23 munmap
1.08 0.000370 0 21273 rt_sigprocmask
0.26 0.000090 0 10638 clock_gettime
0.21 0.000073 0 10638 getrusage
0.07 0.000023 4 6 mprotect
0.00 0.000000 0 8 read
0.00 0.000000 0 1 write
0.00 0.000000 0 144 134 open
0.00 0.000000 0 10 close
0.00 0.000000 0 1 execve
0.00 0.000000 0 9 9 access
0.00 0.000000 0 3 brk
0.00 0.000000 0 1 ioctl
0.00 0.000000 0 847 sigreturn
0.00 0.000000 0 1 uname
0.00 0.000000 0 1 select
0.00 0.000000 0 13 rt_sigaction
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 387 mmap2
0.00 0.000000 0 16 15 stat64
0.00 0.000000 0 10 fstat64
0.00 0.000000 0 1 1 futex
0.00 0.000000 0 1 set_thread_area
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 timer_create
0.00 0.000000 0 2 timer_settime
0.00 0.000000 0 1 timer_delete
0.00 0.000000 0 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 0.034354 44039 159 total
So.. strace tells me only 0.034354s was spent in system calls.
Where's the rest of the sys time reported by time going?
One further data point: GC time is really high. Is there an easy way to bring that down?
$ ./statefulbacktrack +RTS -s
([(9567,1085,10652)],2606500)
5,541,572,660 bytes allocated in the heap
1,465,208,164 bytes copied during GC
27,317,868 bytes maximum residency (66 sample(s))
635,056 bytes maximum slop
65 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10568 colls, 0 par 1.924s 2.658s 0.0003s 0.0081s
Gen 1 66 colls, 0 par 0.696s 2.226s 0.0337s 0.1059s
INIT time 0.000s ( 0.001s elapsed)
MUT time 1.656s ( 2.279s elapsed)
GC time 2.620s ( 4.884s elapsed)
EXIT time 0.000s ( 0.009s elapsed)
Total time 4.276s ( 7.172s elapsed)
%GC time 61.3% (68.1% elapsed)
Alloc rate 3,346,131,972 bytes per MUT second
Productivity 38.7% of total user, 23.1% of total elapsed
System Info:
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.1
$ uname -a
Linux debian 3.2.0-4-686-pae #1 SMP Debian 3.2.68-1+deb7u1 i686 GNU/Linux
Running a Debian 7 virtual machine in VMWare Player 7.10 hosted on Windows 8.1.
Be sure to add -H128 to your build command line after
+RTS -s
Your eval looks fine, so you are good to go there.
If you really wanted to go after sluggishness on this VM, increase the thread priority on the VM (and the VM console slightly if you want).
Another unexpected penalty will be due to sync confirmation for GC (since this is SMP Debian on a multicore system).
The GC will have even more VM manipulation to perform on any multicore system, which partially explains the 61 percent GC stat and your strace and time discrepancy. The stats are not reliable for most situations anyway
You are actually doing quite well - - especially if this is on an i7 or later, for example.
I would be surprised if the -H128 option does not resolve this.
I am new here, please do let me know if I can help further or if there is anything you require prior to doling out the bounty.

Is there a parallel find in Haskell?

I have some kind of brute force problem I like to solve in Haskell. My machine has 16 cores so I want to speed up my current algorithm a bit.
I have a method "tryCombination" which returns either a Just (String) or a Nothing. My loop looks like this:
findSolution = find (isJust) [tryCombination a1 a2 a3 n z p |
a1 <- [600..700],
a2 <- [600..700],
a3 <- [600..700],
n <- [1..100],
....
I know there is a special parMap to parallelize a map function. A mapFind could be tricky as it is not predictable, if a thread really finds the first occurence. But is there something like a mapAny to speed up the search?
EDIT:
I rewrote the code using the "withStrategy (parList rseq)" snippet. The status report looks like this:
38,929,334,968 bytes allocated in the heap
2,215,280,048 bytes copied during GC
3,505,624 bytes maximum residency (795 sample(s))
202,696 bytes maximum slop
15 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 44922 colls, 44922 par 37.33s 8.34s 0.0002s 0.0470s
Gen 1 795 colls, 794 par 7.58s 1.43s 0.0018s 0.0466s
Parallel GC work balance: 4.36% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N8)
SPARKS: 17576 (8198 converted, 9378 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 81.79s ( 36.37s elapsed)
GC time 44.91s ( 9.77s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 126.72s ( 46.14s elapsed)
Alloc rate 475,959,220 bytes per MUT second
Productivity 64.6% of total user, 177.3% of total elapsed
gc_alloc_block_sync: 834851
whitehole_spin: 0
gen[0].sync: 10
gen[1].sync: 3724
As I already mentioned (see my comments), all the cores are working for only three seconds (wenn all the sparks are processed). The following 30s all the work is done by a single core. How can I optimize still more?
Some more EDIT:
I now gave "withStrategy (parBuffer 10 rdeepseq)" a try and fiddled around with different buffer sizes:
Buffersize GC work Balance MUT GC
10 50% 11,69s 0,94s
100 47% 12,31s 1,67s
500 40% 11,5 s 1,35s
5000 21% 11,47s 2,25s
First of all I can say, that this is a big improvement against the 59s it took without any multithreading. The second conclusion is, that the buffer size should be as small as possible but bigger than the number of cores.
But the best is, that I have neither overflowed nor fizzled sparks any more. All were converted successfully.
Depending on the lazyness of tryCombination and the desired parallelization, one of these might do what you want:
import Control.Parallel.Strategies
findSolution =
find (isJust) $
withStrategy (parList rseq) $
[ tryCombination a1 a2 a3 n z p
| a1 <- [600..700]
, a2 <- [600..700]
, a3 <- [600..700]
, n <- [1..100]]
This paralleizes the work performed by tryCombination to figure out whether it is a Just or a Nothing, but not the actual result in the Just.
If there is no such lazyness to be exploited and the result type is simple, it might work better to write
findSolution =
find (isJust) $
withStrategy (parList rdeepseq) $
[ tryCombination a1 a2 a3 n z p
| a1 <- [600..700]
, a2 <- [600..700]
, a3 <- [600..700]
, n <- [1..100]]

Resources