In order to learn about GHC's parallel strategies, I've written a simple particle simulator, that, given a particle's position, velocity, and acceleration, will project that particle's path forward.
import Control.Parallel.Strategies
-- Use phantom a to store axis.
newtype Pos a = Pos Double deriving Show
newtype Vel a = Vel Double deriving Show
newtype Acc a = Acc Double deriving Show
newtype TimeStep = TimeStep Double deriving Show
-- Phantom axis
data X
data Y
-- Position, velocity, acceleration for a particle.
data Particle = Particle (Pos X) (Pos Y) (Vel X) (Vel Y) (Acc X) (Acc Y) deriving (Show)
stepParticle :: TimeStep -> Particle -> Particle
stepParticle ts (Particle x y xv yv xa ya) =
Particle x' y' xv' yv' xa' ya'
where
(x', xv', xa') = step ts x xv xa
(y', yv', ya') = step ts y yv ya
-- Given a position, velocity, and accel, calculate the pos, vel, acc after
-- a given TimeStep.
step :: TimeStep -> Pos a -> Vel a -> Acc a -> (Pos a, Vel a, Acc a)
step (TimeStep ts) (Pos p) (Vel v) (Acc a) = (Pos p', Vel v', Acc a)
where
v' = ts * a + v
p' = ts * v + p
-- Build a list of lazy infinite lists of a particles' travel
-- with each update a TimeStep apart. Evaluate each inner list in
-- parallel.
simulateParticlesPar :: TimeStep -> [Particle] -> [[Particle]]
simulateParticlesPar ts = withStrategy (parList (parBuffer 250 particleStrategy))
. fmap (simulateParticle ts)
-- Build a lazy infinite list of the particle's travel with each
-- update being a TimeStep apart.
simulateParticle :: TimeStep -> Particle -> [Particle]
simulateParticle ts m = m' : simulateParticle ts m'
where
m' = stepParticle ts m
particleStrategy :: Strategy Particle
particleStrategy (Particle (Pos x) (Pos y) (Vel xv) (Vel yv) (Acc xa) (Acc ya)) = do
x' <- rseq x
y' <- rseq y
xv' <- rseq xv
yv' <- rseq yv
xa' <- rseq xa
ya' <- rseq ya
return $ Particle (Pos x') (Pos y') (Vel xv') (Vel yv') (Acc xa') (Acc ya')
main :: IO ()
main = do
let world = replicate 100 (Particle (Pos 0) (Pos 0) (Vel 1) (Vel 1) (Acc 0) (Acc 0))
ts = TimeStep 0.1
print $ fmap (take 10000) (simulateParticlesPar ts world)
For each particle, I create a lazy infinite list projecting the particle's path into the future. I start out with 100 of these particles and project these all forward, my intention being to project each of these forward in parallel (roughly a spark per infinite list). If I project these lists forward long enough, I'd expect a significant speedup. Unfortunately, I see a slight slow down.
Compilation: ghc phys.hs -rtsopts -threaded -eventlog -O2
With 1 thread:
$ ./phys +RTS -N1 -sstderr -ls > /dev/null
24,264,983,224 bytes allocated in the heap
441,881,088 bytes copied during GC
1,942,848 bytes maximum residency (104 sample(s))
75,880 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46820 colls, 0 par 0.82s 0.88s 0.0000s 0.0039s
Gen 1 104 colls, 0 par 0.23s 0.23s 0.0022s 0.0037s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 1025000 (25 converted, 0 overflowed, 0 dud, 28680 GC'd, 996295 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 9.90s ( 10.09s elapsed)
GC time 1.05s ( 1.11s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 10.95s ( 11.20s elapsed)
Alloc rate 2,451,939,648 bytes per MUT second
Productivity 90.4% of total user, 88.4% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
With 2 threads:
$ ./phys +RTS -N2 -sstderr -ls > /dev/null
24,314,635,280 bytes allocated in the heap
457,603,240 bytes copied during GC
1,962,152 bytes maximum residency (104 sample(s))
119,824 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46555 colls, 46555 par 1.40s 0.85s 0.0000s 0.0048s
Gen 1 104 colls, 103 par 0.42s 0.25s 0.0024s 0.0043s
Parallel GC work balance: 16.85% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 1025000 (1023572 converted, 0 overflowed, 0 dud, 1367 GC'd, 61 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 11.07s ( 11.20s elapsed)
GC time 1.82s ( 1.10s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 12.89s ( 12.30s elapsed)
Alloc rate 2,196,259,905 bytes per MUT second
Productivity 85.9% of total user, 90.0% of total elapsed
gc_alloc_block_sync: 9222
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 2393
I have an Intel i5 with 2 cores and 4 threads, and with -N4 it's 2x slower than -N1 (total time ~20 sec).
I've spent quite a bit of time trying different strategies, such as chunking the outer list (so each spark gets more than one stream to project forward) and using rpar for each field in particleStrategy, but I've yet to get any speed up at all.
Below is a zoomed in section of the eventlog under threadscope. As you can see, I'm getting almost no concurrency. Most of the work is being done by HEC0, with some activity from HEC1 interleaved in, but only one HEC is doing work at a time. This is pretty representative of all the strategies I've tried.
As a sanity check, I've run a few of the example programs from "Parallel and Concurrent Programming in Haskell" and also see slow downs on these programs, even though I'm using the same params that give them significant speeds ups in the book! I'm beginning to think there's something wrong with my ghc.
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.3
Installed from: https://ghcformacosx.github.io/
OS X 10.10.2
Update:
I've found this thread in the ghc tracker on an OS X threaded RTS performance regression: https://ghc.haskell.org/trac/ghc/ticket/7602. I'm hesitant to blame the compiler, but my -N4 outputs supports this hypothesis. The "parallel gc word balance" is terrible:
$ ./phys +RTS -N4 -sstderr -ls > /dev/null
24,392,146,832 bytes allocated in the heap
481,001,648 bytes copied during GC
1,989,272 bytes maximum residency (104 sample(s))
181,208 bytes maximum slop
8 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46555 colls, 46555 par 4.80s 1.98s 0.0000s 0.0055s
Gen 1 104 colls, 103 par 0.99s 0.39s 0.0037s 0.0048s
Parallel GC work balance: 7.59% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 1025000 (1023640 converted, 0 overflowed, 0 dud, 1331 GC'd, 29 fizzled)
INIT time 0.00s ( 0.01s elapsed)
MUT time 14.85s ( 13.12s elapsed)
GC time 5.79s ( 2.36s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 20.65s ( 15.49s elapsed)
Alloc rate 1,642,170,155 bytes per MUT second
Productivity 71.9% of total user, 95.9% of total elapsed
gc_alloc_block_sync: 61429
whitehole_spin: 0
gen[0].sync: 1
gen[1].sync: 617
On the other hand, I don't know if this explains my threadscope output, which shows a lack of any concurrency at all.
Related
I'm trying to understand how expensive a (green) thread in Haskell (GHC 7.10.1 on OS X 10.10.5) really is. I'm aware that its super cheap compared to a real OS thread, both for memory usage and for CPU.
Right, so I started writing a super simple program with forks n (green) threads (using the excellent async library) and then just sleeps each thread for m seconds.
Well, that's easy enough:
$ cat PerTheadMem.hs
import Control.Concurrent (threadDelay)
import Control.Concurrent.Async (mapConcurrently)
import System.Environment (getArgs)
main = do
args <- getArgs
let (numThreads, sleep) = case args of
numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
_ -> error "wrong args"
mapConcurrently (\_ -> threadDelay (sleep*1000*1000)) [1..numThreads]
and first of all, let's compile and run it:
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.1
$ ghc -rtsopts -O3 -prof -auto-all -caf-all PerTheadMem.hs
$ time ./PerTheadMem 100000 10 +RTS -sstderr
that should fork 100k threads and wait 10s in each and then print us some information:
$ time ./PerTheadMem 100000 10 +RTS -sstderr
340,942,368 bytes allocated in the heap
880,767,000 bytes copied during GC
164,702,328 bytes maximum residency (11 sample(s))
21,736,080 bytes maximum slop
350 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 648 colls, 0 par 0.373s 0.415s 0.0006s 0.0223s
Gen 1 11 colls, 0 par 0.298s 0.431s 0.0392s 0.1535s
INIT time 0.000s ( 0.000s elapsed)
MUT time 79.062s ( 92.803s elapsed)
GC time 0.670s ( 0.846s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.065s ( 0.091s elapsed)
Total time 79.798s ( 93.740s elapsed)
%GC time 0.8% (0.9% elapsed)
Alloc rate 4,312,344 bytes per MUT second
Productivity 99.2% of total user, 84.4% of total elapsed
real 1m33.757s
user 1m19.799s
sys 0m2.260s
It took quite long (1m33.757s) given that each thread is supposed to only just wait for 10s but we've build it non-threaded so fair enough for now. All in all, we used 350 MB, that's not too bad, that's 3.5 KB per thread. Given that the initial stack size (-ki is 1 KB).
Right, but now let's compile is in threaded mode and see if we can get any faster:
$ ghc -rtsopts -O3 -prof -auto-all -caf-all -threaded PerTheadMem.hs
$ time ./PerTheadMem 100000 10 +RTS -sstderr
3,996,165,664 bytes allocated in the heap
2,294,502,968 bytes copied during GC
3,443,038,400 bytes maximum residency (20 sample(s))
14,842,600 bytes maximum slop
3657 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6435 colls, 0 par 0.860s 1.022s 0.0002s 0.0028s
Gen 1 20 colls, 0 par 2.206s 2.740s 0.1370s 0.3874s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 0.879s ( 8.534s elapsed)
GC time 3.066s ( 3.762s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.074s ( 0.247s elapsed)
Total time 4.021s ( 12.545s elapsed)
Alloc rate 4,544,893,364 bytes per MUT second
Productivity 23.7% of total user, 7.6% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 0m12.565s
user 0m4.021s
sys 0m1.154s
Wow, much faster, just 12s now, way better. From Activity Monitor I saw that it roughly used 4 OS threads for the 100k green threads, which makes sense.
However, 3657 MB total memory ! That's 10x more than the non-threaded version used...
Up until now, I didn't do any profiling using -prof or -hy or so. To investigate a bit more I then did some heap profiling (-hy) in separate runs. The memory usage didn't change in either case, the heap profiling graphs look interestingly different (left: non-threaded, right: threaded) but I can't find the reason for the 10x difference.
Diffing the profiling output (.prof files) I can also not find any real difference.
Therefore my question: Where does the 10x difference in memory usage come from?
EDIT: Just to mention it: The same difference applies when the program isn't even compiled with profiling support. So running time ./PerTheadMem 100000 10 +RTS -sstderr with ghc -rtsopts -threaded -fforce-recomp PerTheadMem.hs is 3559 MB. And with ghc -rtsopts -fforce-recomp PerTheadMem.hs it's 395 MB.
EDIT 2: On Linux (GHC 7.10.2 on Linux 3.13.0-32-generic #57-Ubuntu SMP, x86_64) the same happens: Non-threaded 460 MB in 1m28.538s and threaded is 3483 MB is 12.604s. /usr/bin/time -v ... reports Maximum resident set size (kbytes): 413684 and Maximum resident set size (kbytes): 1645384 respectively.
EDIT 3: Also changed the program to use forkIO directly:
import Control.Concurrent (threadDelay, forkIO)
import Control.Concurrent.MVar
import Control.Monad (mapM_)
import System.Environment (getArgs)
main = do
args <- getArgs
let (numThreads, sleep) = case args of
numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
_ -> error "wrong args"
mvar <- newEmptyMVar
mapM_ (\_ -> forkIO $ threadDelay (sleep*1000*1000) >> putMVar mvar ())
[1..numThreads]
mapM_ (\_ -> takeMVar mvar) [1..numThreads]
And it doesn't change anything: non-threaded: 152 MB, threaded: 3308 MB.
IMHO, the culprit is threadDelay. *threadDelay** uses a lot of memory. Here is a program equivalent to yours that behaves better with memory. It ensures that all the threads are running concurrently by having a long-running computation.
uBound = 38
lBound = 34
doSomething :: Integer -> Integer
doSomething 0 = 1
doSomething 1 = 1
doSomething n | n < uBound && n > 0 = let
a = doSomething (n-1)
b = doSomething (n-2)
in a `seq` b `seq` (a + b)
| otherwise = doSomething (n `mod` uBound )
e :: Chan Integer -> Int -> IO ()
e mvar i =
do
let y = doSomething . fromIntegral $ lBound + (fromIntegral i `mod` (uBound - lBound) )
y `seq` writeChan mvar y
main =
do
args <- getArgs
let (numThreads, sleep) = case args of
numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
_ -> error "wrong args"
dld = (sleep*1000*1000)
chan <- newChan
mapM_ (\i -> forkIO $ e chan i) [1..numThreads]
putStrLn "All threads created"
mapM_ (\_ -> readChan chan >>= putStrLn . show ) [1..numThreads]
putStrLn "All read"
And here are the timing statistics:
$ ghc -rtsopts -O -threaded test.hs
$ ./test 200 10 +RTS -sstderr -N4
133,541,985,480 bytes allocated in the heap
176,531,576 bytes copied during GC
356,384 bytes maximum residency (16 sample(s))
94,256 bytes maximum slop
4 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 64246 colls, 64246 par 1.185s 0.901s 0.0000s 0.0274s
Gen 1 16 colls, 15 par 0.004s 0.002s 0.0001s 0.0002s
Parallel GC work balance: 65.96% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.003s elapsed)
MUT time 63.747s ( 16.333s elapsed)
GC time 1.189s ( 0.903s elapsed)
EXIT time 0.001s ( 0.000s elapsed)
Total time 64.938s ( 17.239s elapsed)
Alloc rate 2,094,861,384 bytes per MUT second
Productivity 98.2% of total user, 369.8% of total elapsed
gc_alloc_block_sync: 98548
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 2
Maximum residency is at around 1.5 kb per thread. I played a bit with the number of threads and the running length of the computation. Since threads start doing stuff immediately after forkIO, creating 100000 threads actually takes a very long time. But the results held for 1000 threads.
Here is another program where threadDelay has been "factored out", this one doesn't use any CPU and can be executed easily with 100000 threads:
e :: MVar () -> MVar () -> IO ()
e start end =
do
takeMVar start
putMVar end ()
main =
do
args <- getArgs
let (numThreads, sleep) = case args of
numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)
_ -> error "wrong args"
starts <- mapM (const newEmptyMVar ) [1..numThreads]
ends <- mapM (const newEmptyMVar ) [1..numThreads]
mapM_ (\ (start,end) -> forkIO $ e start end) (zip starts ends)
mapM_ (\ start -> putMVar start () ) starts
putStrLn "All threads created"
threadDelay (sleep * 1000 * 1000)
mapM_ (\ end -> takeMVar end ) ends
putStrLn "All done"
And the results:
129,270,632 bytes allocated in the heap
404,154,872 bytes copied during GC
77,844,160 bytes maximum residency (10 sample(s))
10,929,688 bytes maximum slop
165 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 128 colls, 128 par 0.178s 0.079s 0.0006s 0.0152s
Gen 1 10 colls, 9 par 0.367s 0.137s 0.0137s 0.0325s
Parallel GC work balance: 50.09% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 0.189s ( 10.094s elapsed)
GC time 0.545s ( 0.217s elapsed)
EXIT time 0.001s ( 0.002s elapsed)
Total time 0.735s ( 10.313s elapsed)
Alloc rate 685,509,460 bytes per MUT second
Productivity 25.9% of total user, 1.8% of total elapsed
On my i5, it takes less than one second to create the 100000 threads and put the "start" mvar. The peak residency is at around 778 bytes per thread, not bad at all!
Checking threadDelay's implementation, we see that it is effectively different for the threaded and unthreaded case:
https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Conc.IO.html#threadDelay
Then here: https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Event.TimerManager.html
which looks innocent enough. But an older version of base has an arcane spelling of (memory) doom for those that invoke threadDelay:
https://hackage.haskell.org/package/base-4.4.0.0/docs/src/GHC-Event-Manager.html#line-121
If there is still an issue or not, it is hard to say. However, one can always hope that a "real life" concurrent program won't need to have too many threads waiting on threadDelay at the same time. I for one will keep an eye on my usage of threadDelay from now on.
I have been coding a histogram and I have had some great help on here. I have been coding my histogram using a hash table to store the keys and frequency values because the distribution of the keys are unknown; so they might not be sorted or consecutively together.
The problem with my code is that it spends too much time in GC so looks like a space leak as the time spent in GC is 60.3% - so my productivity is a poor 39.7%.
What is going wrong? I have tried to make things strict in the histogram function and I've also in-lined it (GC time went from 69.1% to 59.4%.)
Please note I have simplified this code by not updating the frequencies in the HT.
{-# LANGUAGE BangPatterns #-}
import qualified Data.HashTable.IO as H
import qualified Data.Vector as V
type HashTable k v = H.BasicHashTable k v
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip k v
where
k = V.generate n (\i -> i `mod` 10)
v = V.generate n (\i -> 1)
histogram :: V.Vector (Int,Int) -> Int -> IO (H.CuckooHashTable Int Int)
histogram vec !n = do
ht <- H.newSized n
go ht (n-1)
where
go ht = go'
where
go' (-1) = return ht
go' !i = do
let (k,v) = vec V.! i
H.insert ht k v
go' (i-1)
{-# INLINE histogram #-}
main :: IO ()
main = do
ht <- histogram kv n
putStrLn "done"
Here's how it is compiled:
ghc --make -O3 -fllvm -rtsopts histogram.hs
Diagnosis:
jap#devbox:~/dev$ ./histogram +RTS -sstderr
done
863,187,472 bytes allocated in the heap
708,960,048 bytes copied during GC
410,476,592 bytes maximum residency (5 sample(s))
4,791,736 bytes maximum slop
613 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1284 colls, 0 par 0.46s 0.46s 0.0004s 0.0322s
Gen 1 5 colls, 0 par 0.36s 0.36s 0.0730s 0.2053s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.51s ( 0.50s elapsed)
GC time 0.82s ( 0.82s elapsed)
EXIT time 0.03s ( 0.04s elapsed)
Total time 1.36s ( 1.36s elapsed)
%GC time 60.3% (60.4% elapsed)
Alloc rate 1,708,131,822 bytes per MUT second
Productivity 39.7% of total user, 39.7% of total elapsed
For the sake of comparison, this is what I get running your code as posted:
863,187,472 bytes allocated in the heap
708,960,048 bytes copied during GC
410,476,592 bytes maximum residency (5 sample(s))
4,791,736 bytes maximum slop
613 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1284 colls, 0 par 1.01s 1.01s 0.0008s 0.0766s
Gen 1 5 colls, 0 par 0.81s 0.81s 0.1626s 0.4783s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.04s ( 1.04s elapsed)
GC time 1.82s ( 1.82s elapsed)
EXIT time 0.04s ( 0.04s elapsed)
Total time 2.91s ( 2.91s elapsed)
%GC time 62.6% (62.6% elapsed)
Alloc rate 827,493,210 bytes per MUT second
Productivity 37.4% of total user, 37.4% of total elapsed
Given that your vector elements are just (Int, Int) tuples, we have no reason not to use Data.Vector.Unboxed instead of plain Data.Vector. That already leads to significant improvement:
743,148,592 bytes allocated in the heap
38,440 bytes copied during GC
231,096,768 bytes maximum residency (4 sample(s))
4,759,104 bytes maximum slop
226 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 977 colls, 0 par 0.23s 0.23s 0.0002s 0.0479s
Gen 1 4 colls, 0 par 0.22s 0.22s 0.0543s 0.1080s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.04s ( 1.04s elapsed)
GC time 0.45s ( 0.45s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.49s ( 1.49s elapsed)
%GC time 30.2% (30.2% elapsed)
Alloc rate 715,050,070 bytes per MUT second
Productivity 69.8% of total user, 69.9% of total elapsed
Next, instead of hand-rolling recursion over the vector, we might use the optimised functions the vector library provides for that purpose. Code...
import qualified Data.HashTable.IO as H
import qualified Data.Vector.Unboxed as V
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip k v
where
k = V.generate n (\i -> i `mod` 10)
v = V.generate n (\i -> 1)
histogram :: V.Vector (Int,Int) -> Int -> IO (H.CuckooHashTable Int Int)
histogram vec n = do
ht <- H.newSized n
V.mapM_ (\(k, v) -> H.insert ht k v) vec
return ht
{-# INLINE histogram #-}
main :: IO ()
main = do
ht <- histogram kv n
putStrLn "done"
... and result:
583,151,048 bytes allocated in the heap
35,632 bytes copied during GC
151,096,672 bytes maximum residency (3 sample(s))
3,003,040 bytes maximum slop
148 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 826 colls, 0 par 0.20s 0.20s 0.0002s 0.0423s
Gen 1 3 colls, 0 par 0.12s 0.12s 0.0411s 0.1222s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.92s ( 0.92s elapsed)
GC time 0.32s ( 0.33s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.25s ( 1.25s elapsed)
%GC time 25.9% (26.0% elapsed)
Alloc rate 631,677,209 bytes per MUT second
Productivity 74.1% of total user, 74.0% of total elapsed
81MB saved, not bad at all. Can we do even better?
A heap profile (which should be the first thing you think of when having memory consumption woes - debugging them without one is shooting in the dark) will reveal that, even with the original code, peak memory consumption happens very early on. Strictly speaking we do not have a leak; we just spend a lot of memory from the beginning. Now, note that the hash table is created with ht <- H.newSized n, with n = 5000000. Unless you expect to have so many different keys (as opposed to elements), that is extremely wasteful. Changing the initial size to 10 (the number of keys you actually have in your test) improves things dramatically:
432,059,960 bytes allocated in the heap
50,200 bytes copied during GC
44,416 bytes maximum residency (2 sample(s))
25,216 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 825 colls, 0 par 0.01s 0.01s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0003s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.90s ( 0.90s elapsed)
GC time 0.01s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.91s ( 0.90s elapsed)
%GC time 0.6% (0.6% elapsed)
Alloc rate 481,061,802 bytes per MUT second
Productivity 99.4% of total user, 99.4% of total elapsed
Finally, we might as well make our life simpler and try using the pure, yet efficient, hash map from unordered-containers. Code...
import qualified Data.HashMap.Strict as M
import qualified Data.Vector.Unboxed as V
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip k v
where
k = V.generate n (\i -> i `mod` 10)
v = V.generate n (\i -> 1)
histogram :: V.Vector (Int,Int) -> M.HashMap Int Int
histogram vec =
V.foldl' (\ht (k, v) -> M.insert k v ht) M.empty vec
main :: IO ()
main = do
print $ M.size $ histogram kv
putStrLn "done"
... and result.
55,760 bytes allocated in the heap
3,512 bytes copied during GC
44,416 bytes maximum residency (1 sample(s))
17,024 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.34s ( 0.34s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.34s ( 0.34s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 162,667 bytes per MUT second
Productivity 99.9% of total user, 100.0% of total elapsed
~60% faster. It remains to be seen how it would scale with a larger amount of keys, but with your test data unordered-containers ends up being not only more convenient (pure functions; actually updating the histogram values only takes changing M.insert to M.insertWith) but also faster.
I am quite new to Haskell and I am wanting to create a histogram. I am using Data.Vector.Unboxed to fuse operations on the data; which is blazing fast (when compiled with -O -fllvm) and the bottleneck is my fold application; which aggregates the bucket counts.
How can I make it faster? I read about trying to reduce the number of thunks by keeping things strict so I've made things strict by using seq and foldr' but not seeing much performance increase. Your ideas are strongly encouraged.
import qualified Data.Vector.Unboxed as V
histogram :: [(Int,Int)]
histogram = V.foldr' agg [] $ V.zip k v
where
n = 10000000
c = 1000000
k = V.generate n (\i -> i `div` c * c)
v = V.generate n (\i -> 1)
agg kv [] = [kv]
agg kv#(k,v) acc#((ck,cv):as)
| k == ck = let a = (ck,cv+v):as in a `seq` a
| otherwise = let a = kv:acc in a `seq` a
main :: IO ()
main = print histogram
Compiled with:
ghc --make -O -fllvm histogram.hs
First, compile the program with -O2 -rtsopts. Then, to get a first idea where you could optimize, run the program with the options +RTS -sstderr:
$ ./question +RTS -sstderr
[(0,1000000),(1000000,1000000),(2000000,1000000),(3000000,1000000),(4000000,1000000),(5000000,1000000),(6000000,1000000),(7000000,1000000),(8000000,1000000),(9000000,1000000)]
1,193,907,224 bytes allocated in the heap
1,078,027,784 bytes copied during GC
282,023,968 bytes maximum residency (7 sample(s))
86,755,184 bytes maximum slop
763 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1964 colls, 0 par 3.99s 4.05s 0.0021s 0.0116s
Gen 1 7 colls, 0 par 1.60s 1.68s 0.2399s 0.6665s
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.67s ( 2.68s elapsed)
GC time 5.59s ( 5.73s elapsed)
EXIT time 0.02s ( 0.03s elapsed)
Total time 8.29s ( 8.43s elapsed)
%GC time 67.4% (67.9% elapsed)
Alloc rate 446,869,876 bytes per MUT second
Productivity 32.6% of total user, 32.0% of total elapsed
Notice that 67% of your time is spent in GC! There is clearly something wrong. To find out what is wrong, we can run the program with heap profiling enabled (using +RTS -h), which produces the following figure:
So, you're leaking thunks. How does this happen? Looking at the code, the only time where a thunk is build up (recursively) in agg is when you do the addition. Making cv strict by adding a bang pattern thus fixes the issue:
{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Unboxed as V
histogram :: [(Int,Int)]
histogram = V.foldr' agg [] $ V.zip k v
where
n = 10000000
c = 1000000
k = V.generate n (\i -> i `div` c * c)
v = V.generate n id
agg kv [] = [kv]
agg kv#(k,v) acc#((ck,!cv):as) -- Note the !
| k == ck = (ck,cv+v):as
| otherwise = kv:acc
main :: IO ()
main = print histogram
Output:
$ time ./improved +RTS -sstderr
[(0,499999500000),(1000000,1499999500000),(2000000,2499999500000),(3000000,3499999500000),(4000000,4499999500000),(5000000,5499999500000),(6000000,6499999500000),(7000000,7499999500000),(8000000,8499999500000),(9000000,9499999500000)]
672,063,056 bytes allocated in the heap
94,664 bytes copied during GC
160,028,816 bytes maximum residency (2 sample(s))
1,464,176 bytes maximum slop
155 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 992 colls, 0 par 0.03s 0.03s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.03s 0.03s 0.0161s 0.0319s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.24s ( 1.25s elapsed)
GC time 0.06s ( 0.06s elapsed)
EXIT time 0.03s ( 0.03s elapsed)
Total time 1.34s ( 1.34s elapsed)
%GC time 4.4% (4.5% elapsed)
Alloc rate 540,674,868 bytes per MUT second
Productivity 95.5% of total user, 95.1% of total elapsed
./improved +RTS -sstderr 1,14s user 0,20s system 99% cpu 1,352 total
This is much better.
So now you could ask, why did the issue appear, even though you used seq? The reason for this is the seq only forces the first argument to be WHNF, and for a pair, (_,_) (where _ are unevaluated thunks) is already WHNF! Also, seq a a is the same as a, because it seq a b (informally) means: evaluate a before b is evaluated, so seq a a just means: evaluate a before a is evaluated, and that is the same as just evaluating a!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I am trying euler challenge 14. I was wondering if I could have any tips for calculating it quickly in haskell. I tried this naive approach.
import Data.List
import Data.Function
collatz n | even n = n quot 2
| otherwise = 3*n+1
colSeq = takeWhile (/= 1) . (iterate collatz)
main=print $ maximumBy (compare on (length . colSeq)) [1..999999]
But that took too long.
λ <*Main System.Timeout>: timeout (10^6*60) main
Nothing
I also tried using the reverse collatz relation, and keeping the lengths in a map to eliminate redundant calculations, but that didn't work either. And don't want the solution, but does anyone have some mathematical literature, or programming technique that will make this quicker, or do I just have to leave it over night?
Your program is not as slow as you might think…
First of all, your program runs fine and finishes in under two minutes if you compile with -O2 and increase the stack size (I used +RTS -K100m, but your system might vary):
$ .\collatz.exe +RTS -K100m -s
65,565,993,768 bytes allocated in the heap
16,662,910,752 bytes copied during GC
77,042,796 bytes maximum residency (1129 sample(s))
5,199,140 bytes maximum slop
184 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 124724 colls, 0 par 18.41s 18.19s 0.0001s 0.0032s
Gen 1 1129 colls, 0 par 16.67s 16.34s 0.0145s 0.1158s
INIT time 0.00s ( 0.00s elapsed)
MUT time 39.98s ( 41.17s elapsed)
GC time 35.08s ( 34.52s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 75.06s ( 75.69s elapsed)
%GC time 46.7% (45.6% elapsed)
Alloc rate 1,639,790,387 bytes per MUT second
Productivity 53.3% of total user, 52.8% of total elapsed
…but that's still slow
Productivity of ~50% percent means that the GC is using half the time we're staring at the screen, waiting for our result. In our case we create to much garbage by iterating the sequence for every value.
Improvements
The Collatz sequence is a recursive sequence. Therefore, we should define it as a recursive sequence instead of a iterative one and have a look at what happens.
colSeq 1 = [1]
colSeq n
| even n = n : colSeq (n `div` 2)
| otherwise = n : colSeq (3 * n + 1)
The list in Haskell is a fundamental type, so GHC should have some nifty optimization (-O2). So lets try this one:
Result
$ .\collatz_rec.exe +RTS -s
37,491,417,368 bytes allocated in the heap
4,288,084 bytes copied during GC
41,860 bytes maximum residency (2 sample(s))
19,580 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 72068 colls, 0 par 0.22s 0.22s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 32.89s ( 33.12s elapsed)
GC time 0.22s ( 0.22s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 33.11s ( 33.33s elapsed)
%GC time 0.7% (0.7% elapsed)
Alloc rate 1,139,881,573 bytes per MUT second
Productivity 99.3% of total user, 98.7% of total elapsed
Note that we're now up to 99% productivity in ~80% MUT time (compared to the original version). Just by this small change we decreased the runtime tremendously.
Wait, there's more!
There's a thing that's rather strange. Why are we calculating the length of both 1024 and 512? After all, the later cannot create a longer Collatz sequence.
Improvements
However, in this case we must see the problem as one big task, and not a map. We need to keep track of the values we already calculated, and we want to clear those values we already visited.
We use Data.Set for this:
problem_14 :: S.Set Integer -> [(Integer, Integer)]
problem_14 s
| S.null s = []
| otherwise = (c, fromIntegral $ length csq) : problem_14 rest
where (c, rest') = S.deleteFindMin s
csq = colSeq c
rest = rest' `S.difference` S.fromList csq
And we use problem_14 like that:
main = print $ maximumBy (compare `on` snd) $ problem_14 $ S.fromList [1..999999]
Result
$ .\collatz_set.exe +RTS -s
18,405,282,060 bytes allocated in the heap
1,645,842,328 bytes copied during GC
27,446,972 bytes maximum residency (40 sample(s))
373,056 bytes maximum slop
79 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 35193 colls, 0 par 2.17s 2.03s 0.0001s 0.0002s
Gen 1 40 colls, 0 par 0.84s 0.77s 0.0194s 0.0468s
INIT time 0.00s ( 0.00s elapsed)
MUT time 14.91s ( 15.17s elapsed)
GC time 3.02s ( 2.81s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 17.92s ( 17.98s elapsed)
%GC time 16.8% (15.6% elapsed)
Alloc rate 1,234,735,903 bytes per MUT second
Productivity 83.2% of total user, 82.9% of total elapsed
We loose some productivity, but that's reasonable. After all, we're now using Set and not the list anymore and use 79MB instead of 1MB. However, our program now runs in 17s instead of 34s, that's only 25% of the original time.
Using ST
Inspiration (C++)
int main(){
std::vector<bool> Q(1000000,true);
unsigned long long max_l = 0, max_c = 1;
for(unsigned long i = 1; i < Q.size(); ++i){
if(!Q[i])
continue;
unsigned long long c = i, l = 0;
while(c != 1){
if(c < Q.size()) Q[c] = false;
c = c % 2 == 0 ? c / 2 : 3 * c + 1;
l++;
}
if(l > max_l){
max_l = l;
max_c = i;
}
}
std::cout << max_c << std::endl;
}
This program runs in 130ms. Our yet best version needs 100 times more. We can fix that.
Haskell
problem_14_vector_st :: Int -> (Int, Int)
problem_14_vector_st limit =
runST $ do
q <- V.replicate (limit+1) True
best <- newSTRef (1,1)
forM_ [1..limit] $ \i -> do
b <- V.read q i
when b $ do
let csq = colSeq $ fromIntegral i
let l = fromIntegral $ length csq
forM_ (map fromIntegral csq) $ \j->
when (j<= limit && j>= 0) $ V.write q j False
m <- fmap snd $ readSTRef best
when (l > m) $ writeSTRef best (i,l)
readSTRef best
Result
$ collatz_vector_st.exe +RTS -s
2,762,282,216 bytes allocated in the heap
10,021,016 bytes copied during GC
1,026,580 bytes maximum residency (2 sample(s))
21,684 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 5286 colls, 0 par 0.02s 0.02s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.09s ( 3.08s elapsed)
GC time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.11s ( 3.11s elapsed)
%GC time 0.5% (0.7% elapsed)
Alloc rate 892,858,898 bytes per MUT second
Productivity 99.5% of total user, 99.6% of total elapsed
~3 seconds. Someone else might know more tricks, but that's the most I could squeeze out of Haskell.
Caching the value of integers you've already hit will save you a lot of time. If you toss in the number 1234, and find that takes 273 steps to get to 1, associate the values. 1234->273.
Now if you ever hit 1234 while in a sequence, you don't have to take 273 more steps to find the answer, just add 273 to your current number and you know the length of the sequence.
Do this for every number you calculate, even the ones in the middle of a sequence. For example, if you are at 1234 and you don't have a value yet, do the step (divide by 2) and calculate and cache the value for 617. You cache almost all the important values really quick this way. There are some really long chains that you'll end up on again and again.
The easiest way to cache all the values as you go is to make a recursive function. Like this (in pseudo-code):
function collatz(number) {
if number is 1: return 1
else if number is in cache: return cached value
else perform step: newnumber = div 2 if even, time 3 + 1 if odd
steps = collatz(newnumber) + 1 //+1 for the step we just took
cache steps as the result for number
return steps
}
Hopefully Haskell won't have problems with the depths of recursion that you'll end up in like this. However, it haskell doesn't like it, you can implement the same thing with a stack, it is just less intuitive.
The main source of time and memory issues is that you build the whole Collatz sequences, whereas for the task you only need their lengths, and unfortunately the laziness doesn't save the day. The simple solution calculating only lengths runs in a few seconds:
simpleCol :: Integer -> Int
simpleCol 1 = 1
simpleCol x | even x = 1 + simpleCol (x `quot` 2)
| otherwise = 1 + simpleCol (3 * x + 1)
problem14 = maximum $ map simpleCol [1 .. 999999]
It also takes much less memory and doesn't need enlarged stack:
$> ./simpleCollatz +RTS -s
simpleCollatz +RTS -s
2,517,321,124 bytes allocated in the heap
217,468 bytes copied during GC
41,860 bytes maximum residency (2 sample(s))
19,580 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 4804 colls, 0 par 0.00s 0.02s 0.0000s 0.0046s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 4.47s ( 4.49s elapsed)
GC time 0.00s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 4.47s ( 4.52s elapsed)
%GC time 0.0% (0.5% elapsed)
Alloc rate 563,316,615 bytes per MUT second
Productivity 100.0% of total user, 98.9% of total elapsed
To illustrate the proposed solution with caching, there is a nifty technique called memoization. Arguably the easiest way to use it is to install a memoize package:
import Data.Function.Memoize
memoCol :: Integer -> Int
memoCol = memoFix mc where
mc _ 1 = 1
mc f x | even x = 1 + f (x `quot` 2)
| otherwise = 1 + f (3 * x + 1)
This cuts down the both the runtime and memory usage, but also heavily uses GC in order to maintain cached values:
$> ./memoCollatz +RTS -s
memoCollatz +RTS -s
1,577,954,668 bytes allocated in the heap
1,056,591,780 bytes copied during GC
303,942,300 bytes maximum residency (12 sample(s))
341,468 bytes maximum slop
616 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 3003 colls, 0 par 1.11s 1.19s 0.0004s 0.0010s
Gen 1 12 colls, 0 par 3.48s 3.65s 0.3043s 1.7065s
INIT time 0.00s ( 0.00s elapsed)
MUT time 7.55s ( 7.50s elapsed)
GC time 4.59s ( 4.84s elapsed)
EXIT time 0.00s ( 0.05s elapsed)
Total time 12.14s ( 12.39s elapsed)
%GC time 37.8% (39.1% elapsed)
Alloc rate 209,087,160 bytes per MUT second
Productivity 62.2% of total user, 60.9% of total elapsed
Make sure you use Integer instead of Int beacuse of Int32 overflow that makes recursion issues.
collatz :: Integer -> Integer
I'm currently trying to understand how to program in parallel in Haskell. I'm following the paper "A Tutorial on Parallel and Concurrent Programming in Haskell" by Simon Peyton Jones and Satnam Singh. The source code are as followed:
module Main where
import Control.Parallel
import System.Time
main :: IO ()
main = do
putStrLn "Starting computation....."
t0 <- getClockTime
pseq r1 (return())
t1 <- getClockTime
putStrLn ("sum: " ++ show r1)
putStrLn ("time: " ++ show (secDiff t0 t1) ++ " seconds")
putStrLn "Finish."
r1 :: Int
r1 = parSumFibEuler 38 5300
-- This is the Fibonacci number generator
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
-- Gets the euler sum
mkList :: Int -> [Int]
mkList n = [1..n-1]
relprime :: Int -> Int -> Bool
relprime x y = gcd x y == 1
euler :: Int -> Int
euler n = length $ filter (relprime n) (mkList n)
sumEuler :: Int -> Int
sumEuler = sum.(map euler).mkList
-- Gets the sum of Euler and Fibonacci (NORMAL)
sumFibEuler :: Int -> Int -> Int
sumFibEuler a b = fib a + sumEuler b
-- Gets the sum of Euler and Fibonacci (PARALLEL)
parSumFibEuler :: Int -> Int -> Int
parSumFibEuler a b =
f `par` (e `pseq`(f+e))
where
f = fib a
e = sumEuler b
-- Measure time
secDiff :: ClockTime -> ClockTime -> Float
secDiff (TOD secs1 psecs1) (TOD secs2 psecs2)
= fromInteger (psecs2 -psecs1) / 1e12 + fromInteger (secs2- secs1)
I compiled it with the following command:
ghc --make -threaded Main.hs
a) Ran it using 1 core:
./Main +RTS -N1
b) Ran it using 2 core:
./Main +RTS -N2
However, the one core ran 53.556sec. Whereas, the two core ran 73.401sec. I don't understand how 2 cores can actually run slower then 1 core. Maybe the message passing overhead is too big for this small program? The paper have completely different outcomes compared to mines. Following are the output details.
For 1 core:
Starting computation.....
sum: 47625790
time: 53.556335 seconds
Finish.
17,961,210,216 bytes allocated in the heap
12,595,880 bytes copied during GC
176,536 bytes maximum residency (3 sample(s))
23,904 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 34389 colls, 0 par 2.54s 2.57s 0.0001s 0.0123s
Gen 1 3 colls, 0 par 0.00s 0.00s 0.0007s 0.0010s
Parallel GC work balance: -nan (0 / 0, ideal 1)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s)
Task 1 (worker) : 0.00s ( 53.56s) 0.00s ( 0.00s)
Task 2 (bound) : 50.49s ( 50.99s) 2.52s ( 2.57s)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 50.47s ( 50.99s elapsed)
GC time 2.54s ( 2.57s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 53.02s ( 53.56s elapsed)
Alloc rate 355,810,305 bytes per MUT second
Productivity 95.2% of total user, 94.2% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
For 2 core:
Starting computation.....
sum: 47625790
time: 73.401146 seconds
Finish.
17,961,210,256 bytes allocated in the heap
12,558,088 bytes copied during GC
176,536 bytes maximum residency (3 sample(s))
195,936 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 34389 colls, 34388 par 7.42s 4.73s 0.0001s 0.0205s
Gen 1 3 colls, 3 par 0.01s 0.00s 0.0011s 0.0017s
Parallel GC work balance: 1.00 (1432193 / 1429197, ideal 2)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 1.19s ( 40.26s) 16.95s ( 33.15s)
Task 1 (worker) : 0.00s ( 73.40s) 0.00s ( 0.00s)
Task 2 (bound) : 54.50s ( 68.67s) 3.66s ( 4.73s)
Task 3 (worker) : 0.00s ( 73.41s) 0.00s ( 0.00s)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 68.87s ( 68.67s elapsed)
GC time 7.43s ( 4.73s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 76.31s ( 73.41s elapsed)
Alloc rate 260,751,318 bytes per MUT second
Productivity 90.3% of total user, 93.8% of total elapsed
gc_alloc_block_sync: 12254
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
r1 = sumFibEuler 38 5300
I believe that you meant
r1 = parSumFibEuler 38 5300
On my configuration (with parSumFibEuler 45 8000 and with only one run):
When N1 = 126.83s
When N2 = 115.46s
I suspect fib function to be much more CPU consuming than sumEuler. That'd explain the low improvement of -N2. There won't be some work-stealing in your situation.
With memoization, your fibonacci function would be much better but I don't think that's what you wanted to try.
EDIT: as mentioned in the comments, I think that with -N2 you have a lot of interruptions since you have two cores available.
Example on my configuration (4 cores) with sum $ parMap rdeepseq (fib) [1..40]
with -N1 it takes ~26s
with -N2 it takes ~16s
with -N3 it takes ~13s
with -N4 it takes ~30s (well, that Haskell program is not alone here)
From here:
Be careful when using all the processors in your machine: if some of
your processors are in use by other programs, this can actually harm
performance rather than improve it.