I'm getting Heap exhausted message when running the following short Haskell program on a big enough dataset. For example, the program fails (with heap overflow) on 20 Mb input file with around 900k lines. The heap size was set (through -with-rtsopts) to 1 Gb. It runs ok if longestCommonSubstrB is defined as something simpler, e.g. commonPrefix. I need to process files in the order of 100 Mb.
I compiled the program with the following command line (GHC 7.8.3):
ghc -Wall -O2 -prof -fprof-auto "-with-rtsopts=-M512M -p -s -h -i0.1" SampleB.hs
I would appreciate any help in making this thing run in a reasonable amount of space (in the order of the input file size), but I would especially appreciate the thought process of finding where the bottleneck is and where and how to force the strictness.
My guess is that somehow forcing longestCommonSubstrB function to evaluate strictly would solve the problem, but I don't know how to do that.
{-# LANGUAGE BangPatterns #-}
module Main where
import System.Environment (getArgs)
import qualified Data.ByteString.Lazy.Char8 as B
import Data.List (maximumBy, sort)
import Data.Function (on)
import Data.Char (isSpace)
-- | Returns a list of lexicon items, i.e. [[w1,w2,w3]]
readLexicon :: FilePath -> IO [[B.ByteString]]
readLexicon filename = do
text <- B.readFile filename
return $ map (B.split '\t' . stripR) . B.lines $ text
where
stripR = B.reverse . B.dropWhile isSpace . B.reverse
transformOne :: [B.ByteString] -> B.ByteString
transformOne (w1:w2:w3:[]) =
B.intercalate (B.pack "|") [w1, longestCommonSubstrB w2 w1, w3]
transformOne a = error $ "transformOne: unexpected tuple " ++ show a
longestCommonSubstrB :: B.ByteString -> B.ByteString -> B.ByteString
longestCommonSubstrB xs ys = maximumBy (compare `on` B.length) . concat $
[f xs' ys | xs' <- B.tails xs] ++
[f xs ys' | ys' <- tail $ B.tails ys]
where f xs' ys' = scanl g B.empty $ B.zip xs' ys'
g z (x, y) = if x == y
then z `B.snoc` x
else B.empty
main :: IO ()
main = do
(input:output:_) <- getArgs
lexicon <- readLexicon input
let flattened = B.unlines . sort . map transformOne $ lexicon
B.writeFile output flattened
This is the profile ouput for the test dataset (100k lines, heap size set to 1 GB, i.e. generateSample.exe 100000, the resulting file size is 2.38 MB):
Heap profile over time:
Execution statistics:
3,505,737,588 bytes allocated in the heap
785,283,180 bytes copied during GC
62,390,372 bytes maximum residency (44 sample(s))
216,592 bytes maximum slop
96 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6697 colls, 0 par 1.05s 1.03s 0.0002s 0.0013s
Gen 1 44 colls, 0 par 4.14s 3.99s 0.0906s 0.1935s
INIT time 0.00s ( 0.00s elapsed)
MUT time 7.80s ( 9.17s elapsed)
GC time 3.75s ( 3.67s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 1.44s ( 1.35s elapsed)
EXIT time 0.02s ( 0.00s elapsed)
Total time 13.02s ( 12.85s elapsed)
%GC time 28.8% (28.6% elapsed)
Alloc rate 449,633,678 bytes per MUT second
Productivity 60.1% of total user, 60.9% of total elapsed
Time and Allocation Profiling Report:
SampleB.exe +RTS -M1G -p -s -h -i0.1 -RTS sample.txt sample_out.txt
total time = 3.97 secs (3967 ticks # 1000 us, 1 processor)
total alloc = 2,321,595,564 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
longestCommonSubstrB Main 43.3 33.1
longestCommonSubstrB.f Main 21.5 43.6
main.flattened Main 17.5 5.1
main Main 6.6 5.8
longestCommonSubstrB.g Main 5.0 5.8
readLexicon Main 2.5 2.8
transformOne Main 1.8 1.7
readLexicon.stripR Main 1.8 1.9
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 45 0 0.1 0.0 100.0 100.0
main Main 91 0 6.6 5.8 99.9 100.0
main.flattened Main 93 1 17.5 5.1 89.1 89.4
transformOne Main 95 100000 1.8 1.7 71.6 84.3
longestCommonSubstrB Main 100 100000 43.3 33.1 69.8 82.5
longestCommonSubstrB.f Main 101 1400000 21.5 43.6 26.5 49.5
longestCommonSubstrB.g Main 104 4200000 5.0 5.8 5.0 5.8
readLexicon Main 92 1 2.5 2.8 4.2 4.8
readLexicon.stripR Main 98 0 1.8 1.9 1.8 1.9
CAF GHC.IO.Encoding.CodePage 80 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 74 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD 70 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 66 0 0.0 0.0 0.0 0.0
CAF System.Environment 65 0 0.0 0.0 0.0 0.0
CAF Data.ByteString.Lazy.Char8 54 0 0.0 0.0 0.0 0.0
CAF Main 52 0 0.0 0.0 0.0 0.0
transformOne Main 99 0 0.0 0.0 0.0 0.0
readLexicon Main 96 0 0.0 0.0 0.0 0.0
readLexicon.stripR Main 97 1 0.0 0.0 0.0 0.0
main Main 90 1 0.0 0.0 0.0 0.0
UPDATE: The following program can be used to generate sample data. It expects one argument, the number of lines in the generated dataset. The generated data will be saved to the sample.txt file. When I generate 900k lines dataset with it (by running generateSample.exe 900000), the produced dataset makes the above program fail with heap overflow (the heap size was set to 1 GB). The resulting dataset is around 20 MB.
module Main where
import System.Environment (getArgs)
import Data.List (intercalate, permutations)
generate :: Int -> [(String,String,String)]
generate n = take n $ zip3 (f "banana") (f "ruanaba") (f "kikiriki")
where
f = cycle . permutations
main :: IO ()
main = do
(n:_) <- getArgs
let flattened = unlines . map f $ generate (read n :: Int)
writeFile "sample.txt" flattened
where
f (w1,w2,w3) = intercalate "\t" [w1, w2, w3]
It seems to me you've implemented a naive longest common substring, with terrible space complexity (at least O(n^2)). Strictness has nothing to do with it.
You'll want to implement a dynamic programming algo. You may find inspiration in the string-similarity package, or in the lcs function in the guts of the Diff package.
Related
My program uses GPG for signing files. I'm using GPGME in Haskell and the issue is that it is about 16 times slower than using GPG from the command line. Here is an example:
Haskell code:
module Main where
import qualified Data.ByteString as B
import qualified Crypto.Gpgme as Gpg
main :: IO ()
main = do
fileContents <- B.readFile "randomfile"
eitherSigned <- sign fileContents
case eitherSigned of
Left err -> print err
Right signed ->
B.writeFile "signedfile" signed
sign :: B.ByteString -> IO (Either [Gpg.InvalidKey] B.ByteString)
sign bs =
let
sign' :: Gpg.Ctx -> IO (Either [Gpg.InvalidKey] Gpg.Plain)
sign' ctx = Gpg.sign ctx [] Gpg.Normal bs
in
Gpg.withCtx "/home/t/.gnupg" "C" Gpg.OpenPGP sign'
I generate a junk 10MB file with
$ dd if=/dev/zero of=randomfile bs=10000000 count=1
I sign it with GPG from the command line:
$ time gpg -s randomfile
and get
gpg -s randomfile 0.14s user 0.00s system 99% cpu 0.136 total
I sign it with my Haskell program with
$ time stack exec hasksign
and get
stack exec hasksign 0.27s user 0.07s system 14% cpu 2.239 total
I tried running the Haskell code again with profiling on and got this result:
hasksign +RTS -p -RTS randomfile
total time = 2.21 secs (2208 ticks # 1000 us, 1 processor)
total alloc = 115,627,312 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
newCtx Crypto.Gpgme.Ctx src/Crypto/Gpgme/Ctx.hs:(21,1)-(51,21) 77.0 0.0
signIntern Crypto.Gpgme.Crypto src/Crypto/Gpgme/Crypto.hs:(310,1)-(354,14) 20.4 8.6
collectResult.go.\ Crypto.Gpgme.Internal src/Crypto/Gpgme/Internal.hs:(29,21)-(34,46) 2.1 81.9
main Main src/Main.hs:(7,1)-(13,43) 0.1 8.7
I have looked in the newCtx function where the time is spent but it is not clear to me what is costing so much.
In order to learn about GHC's parallel strategies, I've written a simple particle simulator, that, given a particle's position, velocity, and acceleration, will project that particle's path forward.
import Control.Parallel.Strategies
-- Use phantom a to store axis.
newtype Pos a = Pos Double deriving Show
newtype Vel a = Vel Double deriving Show
newtype Acc a = Acc Double deriving Show
newtype TimeStep = TimeStep Double deriving Show
-- Phantom axis
data X
data Y
-- Position, velocity, acceleration for a particle.
data Particle = Particle (Pos X) (Pos Y) (Vel X) (Vel Y) (Acc X) (Acc Y) deriving (Show)
stepParticle :: TimeStep -> Particle -> Particle
stepParticle ts (Particle x y xv yv xa ya) =
Particle x' y' xv' yv' xa' ya'
where
(x', xv', xa') = step ts x xv xa
(y', yv', ya') = step ts y yv ya
-- Given a position, velocity, and accel, calculate the pos, vel, acc after
-- a given TimeStep.
step :: TimeStep -> Pos a -> Vel a -> Acc a -> (Pos a, Vel a, Acc a)
step (TimeStep ts) (Pos p) (Vel v) (Acc a) = (Pos p', Vel v', Acc a)
where
v' = ts * a + v
p' = ts * v + p
-- Build a list of lazy infinite lists of a particles' travel
-- with each update a TimeStep apart. Evaluate each inner list in
-- parallel.
simulateParticlesPar :: TimeStep -> [Particle] -> [[Particle]]
simulateParticlesPar ts = withStrategy (parList (parBuffer 250 particleStrategy))
. fmap (simulateParticle ts)
-- Build a lazy infinite list of the particle's travel with each
-- update being a TimeStep apart.
simulateParticle :: TimeStep -> Particle -> [Particle]
simulateParticle ts m = m' : simulateParticle ts m'
where
m' = stepParticle ts m
particleStrategy :: Strategy Particle
particleStrategy (Particle (Pos x) (Pos y) (Vel xv) (Vel yv) (Acc xa) (Acc ya)) = do
x' <- rseq x
y' <- rseq y
xv' <- rseq xv
yv' <- rseq yv
xa' <- rseq xa
ya' <- rseq ya
return $ Particle (Pos x') (Pos y') (Vel xv') (Vel yv') (Acc xa') (Acc ya')
main :: IO ()
main = do
let world = replicate 100 (Particle (Pos 0) (Pos 0) (Vel 1) (Vel 1) (Acc 0) (Acc 0))
ts = TimeStep 0.1
print $ fmap (take 10000) (simulateParticlesPar ts world)
For each particle, I create a lazy infinite list projecting the particle's path into the future. I start out with 100 of these particles and project these all forward, my intention being to project each of these forward in parallel (roughly a spark per infinite list). If I project these lists forward long enough, I'd expect a significant speedup. Unfortunately, I see a slight slow down.
Compilation: ghc phys.hs -rtsopts -threaded -eventlog -O2
With 1 thread:
$ ./phys +RTS -N1 -sstderr -ls > /dev/null
24,264,983,224 bytes allocated in the heap
441,881,088 bytes copied during GC
1,942,848 bytes maximum residency (104 sample(s))
75,880 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46820 colls, 0 par 0.82s 0.88s 0.0000s 0.0039s
Gen 1 104 colls, 0 par 0.23s 0.23s 0.0022s 0.0037s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 1025000 (25 converted, 0 overflowed, 0 dud, 28680 GC'd, 996295 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 9.90s ( 10.09s elapsed)
GC time 1.05s ( 1.11s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 10.95s ( 11.20s elapsed)
Alloc rate 2,451,939,648 bytes per MUT second
Productivity 90.4% of total user, 88.4% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
With 2 threads:
$ ./phys +RTS -N2 -sstderr -ls > /dev/null
24,314,635,280 bytes allocated in the heap
457,603,240 bytes copied during GC
1,962,152 bytes maximum residency (104 sample(s))
119,824 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46555 colls, 46555 par 1.40s 0.85s 0.0000s 0.0048s
Gen 1 104 colls, 103 par 0.42s 0.25s 0.0024s 0.0043s
Parallel GC work balance: 16.85% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 1025000 (1023572 converted, 0 overflowed, 0 dud, 1367 GC'd, 61 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 11.07s ( 11.20s elapsed)
GC time 1.82s ( 1.10s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 12.89s ( 12.30s elapsed)
Alloc rate 2,196,259,905 bytes per MUT second
Productivity 85.9% of total user, 90.0% of total elapsed
gc_alloc_block_sync: 9222
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 2393
I have an Intel i5 with 2 cores and 4 threads, and with -N4 it's 2x slower than -N1 (total time ~20 sec).
I've spent quite a bit of time trying different strategies, such as chunking the outer list (so each spark gets more than one stream to project forward) and using rpar for each field in particleStrategy, but I've yet to get any speed up at all.
Below is a zoomed in section of the eventlog under threadscope. As you can see, I'm getting almost no concurrency. Most of the work is being done by HEC0, with some activity from HEC1 interleaved in, but only one HEC is doing work at a time. This is pretty representative of all the strategies I've tried.
As a sanity check, I've run a few of the example programs from "Parallel and Concurrent Programming in Haskell" and also see slow downs on these programs, even though I'm using the same params that give them significant speeds ups in the book! I'm beginning to think there's something wrong with my ghc.
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.3
Installed from: https://ghcformacosx.github.io/
OS X 10.10.2
Update:
I've found this thread in the ghc tracker on an OS X threaded RTS performance regression: https://ghc.haskell.org/trac/ghc/ticket/7602. I'm hesitant to blame the compiler, but my -N4 outputs supports this hypothesis. The "parallel gc word balance" is terrible:
$ ./phys +RTS -N4 -sstderr -ls > /dev/null
24,392,146,832 bytes allocated in the heap
481,001,648 bytes copied during GC
1,989,272 bytes maximum residency (104 sample(s))
181,208 bytes maximum slop
8 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46555 colls, 46555 par 4.80s 1.98s 0.0000s 0.0055s
Gen 1 104 colls, 103 par 0.99s 0.39s 0.0037s 0.0048s
Parallel GC work balance: 7.59% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 1025000 (1023640 converted, 0 overflowed, 0 dud, 1331 GC'd, 29 fizzled)
INIT time 0.00s ( 0.01s elapsed)
MUT time 14.85s ( 13.12s elapsed)
GC time 5.79s ( 2.36s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 20.65s ( 15.49s elapsed)
Alloc rate 1,642,170,155 bytes per MUT second
Productivity 71.9% of total user, 95.9% of total elapsed
gc_alloc_block_sync: 61429
whitehole_spin: 0
gen[0].sync: 1
gen[1].sync: 617
On the other hand, I don't know if this explains my threadscope output, which shows a lack of any concurrency at all.
I am quite new to Haskell and I am wanting to create a histogram. I am using Data.Vector.Unboxed to fuse operations on the data; which is blazing fast (when compiled with -O -fllvm) and the bottleneck is my fold application; which aggregates the bucket counts.
How can I make it faster? I read about trying to reduce the number of thunks by keeping things strict so I've made things strict by using seq and foldr' but not seeing much performance increase. Your ideas are strongly encouraged.
import qualified Data.Vector.Unboxed as V
histogram :: [(Int,Int)]
histogram = V.foldr' agg [] $ V.zip k v
where
n = 10000000
c = 1000000
k = V.generate n (\i -> i `div` c * c)
v = V.generate n (\i -> 1)
agg kv [] = [kv]
agg kv#(k,v) acc#((ck,cv):as)
| k == ck = let a = (ck,cv+v):as in a `seq` a
| otherwise = let a = kv:acc in a `seq` a
main :: IO ()
main = print histogram
Compiled with:
ghc --make -O -fllvm histogram.hs
First, compile the program with -O2 -rtsopts. Then, to get a first idea where you could optimize, run the program with the options +RTS -sstderr:
$ ./question +RTS -sstderr
[(0,1000000),(1000000,1000000),(2000000,1000000),(3000000,1000000),(4000000,1000000),(5000000,1000000),(6000000,1000000),(7000000,1000000),(8000000,1000000),(9000000,1000000)]
1,193,907,224 bytes allocated in the heap
1,078,027,784 bytes copied during GC
282,023,968 bytes maximum residency (7 sample(s))
86,755,184 bytes maximum slop
763 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1964 colls, 0 par 3.99s 4.05s 0.0021s 0.0116s
Gen 1 7 colls, 0 par 1.60s 1.68s 0.2399s 0.6665s
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.67s ( 2.68s elapsed)
GC time 5.59s ( 5.73s elapsed)
EXIT time 0.02s ( 0.03s elapsed)
Total time 8.29s ( 8.43s elapsed)
%GC time 67.4% (67.9% elapsed)
Alloc rate 446,869,876 bytes per MUT second
Productivity 32.6% of total user, 32.0% of total elapsed
Notice that 67% of your time is spent in GC! There is clearly something wrong. To find out what is wrong, we can run the program with heap profiling enabled (using +RTS -h), which produces the following figure:
So, you're leaking thunks. How does this happen? Looking at the code, the only time where a thunk is build up (recursively) in agg is when you do the addition. Making cv strict by adding a bang pattern thus fixes the issue:
{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Unboxed as V
histogram :: [(Int,Int)]
histogram = V.foldr' agg [] $ V.zip k v
where
n = 10000000
c = 1000000
k = V.generate n (\i -> i `div` c * c)
v = V.generate n id
agg kv [] = [kv]
agg kv#(k,v) acc#((ck,!cv):as) -- Note the !
| k == ck = (ck,cv+v):as
| otherwise = kv:acc
main :: IO ()
main = print histogram
Output:
$ time ./improved +RTS -sstderr
[(0,499999500000),(1000000,1499999500000),(2000000,2499999500000),(3000000,3499999500000),(4000000,4499999500000),(5000000,5499999500000),(6000000,6499999500000),(7000000,7499999500000),(8000000,8499999500000),(9000000,9499999500000)]
672,063,056 bytes allocated in the heap
94,664 bytes copied during GC
160,028,816 bytes maximum residency (2 sample(s))
1,464,176 bytes maximum slop
155 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 992 colls, 0 par 0.03s 0.03s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.03s 0.03s 0.0161s 0.0319s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.24s ( 1.25s elapsed)
GC time 0.06s ( 0.06s elapsed)
EXIT time 0.03s ( 0.03s elapsed)
Total time 1.34s ( 1.34s elapsed)
%GC time 4.4% (4.5% elapsed)
Alloc rate 540,674,868 bytes per MUT second
Productivity 95.5% of total user, 95.1% of total elapsed
./improved +RTS -sstderr 1,14s user 0,20s system 99% cpu 1,352 total
This is much better.
So now you could ask, why did the issue appear, even though you used seq? The reason for this is the seq only forces the first argument to be WHNF, and for a pair, (_,_) (where _ are unevaluated thunks) is already WHNF! Also, seq a a is the same as a, because it seq a b (informally) means: evaluate a before b is evaluated, so seq a a just means: evaluate a before a is evaluated, and that is the same as just evaluating a!
I have a 2.5 MB file full of floats separated by spaces (the code below can generate it for you) and want to parse it into an array with attoparsec.
It is surprisingly slow, taking almost a second, and allocating a lot of memory:
time ./Attoparsec-problem +RTS -sstderr < floats.txt
299999.0
956,647,344 bytes allocated in the heap
752,875,520 bytes copied during GC
166,485,416 bytes maximum residency (7 sample(s))
8,874,168 bytes maximum slop
337 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1604 colls, 0 par 0.21s 0.27s 0.0002s 0.0092s
Gen 1 7 colls, 0 par 0.24s 0.36s 0.0520s 0.1783s
...
%GC time 65.5% (75.1% elapsed)
Alloc rate 3,985,781,488 bytes per MUT second
Productivity 34.5% of total user, 28.6% of total elapsed
My parser is incredibly simple: It is essentially double <* skipSpace.
This is the code:
import Control.Applicative
import Data.Attoparsec.ByteString.Char8 as A
import qualified Data.ByteString as BS
import qualified Data.Vector.Unboxed as U
-- Compile with:
-- ghc --make -O2 -prof -auto-all -caf-all -rtsopts -fforce-recomp Attoparsec-problem.hs
-- Run:
-- time ./Attoparsec-problem +RTS -sstderr < floats.txt
main :: IO ()
main = do
-- For creating the test file (2.5 MB)
-- writeFile "floats.txt" (Prelude.unwords $ Prelude.map show [1.0,2.0..300000.0])
r <- parse parser <$> BS.getContents
case r of
Done _ arr -> print $ U.last arr
x -> print x
where
parser = do
U.replicateM (300000-1) (double <* skipSpace)
-- This gives surprisingly bad productivity (70% GC time) and 180 MB max residency
-- for a 2.5 MB file!
Can you explain me what is going on?
I'm currently trying to understand how to program in parallel in Haskell. I'm following the paper "A Tutorial on Parallel and Concurrent Programming in Haskell" by Simon Peyton Jones and Satnam Singh. The source code are as followed:
module Main where
import Control.Parallel
import System.Time
main :: IO ()
main = do
putStrLn "Starting computation....."
t0 <- getClockTime
pseq r1 (return())
t1 <- getClockTime
putStrLn ("sum: " ++ show r1)
putStrLn ("time: " ++ show (secDiff t0 t1) ++ " seconds")
putStrLn "Finish."
r1 :: Int
r1 = parSumFibEuler 38 5300
-- This is the Fibonacci number generator
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
-- Gets the euler sum
mkList :: Int -> [Int]
mkList n = [1..n-1]
relprime :: Int -> Int -> Bool
relprime x y = gcd x y == 1
euler :: Int -> Int
euler n = length $ filter (relprime n) (mkList n)
sumEuler :: Int -> Int
sumEuler = sum.(map euler).mkList
-- Gets the sum of Euler and Fibonacci (NORMAL)
sumFibEuler :: Int -> Int -> Int
sumFibEuler a b = fib a + sumEuler b
-- Gets the sum of Euler and Fibonacci (PARALLEL)
parSumFibEuler :: Int -> Int -> Int
parSumFibEuler a b =
f `par` (e `pseq`(f+e))
where
f = fib a
e = sumEuler b
-- Measure time
secDiff :: ClockTime -> ClockTime -> Float
secDiff (TOD secs1 psecs1) (TOD secs2 psecs2)
= fromInteger (psecs2 -psecs1) / 1e12 + fromInteger (secs2- secs1)
I compiled it with the following command:
ghc --make -threaded Main.hs
a) Ran it using 1 core:
./Main +RTS -N1
b) Ran it using 2 core:
./Main +RTS -N2
However, the one core ran 53.556sec. Whereas, the two core ran 73.401sec. I don't understand how 2 cores can actually run slower then 1 core. Maybe the message passing overhead is too big for this small program? The paper have completely different outcomes compared to mines. Following are the output details.
For 1 core:
Starting computation.....
sum: 47625790
time: 53.556335 seconds
Finish.
17,961,210,216 bytes allocated in the heap
12,595,880 bytes copied during GC
176,536 bytes maximum residency (3 sample(s))
23,904 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 34389 colls, 0 par 2.54s 2.57s 0.0001s 0.0123s
Gen 1 3 colls, 0 par 0.00s 0.00s 0.0007s 0.0010s
Parallel GC work balance: -nan (0 / 0, ideal 1)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s)
Task 1 (worker) : 0.00s ( 53.56s) 0.00s ( 0.00s)
Task 2 (bound) : 50.49s ( 50.99s) 2.52s ( 2.57s)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 50.47s ( 50.99s elapsed)
GC time 2.54s ( 2.57s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 53.02s ( 53.56s elapsed)
Alloc rate 355,810,305 bytes per MUT second
Productivity 95.2% of total user, 94.2% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
For 2 core:
Starting computation.....
sum: 47625790
time: 73.401146 seconds
Finish.
17,961,210,256 bytes allocated in the heap
12,558,088 bytes copied during GC
176,536 bytes maximum residency (3 sample(s))
195,936 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 34389 colls, 34388 par 7.42s 4.73s 0.0001s 0.0205s
Gen 1 3 colls, 3 par 0.01s 0.00s 0.0011s 0.0017s
Parallel GC work balance: 1.00 (1432193 / 1429197, ideal 2)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 1.19s ( 40.26s) 16.95s ( 33.15s)
Task 1 (worker) : 0.00s ( 73.40s) 0.00s ( 0.00s)
Task 2 (bound) : 54.50s ( 68.67s) 3.66s ( 4.73s)
Task 3 (worker) : 0.00s ( 73.41s) 0.00s ( 0.00s)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 68.87s ( 68.67s elapsed)
GC time 7.43s ( 4.73s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 76.31s ( 73.41s elapsed)
Alloc rate 260,751,318 bytes per MUT second
Productivity 90.3% of total user, 93.8% of total elapsed
gc_alloc_block_sync: 12254
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
r1 = sumFibEuler 38 5300
I believe that you meant
r1 = parSumFibEuler 38 5300
On my configuration (with parSumFibEuler 45 8000 and with only one run):
When N1 = 126.83s
When N2 = 115.46s
I suspect fib function to be much more CPU consuming than sumEuler. That'd explain the low improvement of -N2. There won't be some work-stealing in your situation.
With memoization, your fibonacci function would be much better but I don't think that's what you wanted to try.
EDIT: as mentioned in the comments, I think that with -N2 you have a lot of interruptions since you have two cores available.
Example on my configuration (4 cores) with sum $ parMap rdeepseq (fib) [1..40]
with -N1 it takes ~26s
with -N2 it takes ~16s
with -N3 it takes ~13s
with -N4 it takes ~30s (well, that Haskell program is not alone here)
From here:
Be careful when using all the processors in your machine: if some of
your processors are in use by other programs, this can actually harm
performance rather than improve it.