Decipher garbage collection output - garbage-collection

I was running a sample program program using
rahul#g3ck0:~/programs/Remodel$ GOGCTRACE=1 go run main.go
gc1(1): 0+0+0 ms 0 -> 0 MB 422 -> 346 (422-76) objects 0 handoff
gc2(1): 0+0+0 ms 0 -> 0 MB 2791 -> 1664 (2867-1203) objects 0 handoff
gc3(1): 0+0+0 ms 1 -> 0 MB 4576 -> 2632 (5779-3147) objects 0 handoff
gc4(1): 0+0+0 ms 1 -> 0 MB 3380 -> 2771 (6527-3756) objects 0 handoff
gc5(1): 0+0+0 ms 1 -> 0 MB 3511 -> 2915 (7267-4352) objects 0 handoff
gc6(1): 0+0+0 ms 1 -> 0 MB 6573 -> 2792 (10925-8133) objects 0 handoff
gc7(1): 0+0+0 ms 1 -> 0 MB 4859 -> 3059 (12992-9933) objects 0 handoff
gc8(1): 0+0+0 ms 1 -> 0 MB 4554 -> 3358 (14487-11129) objects 0 handoff
gc9(1): 0+0+0 ms 1 -> 0 MB 8633 -> 4116 (19762-15646) objects 0 handoff
gc10(1): 0+0+0 ms 1 -> 0 MB 9415 -> 4769 (25061-20292) objects 0 handoff
gc11(1): 0+0+0 ms 1 -> 0 MB 6636 -> 4685 (26928-22243) objects 0 handoff
gc12(1): 0+0+0 ms 1 -> 0 MB 6741 -> 4802 (28984-24182) objects 0 handoff
gc13(1): 0+0+0 ms 1 -> 0 MB 9654 -> 5097 (33836-28739) objects 0 handoff
gc1(1): 0+0+0 ms 0 -> 0 MB 209 -> 171 (209-38) objects 0 handoff
Help me understand the first part i.e.
0 + 0 + 0 => Mark + Sweep + Clean times
Does 422 -> 346 means that there has been memory cleanup from 422MB to 346 MB?
If yes, then how come the memory is been reduced when there was nothing to be cleaned up?

In Go 1.5, the format of this output has changed considerably. For the full documentation, head over to http://godoc.org/runtime and search for "gctrace:"
gctrace: setting gctrace=1 causes the garbage collector to emit a single line to standard
error at each collection, summarizing the amount of memory collected and the
length of the pause. Setting gctrace=2 emits the same summary but also
repeats each collection. The format of this line is subject to change.
Currently, it is:
gc # ##s #%: #+...+# ms clock, #+...+# ms cpu, #->#-># MB, # MB goal, # P
where the fields are as follows:
gc # the GC number, incremented at each GC
##s time in seconds since program start
#% percentage of time spent in GC since program start
#+...+# wall-clock/CPU times for the phases of the GC
#->#-># MB heap size at GC start, at GC end, and live heap
# MB goal goal heap size
# P number of processors used
The phases are stop-the-world (STW) sweep termination, scan,
synchronize Ps, mark, and STW mark termination. The CPU times
for mark are broken down in to assist time (GC performed in
line with allocation), background GC time, and idle GC time.
If the line ends with "(forced)", this GC was forced by a
runtime.GC() call and all phases are STW.

The output is generated from this line: http://golang.org/src/pkg/runtime/mgc0.c?#L2147
So the different parts are:
0+0+0 ms : mark, sweep and clean duration in ms
1 -> 0 MB : heap before and after in MB
209 - 171 : objects before and after
(209-38) objects : number of allocs and frees
handoff (and in Go 1.2 steal and yields) are internals of the algorithm.

Related

Java eden space is not 8 times larger than s0 space

according to oracle's doc default parameter values for SurvivorRatio is 8, that means each survivor space will be one-eighth the size of eden space.
but in my application it don't work
$ jmap -heap 48865
Attaching to process ID 48865, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.45-b02
using thread-local object allocation.
Parallel GC with 8 thread(s)
Heap Configuration:
MinHeapFreeRatio = 0
MaxHeapFreeRatio = 100
MaxHeapSize = 4294967296 (4096.0MB)
NewSize = 89128960 (85.0MB)
MaxNewSize = 1431306240 (1365.0MB)
OldSize = 179306496 (171.0MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
PS Young Generation
Eden Space:
capacity = 67108864 (64.0MB)
used = 64519920 (61.53099060058594MB)
free = 2588944 (2.4690093994140625MB)
96.14217281341553% used
From Space:
capacity = 11010048 (10.5MB)
used = 0 (0.0MB)
free = 11010048 (10.5MB)
0.0% used
To Space:
capacity = 11010048 (10.5MB)
used = 0 (0.0MB)
free = 11010048 (10.5MB)
0.0% used
PS Old Generation
capacity = 179306496 (171.0MB)
used = 0 (0.0MB)
free = 179306496 (171.0MB)
0.0% used
7552 interned Strings occupying 605288 bytes.
but in VisualVM eden space is 1.332G and S0 is 455M, eden is only 3 times larger than S0 not the 8
You have neither disabled -XX:-UseAdaptiveSizePolicy, nor set -Xms equal to -Xmx, so JVM is free to resize heap generations (and survivor spaces) in runtime. In this case the estimated maximum Survior size is
MaxSurvivor = NewGen / MinSurvivorRatio
where -XX:MinSurvivorRatio=3 by default. Note: this is an estimated maximum, not the actual size.
See also this answer.

write error: No space left on device in embedded linux

all
I have a embedded board, run linux OS. and I use yaffs2 as rootfs.
I run a program on it, but after some times, it got a error "error No space left on device.". but I checked the flash, there still have a lot free space.
I just write some config file. the config file is rarely update. the program will write some log to flash. log size is limited to 2M.
I don't know why, and how to solve.
Help me please!(my first language is not English,sorry. hope you understand what I say)
some debug info:
# ./write_test
version 1.0
close file :: No space left on device
return errno 28
# cat /proc/yaffs
YAFFS built:Nov 23 2015 16:57:34
Device 0 "rootfs"
start_block........... 0
end_block............. 511
total_bytes_per_chunk. 2048
use_nand_ecc.......... 1
no_tags_ecc........... 1
is_yaffs2............. 1
inband_tags........... 0
empty_lost_n_found.... 0
disable_lazy_load..... 0
refresh_period........ 500
n_caches.............. 10
n_reserved_blocks..... 5
always_check_erased... 0
data_bytes_per_chunk.. 2048
chunk_grp_bits........ 0
chunk_grp_size........ 1
n_erased_blocks....... 366
blocks_in_checkpt..... 0
n_tnodes.............. 749
n_obj................. 477
n_free_chunks......... 23579
n_page_writes......... 6092
n_page_reads.......... 11524
n_erasures............ 96
n_gc_copies........... 5490
all_gcs............... 1136
passive_gc_count...... 1136
oldest_dirty_gc_count. 95
n_gc_blocks........... 96
bg_gcs................ 96
n_retired_writes...... 0
n_retired_blocks...... 0
n_ecc_fixed........... 0
n_ecc_unfixed......... 0
n_tags_ecc_fixed...... 0
n_tags_ecc_unfixed.... 0
cache_hits............ 0
n_deleted_files....... 0
n_unlinked_files...... 289
refresh_count......... 1
n_bg_deletions........ 0
Device 2 "data"
start_block........... 0
end_block............. 927
total_bytes_per_chunk. 2048
use_nand_ecc.......... 1
no_tags_ecc........... 1
is_yaffs2............. 1
inband_tags........... 0
empty_lost_n_found.... 0
disable_lazy_load..... 0
refresh_period........ 500
n_caches.............. 10
n_reserved_blocks..... 5
always_check_erased... 0
data_bytes_per_chunk.. 2048
chunk_grp_bits........ 0
chunk_grp_size........ 1
n_erased_blocks....... 10
blocks_in_checkpt..... 0
n_tnodes.............. 4211
n_obj................. 24
n_free_chunks......... 658
n_page_writes......... 430
n_page_reads.......... 467
n_erasures............ 7
n_gc_copies........... 421
all_gcs............... 20
passive_gc_count...... 13
oldest_dirty_gc_count. 3
n_gc_blocks........... 6
bg_gcs................ 4
n_retired_writes...... 0
n_retired_blocks...... 0
n_ecc_fixed........... 0
n_ecc_unfixed......... 0
n_tags_ecc_fixed...... 0
n_tags_ecc_unfixed.... 0
cache_hits............ 0
n_deleted_files....... 0
n_unlinked_files...... 2
refresh_count......... 1
n_bg_deletions........ 0
#
log and config file stored in "data".
thanks!!
In General this could be your disk space (here Flash), first of all check your flash space with with df -h (or other commands you have.. df is present in BusyBox). But if your flash space (specially on your program partition) is ok, this could be your "inode" (directory) space problem, you could see your inode usage with df -i command. (a good link for this: https://wiki.gentoo.org/wiki/Knowledge_Base:No_space_left_on_device_while_there_is_plenty_of_space_available)
If non of these is the problem cause, I think you have to have a deeper look at your code, specially if you deal with disk I/O!
Also good to mention that be aware of memory & heap space & free all allocated spaces in you functions.

What diagnostic tools are available for Node.js applications?

There are many tools out there, which diagnostics tools are good for diagnostic memory leak issues for node.js applications?
Yes, IDDE is a powerful tool not only for memory leak detection, but for a wide variety of problem determination of Node.js misbehaviors, including crashes and hangs.
Here is the link for overview, installation, and what is new information: https://www.ibm.com/developerworks/java/jdk/tools/idde
I would start with nodeoverview command. Note that every command starts with a bang (!) and every command is entered with a control (ctrl+enter) for reasons.
!nodeoverview {
Heap and Garbage Collection
Memory allocator, used: 981 MB, available: 482 MB
GC Count: 144
This shows up the occupancy of the heap.
Then, use jsmeminfo to figure out the predominent resident objects in the heap.
!jsmeminfo {
Memory allocator, used: 981 MB, available: 482 MB
Total Heap Objects: 21559924
Largest 5 heap objects Type Size (bytes) More information
0x00000000de06d319 FIXED_ARRAY_TYPE 131112 !array 0x00000000de06d319
0x00000000de0ac6d9 FIXED_ARRAY_TYPE 98360 !array 0x00000000de0ac6d9
0x00000000e90e2f09 ASCII_STRING_TYPE 48152 !string 0x00000000e90e2f09
0x00000000e9035099 ASCII_STRING_TYPE 48088 !string 0x00000000e9035099
0x00000000e9004101 ASCII_STRING_TYPE 40936 !string 0x00000000e9004101
Most Frequent 5 object types Frequency
JS_OBJECT_TYPE 15371393
FIXED_ARRAY_TYPE 6175379
ASCII_INTERNALIZED_STRING_TYPE 3476
BYTE_ARRAY_TYPE 1572
JS_FUNCTION_TYPE 1434
}
Review the application based on this information and see they holding up the memory as shown is justified or not.
If you want to 'dissect' the objects further to see the content, use object expansion commands such as !jsobject or !array:
!array 0x00000000de06d319 {
Array type : FIXED_ARRAY_TYPE
Len : 16387
Showing first 100 elements only
0 : 0xd9400000000 (SMI)
1 : 0x3fe00000000 (SMI)
2 : 0x400000000000 (SMI)
3 : 0x9a1103d1 (ASCII_INTERNALIZED_STRING_TYPE : !print 0x000000009A1103D1 )
4 : 0x9a1042a9 (ASCII_INTERNALIZED_STRING_TYPE : !print 0x000000009A1042A9 )
...
}
If you want to 'segregate' the entire heap into sections based on object's internal types, user jsgroupobjects. This is more useful when you have multiple dumps taken at different time intervals, and want to compare which objects grew over time.
!jsgroupobjects {
Representative Object Address Object Type Num Objects Constructor Num Properties Properties
!jsobject 0x00000000c8244fd1 JS_OBJECT_TYPE 6133503 Object 0
!jsobject 0x00000000c8004161 JS_OBJECT_TYPE 6133499 Database 0
!jsobject 0x00000000c8004101 JS_OBJECT_TYPE 3066750 MyRecord 0
!jsobject 0x00000000c869b111 JS_OBJECT_TYPE 37302 Object 0
!jsobject 0x00000000de05b959 JS_FUNCTION_TYPE 542 0
!jsobject 0x00000000de04bcc1 JS_FUNCTION_TYPE 267 0
!jsobject 0x00000000de04aa09 JS_FUNCTION_TYPE 251 0
!jsobject 0x00000000de04a911 JS_FUNCTION_TYPE 227 0
!jsobject 0x00000000de0a48c9 JS_ARRAY_TYPE 190 Array 0
!jsobject 0x00000000de04a7e9 JS_FUNCTION_TYPE 102 0
!jsobject 0x00000000de04e379 JS_ARRAY_TYPE 34 Array 0
!jsobject 0x00000000de050db1 JS_OBJECT_TYPE 30 Object 0
!jsobject 0x00000000c2938151 JS_REGEXP_TYPE 18 RegExp 0
!jsobject 0x00000000c2955a11 JS_OBJECT_TYPE 15 NativeModule 0
!jsobject 0x00000000c2944519 JS_OBJECT_TYPE 11 Object 0
!jsobject 0x00003abc617bee71 JS_OBJECT_TYPE 102 CallSite 3 receiver, fun, pos
If you want to examine a single object, do jsobject on the object address.
!jsobject 0x00003abc617bee71 {
Object has fast properties
Number of descriptors : 3
Name Value More Information
receiver 0x0000251abe506c91
fun 0x00003abc617bb241
pos 0x00001dfd00000000 SMI = 0x1dfd
}
module https://www.npmjs.com/package/appmetrics but it is more for monitoring and profiling.
You can check it out, it is useful.

Slowdown when using ghc parallel strategies

In order to learn about GHC's parallel strategies, I've written a simple particle simulator, that, given a particle's position, velocity, and acceleration, will project that particle's path forward.
import Control.Parallel.Strategies
-- Use phantom a to store axis.
newtype Pos a = Pos Double deriving Show
newtype Vel a = Vel Double deriving Show
newtype Acc a = Acc Double deriving Show
newtype TimeStep = TimeStep Double deriving Show
-- Phantom axis
data X
data Y
-- Position, velocity, acceleration for a particle.
data Particle = Particle (Pos X) (Pos Y) (Vel X) (Vel Y) (Acc X) (Acc Y) deriving (Show)
stepParticle :: TimeStep -> Particle -> Particle
stepParticle ts (Particle x y xv yv xa ya) =
Particle x' y' xv' yv' xa' ya'
where
(x', xv', xa') = step ts x xv xa
(y', yv', ya') = step ts y yv ya
-- Given a position, velocity, and accel, calculate the pos, vel, acc after
-- a given TimeStep.
step :: TimeStep -> Pos a -> Vel a -> Acc a -> (Pos a, Vel a, Acc a)
step (TimeStep ts) (Pos p) (Vel v) (Acc a) = (Pos p', Vel v', Acc a)
where
v' = ts * a + v
p' = ts * v + p
-- Build a list of lazy infinite lists of a particles' travel
-- with each update a TimeStep apart. Evaluate each inner list in
-- parallel.
simulateParticlesPar :: TimeStep -> [Particle] -> [[Particle]]
simulateParticlesPar ts = withStrategy (parList (parBuffer 250 particleStrategy))
. fmap (simulateParticle ts)
-- Build a lazy infinite list of the particle's travel with each
-- update being a TimeStep apart.
simulateParticle :: TimeStep -> Particle -> [Particle]
simulateParticle ts m = m' : simulateParticle ts m'
where
m' = stepParticle ts m
particleStrategy :: Strategy Particle
particleStrategy (Particle (Pos x) (Pos y) (Vel xv) (Vel yv) (Acc xa) (Acc ya)) = do
x' <- rseq x
y' <- rseq y
xv' <- rseq xv
yv' <- rseq yv
xa' <- rseq xa
ya' <- rseq ya
return $ Particle (Pos x') (Pos y') (Vel xv') (Vel yv') (Acc xa') (Acc ya')
main :: IO ()
main = do
let world = replicate 100 (Particle (Pos 0) (Pos 0) (Vel 1) (Vel 1) (Acc 0) (Acc 0))
ts = TimeStep 0.1
print $ fmap (take 10000) (simulateParticlesPar ts world)
For each particle, I create a lazy infinite list projecting the particle's path into the future. I start out with 100 of these particles and project these all forward, my intention being to project each of these forward in parallel (roughly a spark per infinite list). If I project these lists forward long enough, I'd expect a significant speedup. Unfortunately, I see a slight slow down.
Compilation: ghc phys.hs -rtsopts -threaded -eventlog -O2
With 1 thread:
$ ./phys +RTS -N1 -sstderr -ls > /dev/null
24,264,983,224 bytes allocated in the heap
441,881,088 bytes copied during GC
1,942,848 bytes maximum residency (104 sample(s))
75,880 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46820 colls, 0 par 0.82s 0.88s 0.0000s 0.0039s
Gen 1 104 colls, 0 par 0.23s 0.23s 0.0022s 0.0037s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 1025000 (25 converted, 0 overflowed, 0 dud, 28680 GC'd, 996295 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 9.90s ( 10.09s elapsed)
GC time 1.05s ( 1.11s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 10.95s ( 11.20s elapsed)
Alloc rate 2,451,939,648 bytes per MUT second
Productivity 90.4% of total user, 88.4% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
With 2 threads:
$ ./phys +RTS -N2 -sstderr -ls > /dev/null
24,314,635,280 bytes allocated in the heap
457,603,240 bytes copied during GC
1,962,152 bytes maximum residency (104 sample(s))
119,824 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46555 colls, 46555 par 1.40s 0.85s 0.0000s 0.0048s
Gen 1 104 colls, 103 par 0.42s 0.25s 0.0024s 0.0043s
Parallel GC work balance: 16.85% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 1025000 (1023572 converted, 0 overflowed, 0 dud, 1367 GC'd, 61 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 11.07s ( 11.20s elapsed)
GC time 1.82s ( 1.10s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 12.89s ( 12.30s elapsed)
Alloc rate 2,196,259,905 bytes per MUT second
Productivity 85.9% of total user, 90.0% of total elapsed
gc_alloc_block_sync: 9222
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 2393
I have an Intel i5 with 2 cores and 4 threads, and with -N4 it's 2x slower than -N1 (total time ~20 sec).
I've spent quite a bit of time trying different strategies, such as chunking the outer list (so each spark gets more than one stream to project forward) and using rpar for each field in particleStrategy, but I've yet to get any speed up at all.
Below is a zoomed in section of the eventlog under threadscope. As you can see, I'm getting almost no concurrency. Most of the work is being done by HEC0, with some activity from HEC1 interleaved in, but only one HEC is doing work at a time. This is pretty representative of all the strategies I've tried.
As a sanity check, I've run a few of the example programs from "Parallel and Concurrent Programming in Haskell" and also see slow downs on these programs, even though I'm using the same params that give them significant speeds ups in the book! I'm beginning to think there's something wrong with my ghc.
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.3
Installed from: https://ghcformacosx.github.io/
OS X 10.10.2
Update:
I've found this thread in the ghc tracker on an OS X threaded RTS performance regression: https://ghc.haskell.org/trac/ghc/ticket/7602. I'm hesitant to blame the compiler, but my -N4 outputs supports this hypothesis. The "parallel gc word balance" is terrible:
$ ./phys +RTS -N4 -sstderr -ls > /dev/null
24,392,146,832 bytes allocated in the heap
481,001,648 bytes copied during GC
1,989,272 bytes maximum residency (104 sample(s))
181,208 bytes maximum slop
8 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46555 colls, 46555 par 4.80s 1.98s 0.0000s 0.0055s
Gen 1 104 colls, 103 par 0.99s 0.39s 0.0037s 0.0048s
Parallel GC work balance: 7.59% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 1025000 (1023640 converted, 0 overflowed, 0 dud, 1331 GC'd, 29 fizzled)
INIT time 0.00s ( 0.01s elapsed)
MUT time 14.85s ( 13.12s elapsed)
GC time 5.79s ( 2.36s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 20.65s ( 15.49s elapsed)
Alloc rate 1,642,170,155 bytes per MUT second
Productivity 71.9% of total user, 95.9% of total elapsed
gc_alloc_block_sync: 61429
whitehole_spin: 0
gen[0].sync: 1
gen[1].sync: 617
On the other hand, I don't know if this explains my threadscope output, which shows a lack of any concurrency at all.

Why is my program faster with one core not two core?

I'm currently trying to understand how to program in parallel in Haskell. I'm following the paper "A Tutorial on Parallel and Concurrent Programming in Haskell" by Simon Peyton Jones and Satnam Singh. The source code are as followed:
module Main where
import Control.Parallel
import System.Time
main :: IO ()
main = do
putStrLn "Starting computation....."
t0 <- getClockTime
pseq r1 (return())
t1 <- getClockTime
putStrLn ("sum: " ++ show r1)
putStrLn ("time: " ++ show (secDiff t0 t1) ++ " seconds")
putStrLn "Finish."
r1 :: Int
r1 = parSumFibEuler 38 5300
-- This is the Fibonacci number generator
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
-- Gets the euler sum
mkList :: Int -> [Int]
mkList n = [1..n-1]
relprime :: Int -> Int -> Bool
relprime x y = gcd x y == 1
euler :: Int -> Int
euler n = length $ filter (relprime n) (mkList n)
sumEuler :: Int -> Int
sumEuler = sum.(map euler).mkList
-- Gets the sum of Euler and Fibonacci (NORMAL)
sumFibEuler :: Int -> Int -> Int
sumFibEuler a b = fib a + sumEuler b
-- Gets the sum of Euler and Fibonacci (PARALLEL)
parSumFibEuler :: Int -> Int -> Int
parSumFibEuler a b =
f `par` (e `pseq`(f+e))
where
f = fib a
e = sumEuler b
-- Measure time
secDiff :: ClockTime -> ClockTime -> Float
secDiff (TOD secs1 psecs1) (TOD secs2 psecs2)
= fromInteger (psecs2 -psecs1) / 1e12 + fromInteger (secs2- secs1)
I compiled it with the following command:
ghc --make -threaded Main.hs
a) Ran it using 1 core:
./Main +RTS -N1
b) Ran it using 2 core:
./Main +RTS -N2
However, the one core ran 53.556sec. Whereas, the two core ran 73.401sec. I don't understand how 2 cores can actually run slower then 1 core. Maybe the message passing overhead is too big for this small program? The paper have completely different outcomes compared to mines. Following are the output details.
For 1 core:
Starting computation.....
sum: 47625790
time: 53.556335 seconds
Finish.
17,961,210,216 bytes allocated in the heap
12,595,880 bytes copied during GC
176,536 bytes maximum residency (3 sample(s))
23,904 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 34389 colls, 0 par 2.54s 2.57s 0.0001s 0.0123s
Gen 1 3 colls, 0 par 0.00s 0.00s 0.0007s 0.0010s
Parallel GC work balance: -nan (0 / 0, ideal 1)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s)
Task 1 (worker) : 0.00s ( 53.56s) 0.00s ( 0.00s)
Task 2 (bound) : 50.49s ( 50.99s) 2.52s ( 2.57s)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 50.47s ( 50.99s elapsed)
GC time 2.54s ( 2.57s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 53.02s ( 53.56s elapsed)
Alloc rate 355,810,305 bytes per MUT second
Productivity 95.2% of total user, 94.2% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
For 2 core:
Starting computation.....
sum: 47625790
time: 73.401146 seconds
Finish.
17,961,210,256 bytes allocated in the heap
12,558,088 bytes copied during GC
176,536 bytes maximum residency (3 sample(s))
195,936 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 34389 colls, 34388 par 7.42s 4.73s 0.0001s 0.0205s
Gen 1 3 colls, 3 par 0.01s 0.00s 0.0011s 0.0017s
Parallel GC work balance: 1.00 (1432193 / 1429197, ideal 2)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 1.19s ( 40.26s) 16.95s ( 33.15s)
Task 1 (worker) : 0.00s ( 73.40s) 0.00s ( 0.00s)
Task 2 (bound) : 54.50s ( 68.67s) 3.66s ( 4.73s)
Task 3 (worker) : 0.00s ( 73.41s) 0.00s ( 0.00s)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 68.87s ( 68.67s elapsed)
GC time 7.43s ( 4.73s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 76.31s ( 73.41s elapsed)
Alloc rate 260,751,318 bytes per MUT second
Productivity 90.3% of total user, 93.8% of total elapsed
gc_alloc_block_sync: 12254
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
r1 = sumFibEuler 38 5300
I believe that you meant
r1 = parSumFibEuler 38 5300
On my configuration (with parSumFibEuler 45 8000 and with only one run):
When N1 = 126.83s
When N2 = 115.46s
I suspect fib function to be much more CPU consuming than sumEuler. That'd explain the low improvement of -N2. There won't be some work-stealing in your situation.
With memoization, your fibonacci function would be much better but I don't think that's what you wanted to try.
EDIT: as mentioned in the comments, I think that with -N2 you have a lot of interruptions since you have two cores available.
Example on my configuration (4 cores) with sum $ parMap rdeepseq (fib) [1..40]
with -N1 it takes ~26s
with -N2 it takes ~16s
with -N3 it takes ~13s
with -N4 it takes ~30s (well, that Haskell program is not alone here)
From here:
Be careful when using all the processors in your machine: if some of
your processors are in use by other programs, this can actually harm
performance rather than improve it.

Resources