Excessive mysterious system time use in a GHC-compiled binary - linux

I'm working on an exploration of automatic bounding of constraint-base searches. As such, my starting point is the SEND MORE MONEY problem, with a solution based on nondeterministic selection without replacement. I've modified the approach to count the number of samples performed, in order to better measure the impact of adding constraints to the search.
import Control.Monad.State
import Control.Monad.Trans.List
import Control.Monad.Morph
import Data.List (foldl')
type CS a b = StateT [a] (ListT (State Int)) b
select' :: [a] -> [(a, [a])]
select' [] = []
select' (x:xs) = (x, xs) : [(y, x:ys) | ~(y, ys) <- select' xs]
select :: CS a a
select = do
i <- lift . lift $ get
xs <- get
lift . lift . put $! i + length xs
hoist (ListT . return) (StateT select')
runCS :: CS a b -> [a] -> ([b], Int)
runCS a xs = flip runState 0 . runListT $ evalStateT a xs
fromDigits :: [Int] -> Int
fromDigits = foldl' (\x y -> 10 * x + y) 0
sendMoreMoney :: ([(Int, Int, Int)], Int)
sendMoreMoney = flip runCS [0..9] $ do
[s,e,n,d,m,o,r,y] <- replicateM 8 select
let send = fromDigits [s,e,n,d]
more = fromDigits [m,o,r,e]
money = fromDigits [m,o,n,e,y]
guard $ s /= 0 && m /= 0 && send + more == money
return (send, more, money)
main :: IO ()
main = print sendMoreMoney
It works, it gets correct results, and it maintains a flat heap profile during the search. But even so, it's slow. It's something like 20x slower than without counting the selections. Even that isn't terrible. I can live with paying a huge penalty in order to collect these performance numbers.
But I still don't want the performance to be needlessly terrible, so I decided to look for low-hanging fruit in terms of performance. And I came across some baffling results when I did.
$ ghc -O2 -Wall -fforce-recomp -rtsopts statefulbacktrack.hs
[1 of 1] Compiling Main ( statefulbacktrack.hs, statefulbacktrack.o )
Linking statefulbacktrack ...
$ time ./statefulbacktrack
([(9567,1085,10652)],2606500)
real 0m6.960s
user 0m3.880s
sys 0m2.968s
That system time is utterly ridiculous. The program performs output once. Where's it all going? My next step was checking strace.
$ strace -cf ./statefulbacktrack
([(9567,1085,10652)],2606500)
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.38 0.033798 1469 23 munmap
1.08 0.000370 0 21273 rt_sigprocmask
0.26 0.000090 0 10638 clock_gettime
0.21 0.000073 0 10638 getrusage
0.07 0.000023 4 6 mprotect
0.00 0.000000 0 8 read
0.00 0.000000 0 1 write
0.00 0.000000 0 144 134 open
0.00 0.000000 0 10 close
0.00 0.000000 0 1 execve
0.00 0.000000 0 9 9 access
0.00 0.000000 0 3 brk
0.00 0.000000 0 1 ioctl
0.00 0.000000 0 847 sigreturn
0.00 0.000000 0 1 uname
0.00 0.000000 0 1 select
0.00 0.000000 0 13 rt_sigaction
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 387 mmap2
0.00 0.000000 0 16 15 stat64
0.00 0.000000 0 10 fstat64
0.00 0.000000 0 1 1 futex
0.00 0.000000 0 1 set_thread_area
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 timer_create
0.00 0.000000 0 2 timer_settime
0.00 0.000000 0 1 timer_delete
0.00 0.000000 0 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 0.034354 44039 159 total
So.. strace tells me only 0.034354s was spent in system calls.
Where's the rest of the sys time reported by time going?
One further data point: GC time is really high. Is there an easy way to bring that down?
$ ./statefulbacktrack +RTS -s
([(9567,1085,10652)],2606500)
5,541,572,660 bytes allocated in the heap
1,465,208,164 bytes copied during GC
27,317,868 bytes maximum residency (66 sample(s))
635,056 bytes maximum slop
65 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10568 colls, 0 par 1.924s 2.658s 0.0003s 0.0081s
Gen 1 66 colls, 0 par 0.696s 2.226s 0.0337s 0.1059s
INIT time 0.000s ( 0.001s elapsed)
MUT time 1.656s ( 2.279s elapsed)
GC time 2.620s ( 4.884s elapsed)
EXIT time 0.000s ( 0.009s elapsed)
Total time 4.276s ( 7.172s elapsed)
%GC time 61.3% (68.1% elapsed)
Alloc rate 3,346,131,972 bytes per MUT second
Productivity 38.7% of total user, 23.1% of total elapsed
System Info:
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.1
$ uname -a
Linux debian 3.2.0-4-686-pae #1 SMP Debian 3.2.68-1+deb7u1 i686 GNU/Linux
Running a Debian 7 virtual machine in VMWare Player 7.10 hosted on Windows 8.1.

Be sure to add -H128 to your build command line after
+RTS -s
Your eval looks fine, so you are good to go there.
If you really wanted to go after sluggishness on this VM, increase the thread priority on the VM (and the VM console slightly if you want).
Another unexpected penalty will be due to sync confirmation for GC (since this is SMP Debian on a multicore system).
The GC will have even more VM manipulation to perform on any multicore system, which partially explains the 61 percent GC stat and your strace and time discrepancy. The stats are not reliable for most situations anyway
You are actually doing quite well - - especially if this is on an i7 or later, for example.
I would be surprised if the -H128 option does not resolve this.
I am new here, please do let me know if I can help further or if there is anything you require prior to doling out the bounty.

Related

"vmstat" and "perf stat -a" show different numbers for context-switching

I'm trying to understand the context-switching rate on my system (running on AWS EC2), and where the switches are coming from. Just getting the number is already confusing, as two tools that I know can output such a metric give me different results. Here's the output from vmstat:
$ vmstat -w 2
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 0 0 443888 492304 8632452 0 0 0 1 0 0 14 2 84 0 0
37 0 0 444820 492304 8632456 0 0 0 20 131602 155911 43 5 52 0 0
8 0 0 445040 492304 8632460 0 0 0 42 131117 147812 46 4 50 0 0
13 0 0 446572 492304 8632464 0 0 0 34 129154 142260 49 4 46 0 0
The number is ~140k-160k/sec.
But perf tells something else:
$ sudo perf stat -a
Performance counter stats for 'system wide':
2980794.013800 cpu-clock (msec) # 35.997 CPUs utilized
12,335,935 context-switches # 0.004 M/sec
2,086,162 cpu-migrations # 0.700 K/sec
11,617 page-faults # 0.004 K/sec
...
0.004 M/sec is apparently 4k/sec.
Why is there a disparity between the two tools? Am I misinterpreting something in either of them, or are their CS metrics somehow different?
FWIW, I've tried doing the same on a machine running a different workload, and the difference there is even twice larger.
Environment:
AWS EC2 c5.9xlarge instance
Amazon Linux, kernel 4.14.94-73.73.amzn1.x86_64
The service runs on Docker 18.06.1-ce
Some recent versions of perf have a unit-scaling bug in the printing code. Manually do 12.3M / wall-time and see if that's sane. (spoiler alert: it is according to OP's comment.)
https://lore.kernel.org/patchwork/patch/1025968/
Commit 0aa802a79469 ("perf stat: Get rid of extra clock display
function") introduced the bug in mainline Linux 4.19-rc1 or so.
Thus, perf_stat__update_shadow_stats() now saves scaled values of clock events
in msecs, instead of original nsecs. But while calculating values of
shadow stats we still consider clock event values in nsecs. This results
in a wrong shadow stat values.
Commit 57ddf09173c1 on Mon, 17 Dec 2018 fixed it in 5.0-rc1, eventually being released with perf upstream version 5.0.
Vendor kernel trees that cherry-pick commits for their stable kernels might have the bug or have fixed the bug earlier.

Parallel Fibonacci example from "Parallel and Concurrent Programming" [duplicate]

I tried running the first example here: http://chimera.labs.oreilly.com/books/1230000000929/ch03.html
Code: https://github.com/simonmar/parconc-examples/blob/master/strat.hs
import Control.Parallel
import Control.Parallel.Strategies (rpar, Strategy, using)
import Text.Printf
import System.Environment
-- <<fib
fib :: Integer -> Integer
fib 0 = 1
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
-- >>
main = print pair
where
pair =
-- <<pair
(fib 35, fib 36) `using` parPair
-- >>
-- <<parPair
parPair :: Strategy (a,b)
parPair (a,b) = do
a' <- rpar a
b' <- rpar b
return (a',b')
-- >>
I've built using ghc 7.10.2 (on OSX, with a multicore machine) using the following command:
ghc -O2 strat.hs -threaded -rtsopts -eventlog
And run using:
./strat +RTS -N2 -l -s
I expected the 2 fibs calculations to be run in parallel (previous chapter examples worked as expected, so no setup issues), and I wasn't getting any speedup at all, as seen here:
% ./strat +RTS -N2 -l -s
(14930352,24157817)
3,127,178,800 bytes allocated in the heap
6,323,360 bytes copied during GC
70,000 bytes maximum residency (2 sample(s))
31,576 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 5963 colls, 5963 par 0.179s 0.074s 0.0000s 0.0001s
Gen 1 2 colls, 1 par 0.000s 0.000s 0.0001s 0.0001s
Parallel GC work balance: 2.34% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 2 (0 converted, 0 overflowed, 0 dud, 1 GC'd, 1 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 1.809s ( 1.870s elapsed)
GC time 0.180s ( 0.074s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.991s ( 1.945s elapsed)
Alloc rate 1,728,514,772 bytes per MUT second
Productivity 91.0% of total user, 93.1% of total elapsed
gc_alloc_block_sync: 238
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
-N1 gets similar results (omitted).
The # of GC collections seemed suspicious, as pointed out by others in #haskell-beginners, so I tried adding -A16M when running. The results looked much more in line with expectations:
% ./strat +RTS -N2 -l -s -A16M
(14930352,24157817)
3,127,179,920 bytes allocated in the heap
260,960 bytes copied during GC
69,984 bytes maximum residency (2 sample(s))
28,320 bytes maximum slop
33 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 115 colls, 115 par 0.105s 0.002s 0.0000s 0.0003s
Gen 1 2 colls, 1 par 0.000s 0.000s 0.0002s 0.0002s
Parallel GC work balance: 71.25% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 2 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 1.579s ( 1.087s elapsed)
GC time 0.106s ( 0.002s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.686s ( 1.091s elapsed)
Alloc rate 1,980,993,138 bytes per MUT second
Productivity 93.7% of total user, 144.8% of total elapsed
gc_alloc_block_sync: 27
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
The question is: Why is this the behavior? Even with frequent GC, I still intuitively expect the 2 sparks to run in parallel in the other 90% of the running time.
Yes, this is actually a bug in GHC 8.0.1 and earlier (I'm working on fixing it for 8.0.2). The problem is that the fib 35 and fib 36 expressions are constant and so GHC lifts them to the top level as CAFs, and the RTS was wrongly assuming that the CAFs were unreachable and so garbage collecting the sparks.
You can work around it by making the expressions non-constant by passing in parameters on the command line:
main = do
[a,b] <- map read <$> getArgs
let pair = (fib a, fib b) `using` parPair
print pair
and then run the program with ./strat 35 36.

Is there a parallel find in Haskell?

I have some kind of brute force problem I like to solve in Haskell. My machine has 16 cores so I want to speed up my current algorithm a bit.
I have a method "tryCombination" which returns either a Just (String) or a Nothing. My loop looks like this:
findSolution = find (isJust) [tryCombination a1 a2 a3 n z p |
a1 <- [600..700],
a2 <- [600..700],
a3 <- [600..700],
n <- [1..100],
....
I know there is a special parMap to parallelize a map function. A mapFind could be tricky as it is not predictable, if a thread really finds the first occurence. But is there something like a mapAny to speed up the search?
EDIT:
I rewrote the code using the "withStrategy (parList rseq)" snippet. The status report looks like this:
38,929,334,968 bytes allocated in the heap
2,215,280,048 bytes copied during GC
3,505,624 bytes maximum residency (795 sample(s))
202,696 bytes maximum slop
15 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 44922 colls, 44922 par 37.33s 8.34s 0.0002s 0.0470s
Gen 1 795 colls, 794 par 7.58s 1.43s 0.0018s 0.0466s
Parallel GC work balance: 4.36% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N8)
SPARKS: 17576 (8198 converted, 9378 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 81.79s ( 36.37s elapsed)
GC time 44.91s ( 9.77s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 126.72s ( 46.14s elapsed)
Alloc rate 475,959,220 bytes per MUT second
Productivity 64.6% of total user, 177.3% of total elapsed
gc_alloc_block_sync: 834851
whitehole_spin: 0
gen[0].sync: 10
gen[1].sync: 3724
As I already mentioned (see my comments), all the cores are working for only three seconds (wenn all the sparks are processed). The following 30s all the work is done by a single core. How can I optimize still more?
Some more EDIT:
I now gave "withStrategy (parBuffer 10 rdeepseq)" a try and fiddled around with different buffer sizes:
Buffersize GC work Balance MUT GC
10 50% 11,69s 0,94s
100 47% 12,31s 1,67s
500 40% 11,5 s 1,35s
5000 21% 11,47s 2,25s
First of all I can say, that this is a big improvement against the 59s it took without any multithreading. The second conclusion is, that the buffer size should be as small as possible but bigger than the number of cores.
But the best is, that I have neither overflowed nor fizzled sparks any more. All were converted successfully.
Depending on the lazyness of tryCombination and the desired parallelization, one of these might do what you want:
import Control.Parallel.Strategies
findSolution =
find (isJust) $
withStrategy (parList rseq) $
[ tryCombination a1 a2 a3 n z p
| a1 <- [600..700]
, a2 <- [600..700]
, a3 <- [600..700]
, n <- [1..100]]
This paralleizes the work performed by tryCombination to figure out whether it is a Just or a Nothing, but not the actual result in the Just.
If there is no such lazyness to be exploited and the result type is simple, it might work better to write
findSolution =
find (isJust) $
withStrategy (parList rdeepseq) $
[ tryCombination a1 a2 a3 n z p
| a1 <- [600..700]
, a2 <- [600..700]
, a3 <- [600..700]
, n <- [1..100]]

Using GHC's profiling stats/charts to identify trouble-areas / improve performance of Haskell code

TL;DR: Based on the Haskell code and it's associated profiling data below, what conclusions can we draw that let us modify/improve it so we can narrow the performance gap vs. the same algorithm written in imperative languages (namely C++ / Python / C# but the specific language isn't important)?
Background
I wrote the following piece of code as an answer to a question on a popular site which contains many questions of a programming and/or mathematical nature. (You've probably heard of this site, whose name is pronounced "oiler" by some, "yoolurr" by others.) Since the code below is a solution to one of the problems, I'm intentionally avoiding any mention of the site's name or any specific terms in the problem. That said, I'm talking about problem one hundred and three.
(In fact, I've seen many solutions in the site's forums from resident Haskell wizards :P)
Why did I choose to profile this code?
This was the first problem (on said site) in which I encountered a difference in performance (as measured by execution time) between Haskell code vs. C++/Python/C# code (when both use a similar algorithm). In fact, it was the case for all of the problems (thus far; I've done ~100 problems but not sequentially) that an optimized Haskell code was pretty much neck-and-neck with the fastest C++ solutions, ceteris paribus for the algorithm, of course.
However, the posts in the forum for this particular problem would indicate that the same algorithm in these other languages typically require at most one or two seconds, with the longest taking 10-15 sec (assuming the same starting parameters; I'm ignoring the very naive algorithms that take 2-3 min+). In contrast, the Haskell code below required ~50 sec on my (decent) computer (with profiling disabled; with profiling enabled, it takes ~2 min, as you can see below; note: the exec time was identical when compiling with -fllvm). Specs: i5 2.4ghz laptop, 8gb RAM.
In an effort to learn Haskell in a way that it can become a viable substitute to the imperative languages, one of my aims in solving these problems is learning to write code that, to the extent possible, has performance that's on par with those imperative languages. In that context, I still consider the problem as yet unsolved by me (since there's nearly a ~25x difference in performance!)
What have I done so far?
In addition to the obvious step of streamlining the code itself (to the best of my ability), I've also performed the standard profiling exercises that are recommended in "Real World Haskell".
But I'm having a hard time drawing conclusions that that tell me which pieces need to be modified. That's where I'm hoping folks might be able to help provide some guidance.
Description of the problem:
I'd refer you to the website of problem one hundred and three on the aforementioned site but here's a brief summary: the goal is to find a group of seven numbers such that any two disjoint subgroups (of that group) satisfy the following two properties (I'm trying to avoid using the 's-e-t' word for reasons mentioned above...):
no two subgroups sum to the same amount
the subgroup with more elements has a larger sum (in other words, the sum of the smallest four elements must be greater than the sum of the largest three elements).
In particular, we are trying to find the group of seven numbers with the smallest sum.
My (admittedly weak) observations
A warning: some of these comments may well be totally wrong but I wanted to atleast take a stab at interpreting the profiling data based on what I read in Real World Haskell and other profiling-related posts on SO.
There does indeed seem to be an efficiency issue seeing as how one-third of the time is spent doing garbage collection (37.1%). The first table of figures shows that ~172gb is allocated in the heap, which seems horrible... (Maybe there's a better structure / function to use for implementing the dynamic programming solution?)
Not surprisingly, the vast majority (83.1%) of time is spent checking rule 1: (i) 41.6% in the value sub-function, which determines values to fill in the dynamic programming ("DP") table, (ii) 29.1% in the table function, which generates the DP table and (iii) 12.4% in the rule1 function, which checks the resulting DP table to make sure that a given sum can only be calculated in one way (i.e., from one subgroup).
However, I did find it surprising that more time was spent in the value function relative to the table and rule1 functions given that it's the only one of the three which doesn't construct an array or filter through a large number of elements (it's really only performing O(1) lookups and making comparisons between Int types, which you'd think would be relatively quick). So this is a potential problem area. That said, it's unlikely that the value function is driving the high heap-allocation
Frankly, I'm not sure what to make of the three charts.
Heap profile chart (i.e., the first char below):
I'm honestly not sure what is represented by the red area marked as Pinned. It makes sense that the dynamic function has a "spiky" memory allocation because it's called every time the construct function generates a tuple that meets the first three criteria and, each time it's called, it creates a decently large DP array. Also, I'd think that the allocation of memory to store the tuples (generated by construct) wouldn't be flat over the course of the program.
Pending clarification of the "Pinned" red area, I'm not sure this one tells us anything useful.
Allocation by type and allocation by constructor:
I suspect that the ARR_WORDS (which represents a ByteString or unboxed Array according to the GHC docs) represents the low-level execution of the construction of the DP array (in the table function). Nut I'm not 100% sure.
I'm not sure what's the FROZEN and STATIC pointer categories correspond to.
Like I said, I'm really not sure how to interpret the charts as nothing jumps out (to me) as unexpected.
The code and the profiling results
Without further ado, here's the code with comments explaining my algorithm. I've tried to make sure the code doesn't run off of the right-side of the code-box - but some of the comments do require scrolling (sorry).
{-# LANGUAGE NoImplicitPrelude #-}
{-# OPTIONS_GHC -Wall #-}
import CorePrelude
import Data.Array
import Data.List
import Data.Bool.HT ((?:))
import Control.Monad (guard)
main = print (minimum construct)
cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int
--we enumerate tuples that are potentially valid and then
--filter for valid ones; we perform the most computationally
--expensive step (i.e., rule 1) at the very end
construct :: [[Int]]
construct = {-# SCC "construct" #-} do
a <- [flr..cap] --1st: we construct potentially valid tuples while applying a
b <- [a+step..cap] --constraint on the upper bound of any element as implied by rule 2
c <- [b+step..a+b-1]
d <- [c+step..a+b-1]
e <- [d+step..a+b-1]
f <- [e+step..a+b-1]
g <- [f+step..a+b-1]
guard (a + b + c + d - e - f - g > 0) --2nd: we screen for tuples that completely conform to rule 2
let nn = [g,f,e,d,c,b,a]
guard (sum nn < 285) --3rd: we screen for tuples of a certain size (a guess to speed things up)
guard (rule1 nn) --4th: we screen for tuples that conform to rule 1
return nn
rule1 :: [Int] -> Bool
rule1 nn = {-# SCC "rule1" #-}
null . filter ((>1) . snd) --confirm that there's only one subgroup that sums to any given sum
. filter ((length nn==) . snd . fst) --the last column us how many subgroups sum to a given sum
. assocs --run the dynamic programming algorithm and generate a table
$ dynamic nn
dynamic :: [Int] -> Array (Int,Int) Int
dynamic ns = {-# SCC "dynamic" #-} table
where
(len, maxSum) = (length &&& sum) ns
table = array ((0,0),(maxSum,len))
[ ((s,i),x) | s <- [0..maxSum], i <- [0..len], let x = value (s,i) ]
elements = listArray (0,len) (0:ns)
value (s,i)
| i == 0 || s == 0 = 0
| s == m = table ! (s,i-1) + 1
| s > m = s <= sum (take i ns) ?:
(table ! (s,i-1) + table ! ((s-m),i-1), 0)
| otherwise = 0
where
m = elements ! i
Stats on heap allocation, garbage collection and time elapsed:
% ghc -O2 --make 103_specialsubset2.hs -rtsopts -prof -auto-all -caf-all -fforce-recomp
[1 of 1] Compiling Main ( 103_specialsubset2.hs, 103_specialsubset2.o )
Linking 103_specialsubset2 ...
% time ./103_specialsubset2.hs +RTS -p -sstderr
zsh: permission denied: ./103_specialsubset2.hs
./103_specialsubset2.hs +RTS -p -sstderr 0.00s user 0.00s system 86% cpu 0.002 total
% time ./103_specialsubset2 +RTS -p -sstderr
SOLUTION REDACTED
172,449,596,840 bytes allocated in the heap
21,738,677,624 bytes copied during GC
261,128 bytes maximum residency (74 sample(s))
55,464 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 327548 colls, 0 par 27.34s 41.64s 0.0001s 0.0092s
Gen 1 74 colls, 0 par 0.02s 0.02s 0.0003s 0.0013s
INIT time 0.00s ( 0.01s elapsed)
MUT time 53.91s ( 70.60s elapsed)
GC time 27.35s ( 41.66s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 81.26s (112.27s elapsed)
%GC time 33.7% (37.1% elapsed)
Alloc rate 3,199,123,974 bytes per MUT second
Productivity 66.3% of total user, 48.0% of total elapsed
./103_specialsubset2 +RTS -p -sstderr 81.26s user 30.90s system 99% cpu 1:52.29 total
Stats on time spent per cost-centre:
Wed Dec 17 23:21 2014 Time and Allocation Profiling Report (Final)
103_specialsubset2 +RTS -p -sstderr -RTS
total time = 15.56 secs (15565 ticks # 1000 us, 1 processor)
total alloc = 118,221,354,488 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
dynamic.value Main 41.6 17.7
dynamic.table Main 29.1 37.8
construct Main 12.9 37.4
rule1 Main 12.4 7.0
dynamic.table.x Main 1.9 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 55 0 0.0 0.0 100.0 100.0
main Main 111 0 0.0 0.0 0.0 0.0
CAF:main1 Main 108 0 0.0 0.0 0.0 0.0
main Main 110 1 0.0 0.0 0.0 0.0
CAF:main2 Main 107 0 0.0 0.0 0.0 0.0
main Main 112 0 0.0 0.0 0.0 0.0
CAF:main3 Main 106 0 0.0 0.0 0.0 0.0
main Main 113 0 0.0 0.0 0.0 0.0
CAF:construct Main 105 0 0.0 0.0 100.0 100.0
construct Main 114 1 0.6 0.0 100.0 100.0
construct Main 115 1 12.9 37.4 99.4 100.0
rule1 Main 123 282235 0.6 0.0 86.5 62.6
rule1 Main 124 282235 12.4 7.0 85.9 62.6
dynamic Main 125 282235 0.2 0.0 73.5 55.6
dynamic.elements Main 133 282235 0.3 0.1 0.3 0.1
dynamic.len Main 129 282235 0.0 0.0 0.0 0.0
dynamic.table Main 128 282235 29.1 37.8 72.9 55.5
dynamic.table.x Main 130 133204473 1.9 0.0 43.8 17.7
dynamic.value Main 131 133204473 41.6 17.7 41.9 17.7
dynamic.value.m Main 132 132640003 0.3 0.0 0.3 0.0
dynamic.maxSum Main 127 282235 0.0 0.0 0.0 0.0
dynamic.(...) Main 126 282235 0.1 0.0 0.1 0.0
dynamic Main 122 282235 0.0 0.0 0.0 0.0
construct.nn Main 121 12683926 0.0 0.0 0.0 0.0
CAF:main4 Main 102 0 0.0 0.0 0.0 0.0
construct Main 116 0 0.0 0.0 0.0 0.0
construct Main 117 0 0.0 0.0 0.0 0.0
CAF:cap Main 101 0 0.0 0.0 0.0 0.0
cap Main 119 1 0.0 0.0 0.0 0.0
CAF:flr Main 100 0 0.0 0.0 0.0 0.0
flr Main 118 1 0.0 0.0 0.0 0.0
CAF:step_r1dD Main 99 0 0.0 0.0 0.0 0.0
step Main 120 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 96 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 93 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 91 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 82 0 0.0 0.0 0.0 0.0
Heap profile:
Allocation by type:
Allocation by constructors:
There is a lot that can be said. In this answer I'll just comment on the nested list comprehensions in the construct function.
To get an idea on what's going on in construct we'll isolate it and compare it to a nested loop version that you would write in an imperative language. We've removed the rule1 guard to test only the generation of lists.
-- List.hs -- using list comprehensions
import Control.Monad
cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int
construct :: [[Int]]
construct = do
a <- [flr..cap]
b <- [a+step..cap]
c <- [b+step..a+b-1]
d <- [c+step..a+b-1]
e <- [d+step..a+b-1]
f <- [e+step..a+b-1]
g <- [f+step..a+b-1]
guard (a + b + c + d - e - f - g > 0)
guard (a + b + c + d + e + f + g < 285)
return [g,f,e,d,c,b,a]
-- guard (rule1 nn)
main = do
forM_ construct print
-- Loops.hs -- using imperative looping
import Control.Monad
loop a b f = go a
where go i | i > b = return ()
| otherwise = do f i; go (i+1)
cap = 55 :: Int
flr = 20 :: Int
step = 1 :: Int
main =
loop flr cap $ \a ->
loop (a+step) cap $ \b ->
loop (b+step) (a+b-1) $ \c ->
loop (c+step) (a+b-1) $ \d ->
loop (d+step) (a+b-1) $ \e ->
loop (e+step) (a+b-1) $ \f ->
loop (f+step) (a+b-1) $ \g ->
if (a+b+c+d-e-f-g > 0) && (a+b+c+d+e+f+g < 285)
then print [g,f,e,d,c,b,a]
else return ()
Both programs were compiled with ghc -O2 -rtsopts and run with prog +RTS -s > out.
Here is a summary of the results:
Lists.hs Loops.hs
Heap allocation 44,913 MB 2,740 MB
Max. Residency 44,312 44,312
%GC 5.8 % 1.7 %
Total Time 9.48 secs 1.43 secs
As you can see, the loop version, which is the way you would write this in a language like C,
wins in every category.
The list comprehension version is cleaner and more composable but also less performant than direct iteration.

Avoiding allocations in Haskell

I'm working on a more complicated program I would like to be very efficient, but I have boiled down my concerns for right now to the following simple program:
main :: IO ()
main = print $ foldl (+) 0 [(1::Int)..1000000]
Here I build and run it.
$ uname -s -r -v -m
Linux 3.12.9-x86_64-linode37 #1 SMP Mon Feb 3 10:01:02 EST 2014 x86_64
$ ghc -V
The Glorious Glasgow Haskell Compilation System, version 7.4.1
$ ghc -O -prof --make B.hs
$ ./B +RTS -P
500000500000
$ less B.prof
Sun Feb 16 16:37 2014 Time and Allocation Profiling Report (Final)
B +RTS -P -RTS
total time = 0.04 secs (38 ticks # 1000 us, 1 processor)
total alloc = 80,049,792 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc ticks bytes
CAF Main 100.0 99.9 38 80000528
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc ticks bytes
MAIN MAIN 44 0 0.0 0.0 100.0 100.0 0 10872
CAF Main 87 0 100.0 99.9 100.0 99.9 38 80000528
CAF GHC.IO.Handle.FD 85 0 0.0 0.0 0.0 0.0 0 34672
CAF GHC.Conc.Signal 83 0 0.0 0.0 0.0 0.0 0 672
CAF GHC.IO.Encoding 76 0 0.0 0.0 0.0 0.0 0 2800
CAF GHC.IO.Encoding.Iconv 60 0 0.0 0.0 0.0 0.0 0 248
It looks like 80 bytes are being allocated per iteration. I think it's quite reasonable to expect the compiler to generate allocation-free code here.
Is my expectation unreasonable? Are the allocations a side effect of enabling profiling? How can I finangle things to get rid of the allocation?
In this case it looks like GHC was smart enough to optimize foldl into the stricter form, but GHC can't optimize away the intermediate list because foldl isn't a good consumer, so presumably those allocations are for the (:) constructors. (EDIT3: No, looks like that's not the case; see comments)
By using foldr fusion kicks in and you can get rid of the intermediate list:
main :: IO ()
main = print $ foldr (+) 0 [(1::Int)..1000000]
...as you can see:
k +RTS -P -RTS
total time = 0.01 secs (10 ticks # 1000 us, 1 processor)
total alloc = 45,144 bytes (excludes profiling overheads)
which has the same memory profile for me as
main = print $ (1784293664 :: Int)
EDIT: In this new version we're trading heap allocation for a bunch of (1 + (2 + (3 +...))) on the stack. To really get a good loop we have to write it by hand like:
main = print $ add 1000000
add :: Int -> Int
add nMax = go 0 1 where
go !acc !n
| n == nMax = acc + n
| otherwise = go (acc+n) (n+1)
showing:
total time = 0.00 secs (0 ticks # 1000 us, 1 processor)
total alloc = 45,144 bytes (excludes profiling overheads)
EDIT2 I haven't gotten to use Gabriel Gonzalez foldl library yet, but it also might be worth playing with for your application.

Resources