Space leak when grouping key/value pairs in Haskell - haskell

I have a problem where my code is creating too many thunks (over 270MB) and consequently spends way too much time (over 70%) in GC when grouping values by key. I was wondering what the best way to group values by key.
The problem is that I have keys and values represented by vectors and I want to group the values by keys preserving the order. For example:
Input:
keys = 1 2 4 3 1 3 4 2 1
vals = 1 2 3 4 5 6 7 8 9
Output:
1 = 1,5,9
2 = 2,8
3 = 4,6
4 = 3,7
Compile options:
ghc --make -03 -fllvm histogram.hs
In imperative programming, I would just use a multimap so I decided to use a hash table and where the associated value is [Int] to store the grouped values. I am hoping there is a much better FP solution.
{-# LANGUAGE BangPatterns #-}
import qualified Data.HashMap.Strict as M
import qualified Data.Vector.Unboxed as V
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip k v
where
k = V.generate n (\i -> i `mod` 1000)
v = V.generate n (\i -> i)
ts :: V.Vector (Int,Int) -> M.HashMap Int Int
ts vec =
V.foldl' (\ht (k, v) -> M.insertWith (+) k v ht) M.empty vec
ts2 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
ts2 vec =
V.foldl' (\ht (!k, !v) -> M.insertWith (++) k [v] ht) M.empty vec
main :: IO ()
main = ts2 kv `seq` putStrLn "done"
Here's what spits out at runtime:
3,117,102,992 bytes allocated in the heap
1,847,205,880 bytes copied during GC
324,159,752 bytes maximum residency (12 sample(s))
6,502,224 bytes maximum slop
658 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 5991 colls, 0 par 0.58s 0.58s 0.0001s 0.0003s
Gen 1 12 colls, 0 par 0.69s 0.69s 0.0577s 0.3070s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.45s ( 0.45s elapsed)
GC time 1.27s ( 1.27s elapsed)
EXIT time 0.03s ( 0.03s elapsed)
Total time 1.75s ( 1.75s elapsed)
%GC time 72.7% (72.8% elapsed)
Alloc rate 6,933,912,935 bytes per MUT second
Productivity 27.3% of total user, 27.3% of total elapsed
You can see it spends a lot of time in GC so I decided to use bangs to make the list concatenation strict. I guess the ++ is quite expensive too but don't know a workaround around this.

Those strictness annotations are useless. They're forcing only the first constructor of the lists.
Even worse, it appears you're attempting to left fold (++), which is never a good idea. It results in lots of useless copying of intermediate lists, even when it's made fully strict.
You should fold to a [Int] -> [Int] value, instead. That will get rid of the multiple useless allocations. I'm on mobile, so I can't really provide full example code. The main idea is that you change the loop to M.insertWith (.) k (v:) and then map ($ [] ) over the values in the HashMap after the fold.

The bulk of your problem is due to (++) leading to "lots of useless copying of intermediate lists", as Carl puts it in his answer. Having played with a few different approaches at replacing (++), I got the best results thus far by switching to Data.IntMap.Strict from containers (just to take advantage of the less stern API - I don't know which implementation is more efficient per se) and using its alter function to prepend the vector elements without creating singleton lists:
import qualified Data.IntMap.Strict as M
import qualified Data.Vector.Unboxed as V
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip k v
where
k = V.generate n (\i -> i `mod` 1000)
v = V.generate n (\i -> i)
ts2 :: V.Vector (Int,Int) -> M.IntMap [Int]
ts2 vec =
V.foldl' (\ht (k, v) -> M.alter (prep v) k ht) M.empty vec
where
prep x = Just . maybe [x] (x:)
main :: IO ()
main = print $ M.foldl' (+) 0 $ M.map length $ ts2 kv
The second best solution was using
\ht (k, v) -> M.insertWith (\(x:_) -> (x :)) k [v] ht
as the fold operator. That works with both Data.IntMap.Strict and Data.HashMap.Strict, with similar results performance-wise.
N.B.: Note that in all cases, your original implementation included, the vector elements are being prepended, rather than appended, to the lists. Your problems would be much more serious if you were appending the elements, as repeatedly appending to an empty list with (++) is quadratic in the number of elements.

I tried to run your code on my host and I am not able to reproduce your profile:
runhaskell test8.hs +RTS -sstderr
done
120,112 bytes allocated in the heap
3,520 bytes copied during GC
68,968 bytes maximum residency (1 sample(s))
12,952 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.09s 0.0909s 0.0909s
INIT time 0.00s ( 0.01s elapsed)
MUT time 0.00s ( 29.21s elapsed)
GC time 0.00s ( 0.09s elapsed)
EXIT time 0.00s ( 0.09s elapsed)
Total time 0.01s ( 29.40s elapsed)
%GC time 5.7% (0.3% elapsed)
Alloc rate 381,307,936 bytes per MUT second
Productivity 91.1% of total user, 0.0% of total elapsed
Can you pls outline some more detail about how you are testing the code? If you are using ghci then
$ ghci -fobject-code
we probably need to use -fobject-code to eliminate any space leaks from the ghci. If you have already tried the ghci option, assuming that you are using ghci, I will edit my answer. At this point, I would like to reproduce the issue you are seeing.
Update:
# duplode : Thank you for the pointers. I am going to delete the previous output no one objects to it as it is misleading.
I have been able to reduce the gc overhead by a bit using one of the following options. I am getting some benefits but the overhead is still in the 49 - 50 % range:
ts3 :: V.Vector (Int, Int) -> M.HashMap Int [Int]
ts3 vec =
V.foldl (\ht (!k, !v) ->
let
element = M.lookup k ht in
case element of
Nothing -> M.insert k [v] ht
Just aList -> M.insert k (v:aList) ht) M.empty vec
ts4 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
ts4 vec =
let initMap = V.foldl (\ht (!k,_) -> M.insert k [] ht) M.empty vec
in
V.foldl (\ht (!k, !v) -> M.adjust(\x -> v:x) k ht) initMap vec
The adjust seemed a bit better, but they results seem similar to a straight lookup. With ts4 using adjust:
calling ts4 done.
3,838,059,320 bytes allocated in the heap
2,041,603,344 bytes copied during GC
377,412,728 bytes maximum residency (6 sample(s))
7,725,944 bytes maximum slop
737 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 7260 colls, 0 par 1.32s 1.45s 0.0002s 0.0013s
Gen 1 6 colls, 0 par 0.88s 1.40s 0.2328s 0.9236s
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.18s ( 2.21s elapsed)
GC time 2.19s ( 2.85s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.01s ( 0.07s elapsed)
Total time 4.38s ( 5.13s elapsed)
%GC time 50.0% (55.5% elapsed)
Alloc rate 1,757,267,879 bytes per MUT second
Productivity 50.0% of total user, 42.7% of total elapsed
Using the simple lookup/update (imperative style of updating a map)
calling ts3 done.
3,677,137,816 bytes allocated in the heap
2,040,053,712 bytes copied during GC
395,867,512 bytes maximum residency (6 sample(s))
7,326,104 bytes maximum slop
769 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6999 colls, 0 par 1.35s 1.51s 0.0002s 0.0037s
Gen 1 6 colls, 0 par 1.06s 2.16s 0.3601s 1.3175s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.89s ( 2.07s elapsed)
GC time 2.41s ( 3.67s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.01s ( 0.08s elapsed)
Total time 4.31s ( 5.82s elapsed)
%GC time 55.9% (63.0% elapsed)
Alloc rate 1,942,816,558 bytes per MUT second
Productivity 44.1% of total user, 32.6% of total elapsed
I am interested in finding out as to how to reduce the time for lookup as show in the profile output below:
COST CENTRE MODULE %time %alloc
ts3.\ Main 54.1 91.4
ts3.\.element Main 19.0 2.9
ts3 Main 11.0 2.9
kv.k Main 6.5 1.4
kv.v Main 5.2 1.4
kv.k.\ Main 4.0 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 72 0 0.0 0.0 100.0 100.0
main Main 158 0 0.0 0.0 0.0 0.0
CAF:main Main 143 0 0.0 0.0 84.2 97.1
main Main 144 1 0.0 0.0 84.2 97.1
ts3 Main 145 1 11.0 2.9 84.2 97.1
ts3.\ Main 156 5000000 54.1 91.4 73.2 94.3
ts3.\.element Main 157 5000000 19.0 2.9 19.0 2.9
CAF:kv Main 142 0 0.0 0.0 0.0 0.0
Code
-- ghc -O2 --make test8.hs -prof -auto-all -caf-all -fforce-recomp +RTS
-- ./test8 +RTS -p
{-# LANGUAGE BangPatterns #-}
import qualified Data.HashMap.Strict as M
import qualified Data.Vector.Unboxed as V
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip (k) (v)
where
k = V.generate n (\i -> i `mod` 1000)
v = V.generate n (\i -> i)
ts :: V.Vector (Int,Int) -> M.HashMap Int Int
ts vec =
V.foldl' (\ht (k, v) -> M.insertWith (+) k v ht) M.empty vec
ts2 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
ts2 vec =
V.foldl (\ht (!k, !v) -> M.insertWith (++) k [v] ht) M.empty vec
ts3 :: V.Vector (Int, Int) -> M.HashMap Int [Int]
ts3 vec =
V.foldl (\ht (!k, !v) ->
let
element = M.lookup k ht in
case element of
Nothing -> M.insert k [v] ht
Just aList -> M.insert k (v:aList) ht) M.empty vec
ts4 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
ts4 vec =
let initMap = V.foldl (\ht (!k,_) -> M.insert k [] ht) M.empty vec
in
V.foldl (\ht (!k, !v) -> M.adjust(\x -> v:x) k ht) initMap vec
main :: IO ()
main = ts3 kv `seq` putStrLn "calling ts3 done."
main1 = do
if x == y then
putStrLn "Algos Match"
else
putStrLn "Error"
where
x = ts2 kv
y = ts4 kv

Related

Finding the size of a list that's too big for memory?

Brand new Haskell programmer here. Just finished "Learn you a Haskell"... I'm interested in how large a set is that has some particular properties. I have working code for some small parameter values, but I'd like to know how to deal with larger structures. I know Haskell can do "infinite data structures" but I'm just not seeing how to get there from where I'm at and Learn You a Haskell / Google isn't getting me over this.
Here's the working code for my eSet given "small" arguments r and t
import Control.Monad
import System.Environment
import System.Exit
myPred :: [Int] -> Bool
myPred a = myPred' [] a
where
myPred' [] [] = False
myPred' [] [0] = True
myPred' _ [] = True
myPred' acc (0:aTail) = myPred' acc aTail
myPred' acc (a:aTail)
| a `elem` acc = False
| otherwise = myPred' (a:acc) aTail
superSet :: Int -> Int -> [[Int]]
superSet r t = replicateM r [0..t]
eSet :: Int -> Int -> [[Int]]
eSet r t = filter myPred $ superSet r t
main :: IO ()
main = do
args <- getArgs
case args of
[rArg, tArg] ->
print $ length $ eSet (read rArg) (read tArg)
[rArg, tArg, "set"] ->
print $ eSet (read rArg) (read tArg)
_ ->
die "Usage: eSet r r set <set optional for printing set itself otherwise just print the size
When compiled/run I get
$ ghc eSet.hs -rtsopts
[1 of 1] Compiling Main ( eSet.hs, eSet.o )
Linking eSet ...
$ # Here's is a tiny eSet to illustrate. It is the set of lists of r integers from zero to t with no repeated nonzero list entries
$ ./eSet 4 2 set
[[0,0,0,0],[0,0,0,1],[0,0,0,2],[0,0,1,0],[0,0,1,2],[0,0,2,0],[0,0,2,1],[0,1,0,0],[0,1,0,2],[0,1,2,0],[0,2,0,0],[0,2,0,1],[0,2,1,0],[1,0,0,0],[1,0,0,2],[1,0,2,0],[1,2,0,0],[2,0,0,0],[2,0,0,1],[2,0,1,0],[2,1,0,0]]
$ ./eSet 8 4 +RTS -sstderr
3393
174,406,136 bytes allocated in the heap
29,061,152 bytes copied during GC
4,382,568 bytes maximum residency (7 sample(s))
148,664 bytes maximum slop
14 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 328 colls, 0 par 0.047s 0.047s 0.0001s 0.0009s
Gen 1 7 colls, 0 par 0.055s 0.055s 0.0079s 0.0147s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.298s ( 0.301s elapsed)
GC time 0.102s ( 0.102s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 0.406s ( 0.405s elapsed)
%GC time 25.1% (25.2% elapsed)
Alloc rate 585,308,888 bytes per MUT second
Productivity 74.8% of total user, 75.0% of total elapsed
$ ./eSet 10 5 +RTS -sstderr
63591
27,478,010,744 bytes allocated in the heap
4,638,903,384 bytes copied during GC
532,163,096 bytes maximum residency (15 sample(s))
16,500,072 bytes maximum slop
1556 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 52656 colls, 0 par 6.865s 6.864s 0.0001s 0.0055s
Gen 1 15 colls, 0 par 8.853s 8.997s 0.5998s 1.8617s
INIT time 0.000s ( 0.000s elapsed)
MUT time 52.652s ( 52.796s elapsed)
GC time 15.717s ( 15.861s elapsed)
EXIT time 0.193s ( 0.211s elapsed)
Total time 68.564s ( 68.868s elapsed)
%GC time 22.9% (23.0% elapsed)
Alloc rate 521,883,277 bytes per MUT second
Productivity 77.1% of total user, 76.7% of total elapsed
I see my memory usage is very high and there's a lot of garbage collecting. When running eSet 12 6 I get a Segmentation fault.
I feel like filter myPred $ superSet r t is keeping me from lazily making the subset one part at a time so I can deal with much larger (but finite) sets, but I don't know how to change to another approach that would do that. I think that's the root of my question.
Also, as this is my first Haskell program, points on style and how to achieve the Haskell analog of "pythonic" are much appreciated!
I suspect the culprit here is replicateM, which has this implementation:
replicateM cnt0 f =
loop cnt0
where
loop cnt
| cnt <= 0 = pure []
| otherwise = liftA2 (:) f (loop (cnt - 1))
The problem line is liftA2 (:) f (loop (cnt - 1)); probably loop (cnt - 1) is getting shared among all the calls to (:) partially applied to elements of f, and so loop (cnt - 1) must be kept in memory. Unfortunately loop (cnt - 1) is quite a long list...
It can be a bit fiddly to convince GHC not to share something. The following redefinition of superSet gives me a nice flat memory usage; it will probably be a bit slower on examples that do fit in memory, of course. The key idea is to make it look to the untrained eye (i.e. GHC) like the recursive monadic action depends on the choices made earlier, even though it doesn't.
superSet :: Int -> Int -> [[Int]]
superSet r t = go r 0 where
go 0 ignored = if ignored == 0 then [[]] else [[]]
go r ignored = do
x <- [0..t]
xs <- go (r-1) (ignored+x)
return (x:xs)
If you don't mind avoiding optimizations, the more natural definition also works:
superSet 0 t = [[]]
superSet r t = do
x <- [0..t]
xs <- superSet (r-1) t
return (x:xs)
...but with -O2 GHC is too clever and notices that it can share the recursive call.
A completely alternate approach is to just do a little bit of combinatorics analysis. Here's the process that eSet r t does, as near as I can make out:
Choose at most r elements without replacement from a set of size t.
Pad the sequence to a length of r by interleaving a sentinel value.
So let's just count the ways of doing each of these steps, rather than actually performing them. We'll introduce a new parameter, s, which is the length of the sequence produced by step (1) (and which we therefore know has s <= r and s <= t). How many sequences of size s are there when drawing elements without replacement from a set of size t? Well, there are t choices for the first element, t-1 choices for the second element, t-2 choices for the third element, ...
-- sequencesWithoutReplacement is a very long name!
seqWORepSize :: Integer -> Integer -> Integer
seqWORepSize s t = product [t-s+1 .. t]
Then we want to pad the sequence out to a length of r. We're going to choose s positions in the r-long sequence to be drawn from our sequence, and the remainder will be sentinels. How many ways are there to do that? This one is a well-known combinatorics operator called choose.
choose :: Integer -> Integer -> Integer
choose r s = product [r-s+1 .. r] `div` product [2 .. s]
The number of ways to produce a padded sequence of a given length is just the product of these two numbers, since the choices of "what values to insert" and "where to insert values" can be made completely independently.
paddedSeqSize :: Integer -> Integer -> Integer -> Integer
paddedSeqSize r s t = seqWORepSize s t * (r `choose` s)
And now we're pretty much done. Just iterate over all possible sequence lengths and add up the paddedSeqSize.
eSetSize :: Integer -> Integer -> Integer
eSetSize r t = sum $ map (\s -> paddedSeqSize r s t) [0..r]
We can try it out in ghci:
> :set +s
> map length $ [eSet 1 1, eSet 4 4, eSet 6 4, eSet 4 6]
[2,209,1045,1045]
(0.13 secs, 26,924,264 bytes)
> [eSetSize 1 1, eSetSize 4 4, eSetSize 6 4, eSetSize 4 6]
[2,209,1045,1045]
(0.01 secs, 120,272 bytes)
This way is significantly faster and significantly more memory-friendly. Indeed, we can make queries and get answers about eSets that we would never be able to count the size of one-by-one, e.g.
> length . show $ eSetSize 1000 1000
2594
(0.26 secs, 909,746,448 bytes)
Good luck counting to 10^2594 one at a time. =P
Edit
This idea can also be adapted to produce the padded sequences themselves rather than just counting how many there are. First, a handy utility that I find myself defining over and over for picking out individual elements of a list and reporting on the leftovers:
zippers :: [a] -> [([a], a, [a])]
zippers = go [] where
go ls [] = []
go ls (h:rs) = (ls, h, rs) : go (h:ls) rs
Now, the sequences without replacement can be done by repeatedly choosing a single element from the leftovers.
seqWORep :: Int -> [a] -> [[a]]
seqWORep 0 _ = [[]]
seqWORep n xs = do
(ls, y, rs) <- zippers xs
ys <- seqWORep (n-1) (ls++rs)
return (y:ys)
Once we have a sequence, we can pad it to a desired size by producing all the interleavings of the sentinel value as follows:
interleavings :: Int -> a -> [a] -> [[a]]
interleavings 0 _ xs = [xs]
interleavings n z [] = [replicate n z]
interleavings n z xs#(x:xt) = map (z:) (interleavings (n-1) z xs)
++ map (x:) (interleavings n z xt)
And finally, the top-level function just delegates to seqWORep and interleavings.
eSet :: Int -> Int -> [[Int]]
eSet r t = do
s <- [0..r]
xs <- seqWORep s [1..t]
interleavings (r-s) 0 xs
In my tests this modified eSet has nice flat memory usage both with and without optimizations; does not generate any spurious elements that need to be later filtered out, and so is faster than your original proposal; and to me looks like quite a natural definition compared to the answer that relies on tricking GHC. A nice collection of properties!
After re-reading parts of LYaH and thinking about #daniel-wagners answer monadically composing sounded like it would be worthwhile to try again. The new code total memory is flat and works with and without the -O2 optimization.
Source:
import Control.Monad
import System.Environment
import System.Exit
allowed :: [Int] -> Bool
allowed a = allowed' [] a
where
allowed' [ ] [ ] = False
allowed' [ ] [0] = True
allowed' _ [ ] = True
allowed' acc (0:aTail) = allowed' acc aTail
allowed' acc (a:aTail)
| a `elem` acc = False
| otherwise = allowed' (a:acc) aTail
branch :: Int -> [Int] -> [[Int]]
branch t x = filter allowed [n:x | n <- [0..t]]
eSet :: Int -> Int -> [[Int]]
eSet r t = return [] >>= foldr (<=<) return (replicate r (branch t))
main :: IO ()
main = do
args <- getArgs
case args of
[rArg, tArg] ->
print $ length $ eSet (read rArg) (read tArg)
[rArg, tArg, "set"] ->
print $ eSet (read rArg) (read tArg)
_ -> die "Usage: eSet r r set <set optional>"
The version with monadic function composition tests much faster and without the memory issues...
$ ./eSetMonad 10 5 +RTS -sstderr
63591
289,726,000 bytes allocated in the heap
997,968 bytes copied during GC
63,360 bytes maximum residency (2 sample(s))
24,704 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 553 colls, 0 par 0.008s 0.008s 0.0000s 0.0002s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0003s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.426s ( 0.429s elapsed)
GC time 0.009s ( 0.009s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.439s ( 0.438s elapsed)
%GC time 2.0% (2.0% elapsed)
Alloc rate 680,079,893 bytes per MUT second
Productivity 98.0% of total user, 98.3% of total elapsed

Word foldl' isn't optimized as well as Int foldl'

import Data.List
test :: Int -> Int
test n = foldl' (+) 0 [1..n]
main :: IO ()
main = do
print $ test $ 10^8
GHC optimizes the above code to the point that the garbage collector doesn't even have to do anything:
$ ghc -rtsopts -O2 testInt && ./testInt +RTS -s
[1 of 1] Compiling Main ( testInt.hs, testInt.o )
Linking testInt ...
5000000050000000
51,752 bytes allocated in the heap
3,480 bytes copied during GC
44,384 bytes maximum residency (1 sample(s))
17,056 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.101s ( 0.101s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.103s ( 0.102s elapsed)
%GC time 0.1% (0.1% elapsed)
Alloc rate 511,162 bytes per MUT second
Productivity 99.8% of total user, 100.9% of total elapsed
However, if I change the type of test to test :: Word -> Word, then a lot of garbage is produced and the code runs 40x slower.
ghc -rtsopts -O2 testWord && ./testWord +RTS -s
[1 of 1] Compiling Main ( testWord.hs, testWord.o )
Linking testWord ...
5000000050000000
11,200,051,784 bytes allocated in the heap
1,055,520 bytes copied during GC
44,384 bytes maximum residency (2 sample(s))
21,152 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 21700 colls, 0 par 0.077s 0.073s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 4.551s ( 4.556s elapsed)
GC time 0.077s ( 0.073s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 4.630s ( 4.630s elapsed)
%GC time 1.7% (1.6% elapsed)
Alloc rate 2,460,957,186 bytes per MUT second
Productivity 98.3% of total user, 98.3% of total elapsed
Why does this happen? I expected the performance to be nearly identical?
(I'm using GHC version 8.0.1 on x86_64 GNU/Linux)
edit: I submitted a bug: https://ghc.haskell.org/trac/ghc/ticket/12354#ticket
This is probably mostly, though not exclusively, due to rewrite rules that exist for Int and not Word. I say that because if we use -fno-enable-rewrite-rules on the Int case we get a time that is much closer to, but not quite as bad as, the Word case.
% ghc -O2 so.hs -fforce-recomp -fno-enable-rewrite-rules && time ./so
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
5000000050000000
./so 1.45s user 0.03s system 99% cpu 1.489 total
If we dump the rewrite rules with -ddump-rule-rewrites and diff these rules then we see a rule that fires in the Int case and not the Word case:
Rule: fold/build
Before: GHC.Base.foldr
...
That particular rule is in Base 4.9 GHC.Base line 823 (N.B. I'm actually using GHC 7.10 myself) and does not mention Int explicitly. I'm curious why it isn't firing for Word, but don't have the time right now to investigate further.
As pointed out by dfeuer in a comment here, the Enum instance for Int is better than the one for Word:
Int:
instance Enum Int where
{-# INLINE enumFromTo #-}
enumFromTo (I# x) (I# y) = eftInt x y
{-# RULES
"eftInt" [~1] forall x y. eftInt x y = build (\ c n -> eftIntFB c n x y)
"eftIntList" [1] eftIntFB (:) [] = eftInt
#-}
{- Note [How the Enum rules work]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Phase 2: eftInt ---> build . eftIntFB
* Phase 1: inline build; eftIntFB (:) --> eftInt
* Phase 0: optionally inline eftInt
-}
{-# NOINLINE [1] eftInt #-}
eftInt :: Int# -> Int# -> [Int]
-- [x1..x2]
eftInt x0 y | isTrue# (x0 ># y) = []
| otherwise = go x0
where
go x = I# x : if isTrue# (x ==# y)
then []
else go (x +# 1#)
{-# INLINE [0] eftIntFB #-}
eftIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> r
eftIntFB c n x0 y | isTrue# (x0 ># y) = n
| otherwise = go x0
where
go x = I# x `c` if isTrue# (x ==# y)
then n
else go (x +# 1#)
-- Watch out for y=maxBound; hence ==, not >
-- Be very careful not to have more than one "c"
-- so that when eftInfFB is inlined we can inline
-- whatever is bound to "c"
Now Word actually uses the implementation for Integer
enumFromTo n1 n2 = map integerToWordX [wordToIntegerX n1 .. wordToIntegerX n2]
which uses
instance Enum Integer where
enumFromTo x lim = enumDeltaToInteger x 1 lim
Now enumDeltaToInteger has rewrite rules set up, but it turns out that Word’s enumFromTo is never inlined, so this setup has no chance of fusing here.
Copying this function into my test code causes GHC to inline it, the fold/build rule to fire, and cuts down allocation severely, but the conversion from and to Integer (which does allocate) remains.

Generating a tree of distinct integers results in a space leak

I want to generate a tree that contains distinct integers and find their sum. Here is the code:
{-# LANGUAGE BangPatterns #-}
import Control.Applicative
import Control.Monad.Trans.State
data Tree a = Leaf a | Branch (Tree a) a (Tree a)
new = get <* modify' (+ 1)
tree :: Integer -> Tree Integer
tree n = evalState (go n) 0 where
go 0 = Leaf <$> new
go n = Branch <$> go (n - 1) <*> new <*> go (n - 1)
sumTree = go 0 where
go !a (Leaf n) = a + n
go !a (Branch l n r) = go (go (a + n) l) r
main = print $ sumTree (tree 20)
Compiled with -O2 it results in
348,785,728 bytes allocated in the heap
147,227,228 bytes copied during GC
34,656,860 bytes maximum residency (13 sample(s))
35,468 bytes maximum slop
72 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 565 colls, 0 par 0.764s 1.024s 0.0018s 0.0071s
Gen 1 13 colls, 0 par 0.936s 1.014s 0.0780s 0.3214s
INIT time 0.000s ( 0.001s elapsed)
MUT time 0.936s ( 0.764s elapsed)
GC time 1.700s ( 2.038s elapsed)
EXIT time 0.000s ( 0.002s elapsed)
Total time 2.636s ( 2.805s elapsed)
%GC time 64.5% (72.7% elapsed)
Alloc rate 372,631,936 bytes per MUT second
Productivity 35.5% of total user, 33.4% of total elapsed
Why am I getting this space leak? How to remove it?
Any time you build a tree, you should try to find a way to work exclusively from the top down. This is generally good for laziness, concurrency, cache utilization, GC effectiveness, etc. The tree you build is just a complete binary tree numbered in order. I suggest you consider using the following signature and doing some bit shifting:
tree :: Bits b => Int -> Tree b
You can break out a helper function that takes the starting point.

Parallel Haskell - GHC GC'ing sparks

I have a program I'm trying to parallelize (full paste with runnable code here).
I've profiled and found that the majority of time is spent in findNearest which is essentially a simple foldr over a large Data.Map.
findNearest :: RGB -> M.Map k RGB -> (k, Word32)
findNearest rgb m0 =
M.foldrWithKey' minDistance (k0, distance rgb r0) m0
where (k0, r0) = M.findMin m0
minDistance k r x#(_, d1) =
-- Euclidean distance in RGB-space
let d0 = distance rgb r
in if d0 < d1 then (k, d0) else x
parFindNearest is supposed to execute findNearest in parallel over subtrees of the larger Map.
parFindNearest :: NFData k => RGB -> M.Map k RGB -> (k, Word32)
parFindNearest rgb = minimumBy (comparing snd)
. parMap rdeepseq (findNearest rgb)
. M.splitRoot
Unfortunately GHC GC's most my sparks before they are converted into useful parallelism.
Here's the result of compiling with ghc -O2 -threaded and running with +RTS -s -N2
839,892,616 bytes allocated in the heap
123,999,464 bytes copied during GC
5,320,184 bytes maximum residency (19 sample(s))
3,214,200 bytes maximum slop
16 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1550 colls, 1550 par 0.23s 0.11s 0.0001s 0.0004s
Gen 1 19 colls, 18 par 0.11s 0.06s 0.0030s 0.0052s
Parallel GC work balance: 16.48% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 215623 (1318 converted, 0 overflowed, 0 dud, 198111 GC'd, 16194 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.72s ( 3.66s elapsed)
GC time 0.34s ( 0.17s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 4.07s ( 3.84s elapsed)
Alloc rate 225,726,318 bytes per MUT second
Productivity 91.6% of total user, 97.1% of total elapsed
gc_alloc_block_sync: 9862
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 2103
As you can see, the majority of sparks are GC'd or fizzle before being converted. I've tried experimenting with different strictness, having findNearest return a custom strict pair data type instead of a tuple
, or using rdeepseq from Control.Parallel.Strategies, but my sparks are still GC'd.
I'd like to know
why are my sparks being GC'd before being converted?
how can I change my program to take advantage of parallelism?
I'm not at expert in parallel strategies, so I may be completely wrong. But:
If you disable GC by setting big enough allocation area (e.g. using -A20M runtime option), you'll see that most of sparks are fizzled, not GC'd. It means they are evaluated by ordinary program flow before the corresponding spark finished.
minimumBy forces parMap results immediately, starting evaluating them. At the same time, sparks are scheduled and executed, but it is too late. When spark finished, the value is already evaluated by the main thread. Without -A20M, sparks are GC'd because the value is evaluated and GC'd even before the spark is scheduled.
Here is a simplified test case:
import Control.Parallel.Strategies
f :: Integer -> Integer
f 0 = 1
f n = n * f (n - 1)
main :: IO ()
main = do
let l = [n..n+10]
n = 1
res = parMap rdeepseq f l
print res
In that case all the sparks are fizzled:
SPARKS: 11 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 11 fizzled)
(Some times they are GC'd)
But if I yield main thread before printing results,
import Control.Parallel.Strategies
import Control.Concurrent
f :: Integer -> Integer
f 0 = 1
f n = n * f (n - 1)
main :: IO ()
main = do
let l = [n..n+10]
n = 1
res = parMap rdeepseq f l
res `seq` threadDelay 1
print res
Then all the sparks are converted:
SPARKS: 11 (11 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
So, looks like you have not enough sparks (try to set l = [n..n+1000] in my example), and they are not heavy enough (try to set n = 1000 in my example).

Parallel Haskell in order to find the divisors of a huge number

I have written the following program using Parallel Haskell to find the divisors of 1 billion.
import Control.Parallel
parfindDivisors :: Integer->[Integer]
parfindDivisors n = f1 `par` (f2 `par` (f1 ++ f2))
where f1=filter g [1..(quot n 4)]
f2=filter g [(quot n 4)+1..(quot n 2)]
g z = n `rem` z == 0
main = print (parfindDivisors 1000000000)
I've compiled the program with ghc -rtsopts -threaded findDivisors.hs and I run it with:
findDivisors.exe +RTS -s -N2 -RTS
I have found a 50% speedup compared to the simple version which is this:
findDivisors :: Integer->[Integer]
findDivisors n = filter g [1..(quot n 2)]
where g z = n `rem` z == 0
My processor is a dual core 2 duo from Intel.
I was wondering if there can be any improvement in above code. Because in the statistics that program prints says:
Parallel GC work balance: 1.01 (16940708 / 16772868, ideal 2)
and SPARKS: 2 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
What are these converted , overflowed , dud, GC'd, fizzled and how can help to improve the time.
IMO, the Par monad helps for reasoning about parallelism. It's a little higher-level than dealing with par and pseq.
Here's a rewrite of parfindDivisors using the Par monad. Note that this is essentially the same as your algorithm:
import Control.Monad.Par
findDivisors :: Integer -> [Integer]
findDivisors n = runPar $ do
[f0, f1] <- sequence [new, new]
fork $ put f0 (filter g [1..(quot n 4)])
fork $ put f1 (filter g [(quot n 4)+1..(quot n 2)])
[f0', f1'] <- sequence [get f0, get f1]
return $ f0' ++ f1'
where g z = n `rem` z == 0
Compiling that with -O2 -threaded -rtsopts -eventlog and running with +RTS -N2 -s yields the following relevant runtime stats:
36,000,130,784 bytes allocated in the heap
3,165,440 bytes copied during GC
48,464 bytes maximum residency (1 sample(s))
Tot time (elapsed) Avg pause Max pause
Gen 0 35162 colls, 35161 par 0.39s 0.32s 0.0000s 0.0006s
Gen 1 1 colls, 1 par 0.00s 0.00s 0.0002s 0.0002s
Parallel GC work balance: 1.32 (205296 / 155521, ideal 2)
MUT time 42.68s ( 21.48s elapsed)
GC time 0.39s ( 0.32s elapsed)
Total time 43.07s ( 21.80s elapsed)
Alloc rate 843,407,880 bytes per MUT second
Productivity 99.1% of total user, 195.8% of total elapsed
The productivity is very high. To improve the GC work balance slightly we can increase the GC allocation area size; run with +RTS -N2 -s -A128M, for example:
36,000,131,336 bytes allocated in the heap
47,088 bytes copied during GC
49,808 bytes maximum residency (1 sample(s))
Tot time (elapsed) Avg pause Max pause
Gen 0 135 colls, 134 par 0.19s 0.10s 0.0007s 0.0009s
Gen 1 1 colls, 1 par 0.00s 0.00s 0.0010s 0.0010s
Parallel GC work balance: 1.62 (2918 / 1801, ideal 2)
MUT time 42.65s ( 21.49s elapsed)
GC time 0.20s ( 0.10s elapsed)
Total time 42.85s ( 21.59s elapsed)
Alloc rate 843,925,806 bytes per MUT second
Productivity 99.5% of total user, 197.5% of total elapsed
But this is really just nitpicking. The real story comes from ThreadScope:
The utilisation is essentially maxed out for two cores, so additional significant parallelization (for two cores) is probably not going to happen.
Some good notes on the Par monad are here.
UPDATE
A rewrite of the alternative algorithm using Par looks something like this:
findDivisors :: Integer -> [Integer]
findDivisors n = let sqrtn = floor (sqrt (fromInteger n)) in runPar $ do
[a, b] <- sequence [new, new]
fork $ put a [a | (a, b) <- [quotRem n x | x <- [1..sqrtn]], b == 0]
firstDivs <- get a
fork $ put b [n `quot` x | x <- firstDivs, x /= sqrtn]
secondDivs <- get b
return $ firstDivs ++ secondDivs
But you're right in that this will not get any gains from parallelism due to the dependence on firstDivs.
You can still incorporate parallelism here, by getting Strategies involved to evaluate the elements of the list comprehensions in parallel. Something like:
import Control.Monad.Par
import Control.Parallel.Strategies
findDivisors :: Integer -> [Integer]
findDivisors n = let sqrtn = floor (sqrt (fromInteger n)) in runPar $ do
[a, b] <- sequence [new, new]
fork $ put a
([a | (a, b) <- [quotRem n x | x <- [1..sqrtn]], b == 0] `using` parListChunk 2 rdeepseq)
firstDivs <- get a
fork $ put b
([n `quot` x | x <- firstDivs, x /= sqrtn] `using` parListChunk 2 rdeepseq)
secondDivs <- get b
return $ firstDivs ++ secondDivs
and running this gives some stats like
3,388,800 bytes allocated in the heap
43,656 bytes copied during GC
68,032 bytes maximum residency (1 sample(s))
Tot time (elapsed) Avg pause Max pause
Gen 0 5 colls, 4 par 0.00s 0.00s 0.0000s 0.0001s
Gen 1 1 colls, 1 par 0.00s 0.00s 0.0002s 0.0002s
Parallel GC work balance: 1.22 (2800 / 2290, ideal 2)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s)
Task 1 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s)
Task 2 (bound) : 0.01s ( 0.01s) 0.00s ( 0.00s)
Task 3 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s)
SPARKS: 50 (49 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
MUT time 0.01s ( 0.00s elapsed)
GC time 0.00s ( 0.00s elapsed)
Total time 0.01s ( 0.01s elapsed)
Alloc rate 501,672,834 bytes per MUT second
Productivity 85.0% of total user, 95.2% of total elapsed
Here almost 50 sparks were converted - that is, meaningful parallel work was being done - but the computations were not large enough to observe any wall-clock gains from parallelism. Any gains were probably offset by the overhead of scheduling computations in the threaded runtime.
I think this page explains it better than I could:
http://www.haskell.org/haskellwiki/ThreadScope_Tour/SparkOverview
I also found these slides interesting:
http://haskellwiki.gitit.net/Upload/HIW2011-Talk-Coutts.pdf
My modifying the original code with the following I have better speedup but this code I think that cannot be parallelised
findDivisors2 :: Integer->[Integer]
findDivisors2 n= let firstDivs=[a|(a,b)<-[quotRem n x|x<-[1..sqrtn]],b==0]
secondDivs=[n `quot` x|x<-firstDivs,x/=sqrtn]
sqrtn = floor(sqrt (fromInteger n))
in firstDivs ++ secondDivs
I tried to parallelise the code with this:
parfindDivisors2 :: Integer->[Integer]
parfindDivisors2 n= let firstDivs=[a|(a,b)<-[quotRem n x|x<-[1..sqrtn]],b==0]
secondDivs=[n `quot` x|x<-firstDivs,x/=sqrtn]
sqrtn = floor(sqrt (fromInteger n))
in secondDivs `par` firstDivs++secondDivs
Instead of reducing the time I have doubled the time. I think that this happens because the findDivisors2 have strong data dependence while the first algorithm findDivisors does not.
Any comments are welcome.

Resources