I wrote a file indexing program that should read thousands of text file lines as records and finally group those records by fingerprint. It uses Data.List.Split.splitOn to split the lines at tabs and retrieve the record fields. The program consumes 10-20 GB of memory.
Probably there is not much I can do to reduce that huge memory footprint, but I cannot explain why a function like splitOn (breakDelim) can consume that much memory:
Mon Dec 9 21:07 2019 Time and Allocation Profiling Report (Final)
group +RTS -p -RTS file1 file2 -o 2 -h
total time = 7.40 secs (7399 ticks # 1000 us, 1 processor)
total alloc = 14,324,828,696 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
fileToPairs.linesIncludingEmptyLines ImageFileRecordParser ImageFileRecordParser.hs:35:7-47 25.0 33.8
breakDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(151,1)-(156,36) 24.9 39.3
sortAndGroup Aggregations Aggregations.hs:6:1-85 12.9 1.7
fileToPairs ImageFileRecordParser ImageFileRecordParser.hs:(33,1)-(42,14) 8.2 10.7
matchDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(73,1)-(77,23) 7.4 0.4
onSublist Data.List.Split.Internals src/Data/List/Split/Internals.hs:278:1-72 3.6 0.0
toHashesView ImageFileRecordStatistics ImageFileRecordStatistics.hs:(48,1)-(51,24) 3.0 6.3
main Main group.hs:(47,1)-(89,54) 2.9 0.4
numberOfUnique ImageFileRecord ImageFileRecord.hs:37:1-40 1.6 0.1
toHashesView.sortedLines ImageFileRecordStatistics ImageFileRecordStatistics.hs:50:7-30 1.4 0.1
imageFileRecordFromFields ImageFileRecordParser ImageFileRecordParser.hs:(11,1)-(30,5) 1.1 0.3
toHashView ImageFileRecord ImageFileRecord.hs:(67,1)-(69,23) 0.7 1.7
Or is type [Char] too memory inefficient (compared to Text), causing splitOn to take that much memory?
UPDATE 1 (+RTS -s suggestion of user HTNW)
23,446,268,504 bytes allocated in the heap
10,753,363,408 bytes copied during GC
1,456,588,656 bytes maximum residency (22 sample(s))
29,282,936 bytes maximum slop
3620 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 45646 colls, 0 par 4.055s 4.059s 0.0001s 0.0013s
Gen 1 22 colls, 0 par 4.034s 4.035s 0.1834s 1.1491s
INIT time 0.000s ( 0.000s elapsed)
MUT time 7.477s ( 7.475s elapsed)
GC time 8.089s ( 8.094s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.114s ( 0.114s elapsed)
Total time 15.687s ( 15.683s elapsed)
%GC time 51.6% (51.6% elapsed)
Alloc rate 3,135,625,407 bytes per MUT second
Productivity 48.4% of total user, 48.4% of total elapsed
The processed text files are smaller than usual (UTF-8 encoded, 37 MB). But still 3 GB of memory are used.
UPDATE 2 (critical part of the code)
Explanation: fileToPairs processes a text file. It returns a list of key-value pairs (key: fingerprint of record, value: record).
sortAndGroup associations = Map.fromListWith (++) [(k, [v]) | (k, v) <- associations]
main = do
CommandLineArguments{..} <- cmdArgs $ CommandLineArguments {
ignored_paths_file = def &= typFile,
files = def &= typ "FILES" &= args,
number_of_occurrences = def &= name "o",
minimum_number_of_occurrences = def &= name "l",
maximum_number_of_occurrences = def &= name "u",
number_of_hashes = def &= name "n",
having_record_errors = def &= name "e",
hashes = def
}
&= summary "Group image/video files"
&= program "group"
let ignoredPathsFilenameMaybe = ignored_paths_file
let filenames = files
let hashesMaybe = hashes
ignoredPaths <- case ignoredPathsFilenameMaybe of
Just ignoredPathsFilename -> ioToLines (readFile ignoredPathsFilename)
_ -> return []
recordPairs <- mapM (fileToPairs ignoredPaths) filenames
let allRecordPairs = concat recordPairs
let groupMap = sortAndGroup allRecordPairs
let statisticsPairs = map toPair (Map.toList groupMap) where toPair item = (fst item, imageFileRecordStatisticsFromRecords . snd $ item)
let filterArguments = FilterArguments {
numberOfOccurrencesMaybe = number_of_occurrences,
minimumNumberOfOccurrencesMaybe = minimum_number_of_occurrences,
maximumNumberOfOccurrencesMaybe = maximum_number_of_occurrences,
numberOfHashesMaybe = number_of_hashes,
havingRecordErrorsMaybe = having_record_errors
}
let filteredPairs = filterImageRecords filterArguments statisticsPairs
let filteredMap = Map.fromList filteredPairs
case hashesMaybe of
Just True -> mapM_ putStrLn (map toHashesView (map snd filteredPairs))
_ -> Char8.putStrLn (encodePretty filteredMap)
As I'm sure you're aware, there's not really enough information here for us to help you make your program more efficient. It might be worth posting some (complete, self-contained) code on the Code Review site for that.
However, I think I can answer your specific question about why splitOn allocates so much memory. In fact, there's nothing particularly special about splitOn or how it's been implemented. Many straightforward Haskell functions will allocate lots of memory, and this in itself doesn't indicate that they've been poorly written or are running inefficiently. In particular, splitOn's memory usage seems similar to other straightforward approaches to splitting a string based on delimiters.
The first thing to understand is that GHC compiled code works differently than other compiled code you're likely to have seen. If you know a lot of C and understand stack frames and heap allocation, or if you've studied some JVM implementations, you might reasonably expect that some of that understanding would translate to GHC executables, but you'd be mostly wrong.
A GHC program is more or less an engine for allocating heap objects, and -- with a few exceptions -- that's all it really does. Nearly every argument passed to a function or constructor (as well as the constructor application itself) allocates a heap object of at least 16 bytes, and often more. Take a simple function like:
fact :: Int -> Int
fact 0 = 1
fact n = n * fact (n-1)
With optimization turned off, it compiles to the following so-called "STG" form (simplified from the actual -O0 -ddump-stg output):
fact = \n -> case n of I# n' -> case n' of
0# -> I# 1#
_ -> let sat1 = let sat2 = let one = I#! 1# in n-one
in fact sat2;
in n*sat1
Everywhere you see a let, that's a heap allocation (16+ bytes), and there are presumably more hidden in the (-) and (*) calls. Compiling and running this program with:
main = print $ fact 1000000
gives:
113,343,544 bytes allocated in the heap
44,309,000 bytes copied during GC
25,059,648 bytes maximum residency (5 sample(s))
29,152 bytes maximum slop
23 MB total memory in use (0 MB lost due to fragmentation)
meaning that each iteration allocates over a hundred bytes on the heap, though it's literally just performing a comparison, a subtraction, a multiplication, and a recursive call,
This is what #HTNW meant in saying that total allocation in a GHC program is a measure of "work". A GHC program that isn't allocating probably isn't doing anything (again, with some rare exceptions), and a typical GHC program that is doing something will usually allocate at a relatively constant rate of several gigabytes per second when it's not garbage collecting. So, total allocation has more to do with total runtime than anything else, and it isn't a particularly good metric for assessing code efficiency. Maximum residency is also a poor measure of overall efficiency, though it can be helpful for assessing whether or not you have a space leak, if you find that it tends to grow linearly (or worse) with the size of the input where you expect the program should run in constant memory regardless of input size.
For most programs, the most important true efficiency metric in the +RTS -s output is probably the "productivity" rate at the bottom -- it's the amount of time the program spends not garbage collecting. And, admittedly, your program's productivity of 48% is pretty bad, which probably means that it is, technically speaking, allocating too much memory, but it's probably only allocating two or three times the amount it should be, so, at a guess, maybe it should "only" be allocating around 7-8 Gigs instead of 23 Gigs for this workload (and, consequently, running for about 5 seconds instead of 15 seconds).
With that in mind, if you consider the following simple breakDelim implementation:
breakDelim :: String -> [String]
breakDelim str = case break (=='\t') str of
(a,_:b) -> a : breakDelim b
(a,[]) -> [a]
and use it like so in a simple tab-to-comma delimited file converter:
main = interact (unlines . map (intercalate "," . breakDelim) . lines)
Then, unoptimized and run on a file with 10000 lines of 1000 3-character fields each, it allocates a whopping 17 Gigs:
17,227,289,776 bytes allocated in the heap
2,807,297,584 bytes copied during GC
127,416 bytes maximum residency (2391 sample(s))
32,608 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
and profiling it places a lot of blame on breakDelim:
COST CENTRE MODULE SRC %time %alloc
main Main Delim.hs:8:1-71 57.7 72.6
breakDelim Main Delim.hs:(4,1)-(6,16) 42.3 27.4
In this case, compiling with -O2 doesn't make much difference. The key efficiency metric, productivity, is only 46%. All these results seem to be in line with what you're seeing in your program.
The split package has a lot going for it, but looking through the code, it's pretty clear that little effort has been made to make it particularly efficient or fast, so it's no surprise that splitOn performs no better than my quick-and-dirty custom breakDelim function. And, as I said before, there's nothing special about splitOn that makes it unusually memory hungry -- my simple breakDelim has similar behavior.
With respect to inefficiencies of the String type, it can often be problematic. But, it can also participate in optimizations like list fusion in ways that Text can't. The utility above could be rewritten in simpler form as:
main = interact $ map (\c -> if c == '\t' then ',' else c)
which uses String but runs pretty fast (about a quarter as fast as a naive C getchar/putchar implementation) at 84% productivity, while allocating about 5 Gigs on the heap.
It's quite likely that if you just take your program and "convert it to Text", you'll find it's slower and more memory hungry than the original! While Text has the potential to be much more efficient than String, it's a complicated package, and the way in which Text objects behave with respect to allocation when they're sliced and diced (as when you're chopping a big Text file up into little Text fields) makes it more difficult to get right.
So, some take-home lessons:
Total allocation is a poor measure of efficiency. Most well written GHC programs can and should allocate several gigabytes per second of runtime.
Many innocuous Haskell functions will allocate lots of memory because of the way GHC compiled code works. This isn't necessarily a sign that there's something wrong with the function.
The split package provides a flexible framework for all manner of cool list splitting manipulations, but it was not designed with speed in mind, and it may not be the best method of processing a tab-delimited file.
The String data type has a potential for terrible inefficiency, but isn't always inefficient, and Text is a complicated package that won't be a plug-in replacement to fix your String performance woes.
Most importantly:
Unless your program is too slow for its intended purpose, its run-time statistics and the theoretical advantages of Text over String are largely irrelevant.
Related
I am trying to practice Haskell by solving some of the tasks on Project Euler. In Problem 3, we have to find the biggest prime factor of the number 600851475143, which I had done before in Java a few years back.
I came up with the following:
primes :: [Int]
primes = sieve [2..]
where sieve (p:xs) = p : sieve (filter (\x -> x `rem` p /= 0) xs)
biggestPrimeFactor :: Int -> Int
biggestPrimeFactor 1 = 0
biggestPrimeFactor x =
if x `elem` takeWhile (< x + 1) primes
then x
else last (filter (\y -> x `rem` y == 0) (takeWhile (< x `div` 2) primes))
which works great for smaller numbers, but is terribly inefficient and as a result doesn't work well on the number I have been given.
This seems obvious, because the program iterates over all primes smaller than the number divided by 2 (if it isn't prime itself), but I am unsure what to do about it. Ideally I would be able to further restrict the possible checks, but I don't know how to accomplish this.
Note that I am not looking for an "optimal solution", but rather one that is at least moderately efficient for bigger numbers, and simple to understand and implement, as I am still a beginner in Haskell.
You have two main sources of slowness here. The easier one to address is the boundary condition in biggestPrimeFactor. Checking up to p > x `div` 2 is asymptotically worse than checking up to p^2 > x. But even that is very suboptimal when a number has a lot of factors. The largest factor may be far smaller than sqrt x. If you continually reduce the target number as you find factors, you can account for this and speed up the processing of random inputs by quite a lot.
Here's an example of that, including Daniel Wagner's note from the comments:
-- Naive trial division against a list of primes. Doesn't do anything
-- intelligent when asked to factor a number less than 2.
factorsNaive :: [Integer] -> Integer -> [Integer]
factorsNaive primes#(p : ps) x
| p * p > x = [x]
| otherwise = case x `quotRem` p of
(q, 0) -> p : factorsNaive primes q
_ -> factorsNaive ps x
A few notes:
I decided to have the primes list passed in. This is relevant in the next section, but it also allowed me to write this without a helper.
I specialized to Integer instead of Int because I wanted to throw big numbers at it without caring what maxBound :: Int is. This is slower, but I decided to default to correctness first.
I removed a traversal of the input list. Doing it in one pass is a bit more efficient, but mostly it's cleaner.
Strictly speaking, this is correct even if the input list contains non-primes, so long as the list starts at 2, is monotonically non-decreasing, and eventually contains every prime.
Note that when it recurses, it either discards a prime or produces one. It never will do both at the same time. This is an easy way to ensure it doesn't miss repeated factors.
I named this factorsNaive just to make it clear that it's not doing anything clever with number theory. There are very many things that could be done which are far more complex than this, but this is a good stopping point for understandable factoring of relatively small numbers...
Or at least it is okay at factoring as long as you have a convenient list of prime numbers. It turns out this is the second major cause of slowdown in your code. Your list of prime numbers is slow to generate as it gets longer.
Your definition of primes essentially stacks a bunch of filters on an input list. Every prime produced must go through a filter test for each previous prime. This might sound familiar - it's at least O(n^2) work to generate the first n primes. (It's actually more because division gets more costly as numbers get bigger, but let's ignore that for now.) It's a known (to mathematicians, I had to look it up to be sure) result that the number of primes less than or equal to n approaches n/ln n as n gets large. That approaches linear as n gets large, so generating the list of primes up to n approaches O(n^2) as n gets big.
(Yes, that argument is a mess. A formal version of it is presented in Melissa O'Neill's paper "The Genuine Sieve of Eratosthenes". Refer to it for much more rigorous argumentation of the result.)
It's possible to write much more efficient definitions of primes that have both better constant factors and better asymptotics. As that's the entire point of the paper mentioned in the parenthetical above, I won't go into the details too far. I'll just point out the very first possible optimization:
-- trial division. let's work in Integer for predictable correctness
-- on positive numbers
trialPrimes :: [Integer]
trialPrimes = 2 : sieve [3, 5 ..]
where
sieve (p : ps) = p : sieve (filter (\x -> x `rem` p /= 0) ps)
This does less than you might think. It doesn't double the speed, as the performance improvement is eventually outweighed by the filter stack mentioned before. This version only removes one filter from that stack, but at least it's the filter that rejects the most inputs in the initial version.
In ghci (no compilation or optimizations, and those can really make a difference), this was fast enough to factor the product of two five-digit primes in a few seconds.
ghci> :set +s
ghci> factorsNaive trialPrimes $ 84761 * 60821
[60821,84761]
(5.98 secs, 4,103,321,840 bytes)
Numbers with several small factors are handled much faster. Also notice that because the list of primes is a top-level binding, calculations are cached. Running the computation again has the list of primes pre-computed now.
ghci> factorsNaive trialPrimes $ 84761 * 60821
[60821,84761]
(0.01 secs, 6,934,688 bytes)
That also shows that the run time is absolutely dominated by generating the list of primes. The naive factorization is almost instant at that scale when the list of primes is already in memory.
But you shouldn't really trust performance of interpreted code.
main :: IO ()
main = print (factorsNaive trialPrimes $ 84761 * 60821)
gives
carl#DESKTOP:~/hask/2023$ ghc -O2 -rtsopts factor.hs
[1 of 2] Compiling Main ( factor.hs, factor.o )
[2 of 2] Linking factor
carl#DESKTOP:~/hask/2023$ ./factor +RTS -s
[60821,84761]
1,884,787,896 bytes allocated in the heap
32,303,080 bytes copied during GC
89,072 bytes maximum residency (2 sample(s))
29,400 bytes maximum slop
7 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 326 colls, 0 par 0.021s 0.021s 0.0001s 0.0002s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0004s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.523s ( 0.522s elapsed)
GC time 0.021s ( 0.022s elapsed)
EXIT time 0.000s ( 0.007s elapsed)
Total time 0.545s ( 0.550s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,603,678,988 bytes per MUT second
Productivity 96.0% of total user, 94.8% of total elapsed
That dropped the run time from six seconds to a half-second. (Yeah, +RTS -s is pretty verbose for this, but it's quick and easy.) I think this is a reasonable place to stop with beginner-level code.
If you want to look into more efficient prime generation, the primes package on hackage contains an implementation of the algorithm in O'Neill's paper and an implementation of naive factoring that's equivalent to the one here.
I needed to use an algorithm to solve a KP problem some time ago, in haskell
Here is what my code look like:
stepKP :: [Int] -> (Int, Int) -> [Int]
stepKP l (p, v) = take p l ++ zipWith bestOption l (drop p l)
where bestOption a = max (a+v)
kp :: [(Int, Int)] -> Int -> Int
kp l pMax = last $ foldl stepKP [0 | i <- [0..pMax]] l
main = print $ kp (zip weights values) 20000
where weights = [0..2000]
values = reverse [8000..10000]
But when I try to execute it (after compilation with ghc, no flags), it seems pretty bad:
here is the result of the command ./kp -RTS -s
1980100
9,461,474,416 bytes allocated in the heap
6,103,730,184 bytes copied during GC
1,190,494,880 bytes maximum residency (18 sample(s))
5,098,848 bytes maximum slop
2624 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6473 colls, 0 par 2.173s 2.176s 0.0003s 0.0010s
Gen 1 18 colls, 0 par 4.185s 4.188s 0.2327s 1.4993s
INIT time 0.000s ( 0.000s elapsed)
MUT time 3.320s ( 3.322s elapsed)
GC time 6.358s ( 6.365s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 9.679s ( 9.687s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,849,443,762 bytes per MUT second
Productivity 34.3% of total user, 34.3% of total elapsed
I thinks that my programm takes O(n*w) memory, while it could do it in O(w).
(w is the total capacity)
Is that a problem of lazy evaluation taking too much space, or something else ?
How could this code be more memory and time efficient ?
We can think of a left fold as performing iterations while keeping an accumulator that is returned at the end.
When there are lots of iterations, one concern is that the accumulator might grow too large in memory. And because Haskell is lazy, this can happen even when the accumulator is of a primitive type like Int: behind some seemingly innocent Int value a large number of pending operations might lurk, in the form of thunks.
Here the strict left fold function foldl' is useful because it ensures that, as the left fold is being evaluated, the accumulator will always be kept in weak head normal form (WHNF).
Alas, sometimes this isn't enough. WHNF only says that evaluation has progressed up to the "outermost constructor" of the value. This is enough for Int, but for recursive types like lists or trees, that isn't saying much: the thunks might simply lurk further down the list, or in branches below.
This is the case here, where the accumulator is a list that is recreated at each iteration. Each iteration, the foldl' only evaluates the list up to _ : _. Unevaluated max and zipWith operations start to pile up.
What we need is a way to trigger a full evaluation of the accumulator list at each iteration, one which cleans any max and zipWith thunks from memory. And this is what force accomplishes. When force $ something is evaluated to WHNF, something is fully evaluated to normal form, that is, not only up to the outermost constructor but "deeply".
Notice that we still need the foldl' in order to "trigger" the force at each iteration.
I am new to Haskell.
While studying about foldr many are suggesting to use it and avoid explicit recursion which can lead to Memory Inefficient code.
https://www.reddit.com/r/haskell/comments/1nb80j/proper_use_of_recursion_in_haskell/
As I was running the sample mentioned in the above link. I can see the explicit recursion is doing better in terms of memory. First I thought May be running on GHCi is not near to perfect benchmark and I tried compiling it using stack ghc. And btw How can I pass Compiler Optimization flags via stack ghc. What am I missing from the Expression Avoid Explicit Recursion.
find p = foldr go Nothing
where go x rest = if p x then Just x else rest
findRec :: (a -> Bool) -> [a] -> Maybe a
findRec _ [] = Nothing
findRec p (x:xs) = if p x then Just x else (findRec p xs)
main :: IO ()
main = print $ find (\x -> x `mod` 2 == 0) [1, 3..1000000]
main = print $ findRec (\x -> x `mod` 2 == 0) [1, 3..1000000]
-- find
Nothing
92,081,224 bytes allocated in the heap
9,392 bytes copied during GC
58,848 bytes maximum residency (2 sample(s))
26,704 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 87 colls, 0 par 0.000s 0.000s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.000s 0.001s 0.0004s 0.0008s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.031s ( 0.043s elapsed)
GC time 0.000s ( 0.001s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.031s ( 0.044s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,946,599,168 bytes per MUT second
Productivity 100.0% of total user, 96.8% of total elapsed
-- findRec
Nothing
76,048,432 bytes allocated in the heap
13,768 bytes copied during GC
42,928 bytes maximum residency (2 sample(s))
26,704 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 71 colls, 0 par 0.000s 0.000s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.000s 0.001s 0.0004s 0.0007s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.031s ( 0.038s elapsed)
GC time 0.000s ( 0.001s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.031s ( 0.039s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,433,549,824 bytes per MUT second
Productivity 100.0% of total user, 96.6% of total elapsed
You are measuring how quickly GHC can do half a million modulus operations. As you might expect, "in the blink of an eye" is the answer regardless of how you iterate. There is no obvious difference in speed.
You claim that you can see that explicit recursion is using less memory, but the heap profiling data you provide shows the opposite: more allocation and higher max residency when using explicit recursion. I don't think the difference is significant, but if it were then your evidence would be contradicting your claim.
As to the question of why to avoid explicit recursion, it's not really clear what part of that thread you read that made you come to your conclusion. You linked to a giant thread which itself links to another giant thread, with many competing opinions. The comment that stands out the most to me is it's not about efficiency, it's about levels of abstraction. You are looking at this the wrong way by trying to measure its performance.
First, don't try to understand the performance of GHC-compiled code using anything other than optimized compilation:
$ stack ghc -- -O2 Find.hs
$ ./Find +RTS -s
With the -O2 flag (and GHC version 8.6.4), your find performs as follows:
16,051,544 bytes allocated in the heap
14,184 bytes copied during GC
44,576 bytes maximum residency (2 sample(s))
29,152 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
However, this is very misleading. None of this memory usage is due to the looping performed by foldr. Rather it's all due to the use of boxed Integers. If you switch to using plain Ints which the compiler can unbox:
main = print $ find (\x -> x `mod` 2 == 0) [1::Int, 3..1000000]
^^^^^
the memory performance changes drastically and demonstrates the true memory cost of foldr:
51,544 bytes allocated in the heap
3,480 bytes copied during GC
44,576 bytes maximum residency (1 sample(s))
25,056 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
If you test findRec with Ints like so:
main = print $ findRec (\x -> x `mod` 2 == 0) [1::Int, 3..1000000]
you'll see much worse memory performance:
40,051,528 bytes allocated in the heap
14,992 bytes copied during GC
44,576 bytes maximum residency (2 sample(s))
29,152 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
which seems to make a compelling case that recursion should be avoided in preference to foldr, but this, too, is very misleading. What you are seeing here is not the memory cost of recursion, but rather the memory cost of "list building".
See, foldr and the expression [1::Int, 3..1000000] both include some magic called "list fusion". This means that when they are used together (i.e., when foldr is applied to [1::Int 3..1000000]), an optimization can be performed to completely eliminate the creation of a Haskell list. Critically, the foldr code, even using list fusion, compiles to recursive code which looks like this:
main_go
= \ x ->
case gtInteger# x lim of {
__DEFAULT ->
case eqInteger# (modInteger x lvl) lvl1 of {
__DEFAULT -> main_go (plusInteger x lvl);
-- ^^^^^^^ - SEE? IT'S JUST RECURSION
1# -> Just x
};
1# -> Nothing
}
end Rec }
So, it's list fusion, rather than "avoiding recursion" that makes find faster than findRec.
You can see this is true by considering the performance of:
find1 :: Int -> Maybe Int
find1 n | n >= 1000000 = Nothing
| n `mod` 2 == 0 = Just n
| otherwise = find1 (n+2)
main :: IO ()
main = print $ find1 1
Even though this uses recursion, it doesn't generate a list (or use boxed Integers), so it runs just like the foldr version:
51,544 bytes allocated in the heap
3,480 bytes copied during GC
44,576 bytes maximum residency (1 sample(s))
25,056 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
So, what are the take home lessons?
Always benchmark Haskell code using ghc -O2, never GHCi or ghc without optimization flags.
Less than 10% of people in any Reddit thread know what they're talking about.
foldr can sometimes perform better than explicit recursion when special optimizations like list fusion can apply.
But in the general case, explicit recursion performs just as well as foldr or other specialized constructs.
Also, optimizing Haskell code is hard.
Actually, here's a better (more serious) take-home lesson. Especially when you're getting started with Haskell, make every possible effort to avoid thinking about "optimizing" your code. Far more than any other language I know, there is an enormous gulf between the code you write and the code the compiler generates, so don't even try to figure it out right now. Instead, write code that is clear, straightforward, and idiomatic. If you try to learn the "rules" for high-performance code now, you'll get them all wrong and learn really bad programming style into the bargain.
I'm trying to improve performance of this binary-trees benchmark from The Computer Language Benchmark Game. The idea is to build lots of binary trees to benchmark memory allocation. The Tree data definition looks like this:
data Tree = Nil | Node !Int !Tree !Tree
According to the problem statement, there's no need to store an Int in every node and other languages don't have it.
I use GHC 8.2.2 and get the following RTS report when run the original code:
stack --resolver lts-10.3 --compiler ghc-8.2.2 ghc -- --make -O2 -threaded -rtsopts -funbox-strict-fields -XBangPatterns -fllvm -pgmlo opt-3.9 -pgmlc llc-3.9 binarytrees.hs -o binarytrees.ghc_run
./binarytrees.ghc_run +RTS -N4 -sstderr -K128M -H -RTS 21
...
19,551,302,672 bytes allocated in the heap
7,291,702,272 bytes copied during GC
255,946,744 bytes maximum residency (18 sample(s))
233,480 bytes maximum slop
635 MB total memory in use (0 MB lost due to fragmentation)
...
Total time 58.620s ( 39.281s elapsed)
So far so good. Let's remove this Int, which is actually never used. The definition becomes
data Tree = Nil | Node !Tree !Tree
In theory we are going to save about 25% of total memory (3 integers in every node instead of 4). Let's try it:
...
313,388,960 bytes allocated in the heap
640,488 bytes copied during GC
90,016 bytes maximum residency (2 sample(s))
57,872 bytes maximum slop
5 MB total memory in use (0 MB lost due to fragmentation)
...
Total time 9.596s ( 9.621s elapsed)
5MB total memory in use and almost zero GC? Why? Where did all the allocations go?
I believe the sudden memory usage drop caused by the Common Sub-expression Elimination optimization. The original code was:
make i d = Node i (make d d2) (make d2 d2)
-- ^ ^
-- | d2 != d
-- d != d2
Since expressions constructing the left and the right subtrees are different, the compiler is not able eliminate any allocations.
If I remove the unused integer, the code looks like this
make d = Node (make (d - 1)) (make (d - 1))
-- ^ ^
-- | |
-- `--------------`----- identical
If I add the -fno-cse flag to GHC, the memory allocation is as high as expected, but the code is rather slow. I couldn't find a way to suppress this optimization locally so I decided to "outsmart" the compiler by adding extra unused arguments:
make' :: Int -> Int -> Tree
make' _ 0 = Node Nil Nil
make' !n d = Node (make' (n - 1) (d - 1)) (make' (n + 1) (d - 1))
The trick worked, the memory usage dropped by expected 30%. But I wish there was a nicer way to tell the compiler what I want.
Thanks to #Carl for mentioning the CSE optimization.
I am trying to add parallelism to a program that converts a .bmp to a grayscale .bmp. I am seeing usually 2-4x worse performance for the parallel code. I am tweaking parBuffer / chunking sizes and still cannot seem to reason about it. Looking for guidance.
The entire source file used here: http://lpaste.net/106832
We use Codec.BMP to read in a stream of pixels represented by type RGBA = (Word8, Word8, Word8, Word8). To convert to grayscale, simply map a 'luma' transform across all the pixels.
The serial implementation is literally:
toGray :: [RGBA] -> [RGBA]
toGray x = map luma x
The test input .bmp is 5184 x 3456 (71.7 MB).
The serial implementation runs in ~10s, ~550ns/pixel. Threadscope looks clean:
Why is this so fast? I suppose it has something with lazy ByteString (even though Codec.BMP uses strict ByteString--is there implicit conversion occurring here?) and fusion.
Adding Parallelism
First attempt at adding parallelism was via parList. Oh boy. The program used ~4-5GB memory and system started swapping.
I then read "Parallelizing Lazy Streams with parBuffer" section of Simon Marlow's O'Reilly book and tried parBuffer with a large size. This still did not produce desirable performance. The spark sizes were incredibly small.
I then tried to increase the spark size by chunking the lazy list and then sticking with parBuffer for the parallelism:
toGrayPar :: [RGBA] -> [RGBA]
toGrayPar x = concat $ (withStrategy (parBuffer 500 rpar) . map (map luma))
(chunk 8000 x)
chunk :: Int -> [a] -> [[a]]
chunk n [] = []
chunk n xs = as : chunk n bs where
(as,bs) = splitAt (fromIntegral n) xs
But this still does not yield desirable performance:
18,934,235,760 bytes allocated in the heap
15,274,565,976 bytes copied during GC
639,588,840 bytes maximum residency (27 sample(s))
238,163,792 bytes maximum slop
1910 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 35277 colls, 35277 par 19.62s 14.75s 0.0004s 0.0234s
Gen 1 27 colls, 26 par 13.47s 7.40s 0.2741s 0.5764s
Parallel GC work balance: 30.76% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 4480 (2240 converted, 0 overflowed, 0 dud, 2 GC'd, 2238 fizzled)
INIT time 0.00s ( 0.01s elapsed)
MUT time 14.31s ( 14.75s elapsed)
GC time 33.09s ( 22.15s elapsed)
EXIT time 0.01s ( 0.12s elapsed)
Total time 47.41s ( 37.02s elapsed)
Alloc rate 1,323,504,434 bytes per MUT second
Productivity 30.2% of total user, 38.7% of total elapsed
gc_alloc_block_sync: 7433188
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 1017408
How can I better reason about what is going on here?
You have a big list of RGBA pixels. Why don't you use parListChunk with a reasonable chunk size?