GHC optimizes allocations away - haskell

I'm trying to improve performance of this binary-trees benchmark from The Computer Language Benchmark Game. The idea is to build lots of binary trees to benchmark memory allocation. The Tree data definition looks like this:
data Tree = Nil | Node !Int !Tree !Tree
According to the problem statement, there's no need to store an Int in every node and other languages don't have it.
I use GHC 8.2.2 and get the following RTS report when run the original code:
stack --resolver lts-10.3 --compiler ghc-8.2.2 ghc -- --make -O2 -threaded -rtsopts -funbox-strict-fields -XBangPatterns -fllvm -pgmlo opt-3.9 -pgmlc llc-3.9 binarytrees.hs -o binarytrees.ghc_run
./binarytrees.ghc_run +RTS -N4 -sstderr -K128M -H -RTS 21
...
19,551,302,672 bytes allocated in the heap
7,291,702,272 bytes copied during GC
255,946,744 bytes maximum residency (18 sample(s))
233,480 bytes maximum slop
635 MB total memory in use (0 MB lost due to fragmentation)
...
Total time 58.620s ( 39.281s elapsed)
So far so good. Let's remove this Int, which is actually never used. The definition becomes
data Tree = Nil | Node !Tree !Tree
In theory we are going to save about 25% of total memory (3 integers in every node instead of 4). Let's try it:
...
313,388,960 bytes allocated in the heap
640,488 bytes copied during GC
90,016 bytes maximum residency (2 sample(s))
57,872 bytes maximum slop
5 MB total memory in use (0 MB lost due to fragmentation)
...
Total time 9.596s ( 9.621s elapsed)
5MB total memory in use and almost zero GC? Why? Where did all the allocations go?

I believe the sudden memory usage drop caused by the Common Sub-expression Elimination optimization. The original code was:
make i d = Node i (make d d2) (make d2 d2)
-- ^ ^
-- | d2 != d
-- d != d2
Since expressions constructing the left and the right subtrees are different, the compiler is not able eliminate any allocations.
If I remove the unused integer, the code looks like this
make d = Node (make (d - 1)) (make (d - 1))
-- ^ ^
-- | |
-- `--------------`----- identical
If I add the -fno-cse flag to GHC, the memory allocation is as high as expected, but the code is rather slow. I couldn't find a way to suppress this optimization locally so I decided to "outsmart" the compiler by adding extra unused arguments:
make' :: Int -> Int -> Tree
make' _ 0 = Node Nil Nil
make' !n d = Node (make' (n - 1) (d - 1)) (make' (n + 1) (d - 1))
The trick worked, the memory usage dropped by expected 30%. But I wish there was a nicer way to tell the compiler what I want.
Thanks to #Carl for mentioning the CSE optimization.

Related

why does this programm consume so much memory?

I needed to use an algorithm to solve a KP problem some time ago, in haskell
Here is what my code look like:
stepKP :: [Int] -> (Int, Int) -> [Int]
stepKP l (p, v) = take p l ++ zipWith bestOption l (drop p l)
where bestOption a = max (a+v)
kp :: [(Int, Int)] -> Int -> Int
kp l pMax = last $ foldl stepKP [0 | i <- [0..pMax]] l
main = print $ kp (zip weights values) 20000
where weights = [0..2000]
values = reverse [8000..10000]
But when I try to execute it (after compilation with ghc, no flags), it seems pretty bad:
here is the result of the command ./kp -RTS -s
1980100
9,461,474,416 bytes allocated in the heap
6,103,730,184 bytes copied during GC
1,190,494,880 bytes maximum residency (18 sample(s))
5,098,848 bytes maximum slop
2624 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6473 colls, 0 par 2.173s 2.176s 0.0003s 0.0010s
Gen 1 18 colls, 0 par 4.185s 4.188s 0.2327s 1.4993s
INIT time 0.000s ( 0.000s elapsed)
MUT time 3.320s ( 3.322s elapsed)
GC time 6.358s ( 6.365s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 9.679s ( 9.687s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,849,443,762 bytes per MUT second
Productivity 34.3% of total user, 34.3% of total elapsed
I thinks that my programm takes O(n*w) memory, while it could do it in O(w).
(w is the total capacity)
Is that a problem of lazy evaluation taking too much space, or something else ?
How could this code be more memory and time efficient ?
We can think of a left fold as performing iterations while keeping an accumulator that is returned at the end.
When there are lots of iterations, one concern is that the accumulator might grow too large in memory. And because Haskell is lazy, this can happen even when the accumulator is of a primitive type like Int: behind some seemingly innocent Int value a large number of pending operations might lurk, in the form of thunks.
Here the strict left fold function foldl' is useful because it ensures that, as the left fold is being evaluated, the accumulator will always be kept in weak head normal form (WHNF).
Alas, sometimes this isn't enough. WHNF only says that evaluation has progressed up to the "outermost constructor" of the value. This is enough for Int, but for recursive types like lists or trees, that isn't saying much: the thunks might simply lurk further down the list, or in branches below.
This is the case here, where the accumulator is a list that is recreated at each iteration. Each iteration, the foldl' only evaluates the list up to _ : _. Unevaluated max and zipWith operations start to pile up.
What we need is a way to trigger a full evaluation of the accumulator list at each iteration, one which cleans any max and zipWith thunks from memory. And this is what force accomplishes. When force $ something is evaluated to WHNF, something is fully evaluated to normal form, that is, not only up to the outermost constructor but "deeply".
Notice that we still need the foldl' in order to "trigger" the force at each iteration.

Memory footprint of splitOn?

I wrote a file indexing program that should read thousands of text file lines as records and finally group those records by fingerprint. It uses Data.List.Split.splitOn to split the lines at tabs and retrieve the record fields. The program consumes 10-20 GB of memory.
Probably there is not much I can do to reduce that huge memory footprint, but I cannot explain why a function like splitOn (breakDelim) can consume that much memory:
Mon Dec 9 21:07 2019 Time and Allocation Profiling Report (Final)
group +RTS -p -RTS file1 file2 -o 2 -h
total time = 7.40 secs (7399 ticks # 1000 us, 1 processor)
total alloc = 14,324,828,696 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
fileToPairs.linesIncludingEmptyLines ImageFileRecordParser ImageFileRecordParser.hs:35:7-47 25.0 33.8
breakDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(151,1)-(156,36) 24.9 39.3
sortAndGroup Aggregations Aggregations.hs:6:1-85 12.9 1.7
fileToPairs ImageFileRecordParser ImageFileRecordParser.hs:(33,1)-(42,14) 8.2 10.7
matchDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(73,1)-(77,23) 7.4 0.4
onSublist Data.List.Split.Internals src/Data/List/Split/Internals.hs:278:1-72 3.6 0.0
toHashesView ImageFileRecordStatistics ImageFileRecordStatistics.hs:(48,1)-(51,24) 3.0 6.3
main Main group.hs:(47,1)-(89,54) 2.9 0.4
numberOfUnique ImageFileRecord ImageFileRecord.hs:37:1-40 1.6 0.1
toHashesView.sortedLines ImageFileRecordStatistics ImageFileRecordStatistics.hs:50:7-30 1.4 0.1
imageFileRecordFromFields ImageFileRecordParser ImageFileRecordParser.hs:(11,1)-(30,5) 1.1 0.3
toHashView ImageFileRecord ImageFileRecord.hs:(67,1)-(69,23) 0.7 1.7
Or is type [Char] too memory inefficient (compared to Text), causing splitOn to take that much memory?
UPDATE 1 (+RTS -s suggestion of user HTNW)
23,446,268,504 bytes allocated in the heap
10,753,363,408 bytes copied during GC
1,456,588,656 bytes maximum residency (22 sample(s))
29,282,936 bytes maximum slop
3620 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 45646 colls, 0 par 4.055s 4.059s 0.0001s 0.0013s
Gen 1 22 colls, 0 par 4.034s 4.035s 0.1834s 1.1491s
INIT time 0.000s ( 0.000s elapsed)
MUT time 7.477s ( 7.475s elapsed)
GC time 8.089s ( 8.094s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.114s ( 0.114s elapsed)
Total time 15.687s ( 15.683s elapsed)
%GC time 51.6% (51.6% elapsed)
Alloc rate 3,135,625,407 bytes per MUT second
Productivity 48.4% of total user, 48.4% of total elapsed
The processed text files are smaller than usual (UTF-8 encoded, 37 MB). But still 3 GB of memory are used.
UPDATE 2 (critical part of the code)
Explanation: fileToPairs processes a text file. It returns a list of key-value pairs (key: fingerprint of record, value: record).
sortAndGroup associations = Map.fromListWith (++) [(k, [v]) | (k, v) <- associations]
main = do
CommandLineArguments{..} <- cmdArgs $ CommandLineArguments {
ignored_paths_file = def &= typFile,
files = def &= typ "FILES" &= args,
number_of_occurrences = def &= name "o",
minimum_number_of_occurrences = def &= name "l",
maximum_number_of_occurrences = def &= name "u",
number_of_hashes = def &= name "n",
having_record_errors = def &= name "e",
hashes = def
}
&= summary "Group image/video files"
&= program "group"
let ignoredPathsFilenameMaybe = ignored_paths_file
let filenames = files
let hashesMaybe = hashes
ignoredPaths <- case ignoredPathsFilenameMaybe of
Just ignoredPathsFilename -> ioToLines (readFile ignoredPathsFilename)
_ -> return []
recordPairs <- mapM (fileToPairs ignoredPaths) filenames
let allRecordPairs = concat recordPairs
let groupMap = sortAndGroup allRecordPairs
let statisticsPairs = map toPair (Map.toList groupMap) where toPair item = (fst item, imageFileRecordStatisticsFromRecords . snd $ item)
let filterArguments = FilterArguments {
numberOfOccurrencesMaybe = number_of_occurrences,
minimumNumberOfOccurrencesMaybe = minimum_number_of_occurrences,
maximumNumberOfOccurrencesMaybe = maximum_number_of_occurrences,
numberOfHashesMaybe = number_of_hashes,
havingRecordErrorsMaybe = having_record_errors
}
let filteredPairs = filterImageRecords filterArguments statisticsPairs
let filteredMap = Map.fromList filteredPairs
case hashesMaybe of
Just True -> mapM_ putStrLn (map toHashesView (map snd filteredPairs))
_ -> Char8.putStrLn (encodePretty filteredMap)
As I'm sure you're aware, there's not really enough information here for us to help you make your program more efficient. It might be worth posting some (complete, self-contained) code on the Code Review site for that.
However, I think I can answer your specific question about why splitOn allocates so much memory. In fact, there's nothing particularly special about splitOn or how it's been implemented. Many straightforward Haskell functions will allocate lots of memory, and this in itself doesn't indicate that they've been poorly written or are running inefficiently. In particular, splitOn's memory usage seems similar to other straightforward approaches to splitting a string based on delimiters.
The first thing to understand is that GHC compiled code works differently than other compiled code you're likely to have seen. If you know a lot of C and understand stack frames and heap allocation, or if you've studied some JVM implementations, you might reasonably expect that some of that understanding would translate to GHC executables, but you'd be mostly wrong.
A GHC program is more or less an engine for allocating heap objects, and -- with a few exceptions -- that's all it really does. Nearly every argument passed to a function or constructor (as well as the constructor application itself) allocates a heap object of at least 16 bytes, and often more. Take a simple function like:
fact :: Int -> Int
fact 0 = 1
fact n = n * fact (n-1)
With optimization turned off, it compiles to the following so-called "STG" form (simplified from the actual -O0 -ddump-stg output):
fact = \n -> case n of I# n' -> case n' of
0# -> I# 1#
_ -> let sat1 = let sat2 = let one = I#! 1# in n-one
in fact sat2;
in n*sat1
Everywhere you see a let, that's a heap allocation (16+ bytes), and there are presumably more hidden in the (-) and (*) calls. Compiling and running this program with:
main = print $ fact 1000000
gives:
113,343,544 bytes allocated in the heap
44,309,000 bytes copied during GC
25,059,648 bytes maximum residency (5 sample(s))
29,152 bytes maximum slop
23 MB total memory in use (0 MB lost due to fragmentation)
meaning that each iteration allocates over a hundred bytes on the heap, though it's literally just performing a comparison, a subtraction, a multiplication, and a recursive call,
This is what #HTNW meant in saying that total allocation in a GHC program is a measure of "work". A GHC program that isn't allocating probably isn't doing anything (again, with some rare exceptions), and a typical GHC program that is doing something will usually allocate at a relatively constant rate of several gigabytes per second when it's not garbage collecting. So, total allocation has more to do with total runtime than anything else, and it isn't a particularly good metric for assessing code efficiency. Maximum residency is also a poor measure of overall efficiency, though it can be helpful for assessing whether or not you have a space leak, if you find that it tends to grow linearly (or worse) with the size of the input where you expect the program should run in constant memory regardless of input size.
For most programs, the most important true efficiency metric in the +RTS -s output is probably the "productivity" rate at the bottom -- it's the amount of time the program spends not garbage collecting. And, admittedly, your program's productivity of 48% is pretty bad, which probably means that it is, technically speaking, allocating too much memory, but it's probably only allocating two or three times the amount it should be, so, at a guess, maybe it should "only" be allocating around 7-8 Gigs instead of 23 Gigs for this workload (and, consequently, running for about 5 seconds instead of 15 seconds).
With that in mind, if you consider the following simple breakDelim implementation:
breakDelim :: String -> [String]
breakDelim str = case break (=='\t') str of
(a,_:b) -> a : breakDelim b
(a,[]) -> [a]
and use it like so in a simple tab-to-comma delimited file converter:
main = interact (unlines . map (intercalate "," . breakDelim) . lines)
Then, unoptimized and run on a file with 10000 lines of 1000 3-character fields each, it allocates a whopping 17 Gigs:
17,227,289,776 bytes allocated in the heap
2,807,297,584 bytes copied during GC
127,416 bytes maximum residency (2391 sample(s))
32,608 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
and profiling it places a lot of blame on breakDelim:
COST CENTRE MODULE SRC %time %alloc
main Main Delim.hs:8:1-71 57.7 72.6
breakDelim Main Delim.hs:(4,1)-(6,16) 42.3 27.4
In this case, compiling with -O2 doesn't make much difference. The key efficiency metric, productivity, is only 46%. All these results seem to be in line with what you're seeing in your program.
The split package has a lot going for it, but looking through the code, it's pretty clear that little effort has been made to make it particularly efficient or fast, so it's no surprise that splitOn performs no better than my quick-and-dirty custom breakDelim function. And, as I said before, there's nothing special about splitOn that makes it unusually memory hungry -- my simple breakDelim has similar behavior.
With respect to inefficiencies of the String type, it can often be problematic. But, it can also participate in optimizations like list fusion in ways that Text can't. The utility above could be rewritten in simpler form as:
main = interact $ map (\c -> if c == '\t' then ',' else c)
which uses String but runs pretty fast (about a quarter as fast as a naive C getchar/putchar implementation) at 84% productivity, while allocating about 5 Gigs on the heap.
It's quite likely that if you just take your program and "convert it to Text", you'll find it's slower and more memory hungry than the original! While Text has the potential to be much more efficient than String, it's a complicated package, and the way in which Text objects behave with respect to allocation when they're sliced and diced (as when you're chopping a big Text file up into little Text fields) makes it more difficult to get right.
So, some take-home lessons:
Total allocation is a poor measure of efficiency. Most well written GHC programs can and should allocate several gigabytes per second of runtime.
Many innocuous Haskell functions will allocate lots of memory because of the way GHC compiled code works. This isn't necessarily a sign that there's something wrong with the function.
The split package provides a flexible framework for all manner of cool list splitting manipulations, but it was not designed with speed in mind, and it may not be the best method of processing a tab-delimited file.
The String data type has a potential for terrible inefficiency, but isn't always inefficient, and Text is a complicated package that won't be a plug-in replacement to fix your String performance woes.
Most importantly:
Unless your program is too slow for its intended purpose, its run-time statistics and the theoretical advantages of Text over String are largely irrelevant.

Why to avoid Explicit recursion in Haskell?

I am new to Haskell.
While studying about foldr many are suggesting to use it and avoid explicit recursion which can lead to Memory Inefficient code.
https://www.reddit.com/r/haskell/comments/1nb80j/proper_use_of_recursion_in_haskell/
As I was running the sample mentioned in the above link. I can see the explicit recursion is doing better in terms of memory. First I thought May be running on GHCi is not near to perfect benchmark and I tried compiling it using stack ghc. And btw How can I pass Compiler Optimization flags via stack ghc. What am I missing from the Expression Avoid Explicit Recursion.
find p = foldr go Nothing
where go x rest = if p x then Just x else rest
findRec :: (a -> Bool) -> [a] -> Maybe a
findRec _ [] = Nothing
findRec p (x:xs) = if p x then Just x else (findRec p xs)
main :: IO ()
main = print $ find (\x -> x `mod` 2 == 0) [1, 3..1000000]
main = print $ findRec (\x -> x `mod` 2 == 0) [1, 3..1000000]
-- find
Nothing
92,081,224 bytes allocated in the heap
9,392 bytes copied during GC
58,848 bytes maximum residency (2 sample(s))
26,704 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 87 colls, 0 par 0.000s 0.000s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.000s 0.001s 0.0004s 0.0008s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.031s ( 0.043s elapsed)
GC time 0.000s ( 0.001s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.031s ( 0.044s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,946,599,168 bytes per MUT second
Productivity 100.0% of total user, 96.8% of total elapsed
-- findRec
Nothing
76,048,432 bytes allocated in the heap
13,768 bytes copied during GC
42,928 bytes maximum residency (2 sample(s))
26,704 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 71 colls, 0 par 0.000s 0.000s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.000s 0.001s 0.0004s 0.0007s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.031s ( 0.038s elapsed)
GC time 0.000s ( 0.001s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.031s ( 0.039s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,433,549,824 bytes per MUT second
Productivity 100.0% of total user, 96.6% of total elapsed
You are measuring how quickly GHC can do half a million modulus operations. As you might expect, "in the blink of an eye" is the answer regardless of how you iterate. There is no obvious difference in speed.
You claim that you can see that explicit recursion is using less memory, but the heap profiling data you provide shows the opposite: more allocation and higher max residency when using explicit recursion. I don't think the difference is significant, but if it were then your evidence would be contradicting your claim.
As to the question of why to avoid explicit recursion, it's not really clear what part of that thread you read that made you come to your conclusion. You linked to a giant thread which itself links to another giant thread, with many competing opinions. The comment that stands out the most to me is it's not about efficiency, it's about levels of abstraction. You are looking at this the wrong way by trying to measure its performance.
First, don't try to understand the performance of GHC-compiled code using anything other than optimized compilation:
$ stack ghc -- -O2 Find.hs
$ ./Find +RTS -s
With the -O2 flag (and GHC version 8.6.4), your find performs as follows:
16,051,544 bytes allocated in the heap
14,184 bytes copied during GC
44,576 bytes maximum residency (2 sample(s))
29,152 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
However, this is very misleading. None of this memory usage is due to the looping performed by foldr. Rather it's all due to the use of boxed Integers. If you switch to using plain Ints which the compiler can unbox:
main = print $ find (\x -> x `mod` 2 == 0) [1::Int, 3..1000000]
^^^^^
the memory performance changes drastically and demonstrates the true memory cost of foldr:
51,544 bytes allocated in the heap
3,480 bytes copied during GC
44,576 bytes maximum residency (1 sample(s))
25,056 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
If you test findRec with Ints like so:
main = print $ findRec (\x -> x `mod` 2 == 0) [1::Int, 3..1000000]
you'll see much worse memory performance:
40,051,528 bytes allocated in the heap
14,992 bytes copied during GC
44,576 bytes maximum residency (2 sample(s))
29,152 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
which seems to make a compelling case that recursion should be avoided in preference to foldr, but this, too, is very misleading. What you are seeing here is not the memory cost of recursion, but rather the memory cost of "list building".
See, foldr and the expression [1::Int, 3..1000000] both include some magic called "list fusion". This means that when they are used together (i.e., when foldr is applied to [1::Int 3..1000000]), an optimization can be performed to completely eliminate the creation of a Haskell list. Critically, the foldr code, even using list fusion, compiles to recursive code which looks like this:
main_go
= \ x ->
case gtInteger# x lim of {
__DEFAULT ->
case eqInteger# (modInteger x lvl) lvl1 of {
__DEFAULT -> main_go (plusInteger x lvl);
-- ^^^^^^^ - SEE? IT'S JUST RECURSION
1# -> Just x
};
1# -> Nothing
}
end Rec }
So, it's list fusion, rather than "avoiding recursion" that makes find faster than findRec.
You can see this is true by considering the performance of:
find1 :: Int -> Maybe Int
find1 n | n >= 1000000 = Nothing
| n `mod` 2 == 0 = Just n
| otherwise = find1 (n+2)
main :: IO ()
main = print $ find1 1
Even though this uses recursion, it doesn't generate a list (or use boxed Integers), so it runs just like the foldr version:
51,544 bytes allocated in the heap
3,480 bytes copied during GC
44,576 bytes maximum residency (1 sample(s))
25,056 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
So, what are the take home lessons?
Always benchmark Haskell code using ghc -O2, never GHCi or ghc without optimization flags.
Less than 10% of people in any Reddit thread know what they're talking about.
foldr can sometimes perform better than explicit recursion when special optimizations like list fusion can apply.
But in the general case, explicit recursion performs just as well as foldr or other specialized constructs.
Also, optimizing Haskell code is hard.
Actually, here's a better (more serious) take-home lesson. Especially when you're getting started with Haskell, make every possible effort to avoid thinking about "optimizing" your code. Far more than any other language I know, there is an enormous gulf between the code you write and the code the compiler generates, so don't even try to figure it out right now. Instead, write code that is clear, straightforward, and idiomatic. If you try to learn the "rules" for high-performance code now, you'll get them all wrong and learn really bad programming style into the bargain.

Efficient bit-fiddling in a LFSR implementation

Although I have a good LSFR C implementation I thought I'd try the same in Haskell - just to see how it goes. What I came up with, so far, is two orders of magnitude slower than the C implementation, which begs the question: How can the performance be improved? Clearly, the bit-fiddling operations are the bottleneck, and the profiler confirms this.
Here's the baseline Haskell code using lists and Data.Bits:
import Control.Monad (when)
import Data.Bits (Bits, shift, testBit, xor, (.&.), (.|.))
import System.Environment (getArgs)
import System.Exit (exitFailure, exitSuccess)
tap :: [[Int]]
tap = [
[], [], [], [3, 2],
[4, 3], [5, 3], [6, 5], [7, 6],
[8, 6, 5, 4], [9, 5], [10, 7], [11, 9],
[12, 6, 4, 1], [13, 4, 3, 1], [14, 5, 3, 1], [15, 14],
[16,15,13,4], [17, 14], [18, 11], [19, 6, 2, 1],
[20, 17], [21, 19], [22, 21], [23, 18],
[24,23,22,17], [25, 22], [26, 6, 2, 1], [27, 5, 2, 1],
[28, 25], [29, 27], [30, 6, 4, 1], [31, 28],
[32,22,2,1], [33,20], [34,27,2,1], [35,33],
[36,25], [37,5,4,3,2,1],[38,6,5,1], [39,35],
[40,38,21,19], [41,38], [42,41,20,19], [43,42,38,37],
[44,43,18,17], [45,44,42,41], [46,45,26,25], [47,42],
[48,47,21,20], [49,40], [50,49,24,23], [51,50,36,35],
[52,49], [53,52,38,37], [54,53,18,17], [55,31],
[56,55,35,34], [57,50], [58,39], [59,58,38,37],
[60,59], [61,60,46,45], [62,61,6,5], [63,62] ]
xor' :: [Bool] -> Bool
xor' = foldr xor False
mask :: (Num a, Bits a) => Int -> a
mask len = shift 1 len - 1
advance :: Int -> [Int] -> Int -> Int
advance len tap lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask len
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
main :: IO ()
main = do
args <- getArgs
when (null args) $ fail "Usage: lsfr <number-of-bits>"
let len = read $ head args
when (len < 8) $ fail "No need for LFSR"
let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0
if out == 0 then do
putStr "OK\n"
exitSuccess
else do
putStr "FAIL\n"
exitFailure
Basically it tests whether the LSFR defined in tap :: [[Int]] for any given bit-length is of maximum-length. (More precisely, it just checks whether the LSFR reaches the initial state (zero) after 2n iterations.)
According to the profiler the most costly line is the feedback bit d0 = xor' $ map (testBit lfsr) tap'.
What I've tried so far:
use Data.Array: Attempt abandoned because there's no foldl/r
use Data.Vector: Slightly faster than the baseline
The compiler options I use are: -O2, LTS Haskell 8.12 (GHC-8.0.2).
The reference C++ program can be found on gist.github.com.
The Haskell code can't be expected (?) to run as fast as the C code, but two orders of magnitude is too much, there must be a better way to do the bit-fiddling.
Update: Results of applying the optimisations suggested in the answers
The reference C++ program with input 28, compiled with LLVM 8.0.0, runs in 0.67s on my machine (the same with clang 3.7 is marginally slower, 0.68s)
The baseline Haskell code runs about 100x slower (because of the space inefficiency don't try it with inputs larger than 25)
With the rewrite of #Thomas M. DuBuisson, still using the default GHC backend, the execution time goes down to 5.2s
With the rewrite of #Thomas M. DuBuisson, now using the LLVM backend (GHC option -O2 -fllvm), the execution time goes down to 1.7s
Using GHC option -O2 -fllvm -optlc -mcpu=native brings this to 0.73s
Replacing iterate with iterate' of #cirdec makes no difference when Thomas' code is used (both with the default 'native' backend and LLVM). However, it does make a difference when the baseline code is used.
So, we've come from 100x to 8x to 1.09x, i.e. only 9% slower than C!
Note
The LLVM backend to GHC 8.0.2 requires LLVM 3.7. On Mac OS X this means installing this version with brew and then symlinking opt and llc. See 7.10. GHC Backends.
Up Front Matters
For starters, I'm using GHC 8.0.1 on an Intel I5 ~2.5GHz, linux x86-64.
First Draft: Oh No! The slows!
Your starting code with parameter 25 runs:
% ghc -O2 orig.hs && time ./orig 25
[1 of 1] Compiling Main ( orig.hs, orig.o )
Linking orig ...
OK
./orig 25 7.25s user 0.50s system 99% cpu 7.748 total
So the time to beat is 77ms - two orders of magnitude better than this Haskell code. Lets dive in.
Issue 1: Shifty Code
I found a couple of oddities with the code. First was the use of shift in high performance code. Shift supports both left and right shift and to do so it requires a branch. Lets kill that with more readable powers of two and such (shift 1 x ~> 2^x and shift x 1 ~> 2*x):
% ghc -O2 noShift.hs && time ./noShift 25
[1 of 1] Compiling Main ( noShift.hs, noShift.o )
Linking noShift ...
OK
./noShift 25 0.64s user 0.00s system 99% cpu 0.637 total
(As you noted in the comments: Yes, this bears investigation. It might be that some oddity of the prior code was preventing a rewrite rule from firing and, as a result, much worse code resulted)
Issue 2: Lists Of Bits? Int operations save the day!
One change, one order of magnitude. Yay. What else? Well you have this awkward list of bit locations you're tapping that just seems like its begging for inefficiency and/or leans on fragile optimizations. At this point I'll note that hard-coding any one selection from that list results in really good performance (such as testBit lsfr 24 `xor` testBit lsfr 21) but we want a more general fast solution.
I propose we compute the mask of all the tap locations then do a one-instruction pop count. To do this we only need a single Int passed in to advance instead of a whole list. The popcount instruction requires good assembly generation which requires llvm and probably -optlc-mcpu=native or another instruction set selection that is non-pessimistic.
This step gives us pc below. I've folded in the guard-removal of advance that was mentioned in the comments:
let tp = sum $ map ((2^) . subtract 1) (tap !! len)
pc lfsr = fromEnum (even (popCount (lfsr .&. tp)))
mask = 2^len - 1
advance' :: Int -> Int
advance' lfsr = (2*lfsr .&. mask) .|. pc lfsr
out :: Int
out = last $ take (2^len) $ iterate advance' 0
Our resulting performance is:
% ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu=native && time ./so 25
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
OK
./so 25 0.06s user 0.00s system 96% cpu 0.067 total
That's over two orders of magnitude from start to finish, so hopefully it matches your C. Finally, in deployed code it is actually really common to have Haskell packages with C bindings but this is often an educational exercise so I hope you had fun.
Edit: The now-available C++ code takes my system 0.10 (g++ -O3) and 0.12 (clang++ -O3 -march=native) seconds, so it seems we've beat our mark by a fair bit.
I suspect that the following line is building a large list-like thunk in memory before evaluating it.
let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0` is
Let's find out if I'm right, and if I am, we'll fix it. The first debugging step is to get an idea of the memory used by the program. To do this we're going to compile with the options -rtsopts in addition to -O2. This enables running the program with RTS options, inclusing +RTS -s which outputs a small memory summary.
Initial Performance
Running your program as lfsr 25 +RTS -s I get the following output
OK
5,420,148,768 bytes allocated in the heap
6,705,977,216 bytes copied during GC
1,567,511,384 bytes maximum residency (20 sample(s))
357,862,432 bytes maximum slop
3025 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10343 colls, 0 par 2.453s 2.522s 0.0002s 0.0009s
Gen 1 20 colls, 0 par 2.281s 3.065s 0.1533s 0.7128s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.438s ( 1.162s elapsed)
GC time 4.734s ( 5.587s elapsed)
EXIT time 0.016s ( 0.218s elapsed)
Total time 6.188s ( 6.967s elapsed)
%GC time 76.5% (80.2% elapsed)
Alloc rate 3,770,538,273 bytes per MUT second
Productivity 23.5% of total user, 19.8% of total elapsed
That's a lot of memory used at once. It's very likely there's a big thunk building up in there somewhere.
Trying to reduce the thunk size
I hypothesized that the thunk is being built in iterate (advance ...). If this is the case, we can try to reduce the thunk size by making advance more strict in its lsfr argument. This won't remove the spine of the thunk (the successive iterations), but it might reduce the size of the state that's built up as the spine's evaluated.
BangPatterns is an easy way to make a function strict in an argument. f !x = .. is shorthand for f x = seq x $ ...
{-# LANGUAGE BangPatterns #-}
advance :: Int -> [Int] -> Int -> Int
advance len tap = go
where
go !lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask len
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
Let's see what difference this makes ...
>lfsr 25 +RTS -s
OK
5,420,149,072 bytes allocated in the heap
6,705,979,368 bytes copied during GC
1,567,511,448 bytes maximum residency (20 sample(s))
357,862,448 bytes maximum slop
3025 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10343 colls, 0 par 2.688s 2.711s 0.0003s 0.0059s
Gen 1 20 colls, 0 par 2.438s 3.252s 0.1626s 0.8013s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.328s ( 1.146s elapsed)
GC time 5.125s ( 5.963s elapsed)
EXIT time 0.000s ( 0.226s elapsed)
Total time 6.484s ( 7.335s elapsed)
%GC time 79.0% (81.3% elapsed)
Alloc rate 4,081,053,418 bytes per MUT second
Productivity 21.0% of total user, 18.7% of total elapsed
None that's noticeable.
Eliminating the Spine
I guess it's the spine of that iterate (advance ...) that's being built. After all, for the command I'm running the list would be 2^25, or a little over 33 million items long. The list itself is probably being removed by list fusion, but the thunk for the last item of the list is over 33 million applications of advance ...
To solve this problem we need a strict version of iterate so that the value is forced to an Int before applying the advance function again. This should keep the memory down to only a single lfsr value at a time, along with the currently computed application of advance.
Unfortunately, there isn't a strict iterate in Data.List. Here's one that doesn't give up on the list fusion that's providing other important (I think) performance optimizations to this problem.
{-# LANGUAGE BangPatterns #-}
import GHC.Base (build)
{-# NOINLINE [1] iterate' #-}
iterate' :: (a -> a) -> a -> [a]
iterate' f = go
where go !x = x : go (f x)
{-# NOINLINE [0] iterateFB' #-}
iterateFB' :: (a -> b -> b) -> (a -> a) -> a -> b
iterateFB' c f = go
where go !x = x `c` go (f x)
{-# RULES
"iterate'" [~1] forall f x. iterate' f x = build (\c _n -> iterateFB' c f x)
"iterateFB'" [1] iterateFB' (:) = iterate'
#-}
This is just iterate from GHC.List (along with all its rewrite rules), but made strict in the accumulated argument.
Equipped with a strict iterate, iterate', we can change the troublesome line to
let out = last $ take (shift 1 len) $ iterate' (advance len (tap!!len)) 0
I expect that this will perform much better. Let's see ...
>lfsr 25 +RTS -s
OK
3,758,156,184 bytes allocated in the heap
297,976 bytes copied during GC
43,800 bytes maximum residency (1 sample(s))
21,736 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 7281 colls, 0 par 0.047s 0.008s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.750s ( 0.783s elapsed)
GC time 0.047s ( 0.008s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.797s ( 0.792s elapsed)
%GC time 5.9% (1.0% elapsed)
Alloc rate 5,010,874,912 bytes per MUT second
Productivity 94.1% of total user, 99.0% of total elapsed
This used 0.00002 times as much memory and ran 10 times as fast.
I don't know if this will improve on Thomas DeBuisson's answer that improves advance but still leaves a lazy iterate advance' in place. It would be easy to check; add the iterate' code to that answer and use iterate' in place of iterate in that answer.
Does the compiler lift tap !! len out of the loop? I suspect it does, but moving it out to guarantee this can't hurt:
let tap1 = tap !! len
let out = last $ take (shift 1 len) $ iterate (advance len tap1) 0
In the comments you say "2^len is needed exactly once", but this is wrong. You do it each time in advance. So you could try
advance len tap mask lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
-- in main
let tap1 = tap !! len
let numIterations = 2^len
let mask = numIterations - 1
let out = iterate (advance len tap1 mask) 0 !! (numIterations - 1)
(the compiler can't optimize last $ take ... to !! in general, because they are different for finite lists, but iterate always returns an infinite one.)
You compared foldr with foldl, but foldl is almost never what you need; since xor always needs both arguments and is associative, foldl' is very likely to be the right choice (the compiler can optimize it, but if there is any real difference between foldl and foldr and not just random variation, it might have failed in this case).

Running into memory issues with Data.Sequence on a manageably sized dataset

TL;DR: I'm working on a piece of code which generates a (long) array of numbers. I'm able to generate this array, convert it to a List and then calculate the maximum (using a strict left fold). BUT, I run into memory issues when I try to convert the list to a Sequence prior to calculating the maximum. This is quite counter-intuitive to me.
My question: Why is this happening and what is the correct approach for converting the data to a Sequence structure?
Background:
I'm working on a problem which I've chosen to tackle in using three steps (below).
*Note: I'm intentionally keeping the problem statement vague so this post doesn't serve as a hint.
Anyways, my proposed approach:
(i) First, generate a long list of integers, namely, the number of factors for each integer from 1 to 100 million (NOT the factors themselves, just the number of factors)
(ii) second, convert this list into a Sequence.
(iii) lastly, use an efficient sliding window maximum algorithm to calc my answer (this step requires dequeue operations, hence the need for a Sequence)
(Again, the specifics of the problem aren't that relevant as I'm just curious as to why I'm running into this particular issue in the first place.)
What I've done so far?
Step 1 was fairly straightforward - see output below (full code is included at the bottom). I just bruteforce a sieve using an Unboxed Array and the accumArray function, nothing fancy. Note: I've used this same algorithm to solve a number of other such problems so I'm reasonably confident that it's giving the right answer.
For the purposes of showing execution time / memory-usage stats, I've (admittedly arbitrarily) chosen to calculate the maximum element in the resulting array - the idea is simply to use a function which forces construction of all elements of the Array, thereby ensuring that we see meaningful stats for exec time / memory-usage.
The following main function...
main = print $ maximum' $ elems (sieve (10^8))
...results in the following (i.e., it says that the number below 100 million with the most divisors has a total of 768 divisors):
Linking maxdivSO ...
768
33.73s user 70.80s system 99% cpu 1:44.62 total
344,214,504,640 bytes allocated in the heap
58,471,497,320 bytes copied during GC
200,062,352 bytes maximum residency (298 sample(s))
3,413,824 bytes maximum slop
386 MB total memory in use (0 MB lost due to fragmentation)
%GC time 24.7% (30.5% elapsed)
The problem
It seems like we can accomplish the first step without breaking a sweat since I've allocated a total of 5gb to my VirtualBox and the above code uses <400mb (as reference, I've seen programs execute successfully and report using 3gb+ of total memory). In other words, it seems like we've accomplished Step 1 with plenty of headroom.
So I'm a bit surprised as to why the following version of the main function fails. We attempt to perform the same calculation of the maximum but after converting the list of integers to a Sequence. The following code...
main = print $ maximum' $ fromList $ elems (sieve (10^8))
...results in the following:
Linking maxdivSO ...
maxdivSO: out of memory (requested 2097152 bytes)
39.48s user 76.35s system 99% cpu 1:56.03 total
My question: Why does the algorithm (as currently written) run out of memory if we try to convert the list to a Sequence? And how might I go about successfully converting this list into a Sequence?"
(I'm not one to stubbornly stick to brute-force for these types of problems - but I have a strong suspicion that this particular issue is due to my not being able to reason well about evaluation.)
The code itself:
{-# LANGUAGE NoMonomorphismRestriction #-}
import Data.Word (Word32, Word16)
import Data.Foldable (Foldable, foldl')
import Data.Array.Unboxed (UArray, accumArray, elems)
import Data.Sequence (fromList)
main :: IO ()
main = print $ maximum' $ elems (sieve (10^8)) -- <-- works
--main = print $ maximum' $ fromList $ elems (sieve (10^8)) -- <-- doesn't work
maximum' :: (Foldable t, Ord a, Num a) => t a -> a
maximum' = foldl' (\x acc -> if x > acc then x else acc) 0
sieve :: Int -> UArray Word32 Word16
sieve u' = accumArray (+) 2 (1,u) ( (1,-1) : factors )
where
u = fromIntegral u'
cap = floor $ sqrt (fromIntegral u) :: Word32
factors = [ (i*d,j) | d <- [2..cap]
, i <- [2..(u `quot` d)]
, d <= i, let j = if i == d then 1 else 2
]
I think the reason for this is that to get the first element of of a sequence requires the full sequence to be constructed in memory (since the internal representation of the sequence is a tree). In the list case elems yields the elements lazily.
Rather than turning the full array into a sequence, why not make the sequence only as long as your sliding window?

Resources