Lazy IO + Parallelism: converting an image to grayscale - haskell

I am trying to add parallelism to a program that converts a .bmp to a grayscale .bmp. I am seeing usually 2-4x worse performance for the parallel code. I am tweaking parBuffer / chunking sizes and still cannot seem to reason about it. Looking for guidance.
The entire source file used here: http://lpaste.net/106832
We use Codec.BMP to read in a stream of pixels represented by type RGBA = (Word8, Word8, Word8, Word8). To convert to grayscale, simply map a 'luma' transform across all the pixels.
The serial implementation is literally:
toGray :: [RGBA] -> [RGBA]
toGray x = map luma x
The test input .bmp is 5184 x 3456 (71.7 MB).
The serial implementation runs in ~10s, ~550ns/pixel. Threadscope looks clean:
Why is this so fast? I suppose it has something with lazy ByteString (even though Codec.BMP uses strict ByteString--is there implicit conversion occurring here?) and fusion.
Adding Parallelism
First attempt at adding parallelism was via parList. Oh boy. The program used ~4-5GB memory and system started swapping.
I then read "Parallelizing Lazy Streams with parBuffer" section of Simon Marlow's O'Reilly book and tried parBuffer with a large size. This still did not produce desirable performance. The spark sizes were incredibly small.
I then tried to increase the spark size by chunking the lazy list and then sticking with parBuffer for the parallelism:
toGrayPar :: [RGBA] -> [RGBA]
toGrayPar x = concat $ (withStrategy (parBuffer 500 rpar) . map (map luma))
(chunk 8000 x)
chunk :: Int -> [a] -> [[a]]
chunk n [] = []
chunk n xs = as : chunk n bs where
(as,bs) = splitAt (fromIntegral n) xs
But this still does not yield desirable performance:
18,934,235,760 bytes allocated in the heap
15,274,565,976 bytes copied during GC
639,588,840 bytes maximum residency (27 sample(s))
238,163,792 bytes maximum slop
1910 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 35277 colls, 35277 par 19.62s 14.75s 0.0004s 0.0234s
Gen 1 27 colls, 26 par 13.47s 7.40s 0.2741s 0.5764s
Parallel GC work balance: 30.76% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 4480 (2240 converted, 0 overflowed, 0 dud, 2 GC'd, 2238 fizzled)
INIT time 0.00s ( 0.01s elapsed)
MUT time 14.31s ( 14.75s elapsed)
GC time 33.09s ( 22.15s elapsed)
EXIT time 0.01s ( 0.12s elapsed)
Total time 47.41s ( 37.02s elapsed)
Alloc rate 1,323,504,434 bytes per MUT second
Productivity 30.2% of total user, 38.7% of total elapsed
gc_alloc_block_sync: 7433188
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 1017408
How can I better reason about what is going on here?

You have a big list of RGBA pixels. Why don't you use parListChunk with a reasonable chunk size?

Related

why does this programm consume so much memory?

I needed to use an algorithm to solve a KP problem some time ago, in haskell
Here is what my code look like:
stepKP :: [Int] -> (Int, Int) -> [Int]
stepKP l (p, v) = take p l ++ zipWith bestOption l (drop p l)
where bestOption a = max (a+v)
kp :: [(Int, Int)] -> Int -> Int
kp l pMax = last $ foldl stepKP [0 | i <- [0..pMax]] l
main = print $ kp (zip weights values) 20000
where weights = [0..2000]
values = reverse [8000..10000]
But when I try to execute it (after compilation with ghc, no flags), it seems pretty bad:
here is the result of the command ./kp -RTS -s
1980100
9,461,474,416 bytes allocated in the heap
6,103,730,184 bytes copied during GC
1,190,494,880 bytes maximum residency (18 sample(s))
5,098,848 bytes maximum slop
2624 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6473 colls, 0 par 2.173s 2.176s 0.0003s 0.0010s
Gen 1 18 colls, 0 par 4.185s 4.188s 0.2327s 1.4993s
INIT time 0.000s ( 0.000s elapsed)
MUT time 3.320s ( 3.322s elapsed)
GC time 6.358s ( 6.365s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 9.679s ( 9.687s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,849,443,762 bytes per MUT second
Productivity 34.3% of total user, 34.3% of total elapsed
I thinks that my programm takes O(n*w) memory, while it could do it in O(w).
(w is the total capacity)
Is that a problem of lazy evaluation taking too much space, or something else ?
How could this code be more memory and time efficient ?
We can think of a left fold as performing iterations while keeping an accumulator that is returned at the end.
When there are lots of iterations, one concern is that the accumulator might grow too large in memory. And because Haskell is lazy, this can happen even when the accumulator is of a primitive type like Int: behind some seemingly innocent Int value a large number of pending operations might lurk, in the form of thunks.
Here the strict left fold function foldl' is useful because it ensures that, as the left fold is being evaluated, the accumulator will always be kept in weak head normal form (WHNF).
Alas, sometimes this isn't enough. WHNF only says that evaluation has progressed up to the "outermost constructor" of the value. This is enough for Int, but for recursive types like lists or trees, that isn't saying much: the thunks might simply lurk further down the list, or in branches below.
This is the case here, where the accumulator is a list that is recreated at each iteration. Each iteration, the foldl' only evaluates the list up to _ : _. Unevaluated max and zipWith operations start to pile up.
What we need is a way to trigger a full evaluation of the accumulator list at each iteration, one which cleans any max and zipWith thunks from memory. And this is what force accomplishes. When force $ something is evaluated to WHNF, something is fully evaluated to normal form, that is, not only up to the outermost constructor but "deeply".
Notice that we still need the foldl' in order to "trigger" the force at each iteration.

Memory footprint of splitOn?

I wrote a file indexing program that should read thousands of text file lines as records and finally group those records by fingerprint. It uses Data.List.Split.splitOn to split the lines at tabs and retrieve the record fields. The program consumes 10-20 GB of memory.
Probably there is not much I can do to reduce that huge memory footprint, but I cannot explain why a function like splitOn (breakDelim) can consume that much memory:
Mon Dec 9 21:07 2019 Time and Allocation Profiling Report (Final)
group +RTS -p -RTS file1 file2 -o 2 -h
total time = 7.40 secs (7399 ticks # 1000 us, 1 processor)
total alloc = 14,324,828,696 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
fileToPairs.linesIncludingEmptyLines ImageFileRecordParser ImageFileRecordParser.hs:35:7-47 25.0 33.8
breakDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(151,1)-(156,36) 24.9 39.3
sortAndGroup Aggregations Aggregations.hs:6:1-85 12.9 1.7
fileToPairs ImageFileRecordParser ImageFileRecordParser.hs:(33,1)-(42,14) 8.2 10.7
matchDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(73,1)-(77,23) 7.4 0.4
onSublist Data.List.Split.Internals src/Data/List/Split/Internals.hs:278:1-72 3.6 0.0
toHashesView ImageFileRecordStatistics ImageFileRecordStatistics.hs:(48,1)-(51,24) 3.0 6.3
main Main group.hs:(47,1)-(89,54) 2.9 0.4
numberOfUnique ImageFileRecord ImageFileRecord.hs:37:1-40 1.6 0.1
toHashesView.sortedLines ImageFileRecordStatistics ImageFileRecordStatistics.hs:50:7-30 1.4 0.1
imageFileRecordFromFields ImageFileRecordParser ImageFileRecordParser.hs:(11,1)-(30,5) 1.1 0.3
toHashView ImageFileRecord ImageFileRecord.hs:(67,1)-(69,23) 0.7 1.7
Or is type [Char] too memory inefficient (compared to Text), causing splitOn to take that much memory?
UPDATE 1 (+RTS -s suggestion of user HTNW)
23,446,268,504 bytes allocated in the heap
10,753,363,408 bytes copied during GC
1,456,588,656 bytes maximum residency (22 sample(s))
29,282,936 bytes maximum slop
3620 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 45646 colls, 0 par 4.055s 4.059s 0.0001s 0.0013s
Gen 1 22 colls, 0 par 4.034s 4.035s 0.1834s 1.1491s
INIT time 0.000s ( 0.000s elapsed)
MUT time 7.477s ( 7.475s elapsed)
GC time 8.089s ( 8.094s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.114s ( 0.114s elapsed)
Total time 15.687s ( 15.683s elapsed)
%GC time 51.6% (51.6% elapsed)
Alloc rate 3,135,625,407 bytes per MUT second
Productivity 48.4% of total user, 48.4% of total elapsed
The processed text files are smaller than usual (UTF-8 encoded, 37 MB). But still 3 GB of memory are used.
UPDATE 2 (critical part of the code)
Explanation: fileToPairs processes a text file. It returns a list of key-value pairs (key: fingerprint of record, value: record).
sortAndGroup associations = Map.fromListWith (++) [(k, [v]) | (k, v) <- associations]
main = do
CommandLineArguments{..} <- cmdArgs $ CommandLineArguments {
ignored_paths_file = def &= typFile,
files = def &= typ "FILES" &= args,
number_of_occurrences = def &= name "o",
minimum_number_of_occurrences = def &= name "l",
maximum_number_of_occurrences = def &= name "u",
number_of_hashes = def &= name "n",
having_record_errors = def &= name "e",
hashes = def
}
&= summary "Group image/video files"
&= program "group"
let ignoredPathsFilenameMaybe = ignored_paths_file
let filenames = files
let hashesMaybe = hashes
ignoredPaths <- case ignoredPathsFilenameMaybe of
Just ignoredPathsFilename -> ioToLines (readFile ignoredPathsFilename)
_ -> return []
recordPairs <- mapM (fileToPairs ignoredPaths) filenames
let allRecordPairs = concat recordPairs
let groupMap = sortAndGroup allRecordPairs
let statisticsPairs = map toPair (Map.toList groupMap) where toPair item = (fst item, imageFileRecordStatisticsFromRecords . snd $ item)
let filterArguments = FilterArguments {
numberOfOccurrencesMaybe = number_of_occurrences,
minimumNumberOfOccurrencesMaybe = minimum_number_of_occurrences,
maximumNumberOfOccurrencesMaybe = maximum_number_of_occurrences,
numberOfHashesMaybe = number_of_hashes,
havingRecordErrorsMaybe = having_record_errors
}
let filteredPairs = filterImageRecords filterArguments statisticsPairs
let filteredMap = Map.fromList filteredPairs
case hashesMaybe of
Just True -> mapM_ putStrLn (map toHashesView (map snd filteredPairs))
_ -> Char8.putStrLn (encodePretty filteredMap)
As I'm sure you're aware, there's not really enough information here for us to help you make your program more efficient. It might be worth posting some (complete, self-contained) code on the Code Review site for that.
However, I think I can answer your specific question about why splitOn allocates so much memory. In fact, there's nothing particularly special about splitOn or how it's been implemented. Many straightforward Haskell functions will allocate lots of memory, and this in itself doesn't indicate that they've been poorly written or are running inefficiently. In particular, splitOn's memory usage seems similar to other straightforward approaches to splitting a string based on delimiters.
The first thing to understand is that GHC compiled code works differently than other compiled code you're likely to have seen. If you know a lot of C and understand stack frames and heap allocation, or if you've studied some JVM implementations, you might reasonably expect that some of that understanding would translate to GHC executables, but you'd be mostly wrong.
A GHC program is more or less an engine for allocating heap objects, and -- with a few exceptions -- that's all it really does. Nearly every argument passed to a function or constructor (as well as the constructor application itself) allocates a heap object of at least 16 bytes, and often more. Take a simple function like:
fact :: Int -> Int
fact 0 = 1
fact n = n * fact (n-1)
With optimization turned off, it compiles to the following so-called "STG" form (simplified from the actual -O0 -ddump-stg output):
fact = \n -> case n of I# n' -> case n' of
0# -> I# 1#
_ -> let sat1 = let sat2 = let one = I#! 1# in n-one
in fact sat2;
in n*sat1
Everywhere you see a let, that's a heap allocation (16+ bytes), and there are presumably more hidden in the (-) and (*) calls. Compiling and running this program with:
main = print $ fact 1000000
gives:
113,343,544 bytes allocated in the heap
44,309,000 bytes copied during GC
25,059,648 bytes maximum residency (5 sample(s))
29,152 bytes maximum slop
23 MB total memory in use (0 MB lost due to fragmentation)
meaning that each iteration allocates over a hundred bytes on the heap, though it's literally just performing a comparison, a subtraction, a multiplication, and a recursive call,
This is what #HTNW meant in saying that total allocation in a GHC program is a measure of "work". A GHC program that isn't allocating probably isn't doing anything (again, with some rare exceptions), and a typical GHC program that is doing something will usually allocate at a relatively constant rate of several gigabytes per second when it's not garbage collecting. So, total allocation has more to do with total runtime than anything else, and it isn't a particularly good metric for assessing code efficiency. Maximum residency is also a poor measure of overall efficiency, though it can be helpful for assessing whether or not you have a space leak, if you find that it tends to grow linearly (or worse) with the size of the input where you expect the program should run in constant memory regardless of input size.
For most programs, the most important true efficiency metric in the +RTS -s output is probably the "productivity" rate at the bottom -- it's the amount of time the program spends not garbage collecting. And, admittedly, your program's productivity of 48% is pretty bad, which probably means that it is, technically speaking, allocating too much memory, but it's probably only allocating two or three times the amount it should be, so, at a guess, maybe it should "only" be allocating around 7-8 Gigs instead of 23 Gigs for this workload (and, consequently, running for about 5 seconds instead of 15 seconds).
With that in mind, if you consider the following simple breakDelim implementation:
breakDelim :: String -> [String]
breakDelim str = case break (=='\t') str of
(a,_:b) -> a : breakDelim b
(a,[]) -> [a]
and use it like so in a simple tab-to-comma delimited file converter:
main = interact (unlines . map (intercalate "," . breakDelim) . lines)
Then, unoptimized and run on a file with 10000 lines of 1000 3-character fields each, it allocates a whopping 17 Gigs:
17,227,289,776 bytes allocated in the heap
2,807,297,584 bytes copied during GC
127,416 bytes maximum residency (2391 sample(s))
32,608 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
and profiling it places a lot of blame on breakDelim:
COST CENTRE MODULE SRC %time %alloc
main Main Delim.hs:8:1-71 57.7 72.6
breakDelim Main Delim.hs:(4,1)-(6,16) 42.3 27.4
In this case, compiling with -O2 doesn't make much difference. The key efficiency metric, productivity, is only 46%. All these results seem to be in line with what you're seeing in your program.
The split package has a lot going for it, but looking through the code, it's pretty clear that little effort has been made to make it particularly efficient or fast, so it's no surprise that splitOn performs no better than my quick-and-dirty custom breakDelim function. And, as I said before, there's nothing special about splitOn that makes it unusually memory hungry -- my simple breakDelim has similar behavior.
With respect to inefficiencies of the String type, it can often be problematic. But, it can also participate in optimizations like list fusion in ways that Text can't. The utility above could be rewritten in simpler form as:
main = interact $ map (\c -> if c == '\t' then ',' else c)
which uses String but runs pretty fast (about a quarter as fast as a naive C getchar/putchar implementation) at 84% productivity, while allocating about 5 Gigs on the heap.
It's quite likely that if you just take your program and "convert it to Text", you'll find it's slower and more memory hungry than the original! While Text has the potential to be much more efficient than String, it's a complicated package, and the way in which Text objects behave with respect to allocation when they're sliced and diced (as when you're chopping a big Text file up into little Text fields) makes it more difficult to get right.
So, some take-home lessons:
Total allocation is a poor measure of efficiency. Most well written GHC programs can and should allocate several gigabytes per second of runtime.
Many innocuous Haskell functions will allocate lots of memory because of the way GHC compiled code works. This isn't necessarily a sign that there's something wrong with the function.
The split package provides a flexible framework for all manner of cool list splitting manipulations, but it was not designed with speed in mind, and it may not be the best method of processing a tab-delimited file.
The String data type has a potential for terrible inefficiency, but isn't always inefficient, and Text is a complicated package that won't be a plug-in replacement to fix your String performance woes.
Most importantly:
Unless your program is too slow for its intended purpose, its run-time statistics and the theoretical advantages of Text over String are largely irrelevant.

Why to avoid Explicit recursion in Haskell?

I am new to Haskell.
While studying about foldr many are suggesting to use it and avoid explicit recursion which can lead to Memory Inefficient code.
https://www.reddit.com/r/haskell/comments/1nb80j/proper_use_of_recursion_in_haskell/
As I was running the sample mentioned in the above link. I can see the explicit recursion is doing better in terms of memory. First I thought May be running on GHCi is not near to perfect benchmark and I tried compiling it using stack ghc. And btw How can I pass Compiler Optimization flags via stack ghc. What am I missing from the Expression Avoid Explicit Recursion.
find p = foldr go Nothing
where go x rest = if p x then Just x else rest
findRec :: (a -> Bool) -> [a] -> Maybe a
findRec _ [] = Nothing
findRec p (x:xs) = if p x then Just x else (findRec p xs)
main :: IO ()
main = print $ find (\x -> x `mod` 2 == 0) [1, 3..1000000]
main = print $ findRec (\x -> x `mod` 2 == 0) [1, 3..1000000]
-- find
Nothing
92,081,224 bytes allocated in the heap
9,392 bytes copied during GC
58,848 bytes maximum residency (2 sample(s))
26,704 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 87 colls, 0 par 0.000s 0.000s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.000s 0.001s 0.0004s 0.0008s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.031s ( 0.043s elapsed)
GC time 0.000s ( 0.001s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.031s ( 0.044s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,946,599,168 bytes per MUT second
Productivity 100.0% of total user, 96.8% of total elapsed
-- findRec
Nothing
76,048,432 bytes allocated in the heap
13,768 bytes copied during GC
42,928 bytes maximum residency (2 sample(s))
26,704 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 71 colls, 0 par 0.000s 0.000s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.000s 0.001s 0.0004s 0.0007s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.031s ( 0.038s elapsed)
GC time 0.000s ( 0.001s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.031s ( 0.039s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,433,549,824 bytes per MUT second
Productivity 100.0% of total user, 96.6% of total elapsed
You are measuring how quickly GHC can do half a million modulus operations. As you might expect, "in the blink of an eye" is the answer regardless of how you iterate. There is no obvious difference in speed.
You claim that you can see that explicit recursion is using less memory, but the heap profiling data you provide shows the opposite: more allocation and higher max residency when using explicit recursion. I don't think the difference is significant, but if it were then your evidence would be contradicting your claim.
As to the question of why to avoid explicit recursion, it's not really clear what part of that thread you read that made you come to your conclusion. You linked to a giant thread which itself links to another giant thread, with many competing opinions. The comment that stands out the most to me is it's not about efficiency, it's about levels of abstraction. You are looking at this the wrong way by trying to measure its performance.
First, don't try to understand the performance of GHC-compiled code using anything other than optimized compilation:
$ stack ghc -- -O2 Find.hs
$ ./Find +RTS -s
With the -O2 flag (and GHC version 8.6.4), your find performs as follows:
16,051,544 bytes allocated in the heap
14,184 bytes copied during GC
44,576 bytes maximum residency (2 sample(s))
29,152 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
However, this is very misleading. None of this memory usage is due to the looping performed by foldr. Rather it's all due to the use of boxed Integers. If you switch to using plain Ints which the compiler can unbox:
main = print $ find (\x -> x `mod` 2 == 0) [1::Int, 3..1000000]
^^^^^
the memory performance changes drastically and demonstrates the true memory cost of foldr:
51,544 bytes allocated in the heap
3,480 bytes copied during GC
44,576 bytes maximum residency (1 sample(s))
25,056 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
If you test findRec with Ints like so:
main = print $ findRec (\x -> x `mod` 2 == 0) [1::Int, 3..1000000]
you'll see much worse memory performance:
40,051,528 bytes allocated in the heap
14,992 bytes copied during GC
44,576 bytes maximum residency (2 sample(s))
29,152 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
which seems to make a compelling case that recursion should be avoided in preference to foldr, but this, too, is very misleading. What you are seeing here is not the memory cost of recursion, but rather the memory cost of "list building".
See, foldr and the expression [1::Int, 3..1000000] both include some magic called "list fusion". This means that when they are used together (i.e., when foldr is applied to [1::Int 3..1000000]), an optimization can be performed to completely eliminate the creation of a Haskell list. Critically, the foldr code, even using list fusion, compiles to recursive code which looks like this:
main_go
= \ x ->
case gtInteger# x lim of {
__DEFAULT ->
case eqInteger# (modInteger x lvl) lvl1 of {
__DEFAULT -> main_go (plusInteger x lvl);
-- ^^^^^^^ - SEE? IT'S JUST RECURSION
1# -> Just x
};
1# -> Nothing
}
end Rec }
So, it's list fusion, rather than "avoiding recursion" that makes find faster than findRec.
You can see this is true by considering the performance of:
find1 :: Int -> Maybe Int
find1 n | n >= 1000000 = Nothing
| n `mod` 2 == 0 = Just n
| otherwise = find1 (n+2)
main :: IO ()
main = print $ find1 1
Even though this uses recursion, it doesn't generate a list (or use boxed Integers), so it runs just like the foldr version:
51,544 bytes allocated in the heap
3,480 bytes copied during GC
44,576 bytes maximum residency (1 sample(s))
25,056 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
So, what are the take home lessons?
Always benchmark Haskell code using ghc -O2, never GHCi or ghc without optimization flags.
Less than 10% of people in any Reddit thread know what they're talking about.
foldr can sometimes perform better than explicit recursion when special optimizations like list fusion can apply.
But in the general case, explicit recursion performs just as well as foldr or other specialized constructs.
Also, optimizing Haskell code is hard.
Actually, here's a better (more serious) take-home lesson. Especially when you're getting started with Haskell, make every possible effort to avoid thinking about "optimizing" your code. Far more than any other language I know, there is an enormous gulf between the code you write and the code the compiler generates, so don't even try to figure it out right now. Instead, write code that is clear, straightforward, and idiomatic. If you try to learn the "rules" for high-performance code now, you'll get them all wrong and learn really bad programming style into the bargain.

Performance of Integer type with large numbers

I'm trying to solve AdventOfCode 2018 day 14. The task is roughly to create a number with a lot of digits by iteratively appending one or two digits based on two already existing digits. With Haskell I thought Integer might be a good fit for representing the huge number. I think my program is correct, at least it seems to work for the samples AoC provides. However I noticed that the performance of the program drops drastically when the number contains more than 10^4 digits (recipeCount in the appended program). I observed the following execution times when increasing the number up to the following number of digits:
10000 digits: 0.314s
20000 digits: 1.596s
30000 digits: 4.306s
40000 digits: 8.954s
Looks like O(n^2) or worse, doesn't it?
Why is that? The program only does basic calculations as far as I can tell.
import Data.Bool (bool)
main :: IO ()
main = print solve
recipeCount :: Int
recipeCount = 10000
solve :: Integer
solve = loop 0 1 37 2
where
loop recipeA recipeB recipes recipesLength
| recipesLength >= recipeCount + 10 = recipes `rem` (10 ^ 10)
| otherwise =
let recipeAScore = digitAt (recipesLength - 1 - recipeA) recipes
recipeBScore = digitAt (recipesLength - 1 - recipeB) recipes
recipeSum = fromIntegral $ recipeAScore + recipeBScore
recipeSumDigitCount = bool 2 1 $ recipeSum < 10
recipes' = recipes * (10 ^ recipeSumDigitCount) + recipeSum
recipesLength' = recipesLength + recipeSumDigitCount
recipeA' = (recipeA + recipeAScore + 1) `rem` recipesLength'
recipeB' = (recipeB + recipeBScore + 1) `rem` recipesLength'
in loop recipeA' recipeB' recipes' recipesLength'
digitAt :: Int -> Integer -> Int
digitAt i number = fromIntegral $ number `quot` (10 ^ i) `rem` 10
P.S.: Because I'm very new to Haskell I also kindly appreciate feedback on the program itself (style, algorithm, etc.).
EDIT:
I found options to profile both versions of my program.
Both versions are compiled with ghc -O2 -rtsopts ./Program.hs and run with ./Program +RTS -sstderr.
The first version with integers produces the following output when generating 50,000 recipes:
2,435,108,280 bytes allocated in the heap
886,656 bytes copied during GC
44,672 bytes maximum residency (2 sample(s))
29,056 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1925 colls, 0 par 0.018s 0.017s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 15.208s ( 15.225s elapsed)
GC time 0.018s ( 0.017s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 15.227s ( 15.242s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 160,115,875 bytes per MUT second
Productivity 99.9% of total user, 99.9% of total elapsed
The second version with mutable arrays produces the following output when generating ~500,000 recipes:
93,437,744 bytes allocated in the heap
16,120 bytes copied during GC
538,408 bytes maximum residency (2 sample(s))
29,056 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 88 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.021s ( 0.020s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.021s ( 0.021s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 4,552,375,284 bytes per MUT second
Productivity 97.0% of total user, 97.2% of total elapsed
I think using Integer for your recipes list in the first place is a big red flag. Integers store numbers, but your problem does not call for a number. It calls for a list of digits. An Integer, whose first priority is to be a number, is basically "compressed": it's in binary, not decimal, and trying to extract a decimal digit from it means you have to do funky, nontrivial math, as others have said. Also, purity works against you, because each time you add new digits to your list, you end up copying the whole list. With problem sizes on the order of 100,000-1,000,000 digits (I was given a problem input of about 800,000), that's copying Integers on the order of log_(2^8)(10^(10^5)) = ~41000 bytes in size each time. This part also seems quadratic.
I would recommend "decompressing" your list of digits. You can represent a single digit by 1 byte (which does waste a lot of space!)
import Data.Word
type Digit = Word8
addDigit :: Digit -> Digit -> (Digit, Digit)
addDigit = _yourJob
You can implement the meat of the logic as a function using arrays. Yes, Haskell does have arrays, in the sense of contiguous hunks of memory with practically O(1) indexing. It's just that we like to find "more functional" ways to phrase a problem than with arrays. But, they're always there if you need them.
import Data.Array.Unboxed -- from the array package, which is a core library
makeRecipes ::
-- | Elf 1's starting score
Digit ->
-- | Elf 2's starting score
Digit ->
-- | Number of recipes to make
Int ->
-- | Scores of the recipes made, indices running from 0 upwards
UArray Int Digit
The cool thing about arrays is that you can mutate them inside the ST monad, while getting a pure result. Thus, this array does not suffer any copying, and the math involved for indexing it is minimal.
import Control.Monad.ST
import Data.Array.ST
makeRecipes elf1 elf2 need = runSTUArray $ do
arr <- newArray_ (0, need)
writeArray arr 0 elf1
writeArray arr 1 elf2
loop arr 0 1 2
return arr
where
loop :: STUArray s Int Digit -> Int -> Int -> Int -> ST s ()
loop arr loc1 loc2 done = _yourJob
loop is given the array, which is partially filled with done recipe scores, and the locations of the two elves, loc1, loc2 < done. It should calculate the new recipes' scores with addDigit and readArray and add them to the array at the correct location with writeArray. If the array is full, it should terminate (it doesn't return anything useful). Otherwise, it should go on to figure out the new locations of the elves, and then recurse.
You can then write a little adapter on top of makeRecipes to actually extract the last ten recipes, supply the correct inputs, etc. When I filled in all the blanks in the program, I got a runtime of .07s on my input (about 800,000) with -O2, and about 0.8s with -O0. It seems to take O(n) time in the input.

Efficient bit-fiddling in a LFSR implementation

Although I have a good LSFR C implementation I thought I'd try the same in Haskell - just to see how it goes. What I came up with, so far, is two orders of magnitude slower than the C implementation, which begs the question: How can the performance be improved? Clearly, the bit-fiddling operations are the bottleneck, and the profiler confirms this.
Here's the baseline Haskell code using lists and Data.Bits:
import Control.Monad (when)
import Data.Bits (Bits, shift, testBit, xor, (.&.), (.|.))
import System.Environment (getArgs)
import System.Exit (exitFailure, exitSuccess)
tap :: [[Int]]
tap = [
[], [], [], [3, 2],
[4, 3], [5, 3], [6, 5], [7, 6],
[8, 6, 5, 4], [9, 5], [10, 7], [11, 9],
[12, 6, 4, 1], [13, 4, 3, 1], [14, 5, 3, 1], [15, 14],
[16,15,13,4], [17, 14], [18, 11], [19, 6, 2, 1],
[20, 17], [21, 19], [22, 21], [23, 18],
[24,23,22,17], [25, 22], [26, 6, 2, 1], [27, 5, 2, 1],
[28, 25], [29, 27], [30, 6, 4, 1], [31, 28],
[32,22,2,1], [33,20], [34,27,2,1], [35,33],
[36,25], [37,5,4,3,2,1],[38,6,5,1], [39,35],
[40,38,21,19], [41,38], [42,41,20,19], [43,42,38,37],
[44,43,18,17], [45,44,42,41], [46,45,26,25], [47,42],
[48,47,21,20], [49,40], [50,49,24,23], [51,50,36,35],
[52,49], [53,52,38,37], [54,53,18,17], [55,31],
[56,55,35,34], [57,50], [58,39], [59,58,38,37],
[60,59], [61,60,46,45], [62,61,6,5], [63,62] ]
xor' :: [Bool] -> Bool
xor' = foldr xor False
mask :: (Num a, Bits a) => Int -> a
mask len = shift 1 len - 1
advance :: Int -> [Int] -> Int -> Int
advance len tap lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask len
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
main :: IO ()
main = do
args <- getArgs
when (null args) $ fail "Usage: lsfr <number-of-bits>"
let len = read $ head args
when (len < 8) $ fail "No need for LFSR"
let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0
if out == 0 then do
putStr "OK\n"
exitSuccess
else do
putStr "FAIL\n"
exitFailure
Basically it tests whether the LSFR defined in tap :: [[Int]] for any given bit-length is of maximum-length. (More precisely, it just checks whether the LSFR reaches the initial state (zero) after 2n iterations.)
According to the profiler the most costly line is the feedback bit d0 = xor' $ map (testBit lfsr) tap'.
What I've tried so far:
use Data.Array: Attempt abandoned because there's no foldl/r
use Data.Vector: Slightly faster than the baseline
The compiler options I use are: -O2, LTS Haskell 8.12 (GHC-8.0.2).
The reference C++ program can be found on gist.github.com.
The Haskell code can't be expected (?) to run as fast as the C code, but two orders of magnitude is too much, there must be a better way to do the bit-fiddling.
Update: Results of applying the optimisations suggested in the answers
The reference C++ program with input 28, compiled with LLVM 8.0.0, runs in 0.67s on my machine (the same with clang 3.7 is marginally slower, 0.68s)
The baseline Haskell code runs about 100x slower (because of the space inefficiency don't try it with inputs larger than 25)
With the rewrite of #Thomas M. DuBuisson, still using the default GHC backend, the execution time goes down to 5.2s
With the rewrite of #Thomas M. DuBuisson, now using the LLVM backend (GHC option -O2 -fllvm), the execution time goes down to 1.7s
Using GHC option -O2 -fllvm -optlc -mcpu=native brings this to 0.73s
Replacing iterate with iterate' of #cirdec makes no difference when Thomas' code is used (both with the default 'native' backend and LLVM). However, it does make a difference when the baseline code is used.
So, we've come from 100x to 8x to 1.09x, i.e. only 9% slower than C!
Note
The LLVM backend to GHC 8.0.2 requires LLVM 3.7. On Mac OS X this means installing this version with brew and then symlinking opt and llc. See 7.10. GHC Backends.
Up Front Matters
For starters, I'm using GHC 8.0.1 on an Intel I5 ~2.5GHz, linux x86-64.
First Draft: Oh No! The slows!
Your starting code with parameter 25 runs:
% ghc -O2 orig.hs && time ./orig 25
[1 of 1] Compiling Main ( orig.hs, orig.o )
Linking orig ...
OK
./orig 25 7.25s user 0.50s system 99% cpu 7.748 total
So the time to beat is 77ms - two orders of magnitude better than this Haskell code. Lets dive in.
Issue 1: Shifty Code
I found a couple of oddities with the code. First was the use of shift in high performance code. Shift supports both left and right shift and to do so it requires a branch. Lets kill that with more readable powers of two and such (shift 1 x ~> 2^x and shift x 1 ~> 2*x):
% ghc -O2 noShift.hs && time ./noShift 25
[1 of 1] Compiling Main ( noShift.hs, noShift.o )
Linking noShift ...
OK
./noShift 25 0.64s user 0.00s system 99% cpu 0.637 total
(As you noted in the comments: Yes, this bears investigation. It might be that some oddity of the prior code was preventing a rewrite rule from firing and, as a result, much worse code resulted)
Issue 2: Lists Of Bits? Int operations save the day!
One change, one order of magnitude. Yay. What else? Well you have this awkward list of bit locations you're tapping that just seems like its begging for inefficiency and/or leans on fragile optimizations. At this point I'll note that hard-coding any one selection from that list results in really good performance (such as testBit lsfr 24 `xor` testBit lsfr 21) but we want a more general fast solution.
I propose we compute the mask of all the tap locations then do a one-instruction pop count. To do this we only need a single Int passed in to advance instead of a whole list. The popcount instruction requires good assembly generation which requires llvm and probably -optlc-mcpu=native or another instruction set selection that is non-pessimistic.
This step gives us pc below. I've folded in the guard-removal of advance that was mentioned in the comments:
let tp = sum $ map ((2^) . subtract 1) (tap !! len)
pc lfsr = fromEnum (even (popCount (lfsr .&. tp)))
mask = 2^len - 1
advance' :: Int -> Int
advance' lfsr = (2*lfsr .&. mask) .|. pc lfsr
out :: Int
out = last $ take (2^len) $ iterate advance' 0
Our resulting performance is:
% ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu=native && time ./so 25
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
OK
./so 25 0.06s user 0.00s system 96% cpu 0.067 total
That's over two orders of magnitude from start to finish, so hopefully it matches your C. Finally, in deployed code it is actually really common to have Haskell packages with C bindings but this is often an educational exercise so I hope you had fun.
Edit: The now-available C++ code takes my system 0.10 (g++ -O3) and 0.12 (clang++ -O3 -march=native) seconds, so it seems we've beat our mark by a fair bit.
I suspect that the following line is building a large list-like thunk in memory before evaluating it.
let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0` is
Let's find out if I'm right, and if I am, we'll fix it. The first debugging step is to get an idea of the memory used by the program. To do this we're going to compile with the options -rtsopts in addition to -O2. This enables running the program with RTS options, inclusing +RTS -s which outputs a small memory summary.
Initial Performance
Running your program as lfsr 25 +RTS -s I get the following output
OK
5,420,148,768 bytes allocated in the heap
6,705,977,216 bytes copied during GC
1,567,511,384 bytes maximum residency (20 sample(s))
357,862,432 bytes maximum slop
3025 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10343 colls, 0 par 2.453s 2.522s 0.0002s 0.0009s
Gen 1 20 colls, 0 par 2.281s 3.065s 0.1533s 0.7128s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.438s ( 1.162s elapsed)
GC time 4.734s ( 5.587s elapsed)
EXIT time 0.016s ( 0.218s elapsed)
Total time 6.188s ( 6.967s elapsed)
%GC time 76.5% (80.2% elapsed)
Alloc rate 3,770,538,273 bytes per MUT second
Productivity 23.5% of total user, 19.8% of total elapsed
That's a lot of memory used at once. It's very likely there's a big thunk building up in there somewhere.
Trying to reduce the thunk size
I hypothesized that the thunk is being built in iterate (advance ...). If this is the case, we can try to reduce the thunk size by making advance more strict in its lsfr argument. This won't remove the spine of the thunk (the successive iterations), but it might reduce the size of the state that's built up as the spine's evaluated.
BangPatterns is an easy way to make a function strict in an argument. f !x = .. is shorthand for f x = seq x $ ...
{-# LANGUAGE BangPatterns #-}
advance :: Int -> [Int] -> Int -> Int
advance len tap = go
where
go !lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask len
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
Let's see what difference this makes ...
>lfsr 25 +RTS -s
OK
5,420,149,072 bytes allocated in the heap
6,705,979,368 bytes copied during GC
1,567,511,448 bytes maximum residency (20 sample(s))
357,862,448 bytes maximum slop
3025 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 10343 colls, 0 par 2.688s 2.711s 0.0003s 0.0059s
Gen 1 20 colls, 0 par 2.438s 3.252s 0.1626s 0.8013s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.328s ( 1.146s elapsed)
GC time 5.125s ( 5.963s elapsed)
EXIT time 0.000s ( 0.226s elapsed)
Total time 6.484s ( 7.335s elapsed)
%GC time 79.0% (81.3% elapsed)
Alloc rate 4,081,053,418 bytes per MUT second
Productivity 21.0% of total user, 18.7% of total elapsed
None that's noticeable.
Eliminating the Spine
I guess it's the spine of that iterate (advance ...) that's being built. After all, for the command I'm running the list would be 2^25, or a little over 33 million items long. The list itself is probably being removed by list fusion, but the thunk for the last item of the list is over 33 million applications of advance ...
To solve this problem we need a strict version of iterate so that the value is forced to an Int before applying the advance function again. This should keep the memory down to only a single lfsr value at a time, along with the currently computed application of advance.
Unfortunately, there isn't a strict iterate in Data.List. Here's one that doesn't give up on the list fusion that's providing other important (I think) performance optimizations to this problem.
{-# LANGUAGE BangPatterns #-}
import GHC.Base (build)
{-# NOINLINE [1] iterate' #-}
iterate' :: (a -> a) -> a -> [a]
iterate' f = go
where go !x = x : go (f x)
{-# NOINLINE [0] iterateFB' #-}
iterateFB' :: (a -> b -> b) -> (a -> a) -> a -> b
iterateFB' c f = go
where go !x = x `c` go (f x)
{-# RULES
"iterate'" [~1] forall f x. iterate' f x = build (\c _n -> iterateFB' c f x)
"iterateFB'" [1] iterateFB' (:) = iterate'
#-}
This is just iterate from GHC.List (along with all its rewrite rules), but made strict in the accumulated argument.
Equipped with a strict iterate, iterate', we can change the troublesome line to
let out = last $ take (shift 1 len) $ iterate' (advance len (tap!!len)) 0
I expect that this will perform much better. Let's see ...
>lfsr 25 +RTS -s
OK
3,758,156,184 bytes allocated in the heap
297,976 bytes copied during GC
43,800 bytes maximum residency (1 sample(s))
21,736 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 7281 colls, 0 par 0.047s 0.008s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.750s ( 0.783s elapsed)
GC time 0.047s ( 0.008s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.797s ( 0.792s elapsed)
%GC time 5.9% (1.0% elapsed)
Alloc rate 5,010,874,912 bytes per MUT second
Productivity 94.1% of total user, 99.0% of total elapsed
This used 0.00002 times as much memory and ran 10 times as fast.
I don't know if this will improve on Thomas DeBuisson's answer that improves advance but still leaves a lazy iterate advance' in place. It would be easy to check; add the iterate' code to that answer and use iterate' in place of iterate in that answer.
Does the compiler lift tap !! len out of the loop? I suspect it does, but moving it out to guarantee this can't hurt:
let tap1 = tap !! len
let out = last $ take (shift 1 len) $ iterate (advance len tap1) 0
In the comments you say "2^len is needed exactly once", but this is wrong. You do it each time in advance. So you could try
advance len tap mask lfsr
| d0 = shifted
| otherwise = shifted .|. 1
where
shifted = shift lfsr 1 .&. mask
d0 = xor' $ map (testBit lfsr) tap'
tap' = map (subtract 1) tap
-- in main
let tap1 = tap !! len
let numIterations = 2^len
let mask = numIterations - 1
let out = iterate (advance len tap1 mask) 0 !! (numIterations - 1)
(the compiler can't optimize last $ take ... to !! in general, because they are different for finite lists, but iterate always returns an infinite one.)
You compared foldr with foldl, but foldl is almost never what you need; since xor always needs both arguments and is associative, foldl' is very likely to be the right choice (the compiler can optimize it, but if there is any real difference between foldl and foldr and not just random variation, it might have failed in this case).

Resources