Optimizing Conduit pipelines - haskell

I'm currently benchmarking my program to see whether I can improve its performance. Currently my program will take an input file and run some algorithm to split it into multiple files.
It takes roughly 14s to split a file into 3 parts, with -O2 compilation flag for both library and executable.
ghc-options: -Wall -fno-warn-orphans -O2 -auto-all
It looks like it is spending approximately 60% of its time in sinkFile, and I'm wondering whether there is anything I can do to improve the following code.
-- | Get the sink file, a list of FilePaths and the share number of the file to output to.
idxSinkFile :: MonadResource m
=> [FilePath]
-> Int
-> Consumer [Word8] m ()
idxSinkFile outFileNames shareNumber =
let ccm = CC.concatMap $ flip atMay shareNumber
cbs = CC.map BS.singleton
sf = sinkFile (outFileNames !! shareNumber)
in ccm =$= cbs =$= sf
-- | Generate a sink which will take a list of bytes and write each byte to its corresponding file share
sinkMultiFiles :: MonadResource m
=> [FilePath]
-> [Int]
-> Sink [Word8] m ()
sinkMultiFiles outFileNames xs =
let len = [0..length xs - 1]
in getZipSink $ otraverse_ (ZipSink . idxSinkFile outFileNames) len
Here are the output of GHC's profiling:
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
splitFile.sink HaskSplit.Conduit.Split 289 1 0.0 0.0 66.8 74.2
sinkMultiFiles HaskSplit.Conduit.Split 290 1 27.4 33.2 66.8 74.2
idxSinkFile HaskSplit.Conduit.Split 303 3 7.9 11.3 39.4 41.0
idxSinkFile.ccm HaskSplit.Conduit.Split 319 3 3.1 3.6 3.1 3.6
idxSinkFile.cbs HaskSplit.Conduit.Split 317 3 3.5 4.2 3.5 4.2
idxSinkFile.sf HaskSplit.Conduit.Split 307 3 24.9 21.9 24.9 21.9
sinkMultiFiles.len HaskSplit.Conduit.Split 291 1 0.0 0.0 0.0 0.0
Which shows sinkFile taking a lot of time. (I've benchmarked the list access etc in case you're wondering and they have 0% of processing)
While I understand for a small program like this IO is often the bottleneck, I'd like to see if I can improve the runtime performance of my program.
Cheers!

Following nh2's advice, I decided to pack the ByteStrings in 256 byte chunks instead of doing a BS.singleton on each Word8 instance.
cbs = CL.sequence (CL.take 256) =$= CC.map BS.pack
instead of
cbs = CC.map BS.singleton
and I'm able to reduce the running time as well as the memory usage quite significantly, as demonstrated below:
Original Run
total time = 194.37 secs (194367 ticks # 1000 us, 1 processor)
total alloc = 102,021,859,892 bytes (excludes profiling overheads)
New Run, with CL.take
total time = 35.88 secs (35879 ticks # 1000 us, 1 processor)
total alloc = 21,970,152,800 bytes (excludes profiling overheads)
That's some serious improvement! I'd like to optimize it more but that's for another question :)

Related

Memory footprint of splitOn?

I wrote a file indexing program that should read thousands of text file lines as records and finally group those records by fingerprint. It uses Data.List.Split.splitOn to split the lines at tabs and retrieve the record fields. The program consumes 10-20 GB of memory.
Probably there is not much I can do to reduce that huge memory footprint, but I cannot explain why a function like splitOn (breakDelim) can consume that much memory:
Mon Dec 9 21:07 2019 Time and Allocation Profiling Report (Final)
group +RTS -p -RTS file1 file2 -o 2 -h
total time = 7.40 secs (7399 ticks # 1000 us, 1 processor)
total alloc = 14,324,828,696 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
fileToPairs.linesIncludingEmptyLines ImageFileRecordParser ImageFileRecordParser.hs:35:7-47 25.0 33.8
breakDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(151,1)-(156,36) 24.9 39.3
sortAndGroup Aggregations Aggregations.hs:6:1-85 12.9 1.7
fileToPairs ImageFileRecordParser ImageFileRecordParser.hs:(33,1)-(42,14) 8.2 10.7
matchDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(73,1)-(77,23) 7.4 0.4
onSublist Data.List.Split.Internals src/Data/List/Split/Internals.hs:278:1-72 3.6 0.0
toHashesView ImageFileRecordStatistics ImageFileRecordStatistics.hs:(48,1)-(51,24) 3.0 6.3
main Main group.hs:(47,1)-(89,54) 2.9 0.4
numberOfUnique ImageFileRecord ImageFileRecord.hs:37:1-40 1.6 0.1
toHashesView.sortedLines ImageFileRecordStatistics ImageFileRecordStatistics.hs:50:7-30 1.4 0.1
imageFileRecordFromFields ImageFileRecordParser ImageFileRecordParser.hs:(11,1)-(30,5) 1.1 0.3
toHashView ImageFileRecord ImageFileRecord.hs:(67,1)-(69,23) 0.7 1.7
Or is type [Char] too memory inefficient (compared to Text), causing splitOn to take that much memory?
UPDATE 1 (+RTS -s suggestion of user HTNW)
23,446,268,504 bytes allocated in the heap
10,753,363,408 bytes copied during GC
1,456,588,656 bytes maximum residency (22 sample(s))
29,282,936 bytes maximum slop
3620 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 45646 colls, 0 par 4.055s 4.059s 0.0001s 0.0013s
Gen 1 22 colls, 0 par 4.034s 4.035s 0.1834s 1.1491s
INIT time 0.000s ( 0.000s elapsed)
MUT time 7.477s ( 7.475s elapsed)
GC time 8.089s ( 8.094s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.114s ( 0.114s elapsed)
Total time 15.687s ( 15.683s elapsed)
%GC time 51.6% (51.6% elapsed)
Alloc rate 3,135,625,407 bytes per MUT second
Productivity 48.4% of total user, 48.4% of total elapsed
The processed text files are smaller than usual (UTF-8 encoded, 37 MB). But still 3 GB of memory are used.
UPDATE 2 (critical part of the code)
Explanation: fileToPairs processes a text file. It returns a list of key-value pairs (key: fingerprint of record, value: record).
sortAndGroup associations = Map.fromListWith (++) [(k, [v]) | (k, v) <- associations]
main = do
CommandLineArguments{..} <- cmdArgs $ CommandLineArguments {
ignored_paths_file = def &= typFile,
files = def &= typ "FILES" &= args,
number_of_occurrences = def &= name "o",
minimum_number_of_occurrences = def &= name "l",
maximum_number_of_occurrences = def &= name "u",
number_of_hashes = def &= name "n",
having_record_errors = def &= name "e",
hashes = def
}
&= summary "Group image/video files"
&= program "group"
let ignoredPathsFilenameMaybe = ignored_paths_file
let filenames = files
let hashesMaybe = hashes
ignoredPaths <- case ignoredPathsFilenameMaybe of
Just ignoredPathsFilename -> ioToLines (readFile ignoredPathsFilename)
_ -> return []
recordPairs <- mapM (fileToPairs ignoredPaths) filenames
let allRecordPairs = concat recordPairs
let groupMap = sortAndGroup allRecordPairs
let statisticsPairs = map toPair (Map.toList groupMap) where toPair item = (fst item, imageFileRecordStatisticsFromRecords . snd $ item)
let filterArguments = FilterArguments {
numberOfOccurrencesMaybe = number_of_occurrences,
minimumNumberOfOccurrencesMaybe = minimum_number_of_occurrences,
maximumNumberOfOccurrencesMaybe = maximum_number_of_occurrences,
numberOfHashesMaybe = number_of_hashes,
havingRecordErrorsMaybe = having_record_errors
}
let filteredPairs = filterImageRecords filterArguments statisticsPairs
let filteredMap = Map.fromList filteredPairs
case hashesMaybe of
Just True -> mapM_ putStrLn (map toHashesView (map snd filteredPairs))
_ -> Char8.putStrLn (encodePretty filteredMap)
As I'm sure you're aware, there's not really enough information here for us to help you make your program more efficient. It might be worth posting some (complete, self-contained) code on the Code Review site for that.
However, I think I can answer your specific question about why splitOn allocates so much memory. In fact, there's nothing particularly special about splitOn or how it's been implemented. Many straightforward Haskell functions will allocate lots of memory, and this in itself doesn't indicate that they've been poorly written or are running inefficiently. In particular, splitOn's memory usage seems similar to other straightforward approaches to splitting a string based on delimiters.
The first thing to understand is that GHC compiled code works differently than other compiled code you're likely to have seen. If you know a lot of C and understand stack frames and heap allocation, or if you've studied some JVM implementations, you might reasonably expect that some of that understanding would translate to GHC executables, but you'd be mostly wrong.
A GHC program is more or less an engine for allocating heap objects, and -- with a few exceptions -- that's all it really does. Nearly every argument passed to a function or constructor (as well as the constructor application itself) allocates a heap object of at least 16 bytes, and often more. Take a simple function like:
fact :: Int -> Int
fact 0 = 1
fact n = n * fact (n-1)
With optimization turned off, it compiles to the following so-called "STG" form (simplified from the actual -O0 -ddump-stg output):
fact = \n -> case n of I# n' -> case n' of
0# -> I# 1#
_ -> let sat1 = let sat2 = let one = I#! 1# in n-one
in fact sat2;
in n*sat1
Everywhere you see a let, that's a heap allocation (16+ bytes), and there are presumably more hidden in the (-) and (*) calls. Compiling and running this program with:
main = print $ fact 1000000
gives:
113,343,544 bytes allocated in the heap
44,309,000 bytes copied during GC
25,059,648 bytes maximum residency (5 sample(s))
29,152 bytes maximum slop
23 MB total memory in use (0 MB lost due to fragmentation)
meaning that each iteration allocates over a hundred bytes on the heap, though it's literally just performing a comparison, a subtraction, a multiplication, and a recursive call,
This is what #HTNW meant in saying that total allocation in a GHC program is a measure of "work". A GHC program that isn't allocating probably isn't doing anything (again, with some rare exceptions), and a typical GHC program that is doing something will usually allocate at a relatively constant rate of several gigabytes per second when it's not garbage collecting. So, total allocation has more to do with total runtime than anything else, and it isn't a particularly good metric for assessing code efficiency. Maximum residency is also a poor measure of overall efficiency, though it can be helpful for assessing whether or not you have a space leak, if you find that it tends to grow linearly (or worse) with the size of the input where you expect the program should run in constant memory regardless of input size.
For most programs, the most important true efficiency metric in the +RTS -s output is probably the "productivity" rate at the bottom -- it's the amount of time the program spends not garbage collecting. And, admittedly, your program's productivity of 48% is pretty bad, which probably means that it is, technically speaking, allocating too much memory, but it's probably only allocating two or three times the amount it should be, so, at a guess, maybe it should "only" be allocating around 7-8 Gigs instead of 23 Gigs for this workload (and, consequently, running for about 5 seconds instead of 15 seconds).
With that in mind, if you consider the following simple breakDelim implementation:
breakDelim :: String -> [String]
breakDelim str = case break (=='\t') str of
(a,_:b) -> a : breakDelim b
(a,[]) -> [a]
and use it like so in a simple tab-to-comma delimited file converter:
main = interact (unlines . map (intercalate "," . breakDelim) . lines)
Then, unoptimized and run on a file with 10000 lines of 1000 3-character fields each, it allocates a whopping 17 Gigs:
17,227,289,776 bytes allocated in the heap
2,807,297,584 bytes copied during GC
127,416 bytes maximum residency (2391 sample(s))
32,608 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
and profiling it places a lot of blame on breakDelim:
COST CENTRE MODULE SRC %time %alloc
main Main Delim.hs:8:1-71 57.7 72.6
breakDelim Main Delim.hs:(4,1)-(6,16) 42.3 27.4
In this case, compiling with -O2 doesn't make much difference. The key efficiency metric, productivity, is only 46%. All these results seem to be in line with what you're seeing in your program.
The split package has a lot going for it, but looking through the code, it's pretty clear that little effort has been made to make it particularly efficient or fast, so it's no surprise that splitOn performs no better than my quick-and-dirty custom breakDelim function. And, as I said before, there's nothing special about splitOn that makes it unusually memory hungry -- my simple breakDelim has similar behavior.
With respect to inefficiencies of the String type, it can often be problematic. But, it can also participate in optimizations like list fusion in ways that Text can't. The utility above could be rewritten in simpler form as:
main = interact $ map (\c -> if c == '\t' then ',' else c)
which uses String but runs pretty fast (about a quarter as fast as a naive C getchar/putchar implementation) at 84% productivity, while allocating about 5 Gigs on the heap.
It's quite likely that if you just take your program and "convert it to Text", you'll find it's slower and more memory hungry than the original! While Text has the potential to be much more efficient than String, it's a complicated package, and the way in which Text objects behave with respect to allocation when they're sliced and diced (as when you're chopping a big Text file up into little Text fields) makes it more difficult to get right.
So, some take-home lessons:
Total allocation is a poor measure of efficiency. Most well written GHC programs can and should allocate several gigabytes per second of runtime.
Many innocuous Haskell functions will allocate lots of memory because of the way GHC compiled code works. This isn't necessarily a sign that there's something wrong with the function.
The split package provides a flexible framework for all manner of cool list splitting manipulations, but it was not designed with speed in mind, and it may not be the best method of processing a tab-delimited file.
The String data type has a potential for terrible inefficiency, but isn't always inefficient, and Text is a complicated package that won't be a plug-in replacement to fix your String performance woes.
Most importantly:
Unless your program is too slow for its intended purpose, its run-time statistics and the theoretical advantages of Text over String are largely irrelevant.

Why does this solution to the "queens" dilemma run so much slower than the other in Haskell?

In my computer science class we were using Haskell to solve the "queens" problem in which you must find all possible placements of n queens in an nxn board. This was the code we were given:
queens n = solve n
where
solve 0 = [ [] ]
solve k = [ h:partial | partial <- solve(k-1), h <- [0..(n-1)], safe h partial ]
safe h partial = and [ not (checks h partial i) | i <- [0..(length partial)-1] ]
checks h partial i = h == partial!!i || abs(h-partial!!i) == i+1
However, the first time I entered it I accidentally swapped the order in solve k and found that it still gave a correct solution but took much longer:
queens n = solve n
where
solve 0 = [ [] ]
solve k = [ h:partial | h <- [0..(n-1)], partial <- solve(k-1), safe h partial ]
safe h partial = and [ not (checks h partial i) | i <- [0..(length partial)-1] ]
checks h partial i = h == partial!!i || abs(h-partial!!i) == i+1
Why does this second version take so much longer? My thought process is that the second version does recursion at every step while the first version does recursion only once and then backtracks. This is not for a homework problem, I'm just curious and feel like it will help me better understand the language.
Simply put,
[ ... | x <- f 42, n <- [1..100] ]
will evaluate f 42 once to a list, and for each element x in such list it will generate all ns from 1 to 100. Instead,
[ ... | n <- [1..100], x <- f 42 ]
will first generate an n from 1 to 100, and for each of them call f 42. So f is now being called 100 times instead of one.
This is no different from what happens in imperative programming when using nested loops:
for x in f(42): # calls f once
for n in range(1,100):
...
for n in range(1,100):
for x in f(42): # calls f 100 times
...
The fact that your algorithm is recursive makes this swap particularly expensive, since the additional cost factor (100, above) accumulates at each recursive call.
You can also try to bind the result of f 42 to some variable so that it does not need to be recomputed, even if you nest it the other way around:
[ ... | let xs = f 42, n <- [1..100], x <- xs ]
Note that this will keep the whole xs list in memory for the whole loop, preventing it from being garbage collected. Indeed, xs will be fully evaluated for n=1, and then reused for higher values of n.
My guess is that your first version does a depth-first traversal while your second version does a breadth-first traversal of the tree (see Tree Traversal on Wikipedia).
As the complexity of the problem grows with the size of the board, the second version uses more and more memory to keep track of each level of the tree while the first version quickly forgets the previous branch it visited.
Managing the memory takes a lot of time!
By enabling profiling, you can see how the Haskell runtime behaves with your functions.
If you compare the number of calls, they are strictly the same, but still the second version takes more time:
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 44 0 0.0 0.0 100.0 100.0
main Main 89 0 0.3 0.0 0.3 0.0
CAF Main 87 0 0.0 0.0 99.7 100.0
main Main 88 1 0.2 0.6 99.7 100.0
queens2 Main 94 1 0.0 0.0 55.6 48.2
queens2.solve Main 95 13 3.2 0.8 55.6 48.2
queens2.safe Main 96 10103868 42.1 47.5 52.3 47.5
queens2.checks Main 100 37512342 10.2 0.0 10.2 0.0
queens1 Main 90 1 0.0 0.0 43.9 51.1
queens1.solve Main 91 13 2.0 1.6 43.9 51.1
queens1.safe Main 92 10103868 29.3 49.5 41.9 49.5
queens1.checks Main 93 37512342 12.7 0.0 12.7 0.0
Looking at the heap profile tells you what really happens.
The first version has a small and constant heap use:
While the second version has a huge heap use which must also face garbage collection (look at the peaks):
Looking at the core, the first function generates a single function in core, which is tail recursive (constant stack space - very fast and very nice function. Thanks GHC!). However, the 2nd generates two functions: one to do a single step of the inner loop; and a 2nd function which looks like
loop x = case x of { 0 -> someDefault; _ -> do1 (loop (x-1)) }
This function likely isn't performant because do1 must traverse the entire input list, and each iteration appends new elements to the list (meaning the input list to do1 grows monotonically in length). Whereas the core function for the fast version is generating the output list directly, without having to process some other list. It is quite difficult to reason about the performance of list comprehension, I believe, so first translate the function to not use them:
guard b = if b then [()] else []
solve_good k =
concatMap (\partial ->
concatMap (\h ->
guard (safe h partial) >> return (h:partial)
) [0..n-1]
) (solve $ k-1)
solve_bad k =
concatMap (\h ->
concatMap (\partial ->
guard (safe h partial) >> return (h:partial)
) (solve $ k-1)
) [0..n-1]
The transformation is fairly mechanical and is detailed somewhere in the Haskell report, but essentially <- becomes concatMap and conditions become guards. It is much easier to see what is happening now - solve_good makes a recursive call a single time, then concatMaps over that recursively created list. However, solve_bad makes the recursive call inside the outer concatMap, meaning it will potentially (likely) be recomputed for every element in [0..n-1]. Note that there is no semantic reason for solve $ k-1 to be in the inner concatMap - it does not depend on the value that that concatMap binds (the h variable) so it can be safely lifted out above the concatMap which binds h (as is done in solve_good).

Haskell: profiler output not complete?

I'm building a project that stores words into a dictionary (using the library dawg), and when compiled with -fprof-auto, the profiler doesn't tell me that most of the time is spent in functions and CAFs from dawg modules.
The code (also using conduit, but it's pretty straightforward) is:
import Data.DAWG.Static as D
import qualified Data.DAWG.Dynamic as DD
import Conduit
import qualified Data.Conduit.Combinators as C
import qualified Data.Text as T
import Data.List (isSuffixOf)
import Control.Monad
insertEntry dawg word =
DD.insertWith (+) (T.unpack word) 1 dawg
isWhitespace x = x `elem` [' ', '.', '\n', '\'']
appendFileToDDAWG dawg fp =
C.sourceFile fp $= C.decodeUtf8
$= C.splitOnUnboundedE isWhitespace
$$ C.foldl insertEntry dawg
loadDirToDAWG :: FilePath -> IO (DAWG Char () Int)
loadDirToDAWG dir = runResourceT $ do
d <- C.sourceDirectoryDeep True dir
$= C.filter (".txt" `isSuffixOf`)
$$ C.foldM appendFileToDDAWG DD.empty
return $ D.freeze d
main = do d <- loadDirToDAWG some_directory
mapM_ print $ D.assocs d
Running with +RTS -p shows me most of the time is spent in insertEntry (which is normal):
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 211 0 0.0 0.0 100.0 100.0
main Main 423 0 2.3 1.7 99.9 100.0
loadDirToDAWG Mymodule.BuildDAWG 425 0 1.9 1.0 97.6 98.2
appendFileToDDAWG Mymodule.BuildDAWG 427 74 8.0 5.1 95.7 97.2
insertEntry Mymodule.BuildDAWG 430 71366 86.5 92.1 86.5 92.1
[...]
CAF Data.DAWG.Static 418 0 0.1 0.0 0.1 0.0
CAF Data.DAWG.Trans.Map 412 0 0.0 0.0 0.0 0.0
But it doesn't tell me that the time is actually spent inside the Data.DAWG.Dynamic module. Which is weird because it shows the Data.DAWG.Static module, so it's able to "detect" some modules from dawg but not all of them, and especially not the one where most of the work is done.
After downloading dawg, modifying its .cabal file so it's compiled with -fprof-auto-top, and rebuilding everything I get a larger profiler output that shows all inner functions of Data.DAWG.Dynamic and seems ok. But I don't want the full detail (I'm not profiling dawg, just my code) and I just want to be sure that the time is spent in dawg and not in my code (or else it means my code has a problem).
So why in the first case was the Data.DAWG.Dynamic module not shown?
Is there something I'm missing regarding how GHC handles profiling?

What does .(...) mean in a .prof report mean?

I'm looking for optimization oportunities in my Haskell program by compiling with -prof, but I don't know how to interpret the cost centres that contain ellipses. What are filter.(...) and jankRoulette.select.(...)?
COST CENTRE MODULE %time %alloc
filter.(...) Forest 46.5 22.3
set-union Forest 22.5 4.1
cache-lookup Forest 16.0 0.1
removeMany MultiMapSet 3.7 1.9
insertMany MultiMapSet 3.3 1.8
jankRoulette.select.(...) Forest 1.4 15.2
I generated that with: $ ghc --make -rtsopts -prof -auto-all main.hs && ./main +RTS -p && cat main.prof
The function filter has a few definitions in a where clause, like this:
filter a b = blahblah where
foo = bar
bar = baz
baz = bing
But those all show up as filter.foo, filter.bar, etc.
I thought they might be nested let expressions, but jankRoulette.select doesn't have any. And I've added SCC directives in front of most of them without any of those cost centres rising to the top.
Since most of the time is spent in filter.(...), I'd like to know what that is. :)
TL; DR: GHC generates this when you do a pattern match in a let binding, like let (x,y) = c. The cost of evaluating c is tracked by the ... cost centre (since there is no unique name to it)`.
So how did I find this out?
A grep for (...) in the GHC source code finds the following (from compiler/deSugar/Coverage.hs):
-- TODO: Revisit this
addTickLHsBind (L pos (pat#(PatBind { pat_lhs = lhs, pat_rhs = rhs }))) = do
let name = "(...)"
(fvs, rhs') <- getFreeVars $ addPathEntry name $ addTickGRHSs False False rhs
{- ... more code following, but not relevant to this purpose
-}
That code tells us that it has to do something with pattern bindings.
So we can make a small test program to check the behavior:
x :: Int
(x:_) = reverse [1..1000000000]
main :: IO ()
main = print x
Then, we can run this program with profiling enabled. An indeed, GHC generates the following output:
COST CENTRE MODULE no. entries %time %alloc %time
%alloc
MAIN MAIN 42 0 0.0 0.0 100.0 100.0
CAF Main 83 0 0.0 0.0 100.0 100.0
(...) Main 86 1 100.0 100.0 100.0 100.0
x Main 85 1 0.0 0.0 0.0 0.0
main Main 84 1 0.0 0.0 0.0 0.0
So it turns out the assumption made from the code was correct. All of the time of the program is spent evaluating the reverse [1..1000000000] expression, and it's assigned to the (...) cost centre.

Lazy IO + Parallelism: converting an image to grayscale

I am trying to add parallelism to a program that converts a .bmp to a grayscale .bmp. I am seeing usually 2-4x worse performance for the parallel code. I am tweaking parBuffer / chunking sizes and still cannot seem to reason about it. Looking for guidance.
The entire source file used here: http://lpaste.net/106832
We use Codec.BMP to read in a stream of pixels represented by type RGBA = (Word8, Word8, Word8, Word8). To convert to grayscale, simply map a 'luma' transform across all the pixels.
The serial implementation is literally:
toGray :: [RGBA] -> [RGBA]
toGray x = map luma x
The test input .bmp is 5184 x 3456 (71.7 MB).
The serial implementation runs in ~10s, ~550ns/pixel. Threadscope looks clean:
Why is this so fast? I suppose it has something with lazy ByteString (even though Codec.BMP uses strict ByteString--is there implicit conversion occurring here?) and fusion.
Adding Parallelism
First attempt at adding parallelism was via parList. Oh boy. The program used ~4-5GB memory and system started swapping.
I then read "Parallelizing Lazy Streams with parBuffer" section of Simon Marlow's O'Reilly book and tried parBuffer with a large size. This still did not produce desirable performance. The spark sizes were incredibly small.
I then tried to increase the spark size by chunking the lazy list and then sticking with parBuffer for the parallelism:
toGrayPar :: [RGBA] -> [RGBA]
toGrayPar x = concat $ (withStrategy (parBuffer 500 rpar) . map (map luma))
(chunk 8000 x)
chunk :: Int -> [a] -> [[a]]
chunk n [] = []
chunk n xs = as : chunk n bs where
(as,bs) = splitAt (fromIntegral n) xs
But this still does not yield desirable performance:
18,934,235,760 bytes allocated in the heap
15,274,565,976 bytes copied during GC
639,588,840 bytes maximum residency (27 sample(s))
238,163,792 bytes maximum slop
1910 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 35277 colls, 35277 par 19.62s 14.75s 0.0004s 0.0234s
Gen 1 27 colls, 26 par 13.47s 7.40s 0.2741s 0.5764s
Parallel GC work balance: 30.76% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 4480 (2240 converted, 0 overflowed, 0 dud, 2 GC'd, 2238 fizzled)
INIT time 0.00s ( 0.01s elapsed)
MUT time 14.31s ( 14.75s elapsed)
GC time 33.09s ( 22.15s elapsed)
EXIT time 0.01s ( 0.12s elapsed)
Total time 47.41s ( 37.02s elapsed)
Alloc rate 1,323,504,434 bytes per MUT second
Productivity 30.2% of total user, 38.7% of total elapsed
gc_alloc_block_sync: 7433188
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 1017408
How can I better reason about what is going on here?
You have a big list of RGBA pixels. Why don't you use parListChunk with a reasonable chunk size?

Resources