What does .(...) mean in a .prof report mean? - haskell

I'm looking for optimization oportunities in my Haskell program by compiling with -prof, but I don't know how to interpret the cost centres that contain ellipses. What are filter.(...) and jankRoulette.select.(...)?
COST CENTRE MODULE %time %alloc
filter.(...) Forest 46.5 22.3
set-union Forest 22.5 4.1
cache-lookup Forest 16.0 0.1
removeMany MultiMapSet 3.7 1.9
insertMany MultiMapSet 3.3 1.8
jankRoulette.select.(...) Forest 1.4 15.2
I generated that with: $ ghc --make -rtsopts -prof -auto-all main.hs && ./main +RTS -p && cat main.prof
The function filter has a few definitions in a where clause, like this:
filter a b = blahblah where
foo = bar
bar = baz
baz = bing
But those all show up as filter.foo, filter.bar, etc.
I thought they might be nested let expressions, but jankRoulette.select doesn't have any. And I've added SCC directives in front of most of them without any of those cost centres rising to the top.
Since most of the time is spent in filter.(...), I'd like to know what that is. :)

TL; DR: GHC generates this when you do a pattern match in a let binding, like let (x,y) = c. The cost of evaluating c is tracked by the ... cost centre (since there is no unique name to it)`.
So how did I find this out?
A grep for (...) in the GHC source code finds the following (from compiler/deSugar/Coverage.hs):
-- TODO: Revisit this
addTickLHsBind (L pos (pat#(PatBind { pat_lhs = lhs, pat_rhs = rhs }))) = do
let name = "(...)"
(fvs, rhs') <- getFreeVars $ addPathEntry name $ addTickGRHSs False False rhs
{- ... more code following, but not relevant to this purpose
-}
That code tells us that it has to do something with pattern bindings.
So we can make a small test program to check the behavior:
x :: Int
(x:_) = reverse [1..1000000000]
main :: IO ()
main = print x
Then, we can run this program with profiling enabled. An indeed, GHC generates the following output:
COST CENTRE MODULE no. entries %time %alloc %time
%alloc
MAIN MAIN 42 0 0.0 0.0 100.0 100.0
CAF Main 83 0 0.0 0.0 100.0 100.0
(...) Main 86 1 100.0 100.0 100.0 100.0
x Main 85 1 0.0 0.0 0.0 0.0
main Main 84 1 0.0 0.0 0.0 0.0
So it turns out the assumption made from the code was correct. All of the time of the program is spent evaluating the reverse [1..1000000000] expression, and it's assigned to the (...) cost centre.

Related

JAGS: node inconsistent with parents

I am trying to apply Benedict Escoto's method from the paper "Bayesian Claim Severity with Mixed Distributions," published in Variance. I seem to be running into a JAGS simulation problem. When I run the code, JAGS gives me the following error:
Error in node ones[1] Node inconsistent with parents
The data is two-fold. First, there is a list of ground-up insurance claims. Information on their age in years, deductibles (truncation) and whether they have been capped (censoring, True or False) is provided. Second, there is a list of prior means and corresponding probability weights for a mixed exponential distribution.
One thing I noticed is that it works for one set of data of priors for mixed exponential distribution, but fails for another.
It works with:
Mean Weight
50 0.3
100 0.25
500 0.25
1500 0.1
5000 0.07
20000 0.03
But it fails with:
Mean Weight
3 0.72
14 0.19
42 0.05
138 0.02
503 0.01
1501 0.01
So it may have to do with parameter requirements for mixed exponential.
Data packet for JAGS is assembled as below
jags.data <- list(claims=(claim.df$x[!claim.df$capped]
- claim.df$truncation[!claim.df$capped]),
capped.claims=(claim.df$x[claim.df$capped]
- claim.df$truncation[claim.df$capped]),
alpha=alpha,
means=actual.means,
ones=rep(1, length(claim.df$[claim.df$capped])),
ages=claim.df$age[!claim.df$capped],
capped.ages=claim.df$age[claim.df$capped],
trend.shape=trend.shape,
trend.rate=1/trend.scale)
Notice that object "ones" is given values of 1 for each capped claim.
The initial values are supplied as below:
jags.init <- list(means=list(weights=prior.weights),
equal=list(weights=rep(1/m,m)))
Some miscellaneous values are provided as follows:
m <- length(actual.means)
alpha0 <- 20
alpha <- prior.weights * alpha0
trend.prior.mu <- .05
trend.prior.sigma <- .01
trend.scale <- trend.prior.sigma^2 / (1+trend.prior.mu)
trend.shape <- (1+trend.prior.mu)/trend.scale
The JAGS model is coded as below:
model <- "model {
weights ~ ddirch(alpha)
trend.factor ~ dgamma(trend.shape, trend.rate)
for (i in 1:length(claims)) {
buckets[i] ~ dcat(weights)
mu[i] <- means[buckets[i]] / trend.factor^ages[i]
claims[i] ~ dexp(1/mu[i])
}
for (i in 1:length(capped.claims)) {
capped.buckets[i] ~ dcat(weights)
capped.mu[i] <- means[capped.buckets[i]]/trend.factor^capped.ages[i]
prob.capped[i] <- exp(-capped.claims[i]/capped.mu[i])
ones[i] ~ dbern(prob.capped[i])
}
}"
Dirichlet, Categorical and Gamma distributions are used for priors. Ones is Bernoulli distributed to characterize claims as capped or uncapped.
Finally, the model is run in JAGS with the following:
model.out <- autorun.jags(model, data=jags.data, inits=jags.init,
monitor=c("weights","trend.factor"),
startburnin=1000, startsample=5000,
n.chains=n.chains, interactive=FALSE, thin=thin.factor)
Would anyone have an idea what goes wrong? Thanks

Why does this solution to the "queens" dilemma run so much slower than the other in Haskell?

In my computer science class we were using Haskell to solve the "queens" problem in which you must find all possible placements of n queens in an nxn board. This was the code we were given:
queens n = solve n
where
solve 0 = [ [] ]
solve k = [ h:partial | partial <- solve(k-1), h <- [0..(n-1)], safe h partial ]
safe h partial = and [ not (checks h partial i) | i <- [0..(length partial)-1] ]
checks h partial i = h == partial!!i || abs(h-partial!!i) == i+1
However, the first time I entered it I accidentally swapped the order in solve k and found that it still gave a correct solution but took much longer:
queens n = solve n
where
solve 0 = [ [] ]
solve k = [ h:partial | h <- [0..(n-1)], partial <- solve(k-1), safe h partial ]
safe h partial = and [ not (checks h partial i) | i <- [0..(length partial)-1] ]
checks h partial i = h == partial!!i || abs(h-partial!!i) == i+1
Why does this second version take so much longer? My thought process is that the second version does recursion at every step while the first version does recursion only once and then backtracks. This is not for a homework problem, I'm just curious and feel like it will help me better understand the language.
Simply put,
[ ... | x <- f 42, n <- [1..100] ]
will evaluate f 42 once to a list, and for each element x in such list it will generate all ns from 1 to 100. Instead,
[ ... | n <- [1..100], x <- f 42 ]
will first generate an n from 1 to 100, and for each of them call f 42. So f is now being called 100 times instead of one.
This is no different from what happens in imperative programming when using nested loops:
for x in f(42): # calls f once
for n in range(1,100):
...
for n in range(1,100):
for x in f(42): # calls f 100 times
...
The fact that your algorithm is recursive makes this swap particularly expensive, since the additional cost factor (100, above) accumulates at each recursive call.
You can also try to bind the result of f 42 to some variable so that it does not need to be recomputed, even if you nest it the other way around:
[ ... | let xs = f 42, n <- [1..100], x <- xs ]
Note that this will keep the whole xs list in memory for the whole loop, preventing it from being garbage collected. Indeed, xs will be fully evaluated for n=1, and then reused for higher values of n.
My guess is that your first version does a depth-first traversal while your second version does a breadth-first traversal of the tree (see Tree Traversal on Wikipedia).
As the complexity of the problem grows with the size of the board, the second version uses more and more memory to keep track of each level of the tree while the first version quickly forgets the previous branch it visited.
Managing the memory takes a lot of time!
By enabling profiling, you can see how the Haskell runtime behaves with your functions.
If you compare the number of calls, they are strictly the same, but still the second version takes more time:
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 44 0 0.0 0.0 100.0 100.0
main Main 89 0 0.3 0.0 0.3 0.0
CAF Main 87 0 0.0 0.0 99.7 100.0
main Main 88 1 0.2 0.6 99.7 100.0
queens2 Main 94 1 0.0 0.0 55.6 48.2
queens2.solve Main 95 13 3.2 0.8 55.6 48.2
queens2.safe Main 96 10103868 42.1 47.5 52.3 47.5
queens2.checks Main 100 37512342 10.2 0.0 10.2 0.0
queens1 Main 90 1 0.0 0.0 43.9 51.1
queens1.solve Main 91 13 2.0 1.6 43.9 51.1
queens1.safe Main 92 10103868 29.3 49.5 41.9 49.5
queens1.checks Main 93 37512342 12.7 0.0 12.7 0.0
Looking at the heap profile tells you what really happens.
The first version has a small and constant heap use:
While the second version has a huge heap use which must also face garbage collection (look at the peaks):
Looking at the core, the first function generates a single function in core, which is tail recursive (constant stack space - very fast and very nice function. Thanks GHC!). However, the 2nd generates two functions: one to do a single step of the inner loop; and a 2nd function which looks like
loop x = case x of { 0 -> someDefault; _ -> do1 (loop (x-1)) }
This function likely isn't performant because do1 must traverse the entire input list, and each iteration appends new elements to the list (meaning the input list to do1 grows monotonically in length). Whereas the core function for the fast version is generating the output list directly, without having to process some other list. It is quite difficult to reason about the performance of list comprehension, I believe, so first translate the function to not use them:
guard b = if b then [()] else []
solve_good k =
concatMap (\partial ->
concatMap (\h ->
guard (safe h partial) >> return (h:partial)
) [0..n-1]
) (solve $ k-1)
solve_bad k =
concatMap (\h ->
concatMap (\partial ->
guard (safe h partial) >> return (h:partial)
) (solve $ k-1)
) [0..n-1]
The transformation is fairly mechanical and is detailed somewhere in the Haskell report, but essentially <- becomes concatMap and conditions become guards. It is much easier to see what is happening now - solve_good makes a recursive call a single time, then concatMaps over that recursively created list. However, solve_bad makes the recursive call inside the outer concatMap, meaning it will potentially (likely) be recomputed for every element in [0..n-1]. Note that there is no semantic reason for solve $ k-1 to be in the inner concatMap - it does not depend on the value that that concatMap binds (the h variable) so it can be safely lifted out above the concatMap which binds h (as is done in solve_good).

Haskell: profiler output not complete?

I'm building a project that stores words into a dictionary (using the library dawg), and when compiled with -fprof-auto, the profiler doesn't tell me that most of the time is spent in functions and CAFs from dawg modules.
The code (also using conduit, but it's pretty straightforward) is:
import Data.DAWG.Static as D
import qualified Data.DAWG.Dynamic as DD
import Conduit
import qualified Data.Conduit.Combinators as C
import qualified Data.Text as T
import Data.List (isSuffixOf)
import Control.Monad
insertEntry dawg word =
DD.insertWith (+) (T.unpack word) 1 dawg
isWhitespace x = x `elem` [' ', '.', '\n', '\'']
appendFileToDDAWG dawg fp =
C.sourceFile fp $= C.decodeUtf8
$= C.splitOnUnboundedE isWhitespace
$$ C.foldl insertEntry dawg
loadDirToDAWG :: FilePath -> IO (DAWG Char () Int)
loadDirToDAWG dir = runResourceT $ do
d <- C.sourceDirectoryDeep True dir
$= C.filter (".txt" `isSuffixOf`)
$$ C.foldM appendFileToDDAWG DD.empty
return $ D.freeze d
main = do d <- loadDirToDAWG some_directory
mapM_ print $ D.assocs d
Running with +RTS -p shows me most of the time is spent in insertEntry (which is normal):
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 211 0 0.0 0.0 100.0 100.0
main Main 423 0 2.3 1.7 99.9 100.0
loadDirToDAWG Mymodule.BuildDAWG 425 0 1.9 1.0 97.6 98.2
appendFileToDDAWG Mymodule.BuildDAWG 427 74 8.0 5.1 95.7 97.2
insertEntry Mymodule.BuildDAWG 430 71366 86.5 92.1 86.5 92.1
[...]
CAF Data.DAWG.Static 418 0 0.1 0.0 0.1 0.0
CAF Data.DAWG.Trans.Map 412 0 0.0 0.0 0.0 0.0
But it doesn't tell me that the time is actually spent inside the Data.DAWG.Dynamic module. Which is weird because it shows the Data.DAWG.Static module, so it's able to "detect" some modules from dawg but not all of them, and especially not the one where most of the work is done.
After downloading dawg, modifying its .cabal file so it's compiled with -fprof-auto-top, and rebuilding everything I get a larger profiler output that shows all inner functions of Data.DAWG.Dynamic and seems ok. But I don't want the full detail (I'm not profiling dawg, just my code) and I just want to be sure that the time is spent in dawg and not in my code (or else it means my code has a problem).
So why in the first case was the Data.DAWG.Dynamic module not shown?
Is there something I'm missing regarding how GHC handles profiling?

Function to Populate Tree in O(depth)

Purely Functional Data Structures has the following exercise:
-- 2.5 Sharing can be useful within a single object, not just between objects.
-- For example, if the two subtress of a given node are identical, then they can
-- be represented by the same tree.
-- Part a: make a `complete a Int` function that creates a tree of
-- depth Int, putting a in every leaf of the tree.
complete :: a -> Integer -> Maybe (Tree a)
complete x depth
| depth < 0 = Nothing
| otherwise = Just $ complete' depth
where complete' d
| d == 0 = Empty
| otherwise = let copiedTree = complete' (d-1)
in Node x copiedTree copiedTree
Does this implementation run in O(d) time? Could you please say why or why not?
The interesting part of the code is the complete' function:
complete' d
| d == 0 = Empty
| otherwise = let copiedTree = complete' (d-1)
in Node x copiedTree copiedTree
As Cirdec's answer suggests, we should be careful to analyze each part of the implementation to make sure our assumptions are valid. As a general rule, we can assume that the following take 1 unit of time each*:
Using a data constructor to construct a value (e.g., using Empty to make an empty tree or using Node to turn a value and two trees into a tree).
Pattern matching on a value to see what data constructor it was built from and what values the data constructor was applied to.
Guards and if/then/else expressions (which are implemented internally using pattern matching).
Comparing an Integer to 0.
Cirdec mentions that the operation of subtracting 1 from an Integer is logarithmic in the size of the integer. As they say, this is essentially an artifact of the way Integer is implemented. It is possible to implement integers so that it takes only one step to compare them to 0 and also takes only one step to decrement them by 1. To keep things very general, it's safe to assume that there is some function c such that the cost of decrementing an Integer is c(depth).
Now that we've taken care of those preliminaries, let's get down to work! As is generally the case, we need to set up a system of equations and solve it. Let f(d) be the number of steps needed to calculate complete' d. Then the first equation is very simple:
f(0) = 2
That's because it costs 1 step to compare d to 0, and another step to check that the result is True.
The other equation is the interesting part. Think about what happens when d > 0:
We calculate d == 0.
We check if that is True (it's not).
We calculate d-1 (let's call the result dm1)
We calculate complete' dm1, saving the result as copiedTree.
We apply a Node constructor to x, copiedTree, and copiedTree.
The first part takes 1 step. The second part takes one step. The third part takes c(depth) steps, and the fifth step takes 1 step. What about the fourth part? Well, that takes f(d-1) steps, so this will be a recursive definition.
f(0) = 2
f(d) = (3+c(depth)) + f(d-1) when d > 0
OK, now we're cooking with gas! Let's calculate the first few values of f:
f(0) = 2
f(1) = (3+c(depth)) + f(0) = (3+c(depth)) + 2
f(2) = (3+c(depth)) + f(1)
= (3+c(depth)) + ((3+c(depth)) + 2)
= 2*(3+c(depth)) + 2
f(3) = (3+c(depth)) + f(2)
= (3+c(depth)) + (2*(3+c(depth)) + 2)
= 3*(3+c(depth)) + 2
You should be starting to see a pattern by now:
f(d) = d*(3+c(depth)) + 2
We generally prove things about recursive functions using mathematical induction.
Base case:
The claim holds for d=0 because 0*(3+c(depth))+2=0+2=2=f(0).
Suppose that the claim holds for d=D. Then
f(D+1) = (3+c(depth)) + f(D)
= (3+c(depth)) + (D*(3+c(depth))+2)
= (D+1)*(3+c(depth))+2
So the claim holds for D+1 as well. Thus by induction, it holds for all natural numbers d. As a reminder, this gives the conclusion that complete' d takes
f(d) = d*(3+c(depth))+2
time. Now how do we express that in big O terms? Well, big O doesn't care about the constant coefficients of any of the terms, and only cares about the highest-order terms. We can safely assume that c(depth)>=1, so we get
f(d) ∈ O(d*c(depth))
Zooming out to complete, this looks like O(depth*c(depth))
If you use the real cost of Integer decrement, this gives you O(depth*log(depth)). If you pretend that Integer decrement is O(1), this gives you O(depth).
Side note: As you continue to work through Okasaki, you will eventually reach section 10.2.1, where you will see a way to implement natural numbers supporting O(1) decrement and O(1) addition (but not efficient subtraction).
* Haskell's lazy evaluation keeps this from being precisely true, but if you pretend that everything is evaluated strictly, you will get an upper bound for the true value, which will be good enough in this case. If you want to learn how to analyze data structures that use laziness to get good asymptotic bounds, you should keep reading Okasaki.
Theoretical Answer
No, it does not run in O(d) time. Its asymptotic performance is dominated by the the Integer subtraction d-1, which takes O(log d) time. This is repeated O(d) times, giving an asymptotic upper bound on time of O(d log d).
This upper bound can improve if you use an Integer representation with an asymptotically optimal O(1) decrement. In practice we don't, since the asymptotically optimal Integer implementations are slower even for unimaginably large values.
Practically the Integer arithmetic will be a small part of the running time of the program. For practical "large" depths (smaller than a machine word) the program's running time will be dominated by allocating and populating memory. For larger depths you will exhaust the resources of the computer.
Practical Answer
Ask the run time system's profiler.
In order to profile your code, we first need to make sure it is run. Haskell is lazily evaluated, so, unless we do something to cause the tree to be completely evaluated, it might not be. Unfortunately, completely exploring the tree will take O(2^d) steps. We could avoid forcing nodes we had already visited if we kept track of their StableNames. Fortunately, traversing a structure and keeping track of visited nodes by their memory locations is already provided by the data-reify package. Since we will be using it for profiling, we need to install it with profiling enabled (-p).
cabal install -p data-reify
Using Data.Reify requires the TypeFamilies extension and Control.Applicative.
{-# LANGUAGE TypeFamilies #-}
import Data.Reify
import Control.Applicative
We reproduce your Tree code.
data Tree a = Empty | Node a (Tree a) (Tree a)
complete :: a -> Integer -> Maybe (Tree a)
complete x depth
| depth < 0 = Nothing
| otherwise = Just $ complete' depth
where complete' d
| d == 0 = Empty
| otherwise = let copiedTree = complete' (d-1)
in Node x copiedTree copiedTree
Converting data to a graph with data-reify requires that we have a base functor for the data type. The base functor is a representation of the type with explicit recursion removed. The base functor for Tree is TreeF. An additional type parameter is added for the representation of recursive occurrence of the type, and each recursive occurrence is replaced by the new parameter.
data TreeF a x = EmptyF | NodeF a x x
deriving (Show)
The MuRef instance required by reifyGraph requires that we provide a mapDeRef to traverse the structure with an Applicative and convert it to the base functor . The first argument provided to mapDeRef, which I have named deRef, is how we can convert the recursive occurrences of the structure.
instance MuRef (Tree a) where
type DeRef (Tree a) = TreeF a
mapDeRef deRef Empty = pure EmptyF
mapDeRef deRef (Node a l r) = NodeF a <$> deRef l <*> deRef r
We can make a little program to run to test the complete function. When the graph is small, we'll print it out to see what's going on. When the graph gets big, we'll only print out how many nodes it has.
main = do
d <- getLine
let (Just tree) = complete 0 (read d)
graph#(Graph nodes _) <- reifyGraph tree
if length nodes < 30
then print graph
else print (length nodes)
I put this code in a file named profileSymmetricTree.hs. To compile it, we need to enable profiling with -prof and enable the run-time system with -rtsopts.
ghc -fforce-recomp -O2 -prof -fprof-auto -rtsopts profileSymmetricTree.hs
When we run it, we'll enable the time profile with the +RTS option -p. We'll give it the depth input 3 for the first run.
profileSymmetricTree +RTS -p
3
let [(1,NodeF 0 2 2),(2,NodeF 0 3 3),(3,NodeF 0 4 4),(4,EmptyF)] in 1
We can already see from the graph that the nodes are being shared between the left and right sides of the tree.
The profiler makes a file, profileSymmetricTree.prof.
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 43 0 0.0 0.7 100.0 100.0
main Main 87 0 100.0 21.6 100.0 32.5
...
main.(...) Main 88 1 0.0 4.8 0.0 5.1
complete Main 90 1 0.0 0.0 0.0 0.3
complete.complete' Main 92 4 0.0 0.2 0.0 0.3
complete.complete'.copiedTree Main 94 3 0.0 0.1 0.0 0.1
It shows in the entries column that complete.complete' was executed 4 times, and the complete.complete'.copiedTree was evaluated 3 times.
If you repeat this experiment with different depths, and plot the results, you should get a good idea what the practical asymptotic performance of complete is.
Here are the profiling results for a much greater depth, 300000.
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 43 0 0.0 0.0 100.0 100.0
main Main 87 0 2.0 0.0 99.9 100.0
...
main.(...) Main 88 1 0.0 0.0 2.1 5.6
complete Main 90 1 0.0 0.0 2.1 5.6
complete.complete' Main 92 300001 1.3 4.4 2.1 5.6
complete.complete'.copiedTree Main 94 300000 0.8 1.3 0.8 1.3

Optimizing Conduit pipelines

I'm currently benchmarking my program to see whether I can improve its performance. Currently my program will take an input file and run some algorithm to split it into multiple files.
It takes roughly 14s to split a file into 3 parts, with -O2 compilation flag for both library and executable.
ghc-options: -Wall -fno-warn-orphans -O2 -auto-all
It looks like it is spending approximately 60% of its time in sinkFile, and I'm wondering whether there is anything I can do to improve the following code.
-- | Get the sink file, a list of FilePaths and the share number of the file to output to.
idxSinkFile :: MonadResource m
=> [FilePath]
-> Int
-> Consumer [Word8] m ()
idxSinkFile outFileNames shareNumber =
let ccm = CC.concatMap $ flip atMay shareNumber
cbs = CC.map BS.singleton
sf = sinkFile (outFileNames !! shareNumber)
in ccm =$= cbs =$= sf
-- | Generate a sink which will take a list of bytes and write each byte to its corresponding file share
sinkMultiFiles :: MonadResource m
=> [FilePath]
-> [Int]
-> Sink [Word8] m ()
sinkMultiFiles outFileNames xs =
let len = [0..length xs - 1]
in getZipSink $ otraverse_ (ZipSink . idxSinkFile outFileNames) len
Here are the output of GHC's profiling:
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
splitFile.sink HaskSplit.Conduit.Split 289 1 0.0 0.0 66.8 74.2
sinkMultiFiles HaskSplit.Conduit.Split 290 1 27.4 33.2 66.8 74.2
idxSinkFile HaskSplit.Conduit.Split 303 3 7.9 11.3 39.4 41.0
idxSinkFile.ccm HaskSplit.Conduit.Split 319 3 3.1 3.6 3.1 3.6
idxSinkFile.cbs HaskSplit.Conduit.Split 317 3 3.5 4.2 3.5 4.2
idxSinkFile.sf HaskSplit.Conduit.Split 307 3 24.9 21.9 24.9 21.9
sinkMultiFiles.len HaskSplit.Conduit.Split 291 1 0.0 0.0 0.0 0.0
Which shows sinkFile taking a lot of time. (I've benchmarked the list access etc in case you're wondering and they have 0% of processing)
While I understand for a small program like this IO is often the bottleneck, I'd like to see if I can improve the runtime performance of my program.
Cheers!
Following nh2's advice, I decided to pack the ByteStrings in 256 byte chunks instead of doing a BS.singleton on each Word8 instance.
cbs = CL.sequence (CL.take 256) =$= CC.map BS.pack
instead of
cbs = CC.map BS.singleton
and I'm able to reduce the running time as well as the memory usage quite significantly, as demonstrated below:
Original Run
total time = 194.37 secs (194367 ticks # 1000 us, 1 processor)
total alloc = 102,021,859,892 bytes (excludes profiling overheads)
New Run, with CL.take
total time = 35.88 secs (35879 ticks # 1000 us, 1 processor)
total alloc = 21,970,152,800 bytes (excludes profiling overheads)
That's some serious improvement! I'd like to optimize it more but that's for another question :)

Resources