Why do I need so much memory for WriterT State? - haskell

Trying to get to grips with the concepts I am trying to solve an exercise in Haskell using WriterT and State (it's advent of code day 15). For some reason I do not understand I end up using loads of memory and my notebook (just 4G Ram) comes to a halt.
My first idea was to use strictness and sprinkle bangs around - but the issue persists.
Could someone explain me where I did go wrong?
Here's cleaned up code:
{-# LANGUAGE BangPatterns #-}
module Main where
import Control.Monad.State.Strict
import Control.Monad.Writer.Strict
main = do
let generators = (Generator 65 16807, Generator 8921 48271)
res1 = compute generators (4*10^7)
putStrLn "Answer 1"
print res1
data Generator = Generator { _value :: Int
, _factor :: Int
}
deriving Show
newtype Value = Value Int
deriving (Show, Eq)
newtype Counter = Counter Int
deriving (Show, Eq)
instance Monoid Counter where
mempty = Counter 0
mappend (Counter !a) (Counter !b) = Counter (a+b)
generate :: Generator -> (Value, Generator)
generate (Generator v f) = (Value newval, Generator newval f)
where newval = (v * f) `mod` 2147483647
agree (Value a) (Value b) = (a `mod` mf) == (b `mod` mf)
where mf = 2^16
oneComp :: State (Generator, Generator) Bool
oneComp = do
(!ga, !gb) <- get
let (va, gan) = generate ga
(vb, gbn) = generate gb
!ag = agree va vb
put (gan, gbn)
pure ag
counterStep :: WriterT Counter (State (Generator, Generator)) ()
counterStep = do
!ag <- lift oneComp
when ag $ tell (Counter 1)
afterN :: Int -> WriterT Counter (State (Generator, Generator)) ()
afterN n = replicateM_ n counterStep
compute s0 n = evalState (execWriterT (afterN n)) s0
I compile it with stack. The entry in the cabal file is:
executable day15
hs-source-dirs: app
main-is: day15.hs
ghc-options: -threaded -rtsopts -with-rtsopts=-N
build-depends: base
, advent
, hspec
, mtl
default-language: Haskell2010
update
I had a little more time and followed the suggestion to make Generator strict. However still something is using too much memory.
Here's the part of the prof file that I think may be relevant.
Fri Dec 15 16:28 2017 Time and Allocation Profiling Report (Final)
day15 +RTS -N -p -RTS
total time = 71.66 secs (71662 ticks # 1000 us, 1 processor)
total alloc = 17,600,423,088 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
afterN Main app/day15.hs:79:1-36 41.1 20.0
mappend Main app/day15.hs:51:3-51 31.0 3.6
oneComp Main app/day15.hs:(64,1)-(71,9) 9.2 49.1
generate.(...) Main app/day15.hs:55:9-42 8.5 14.5

The cause is likely to be the WriterT layer.
Even the "strict" WriterT is completely lazy in the accumulator —it is strict in another sense unrelated to the accumulator.
For example, this program runs without errors:
import Data.Monoid
import Control.Monad.Trans.Writer
import Control.Exception
main :: IO ()
main = do
let (x,_) = runWriter $ do
tell $ Sum (1::Float)
tell (throw $ AssertionFailed "oops")
tell (error "oops")
tell undefined
tell (let z = z in z)
return True
print x
Furthermore, it is impossible to "strictify" the accumulator from within WriterT, because there's no way to get to it.
For long computations, thunks will accumulate and consume a lot of memory.
One solution is to store the counter in a StateT layer instead. The strict modify' function is helpful here.
Using StateT for an append-only accumulator is a bit unsatisfactory though. Another option is to use Accum with judiciously positioned BangPatterns. This program throws an error:
import Control.Monad.Trans.Accum
main :: IO ()
main = do
let (x,_) = flip runAccum mempty $ do
add $ Sum (1::Float)
add $ error "oops"
!_ <- look
return True
print x
Accum is like a Writer that lets you access the accumulator. It doesn't let you change it at will, only add to it. But getting hold of it is enough to introduce strictness.

Related

Heap is full of PINNED

I have a small program that have reasonable maximum residency but allocates linearly. At first, I thought that should be cons cells or I#, but running the program with -p -hc shows heap overwhelmed by PINNED. Does anyone understand the reason and/or can suggest an improvement?
The program
-- task27.hs
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad
import Control.Monad.ST
import Control.Exception
import System.Random
import Data.Functor
import qualified Data.Vector.Generic.Mutable as V
import qualified Data.Vector.Unboxed as U
m = 120
task27 :: [Int] -> (Int, Int)
task27 l = runST $ do
r <- V.replicate m 0 :: ST s (U.MVector s Int)
let go [] = return (1,2)
go (a:as) = do
let p = a `mod` m
cur_lead <- r `V.read` p
when (a > cur_lead) (V.write r p a)
go as
go l
randTest ::
Int -> -- Length of random testing sequence
IO ()
randTest n =
newStdGen <&>
randoms <&>
take n <&>
task27 >>=
print
main = randTest 1000000
My package.yaml:
name: task27
dependencies:
- base == 4.*
- vector
- random
executables:
task27:
main: task27.hs
ghc-options: -O2
My cabal.project.local:
profiling: True
I do cabal -v0 run task27 -- +RTS -p -hc && hp2ps -e8in -c task27.hp and get this:
I tried to add bangs here and there but that did not seem to help.
As #WillemVanOnsem says, in GHC terms, 35kB resident is miniscule. Whatever performance issue you have, it's got nothing to do with this tiny bit of pinned memory. Originally, I said that this was probably the Vectors, but that's wrong. Data.Text uses pinned memory, but Data.Vector doesn't. This bit of PINNED memory looks like it's actually from the runtime system itself, so you can ignore it (see below).
In GHC code, "total allocation" is a measure of processing. A GHC program is an allocation engine. If it's not allocating, it's probably not doing anything (with rare exceptions). So, if you expect your algorithm to run in O(n) time, then it will also be O(n) in total allocation, usually gigabytes worth.
With respect to the "rare exceptions", a GHC program can run in constant "total allocation" but non-constant time if aggressive optimization allows computations using fully unboxed values. So, for example:
main = print (sum [1..10000000] :: Int)
runs in constant total allocation (e.g., 50kB allocated on the heap), because the Ints can be unboxed. For comparison,
main = print (sum [1..10000000] :: Integer)
runs with O(n) total allocation (e.g., 320MB allocated on the heap). By the way, try profiling this last program (and bump the count up until it runs long enough to generate a few seconds of profile data). You'll see that it uses the same amount of PINNED memory as your program, and the amount doesn't really change with the upper limit. So, this is just runtime system overhead.
Back to your example... If you are concerned about performance, the culprit is probably System.Random. This is an EXTREMELY slow random number generator. If I run your program with n = 10000000, it takes 4secs. If I replace the random number generator with a simple LCG:
randoms :: Word32 -> [Word32]
randoms seed = tail $ iterate lcg seed
where lcg x = (a * x + c)
a = 1664525
c = 1013904223
it runs in 0.2secs, so 20 times faster.

In GHC- Why is the lazy version of this small program so much faster than the loop based variant?

These two programs do the same thing, but one runs 10x faster.
This takes approx. 10 seconds on my machine:
import Control.Monad
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as BL
theValueOne=B.singleton 1
main = replicateM_ 100000000 $ B.putStr theValueOne
The second version uses output-lazy IO. It is done in about 1 second (as fast as c):
import qualified Data.ByteString.Lazy as BL
main = BL.putStr $ BL.pack $ replicate 100000000 1
Question: Why is the non-lazy version so slow? More importantly, how can I make it fast? (I've tried recursion, forM, modifying the output buffer using hSetBuffering... Nothing has made a difference)
Note- This is more than just an academic question. The non-lazy version is an extremely simplified version of an executable my company uses in production, which is also slow in the same way. It would be nearly impossible to re-architect the larger program around the analogous lazy solution.
Updated: Added possible source of problem and a solution.
I don't think it has anything to do with lazy I/O. If you rewrite the strict I/O version to write two bytes at once:
theValueOne = B.singleton 1
main = replicateM_ 50000000 $ B.putStr (theValueOne <> theValueOne)
that halves the time. Write ten bytes at once:
theValueOne = B.singleton 1
main = replicateM_ 10000000 $ B.putStr (foldMap id (replicate 10 theValueOne))
and it's already faster than the lazy I/O version.
The issue is that there's a bit of overhead in a B.hPutStr call, much more than the overhead of a C fwrite call, and it's just not a particularly efficient way to write a single byte.
A good chunk of the overhead comes from the fact that Haskell I/O buffers have immutable metadata. Even though the buffer content itself is mutable, the pointers to valid data within the buffer are immutable, and so writing a single byte requires a heap allocation of a new GHC.IO.Buffer.Buffer structure which GHC can't optimize away
One solution is to use a hand-crafted buffering structure with a mutable pointer. The following works, and it's about twice as fast as the lazy I/O version in the original question.
{-# LANGUAGE RecordWildCards #-}
{-# OPTIONS_GHC -Wall #-}
import Control.Monad
import Data.IORef
import Data.Word
import Foreign.ForeignPtr
import Foreign.Ptr
import Foreign.Storable
import System.IO
data WriteBuffer = WriteBuffer
{ handle :: !Handle
, capacity :: !Int
, used :: !(IORef Int)
, content :: !(ForeignPtr Word8)
}
newBuffer :: Handle -> IO WriteBuffer
newBuffer h = do
hSetBinaryMode h True
hSetBuffering h NoBuffering
WriteBuffer h cap <$> newIORef 0 <*> mallocForeignPtrBytes cap
where cap = 4096
flushBuffer :: WriteBuffer -> IO ()
flushBuffer WriteBuffer{..} = do
n <- readIORef used
withForeignPtr content $ \p -> hPutBuf handle p n
writeIORef used 0
writeByte :: Word8 -> WriteBuffer -> IO ()
writeByte w buf#(WriteBuffer{..}) = do
n <- readIORef used
withForeignPtr content $ \p -> poke (plusPtr p n) w
let n' = n + 1
writeIORef used n'
when (n' == capacity) $
flushBuffer buf
main :: IO ()
main = do
b <- newBuffer stdout
replicateM_ 100000000 (writeByte 1 b)
flushBuffer b
Someone ironically, converting this to a version using an immutable counter and passing the WriteBuffer as state through foldM doubles the speed again, so it's about 4 times as fast as the lazy I/O version in the original question:
{-# LANGUAGE RecordWildCards #-}
{-# OPTIONS_GHC -Wall #-}
import Control.Monad
import Data.Word
import Foreign.ForeignPtr
import Foreign.Ptr
import Foreign.Storable
import System.IO
data WriteBuffer = WriteBuffer
{ handle :: !Handle
, capacity :: !Int
, used :: !Int
, content :: !(ForeignPtr Word8)
}
newBuffer :: Handle -> IO WriteBuffer
newBuffer h = do
hSetBinaryMode h True
hSetBuffering h NoBuffering
WriteBuffer h cap 0 <$> mallocForeignPtrBytes cap
where cap = 4096
flushBuffer :: WriteBuffer -> IO WriteBuffer
flushBuffer buf#WriteBuffer{..} = do
withForeignPtr content $ \p -> hPutBuf handle p used
return $ buf { used = 0 }
writeByte :: Word8 -> WriteBuffer -> IO WriteBuffer
writeByte w buf#(WriteBuffer{..}) = do
withForeignPtr content $ \p -> poke (plusPtr p used) w
let used' = used + 1
buf' = buf { used = used' }
if (used' == capacity)
then flushBuffer buf'
else return buf'
main :: IO ()
main = do
b <- newBuffer stdout
b' <- foldM (\s _ -> writeByte 1 s) b [(1::Int)..100000000]
void (flushBuffer b')
The reason this one is so fast seems to be that GHC is able to optimize away the WriteBuffer constructor entirely from the fold and just pass around unboxed pointers and integers in the loop. My guess is that if I modified the mutable version above to avoid boxing and unboxing the integer in the used IORef, it would be similarly fast.

Haskell - Monad Transformers - Limit number of evaluations in an interpreter

I'm learning Monad Transformers and decided to write an interpreter for a simple language(with loop constructs) similar to Brainfuck using Monad Transformers. I would like to terminate the interpreter after certain number of statements.
This simple language is made of single memory cell capable of holding an Int and 5 instructions Input, Output, Increment, Decrement and Loop. A loop terminates when value in the memory is zero. Input is read from a list and similarly output is written to another list. Increment and Decrement does +1 and -1 to memory correspondingly.
I'm using World type to keep track of input, output (streams) and memory, Sum Int to count number of instructions evaluated. Except World to terminate evaluation after certain statements.
module Transformers where
import qualified Data.Map as Map
import Data.Maybe
import Control.Monad.State.Lazy
import Control.Monad.Writer.Lazy
import Control.Monad.Except
data Term = Input
| Output
| Increment
| Decrement
| Loop [Term]
deriving (Show)
data World = World {
inp :: [Int],
out :: [Int],
mem :: Int
} deriving Show
op_limit = 5
loop
:: [Term]
-> StateT World (WriterT (Sum Int) (Except World)) ()
-> StateT World (WriterT (Sum Int) (Except World)) ()
loop terms sp = sp >> do
s <- get
if mem s == 0 then put s else loop terms (foldM (\_ t -> eval t) () terms)
limit :: StateT World (WriterT (Sum Int) (Except World)) ()
limit = do
(s, count) <- listen get
when (count >= op_limit) $ throwError s
tick :: StateT World (WriterT (Sum Int) (Except World)) ()
tick = tell 1
eval :: Term -> StateT World (WriterT (Sum Int) (Except World)) ()
eval Input =
limit >> tick >> modify (\s -> s { inp = tail (inp s), mem = head (inp s) })
eval Output = limit >> tick >> modify (\s -> s { out = mem s : out s })
eval Increment = limit >> tick >> modify (\s -> s { mem = mem s + 1 })
eval Decrement = limit >> tick >> modify (\s -> s { mem = mem s - 1 })
eval (Loop terms) = loop terms (void get)
type Instructions = [Term]
interp :: Instructions -> World -> Either World (World, Sum Int)
interp insts w =
let sp = foldM (\_ inst -> eval inst) () insts
in runExcept (runWriterT (execStateT sp w))
Example run in ghci:
*Transformers> interp [Loop [Output, Decrement]] $ World [] [] 5
Right (World {inp = [], out = [1,2,3,4,5], mem = 0},Sum {getSum = 10})
The monad limit based on count and should decide to either Fail with current state or do nothing. But I noticed that count in (s, count) <- listen get is always zero. I don't understand why is this happening. Please help me understand where I went wrong.
Is my ordering of transformers in the stack correct? Are there any rules (informal) to decide the layering?
Computations inside the Writer monad can't have access to their own accumulator. What's more: the accumulator is never forced while the computation runs, not even to WHNF. This applies to both the strict and lazy variants of Writer—the strict variant is strict in a sense unrelated to the accumulator. This unavoidable laziness in the accumulator can be a source of space leaks if the computation runs for too long.
Your limit function is not branching on the value of the "mainline" WriterT accumulator. The get action (you are using mtl) simply reads the state from the StateT layer, and performs no effects in the other layers: it adds mempty to its WriterT accumulator an throws no error.
Then, the listen extracts the Writer accumulator of the get action (only of the get, not of the whole computation) and adds it to the "mainline" accumulator. But this extracted value (the one returned in the tuple) will always be mempty, that is, Sum 0!
Instead of WriterT, you could put the counter in the StateT state, as #chi has mentioned. You could also use AccumT, which is very similar to WriterT but lets you inspect the accumulator (it also lets you force it to WHNF using bang patterns).
AccumT doesn't seem to have a corresponding mtl typeclass though, so you'll need to sprinkle a few lifts in order to use it.

Efficient bitstreams in Haskell

In an ongoing endeavour to efficiently fiddle with bits (e.g. see this SO question) the newest challenge is the efficient streaming and consumption of bits.
As a first simple task I choose to find the longest sequence of identical bits in a bitstream generated by /dev/urandom. A typical incantation would be head -c 1000000 </dev/urandom | my-exe. The actual goal is to stream bits and decode an Elias gamma code, for example, i.e. codes that are not chunks of bytes or multiples thereof.
For such codes of variable length it is nice to have the take, takeWhile, group, etc. language for list manipulation. Since a BitStream.take would actually consume part of the bistream some monad would probably come into play.
The obvious starting point is the lazy bytestring from Data.ByteString.Lazy.
A. Counting bytes
This very simple Haskell program performs on par with a C program, as is to be expected.
import qualified Data.ByteString.Lazy as BSL
main :: IO ()
main = do
bs <- BSL.getContents
print $ BSL.length bs
B. Adding bytes
Once I start using unpack things should get worse.
main = do
bs <- BSL.getContents
print $ sum $ BSL.unpack bs
Suprisingly, Haskell and C show the almost same performance.
C. Longest sequence of identical bits
As a first nontrivial task the longest sequence of identical bits can be found like this:
module Main where
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.List (group)
import Data.Word8 (Word8)
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
bitStream :: BSL.ByteString -> [Bool]
bitStream bs = concat $ map splitByte (BSL.unpack bs)
main :: IO ()
main = do
bs <- BSL.getContents
print $ maximum $ length <$> (group $ bitStream bs)
The lazy bytestring is converted to a list [Word8] and then, using shifts, each Word is split into the bits, resulting in a list [Bool]. This list of lists is then flattened with concat. Having obtained a (lazy) list of Bool, use group to split the list into sequences of identical bits and then map length over it. Finally maximum gives the desired result. Quite simple, but not very fast:
# C
real 0m0.606s
# Haskell
real 0m6.062s
This naive implementation is exactly one order of magnitude slower.
Profiling shows that quite a lot of memory gets allocated (about 3GB for parsing 1MB of input). There is no massive space leak to be observed, though.
From here I start poking around:
There is a bitstream package that promises "Fast, packed, strict bit streams (i.e. list of Bools) with semi-automatic stream fusion.". Unfortunately it is not up-to-date with the current vector package, see here for details.
Next, I investigate streaming. I don't quite see why I should need 'effectful' streaming that brings some monad into play - at least until I start with the reverse of the posed task, i.e. encoding and writing bitstreams to file.
How about just fold-ing over the ByteString? I'd have to introduce state to keep track of consumed bits. That's not quite the nice take, takeWhile, group, etc. language that is desirable.
And now I'm not quite sure where to go.
Update:
I figured out how to do this with streaming and streaming-bytestring. I'm probably not doing this right because the result is catastrophically bad.
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Streaming as BSS
import Data.Word8 (Word8)
import qualified Streaming as S
import Streaming.Prelude (Of, Stream)
import qualified Streaming.Prelude as S
splitByte :: Word8 -> [Bool]
splitByte w = (\i-> (w `shiftR` i) .&. 1 == 1) <$> [0..7]
bitStream :: Monad m => Stream (Of Word8) m () -> Stream (Of Bool) m ()
bitStream s = S.concat $ S.map splitByte s
main :: IO ()
main = do
let bs = BSS.unpack BSS.getContents :: Stream (Of Word8) IO ()
gs = S.group $ bitStream bs :: Stream (Stream (Of Bool) IO) IO ()
maxLen <- S.maximum $ S.mapped S.length gs
print $ S.fst' maxLen
This will test your patience with anything beyond a few thousand bytes of input from stdin. The profiler says it spends an insane amount of time (quadratic in the input size) in Streaming.Internal.>>=.loop and Data.Functor.Of.fmap. I'm not quite sure what the first one is, but the fmap indicates (?) that the juggling of these Of a b isn't doing us any good and because we're in the IO monad it can't be optimised away.
I also have the streaming equivalent of the byte adder here: SumBytesStream.hs, which is slightly slower than the simple lazy ByteString implementation, but still decent. Since streaming-bytestring is proclaimed to be "bytestring io done right" I expected better. I'm probably not doing it right, then.
In any case, all these bit-computations shouldn't be happening in the IO monad. But BSS.getContents forces me into the IO monad because getContents :: MonadIO m => ByteString m () and there's no way out.
Update 2
Following the advice of #dfeuer I used the streaming package at master#HEAD. Here's the result.
longest-seq-c 0m0.747s (C)
longest-seq 0m8.190s (Haskell ByteString)
longest-seq-stream 0m13.946s (Haskell streaming-bytestring)
The O(n^2) problem of Streaming.concat is solved, but we're still not getting closer to the C benchmark.
Update 3
Cirdec's solution produces a performance on par with C. The construct that is used is called "Church encoded lists", see this SO answer or the Haskell Wiki on rank-N types.
Source files:
All the source files can be found on github. The Makefile has all the various targets to run the experiments and the profiling. The default make will just build everything (create a bin/ directory first!) and then make time will do the timing on the longest-seq executables. The C executables get a -c appended to distinguish them.
Intermediate allocations and their corresponding overhead can be removed when operations on streams fuse together. The GHC prelude provides foldr/build fusion for lazy streams in the form of rewrite rules. The general idea is if one function produces a result that looks like a foldr (it has the type (a -> b -> b) -> b -> b applied to (:) and []) and another function consumes a list that looks like a foldr, constructing the intermediate list can be removed.
For your problem I'm going to build something similar, but using strict left folds (foldl') instead of foldr. Instead of using rewrite rules that try to detect when something looks like a foldl, I'll use a data type that forces lists to look like left folds.
-- A list encoded as a strict left fold.
newtype ListS a = ListS {build :: forall b. (b -> a -> b) -> b -> b}
Since I've started by abandoning lists we'll be re-implementing part of the prelude for lists.
Strict left folds can be created from the foldl' functions of both lists and bytestrings.
{-# INLINE fromList #-}
fromList :: [a] -> ListS a
fromList l = ListS (\c z -> foldl' c z l)
{-# INLINE fromBS #-}
fromBS :: BSL.ByteString -> ListS Word8
fromBS l = ListS (\c z -> BSL.foldl' c z l)
The simplest example of using one is to find the length of a list.
{-# INLINE length' #-}
length' :: ListS a -> Int
length' l = build l (\z a -> z+1) 0
We can also map and concatenate left folds.
{-# INLINE map' #-}
-- fmap renamed so it can be inlined
map' f l = ListS (\c z -> build l (\z a -> c z (f a)) z)
{-# INLINE concat' #-}
concat' :: ListS (ListS a) -> ListS a
concat' ll = ListS (\c z -> build ll (\z l -> build l c z) z)
For your problem we need to be able to split a word into bits.
{-# INLINE splitByte #-}
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
{-# INLINE splitByte' #-}
splitByte' :: Word8 -> ListS Bool
splitByte' = fromList . splitByte
And a ByteString into bits
{-# INLINE bitStream' #-}
bitStream' :: BSL.ByteString -> ListS Bool
bitStream' = concat' . map' splitByte' . fromBS
To find the longest run we'll keep track of the previous value, the length of the current run, and the length of the longest run. We make the fields strict so that the strictness of the fold prevents chains of thunks from being accumulated in memory. Making a strict data type for a state is an easy way to get control over both its memory representation and when its fields are evaluated.
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where
current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: ListS Bool -> Int
longestRun l = longest
where
(LongestRun _ _ longest) = build l extendRun (LongestRun False 0 0)
And we're done
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun $ bitStream' bs
This is much faster, but not quite the performance of c.
longest-seq-c 0m00.12s (C)
longest-seq 0m08.65s (Haskell ByteString)
longest-seq-fuse 0m00.81s (Haskell ByteString fused)
The program allocates about 1 Mb to read 1000000 bytes from input.
total alloc = 1,173,104 bytes (excludes profiling overheads)
Updated github code
I found another solution that is on par with C. The Data.Vector.Fusion.Stream.Monadic has a stream implementation based on this Coutts, Leshchinskiy, Stewart 2007 paper. The idea behind it is to use a destroy/unfoldr stream fusion.
Recall that list's unfoldr :: (b -> Maybe (a, b)) -> b -> [a] creates a list by repeatedly applying (unfolding) a step-forward function, starting with an initial value. A Stream is just an unfoldr function with starting state. (The Data.Vector.Fusion.Stream.Monadic library uses GADTs to create constructors for Step that can be pattern-matched conveniently. It could just as well be done without GADTs, I think.)
The central piece of the solution is the mkBitstream :: BSL.ByteString -> Stream Bool function that turns a BytesString into a stream of Bool. Basically, we keep track of the current ByteString, the current byte, and how much of the current byte is still unconsumed. Whenever a byte is used up another byte is chopped off ByteString. When Nothing is left, the stream is Done.
The longestRun function is taken straight from #Cirdec's solution.
Here's the etude:
{-# LANGUAGE CPP #-}
#define PHASE_FUSED [1]
#define PHASE_INNER [0]
#define INLINE_FUSED INLINE PHASE_FUSED
#define INLINE_INNER INLINE PHASE_INNER
module Main where
import Control.Monad.Identity (Identity)
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.Functor.Identity (runIdentity)
import qualified Data.Vector.Fusion.Stream.Monadic as S
import Data.Word8 (Word8)
type Stream a = S.Stream Identity a -- no need for any monad, really
data Step = Step BSL.ByteString !Word8 !Word8 -- could use tuples, but this is faster
mkBitstream :: BSL.ByteString -> Stream Bool
mkBitstream bs' = S.Stream step (Step bs' 0 0) where
{-# INLINE_INNER step #-}
step (Step bs w n) | n==0 = case (BSL.uncons bs) of
Nothing -> return S.Done
Just (w', bs') -> return $
S.Yield (w' .&. 1 == 1) (Step bs' (w' `shiftR` 1) 7)
| otherwise = return $
S.Yield (w .&. 1 == 1) (Step bs (w `shiftR` 1) (n-1))
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun :: LongestRun -> Bool -> LongestRun
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: Stream Bool -> Int
longestRun s = runIdentity $ do
(LongestRun _ _ longest) <- S.foldl' extendRun (LongestRun False 0 0) s
return longest
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun (mkBitstream bs)

How do I avoid memory problems when writing to file using the Writer monad?

I am building some moderately large DIMACS files, however with the method used below the memory usage is rather large compared to the size of the files generated, and on some of the larger files I need to generate I run in to out of memory problems.
import Control.Monad.State.Strict
import Control.Monad.Writer.Strict
import qualified Data.ByteString.Lazy.Char8 as B
import Control.Monad
import qualified Text.Show.ByteString as BS
import Data.List
main = printDIMACS "test.cnf" test
test = do
xs <- freshs 100000
forM_ (zip xs (tail xs))
(\(x,y) -> addAll [[negate x, negate y],[x,y]])
type Var = Int
type Clause = [Var]
data DIMACSS = DS{
nextFresh :: Int,
numClauses :: Int
} deriving (Show)
type DIMACSM a = StateT DIMACSS (Writer B.ByteString) a
freshs :: Int -> DIMACSM [Var]
freshs i = do
next <- gets nextFresh
let toRet = [next..next+i-1]
modify (\s -> s{nextFresh = next+i})
return toRet
fresh :: DIMACSM Int
fresh = do
i <- gets nextFresh
modify (\s -> s{nextFresh = i+1})
return i
addAll :: [Clause] -> DIMACSM ()
addAll c = do
tell
(B.concat .
intersperse (B.pack " 0\n") .
map (B.unwords . map BS.show) $ c)
tell (B.pack " 0\n")
modify (\s -> s{numClauses = numClauses s + length c})
add h = addAll [h]
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
(s,b) = runWriter (execStateT f (DS 1 0))
i = nextFresh s - 1
j = numClauses s
I would like to keep the monadic building of clauses since it is very handy, but I need to overcome the memory problem. How do I optimize the above program so that it doesn't use too much memory?
If you want good memory behavior, you need to make sure that you write out the clauses as you generate them, instead of collecting them in memory and dumping them as such, either using lazyness or a more explicit approach such as conduits, enumerators, pipes or the like.
The main obstacle to that approach is that the DIMACS format expects the number of clauses and variables in the header. This prevents the naive implementation from being sufficiently lazy. There are two possibilities:
The pragmatic one is to write the clauses first to a temporary location. After that the numbers are known, so you write them to the real file and append the contents of the temporary file.
The prettier approach is possible if the generation of clauses has no side effects (besides the effects offered by your DIMACSM monad) and is sufficiently fast: Run it twice, first throwing away the clauses and just calculating the numbers, print the header line, run the generator again; now printing the clauses.
(This is from my experience with implementing SAT-Britney, where I took the second approach, because it fitted better with other requirements in that context.)
Also, in your code, addAll is not lazy enough: The list c needs to be retained even after writing (in the MonadWriter sense) the clauses. This is another space leak. I suggest you implement add as the primitive operation and then addAll = mapM_ add.
As explained in Joachim Breitner's answer the problem was that DIMACSM was not lazy enough, both because the strict versions of the monads was used and because the number of variables and clauses are needed before the ByteString can be written to the file. The solution is to use the lazy versions of the Monads and execute them twice. It turns out that it is also necessary to have WriterT be the outer monad:
import Control.Monad.State
import Control.Monad.Writer
...
type DIMACSM a = WriterT B.ByteString (State DIMACSS) a
...
printDIMACS :: FilePath -> DIMACSM a -> IO ()
printDIMACS file f = do
writeFile file ""
appendFile file (concat ["p cnf ", show i, " ", show j, "\n"])
B.appendFile file b
where
s = execState (execWriterT f) (DS 1 0)
b = evalState (execWriterT f) (DS 1 0)
i = nextFresh s - 1
j = numClauses s

Resources