Sequential Binary Data Decoding Using Conduits - haskell

The goal is to have a conduit with the following type signature
protobufConduit :: MonadResource m => (ByteString -> a) -> Conduit ByteString m a
The conduit should repeatedly parse protocol buffers (using the ByteString -> a function) received via TCP/IP (using the network-conduit package).
The wire message format is
{length (32 bits big endian)}{protobuf 1}{length}{protobuf 2}...
(The curly braces are not party of the protocol, only used here to separate the entities).
The first idea was to use sequenceSink to repeatedly apply a Sink that is able to parse one ProtoBuf:
[...]
import qualified Data.Binary as B
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.Util as CU
protobufConduit :: MonadResource m => (ByteString -> a) -> Conduit ByteString m a
protobufConduit protobufDecode =
CU.sequenceSink () $ \() ->
do lenBytes <- CB.take 4 -- read protobuf length
let len :: Word32
len = B.decode lengthBytes -- decode ProtoBuf length
intLen = fromIntegral len
protobufBytes <- CB.take intLen -- read the ProtoBuf bytes
return $ CU.Emit () [ protobufDecode protobufBytes ] -- emit decoded ProtoBuf
It doens't work (only works for the first protocol buffer) because there seems to be a number of "leftover" bytes already read from the source but not consumed via CB.take that get discarded.
And I found no way of pushing "the rest back into the source".
Did I get the concept entirely wrong?
PS: Even if I use protocol buffers here, the problem is not related to protocol buffers. To debug the problem I always use {length}{UTF8 encoded string}{length}{UTF8 encoded string}... and a conduit similar to the above one (utf8StringConduit :: MonadResource m => Conduit ByteString m Text).
Update:
I just tried to replace the state (no state () in the sample above) by the remaining bytes and replaced the CB.take calls by calls to a function that first consumes the already read bytes (from the state) and calls await only as needed (when the state is not large enough). Unfortunately, that doesn't work either because as soon as the Source has no bytes left, sequenceSink does not execute the code but the state still contains the remaining bytes :-(.
If you should be interested in the code (which isn't optimized or very good but should be enough to test):
utf8StringConduit :: forall m. MonadResource m => Conduit ByteString m Text
utf8StringConduit =
CU.sequenceSink [] $ \st ->
do (lengthBytes, st') <- takeWithState BS.empty st 4
let len :: Word32
len = B.decode $ BSL.fromChunks [lengthBytes]
intLength = fromIntegral len
(textBytes, st'') <- takeWithState BS.empty st' intLength
return $ CU.Emit st'' [ TE.decodeUtf8 $ textBytes ]
takeWithState :: Monad m
=> ByteString
-> [ByteString]
-> Int
-> Pipe l ByteString o u m (ByteString, [ByteString])
takeWithState acc state 0 = return (acc, state)
takeWithState acc state neededLen =
let stateLenSum = foldl' (+) 0 $ map BS.length state
in if stateLenSum >= neededLen
then do let (firstChunk:state') = state
(neededChunk, pushBack) = BS.splitAt neededLen firstChunk
acc' = acc `BS.append` neededChunk
neededLen' = neededLen - BS.length neededChunk
state'' = if BS.null pushBack
then state'
else pushBack:state'
takeWithState acc' state'' neededLen'
else do aM <- await
case aM of
Just a -> takeWithState acc (state ++ [a]) neededLen
Nothing -> error "to be fixed later"

For protocol buffer parsing and serializing we use messageWithLengthPutM and messageWithLengthGetM (see below) but I assume it uses a varint encoding for the length, which is not what you need. I'd probably try to adapt our implementation below by replacing the messageWithLength Get/Put with something like
myMessageWithLengthGetM =
do size <- getWord32be
getMessageWithSize size
but I have no idea how to implement the getMessageWithSize using the available functions from the protocol buffer package. On the other hand you could just getByteString and then "reparse" the bytestring.
Regarding conduits: Have you tried implementing the conduit without Data.Conduit.Util? Something like
protobufConduit protobufDecode = loop
where
loop =
do len <- liftM convertLen (CB.take 4)
bs <- CB.take len
yield (protobufDecode bs)
loop
Here's the code we use:
pbufSerialize :: (ReflectDescriptor w, Wire w) => Conduit w IO ByteString
pbufSerialize = awaitForever f
where f pb = M.mapM_ yield $ BSL.toChunks $ runPut (messageWithLengthPutM pb)
pbufParse :: (ReflectDescriptor w, Wire w, Show w) => Conduit ByteString IO w
pbufParse = new
where
new = read (runGet messageWithLengthGetM . BSL.fromChunks . (:[]))
read parse =
do mbs <- await
case mbs of
Just bs -> checkResult (parse bs)
Nothing -> return ()
checkResult result =
case result of
Failed _ errmsg -> fail errmsg
Partial cont -> read (cont . Just . BSL.fromChunks . (:[]))
Finished rest _ msg ->
do yield msg
checkResult (runGet messageWithLengthGetM rest)

Related

Haskell based file streaming causes memory leak

I'm fairly new to Haskell and I'm working on an existing code base, that collects files from file shares. To parallelize the processing of the file shares Conduit is used. The scaffold is based on the this tutorial. To continuously read the file share I added the delay and a recursive call to the streamFile function. I am not sure if this is the problem, but the memory allocation is increasing constantly up to several gigabytes.
What could be the problem that causes the memory leak?
module FileScraper(runFileScraperFinal, FileScraper, watch, watchDirectories) where
import Actions (PostProcAction)
import Colog (LogAction, Msg, Severity)
import Conduit (ConduitM, ConduitT, MonadIO (..), MonadResource, MonadTrans (lift), MonadUnliftIO (withRunInIO), ResourceT, await, bracketP, mapMC, mapM_C, runConduit, runResourceT, yield, (.|), takeWhileC)
import Control.Concurrent (threadDelay)
import qualified Control.Concurrent.Async as Async
import qualified Control.Concurrent.STM as STM
import qualified Control.Concurrent.STM.TBMQueue as STM
import Data.ByteString (ByteString, readFile)
import Data.Conduit.Combinators (filterM, yieldMany)
import Data.Functor ((<&>))
import Data.Text (Text, unpack)
import Filters (FileFilter, DirectoryFilter)
import Polysemy (Final, Inspector (inspect), Member, Sem, makeSem)
import Polysemy.Final (bindS, getInitialStateS, getInspectorS, interpretFinal, liftS)
import Prelude hiding (filter, init, readFile)
import System.FilePath.Find (find, RecursionPredicate, (/~?), filePath, (&&?), (==?), fileType, FileType (RegularFile), always)
import System.Posix (raiseSignal, sigTERM)
data FileScraper m a where
Watch :: [(Text, Text, FileFilter, DirectoryFilter, PostProcAction)] -> (FilePath -> ByteString -> Text -> PostProcAction -> m Bool) -> FileScraper m ()
makeSem ''FileScraper
runFileScraperFinal :: forall m. (MonadUnliftIO m => forall r a. (Member (Final m) r) => LogAction m (Msg Severity) -> Sem (FileScraper ': r) a -> Sem r a)
runFileScraperFinal _ = do
interpretFinal #m (\case
Watch sources callback -> do
is <- getInitialStateS
ins <- getInspectorS
cb' <- bindS $ uncurry4 callback
liftS $ withRunInIO $ \runInIO -> liftIO $ do
runResourceT . runConduit $ watchDirectories sources .| mapMC (\(fp,fc,dest,ppa) -> lift $ do
eff <- runInIO $ cb' ((fp,fc,dest,ppa) <$ is)
case inspect ins eff of
Nothing -> do
raiseSignal sigTERM
pure False
Just v -> do
pure v
) .| takeWhileC id .| mapM_C (const $ pure ())
)
uncurry4 :: (a -> b -> c -> d -> e) -> ((a, b, c, d) -> e)
uncurry4 f ~(a,b,c,d) = f a b c d
watchDirectories :: MonadResource m => [(Text, Text, FileFilter, DirectoryFilter, PostProcAction)] -> ConduitM a (FilePath, ByteString, Text, PostProcAction) m ()
watchDirectories sourceToFilterMap = parSources (fmap (\(src, dest, filter, dirFilter, postProcActions) -> streamFile (unpack src) dest filter dirFilter postProcActions) sourceToFilterMap)
streamFile :: MonadResource m => FilePath -> Text -> FileFilter -> DirectoryFilter -> PostProcAction -> ConduitM a (FilePath, ByteString, Text, PostProcAction) m ()
streamFile baseDir destination filter dirFilter postProcActions = do
newFiles <- liftIO $ find (recursionPredicate dirFilter) (fileType ==? RegularFile) baseDir
yieldMany newFiles .| filterM (liftIO . filter) .| mapMC (\entry -> do
liftIO $ readFile entry <&> (entry,,destination,postProcActions))
let minutes :: Int = 60_000_000
liftIO $ threadDelay (5 * minutes)
streamFile baseDir destination filter dirFilter postProcActions
where
recursionPredicate :: DirectoryFilter -> RecursionPredicate
recursionPredicate df = case df of
[] -> always
excludes -> foldl1 (&&?) $ map ((/~?) filePath . unpack) excludes
parSources :: (MonadResource m, Foldable f) => f (ConduitM () o (ResourceT IO) ()) -> ConduitT i o m ()
parSources sources = bracketP init cleanup finalSource
where
init = do
-- create the queue where all sources will put their items
queue <- STM.newTBMQueueIO 100
-- In a separate thread, run concurrently all conduits
a <- Async.async $ do
Async.mapConcurrently_ (\source -> runResourceT $ runConduit (source .| sinkQueue queue)) sources
-- once all conduits are done, close the queue
STM.atomically (STM.closeTBMQueue queue)
pure (a, queue)
cleanup (async, queue) = do
-- upon exception or cancellation, close the queue and cancel the threads
STM.atomically (STM.closeTBMQueue queue)
Async.cancel async
finalSource (_, queue) = sourceQueue queue
sourceQueue :: MonadIO m => STM.TBMQueue o -> ConduitT i o m ()
sourceQueue queue = do
mbItem <- liftIO $ STM.atomically (STM.readTBMQueue queue)
case mbItem of
Nothing -> pure () -- queue closed
Just item -> yield item *> sourceQueue queue
sinkQueue :: MonadIO m => STM.TBMQueue a -> ConduitT a o m ()
sinkQueue queue = do
mbItem <- await
case mbItem of
Nothing -> pure () -- no more items to come
Just item -> do
liftIO $ STM.atomically (STM.writeTBMQueue queue item)
sinkQueue queue
Update (Added function that uses the callback):
...
void $ async $ watch normalisedPrefixedSources (\fp content dest ppa -> do
log Info $ "Sending file " <> pack fp
result <- await =<< send (unpack dest) content
case result of
Just True -> do
log Info $ "File sent " <> pack fp
res <- embed #m $ liftIO $ ppa fp
if res then pure True else do
log Error "Raise signal for graceful shutdown."
embed #m $ liftIO $ raiseSignal sigTERM
pure False
_ -> do
log Error $ "Error sending file " <> pack fp <> ". Raise signal for graceful shutdown."
embed #m $ liftIO $ raiseSignal sigTERM
pure False
)
...
Update 2:
After removing the idempotent filter from the configuration (the changes from #K. A. Buhr are still in place) the memory consumption is constant.
type FileFilter = FilePath -> IO Bool
createIdempotentFilter :: LogAction IO Message -> M.Idempotent -> IO FileFilter
createIdempotentFilter la filterConfig = do
cache <- newIORef []
let configuredCacheSize :: Int = fromIntegral $ M.lruCacheSize filterConfig
pure $ \path -> do
fileModificationEpoch <- getModificationTime path
cache' <- readIORef cache
if (path, fileModificationEpoch) `elem` cache' then do
la <& logText Debug ("File already in cache " <> pack path <> " | " <> pack (show fileModificationEpoch))
pure False
else do
la <& logText Debug ("File not in cache " <> pack path <> " | " <> pack (show fileModificationEpoch))
let alreadyScanned' = cache' <> [(path, fileModificationEpoch)]
writeIORef cache $ drop (length alreadyScanned' - configuredCacheSize) alreadyScanned'
pure True
Is there any problematic code - that causes a memory leak - in the function createIdempotentFilter?
First, make sure you rule out the ByteStrings of file contents as a source of the leak. You will have a maximum number of files in flight equal to the length of the bounded queue, and so your high watermark will be the contents of some arbitrary collection of 100 files from the input filesystems. If you're processing filesystems with large video/image files, you could see erratic, transient spikes from that. Also, if your callback is holding references to the pathnames and/or contents of (some or all of) those files, you'll see a very severe space leak as a result. Rule all this out by replacing readFile entry with return mempty and using a null callback (\_ _ _ _ -> return True).
After making a similar change myself, I was able to duplicate your space leak and tracked it down to two technical issues.
The first was:
.| takeWhileC id .| mapM_C (const $ pure ())
Replacing this with:
.| Control.Monad.void andC
reduced the maximum residency for a single pass through a test filesystem from 130MB to 15MB, but still with a characteristic linear increase in heap usage on a a heap profile.
The second was:
yield item *> sourceQueue queue
Replacing this with:
yield item >> sourceQueue queue
removed the leak entirely. Maximum residency was only 2MB, and there was no discernible leak on a heap profile for multiple passes through the test filesystem.
I'm not exactly sure what's going on here, for either issue. The *> versus >> issue is a problem I've seen before. While these are semantically equivalent, they don't necessarily have the same implementation, and sometimes *> leaks space where >> doesn't. However, the takeWhileC problem is a mystery to me.

Haskell: Replace mapM in a monad transformer stack to achieve lazy evaluation (no space leaks)

It has already been discussed that mapM is inherently not lazy, e.g. here and here. Now I'm struggling with a variation of this problem where the mapM in question is deep inside a monad transformer stack.
Here's a function taken from a concrete, working (but space-leaking) example using LevelDB that I put on gist.github.com:
-- read keys [1..n] from db at DirName and check that the values are correct
doRead :: FilePath -> Int -> IO ()
doRead dirName n = do
success <- runResourceT $ do
db <- open dirName defaultOptions{ cacheSize= 2048 }
let check' = check db def in -- is an Int -> ResourceT IO Bool
and <$> mapM check' [1..n] -- space leak !!!
putStrLn $ if success then "OK" else "Fail"
This function reads the values corresponding to keys [1..n] and checks that they are all correct. The troublesome line inside the ResourceT IO a monad is
and <$> mapM check' [1..n]
One solution would be to use streaming libraries such as pipes, conduit, etc. But these seem rather heavy and I'm not at all sure how to use them in this situation.
Another path I looked into is ListT as suggested here. But the type signatures of ListT.fromFoldable :: [Bool]->ListT Bool and ListT.fold :: (r -> a -> m r) -> r -> t m a -> mr (where m=IO and a,r=Bool) do not match the problem at hand.
What is a 'nice' way to get rid of the space leak?
Update: Note that this problem has nothing to do with monad transformer stacks! Here's a summary of the proposed solutions:
1) Using Streaming:
import Streaming
import qualified Streaming.Prelude as S
S.all_ id (S.mapM check' (S.each [1..n]))
2) Using Control.Monad.foldM:
foldM (\a i-> do {b<-check' i; return $! a && b}) True [1..n]
3) Using Control.Monad.Loops.allM
allM check' [1..n]
I know you mention you don't want to use streaming libraries, but your problem seems pretty easy to solve with streaming without changing the code too much.
import Streaming
import qualified Streaming.Prelude as S
We use each [1..n] instead of [1..n] to get a stream of elements:
each :: (Monad m, Foldable f) => f a -> Stream (Of a) m ()
Stream the elements of a pure, foldable container.
(We could also write something like S.take n $ S.enumFrom 1).
We use S.mapM check' instead of mapM check':
mapM :: Monad m => (a -> m b) -> Stream (Of a) m r -> Stream (Of b) m r
Replace each element of a stream with the result of a monadic action
And then we fold the stream of booleans with S.all_ id:
all_ :: Monad m => (a -> Bool) -> Stream (Of a) m r -> m Bool
Putting it all together:
S.all_ id (S.mapM check' (S.each [1..n]))
Not too different from the code you started with, and without the need for any new operator.
I think what you need is allM from the monad-loops package.
Then it would be just allM check' [1..n]
(Or if you don't want the import it's a pretty small function to copy.)

Constructing RequestBodyStream from Lazy ByteString when length is known

I am trying to adapt this AWS S3 upload code to handle Lazy ByteString where length is already known (so that it is not forced to be read in its entirety in memory - it comes over the network where length is sent beforehand). It seems I have to define a GivesPopper function over Lazy ByteString to convert it to RequestBodyStream. Because of the convoluted way GivesPopper is defined, I am not sure how to write it for Lazy ByteString. Will appreciate pointers on how to write it. Here is how it is written for reading from the file:
let file ="test"
-- streams large file content, without buffering more than 10k in memory
let streamer sink = withFile file ReadMode $ \h -> sink $ S.hGet h 10240
streamer in the code above is of type GivesPopper () if I understand it correctly.
Given a Lazy ByteString with known length len, what would be a good way to write GivesPopper function over it? We can read one chunk at a time.
Is this what you're looking for?
import qualified Data.ByteString as S
import qualified Data.ByteString.Lazy as L
import System.IO
file = "test"
-- original streamer for feeding a sink from a file
streamer :: (IO S.ByteString -> IO r) -> IO r
streamer sink = withFile file ReadMode $ \h -> sink $ S.hGet h 10240
-- feed a lazy ByteString to sink
lstreamer :: L.ByteString -> (IO S.ByteString -> IO r) -> IO r
lstreamer lbs sink = sink (return (L.toStrict lbs))
lstreamer type checks but probably doesn't do exactly what you want it to do. It simply returns the same data every time the sink calls it. On the other hand S.hGet h ... will eventually return the empty string.
Here is a solution which uses an IORef to keep track of if we should start returning the empty string:
import Data.IORef
mklstream :: L.ByteString -> (IO S.ByteString -> IO r) -> IO r
mklstream lbs sink = do
ref <- newIORef False
let fetch :: IO S.ByteString
fetch = do sent <- readIORef ref
writeIORef ref True
if sent
then return S.empty
else return (L.toStrict lbs)
sink fetch
Here fetch is the action which gets the next chunk. The first time you call it you will get the original lazy Bytestring (strict-ified). Subsequent calls will always return the empty string.
Update
Here's how to give out a small amount at a time:
mklstream :: L.ByteString -> (IO S.ByteString -> IO r) -> IO r
mklstream lbs sink = do
ref <- newIORef (L.toChunks lbs)
let fetch :: IO S.ByteString
fetch = do chunks <- readIORef ref
case chunks of
[] -> return S.empty
(c:cs) -> do writeIORef ref cs
return c
sink fetch

Parsec: error message at specific location

Using Parsec how does one indicate an error at a specific position if a semantic rule is violated. I know typically we don't want to do such things, but consider the example grammar.
<foo> ::= <bar> | ...
<bar> ::= a positive integer power of two
The <bar> rule is a finite set (my example is arbitrary), and a pure approach to the above could be a careful application of the choice combinator, but this might be impractical in space and time. In recursive descent or toolkit-generated parsers the standard trick is to parse an integer (a more relaxed grammar) and then semantically check the harder constraints. For Parsec, I could use a natural parser and check the result calling fail when that doesn't match or unexpected or whatever. But if we do that, the default error location is the wrong one. Somehow I need to raise the error at the earlier state.
I tried a brute force solution and wrote a combinator that uses getPosition and setPosition as illustrated by this very similar question. Of course, I was also unsuccessful (the error location is, of course wrong). I've run into this pattern many times. I am kind of looking for this type of combinator:
withPredicate :: (a -> Bool) -> String -> P a -> P a
withPredicate pred lbl p = do
ok <- lookAhead $ fmap pred (try p) <|> return False -- peek ahead
if ok then p -- consume the input if the value passed the predicate
else fail lbl -- otherwise raise the error at the *start* of this token
pPowerOfTwo = withPredicate isPowerOfTwo "power of two" natural
where isPowerOfTwo = (`elem` [2^i | i<-[1..20]])
The above does not work. (I tried variants on this as well.) Somehow the parser backtracks a says it's expecting a digit. I assume it's returning the error that made it the furthest. Even {get,set}ParserState fails erase that memory.
Am I handling this syntactic pattern wrong? How would all you Parsec users approach these type of problems?
Thanks!
I think both your ideas are OK. The other two answers deal with Parsec, but I'd like to note that in both
cases Megaparsec just does the right thing:
{-# LANGUAGE TypeApplications #-}
module Main (main) where
import Control.Monad
import Data.Void
import Text.Megaparsec
import qualified Text.Megaparsec.Char.Lexer as L
type Parser = Parsec Void String
withPredicate1 :: (a -> Bool) -> String -> Parser a -> Parser a
withPredicate1 f msg p = do
r <- lookAhead p
if f r
then p
else fail msg
withPredicate2 :: (a -> Bool) -> String -> Parser a -> Parser a
withPredicate2 f msg p = do
mpos <- getNextTokenPosition -- †
r <- p
if f r
then return r
else do
forM_ mpos setPosition
fail msg
main :: IO ()
main = do
let msg = "I only like numbers greater than 42!"
parseTest' (withPredicate1 #Integer (> 42) msg L.decimal) "11"
parseTest' (withPredicate2 #Integer (> 42) msg L.decimal) "22"
If I run it:
The next big Haskell project is about to start!
λ> :main
1:1:
|
1 | 11
| ^
I only like numbers greater than 42!
1:1:
|
1 | 22
| ^
I only like numbers greater than 42!
λ>
Try it for yourself! Works as expected.
† getNextTokenPosition is more correct than getPosition for streams where tokens contain position of their beginning and end in themselves. This may or may not be important in your case.
It's not a solution I like, but you can hypnotize Parsec into believing it's had a single failure with consumption:
failAt pos msg = mkPT (\_ -> return (Consumed (return $ Error $ newErrorMessage (Expect msg) pos)))
Here's a complete example:
import Control.Monad
import Text.Parsec
import Text.Parsec.Char
import Text.Parsec.Error
import Text.Parsec.Prim
import Debug.Trace
failAt pos msg = mkPT (\_ -> return (Consumed (return $ Error $ newErrorMessage (Expect msg) pos)))
type P a = Parsec String () a
withPredicate :: (a -> Bool) -> String -> P a -> P a
withPredicate pred msg p = do
pos <- getPosition
x <- p
unless (pred x) $ failAt pos msg
return x
natural = read <$> many1 digit
pPowerOfTwo = withPredicate isPowerOfTwo "power of two" natural
where isPowerOfTwo = (`elem` [2^i | i<-[1..20]])
main = print $ runParser pPowerOfTwo () "myinput" "4095"
When run, it results in:
Left "myinput" (line 1, column 1):
expecting power of two
I think the problem stems from how Parsec picks the "best error" in the non-deterministic setting. See Text.Parsec.Error.mergeError. Specifically, this selects the longest match when choosing which error is the error to report. I think we need some way to make Parsec order errors differently, which may be too obscure for us solving this problem.
In my case, I here's how I worked around the problem:
I solved stacked an Exception monad within my ParsecT type.
type P m = P.ParsecT String ParSt (ExceptT Diagnostic m)
Then I introduced a pair of combinators:
(Note: Loc is my internal location type)
-- stops hard on an error (no backtracking)
-- which is why I say "semantic" instead of "syntax" error
throwSemanticError :: (MonadTrans t, Monad m) => Loc -> String -> t (ExceptT Diagnostic m) a
throwSemanticError loc msg = throwSemanticErrorDiag $! Diagnostic loc msg
withLoc :: Monad m => (Loc -> P m a) -> P m a
withLoc pa = getLoc >>= pa
Now in parsing I can write:
parsePrimeNumber = withLoc $ \loc ->
i <- parseInt
unless (isPrime i) $ throwSemanticError loc "number is not prime!"
return i
The top level interface to run one of these monads is really nasty.
runP :: Monad m
=> ParseOpts
-> P m a
-> String
-> m (ParseResult a)
runP pos pma inp =
case runExceptT (P.runParserT pma (initPSt pos) "" inp) of
mea -> do
ea <- mea
case ea of
-- semantic error (throwSemanticError)
Left err -> return $! PError err
-- regular parse error
Right (Left err) -> return $ PError (errToDiag err)
-- success
Right (Right a) -> return (PSuccess a [])
I'm not terribly happy with this solution and desire something better.
I wish parsec had a:
semanticCheck :: (a -> Parsec Bool) -> Parsec a -> Parsec a
semanticCheck pred p =
a <- p
z <- pred a
unless z $
... somehow raise the error from the beginning of this token/parse
rather than the end ... and when propagating the error up,
use the end parse position, so this parse error beats out other
failed parsers that make it past the beginning of this token
(but not to the end)
return a
Using lookAhead, we can run a parser without consuming any input or registering any new errors, but record the state that we end up in. We can then apply a guard to the result of the parser. The guard can fail in whatever manner it desires if the value does not pass the semantic check. If the guard fails, then the error is located at the initial position. If the guard succeeds, we reset the parser to the recorded state, avoiding the need to re-execute p.
guardP :: Stream s m t => (a -> ParsecT s u m ()) -> ParsecT s u m a -> ParsecT s u m a
guardP guard p = do
(a, s) <- try . lookAhead $ do
a <- p
s <- getParserState
return (a, s)
guard a
setParserState s
return a
We can now implement pPowerOfTwo:
pPowerOfTwo :: Stream s m Char => ParsecT s u m Integer
pPowerOfTwo = guardP guardPowerOfTwo natural <?> "power of two"
where guardPowerOfTwo s = unless (s `elem` [2^i | i <- [1..20]]) . unexpected $ show s

What is the preferred way to combine two sinks?

I've used zipSinks :: Monad m => Sink i m r -> Sink i m r' -> Sink i m (r, r') for this but it is considered deprecated.
Edit
After considering this, I don't think it is possible with the current version of Data.Conduit. Pipes aren't Categories, so &&& is out of the question. And there's no way that I can think of to pull results from upstream, feed them incrementally to both sinks, and short-circuit when the first sink finishes. (Although I don't think that Data.Conduit.Util.zipSinks short-circuits this way, it seems like it would be very desirable.) Except of course, to pattern match on both Sinks (like zipSinks in the package does), but that's what we're trying to avoid here.
That said, I would love to be proven wrong here.
It's not pretty, but you can do this in a kind-of obvious way.
First imports:
module Main where
import Control.Monad.Trans
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.Util as CU
import Data.Maybe
import Data.Text (unpack)
Now for zipSinks. Basically, you want to create a sink that pulls the input from upstream and sends it to each child sink separately. In this case, I've used CL.sourceList to do this. If await returns Nothing, maybeToList returns an empty list, so the child sinks are also run with no input. Finally, the output of each child sink is then fed into the tuple.
zipSinks :: Monad m => Sink i m r -> Sink i m r' -> Sink i m (r, r')
zipSinks s1 s2 = do
l <- fmap maybeToList await
o1 <- lift $ CL.sourceList l $$ s1
o2 <- lift $ CL.sourceList l $$ s2
return (o1, o2)
Here are some examples of using zipSinks. It appears to work fine both inside of IO and outside of it, and in the few tests I did, the output matches the output of zipped', created using the old zipSinks.
doubleHead :: Monad m => Sink Int m (Maybe Int)
doubleHead = await >>= return . fmap (2*)
-- old version
zipped' :: Monad m => Sink Int m (Maybe Int, Maybe Int)
zipped' = CU.zipSinks CL.head doubleHead
-- new version
zipped :: Monad m => Sink Int m (Maybe Int, Maybe Int)
zipped = zipSinks CL.head doubleHead
fromList = CL.sourceList [7, 8, 9] $$ zipped
-- (Just 7, Just 14)
fromFile :: String -> IO (Maybe Int, Maybe Int)
fromFile filename = runResourceT $
CB.sourceFile filename
$= CB.lines
$= CT.decode CT.utf8
$= CL.map (read . unpack)
$$ zipped
-- for a file with the lines:
--
-- 1
-- 2
-- 3
--
-- returns (Just 1, Just 2)
((The package is conduit-0.5.2.3. The whole module is just for backwards compatibility.))
[edit]
So, my straightforward monadic guess (see below) seems to be wrong, even though the types are correct.
Now, I can only guess that the answer is:
The replacing features are still in development, pretty much like all Pipe/Conduit and similar concepts and libraries.
I'd wait for the next API to solve this question and still use zipSink until then.
(Maybe it was just misplaced.)
[/edit]
I'm not that familar with this package, but wouldn't it do just the same as this?
zipSinks :: Monad m => Sink i m r -> Sink i m r' -> Sink i m (r, r')
zipSinks s1 s2 = (,) <$> s1 <*> s2
It is a Monad after all. (Functor, Applicative)
zipSinks :: Monad sink => sink r -> sink r' -> sink (r, r')
zipSinks s1 s2 = liftM2 (,) s1 s2

Resources