Conduit - Combining multiple Sources/Producers into one

Conduit - Combining multiple Sources/Producers into one - haskell

I'm reading from a file using sourceFile, but I also need to introduce randomness into the processing operation. The best approach I believe is to have a producer that is of the type
Producer m (StdGen, ByteString)
where StdGen is used to generate the random number.
I'm intending for the producer to perform the task of sourceFile, as well as producing a new seed to yield everytime it sends data downstream.
My problem is, there doesn't seem to be a source-combiner like zipSink for sinks. Reading through Conduit Overview, it seems to be suggesting that you can embed a Source inside a Conduit, but I'm failing to see how it is done in the example.
Can anyone provide an example of which you fuse two or more IO sources into one single Producer/Source?
EDIT :
An example:
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE RankNTypes #-}
{-# LANGUAGE OverloadedStrings #-}
import System.Random (StdGen(..), split, newStdGen, randomR)
import ClassyPrelude.Conduit as Prelude
import Control.Monad.Trans.Resource (runResourceT, ResourceT(..))
import qualified Data.ByteString as BS
-- generate a infinite source of random number seeds
sourceStdGen :: MonadIO m => Source m StdGen
sourceStdGen = do
g <- liftIO newStdGen
loop g
where loop gin = do
let g' = fst (split gin)
yield gin
loop g'
-- combine the sources into one
sourceInput :: (MonadResource m, MonadIO m) => FilePath -> Source m (StdGen, ByteString)
sourceInput fp = getZipSource $ (,)
<$> ZipSource sourceStdGen
<*> ZipSource (sourceFile fp)
-- a simple conduit, which generates a random number from provide StdGen
-- and append the byte value to the provided ByteString
simpleConduit :: Conduit (StdGen, ByteString) (ResourceT IO) ByteString
simpleConduit = mapC process
process :: (StdGen, ByteString) -> ByteString
process (g, bs) =
let rnd = fst $ randomR (40,50) g
in bs ++ pack [rnd]
main :: IO ()
main = do
runResourceT $ sourceInput "test.txt" $$ simpleConduit =$ sinkFile "output.txt"
So this example takes what's in the input file and write it to the output file, as well as appending a random ASCII value between 40 and 50 to the end of the file. (Don't ask me why)

You can use ZipSource for this. In your case, it might look something like:
sourceStdGens :: Source m StdGen
sourceBytes :: Source m ByteString
sourceBoth :: Source m (StdGen, ByteString)
sourceBoth = getZipSource $ (,)
<$> ZipSource sourceStdGens
<*> ZipSource sourceBytes

You can do it in the IO monad then lift the result to a Producer.
do (i, newSeed) <- next currentSeed
b <- generateByteStringFromRandomNumber i
return (b, newSeed)
That IO action can be lifted into the appropriate conduit with a simple lift:
-- assuming the above action is named x and takes the current seed as an argument
-- the corresponding producer/source is:
lift $ x currentSeed

Related

Efficient streaming and manipulation of a byte stream in Haskell

While writing a deserialiser for a large (<bloblength><blob>)* encoded binary file I got stuck with the various Haskell produce-transform-consume libraries. So far I'm aware of four streaming libraries:
Data.Conduit: Widely used, has very careful resource management
Pipes: Similar to conduit (Haskell Cast #6 nicely reveals the differences between conduit and pipes)
Data.Binary.Get: Offers useful functions such as getWord32be, but the streaming example is awkward
System.IO.Streams: Seems to be the easiest one to use
Here's a stripped down example of where things go wrong when I try to do Word32 streaming with conduit. A slightly more realistic example would first read a Word32 that determines the blob length and then yield a lazy ByteString of that length (which is then deserialised further).
But here I just try to extract Word32's in streaming fashion from a binary file:
module Main where
-- build-depends: bytestring, conduit, conduit-extra, resourcet, binary
import Control.Monad.Trans.Resource (MonadResource, runResourceT)
import qualified Data.Binary.Get as G
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Lazy as BL
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import Data.Word (Word32)
import System.Environment (getArgs)
-- gets a Word32 from a ByteString.
getWord32 :: C.ByteString -> Word32
getWord32 bs = do
G.runGet G.getWord32be $ BL.fromStrict bs
-- should read BytesString and return Word32
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = do
mbs <- await
case mbs of
Just bs -> do
case C.null bs of
False -> do
yield $ getWord32 bs
leftover $ BS.drop 4 bs
transform
True -> return ()
Nothing -> return ()
main :: IO ()
main = do
filename <- fmap (!!0) getArgs -- should check length getArgs
result <- runResourceT $ (CB.sourceFile filename) $$ transform =$ CL.consume
print $ length result -- is always 8188 for files larger than 32752 bytes
The output of the program is just the number of Word32's that were read. It turns out the stream terminates after reading the first chunk (about 32KiB). For some reason mbs is never Nothing, so I must check null bs which stops the stream when the chunk is consumed. Clearly, my conduit transform is faulty. I see two routes to a solution:
The await doesn't want to go to the second chunk of the ByteStream, so is there another function that pulls the next chunk? In examples I've seen (e.g. Conduit 101) this is not how it's done
This is just the wrong way to set up transform.
How is this done properly? Is this the right way to go? (Performance does matter.)
Update: Here's a BAD way to do it using Systems.IO.Streams:
module Main where
import Data.Word (Word32)
import System.Environment (getArgs)
import System.IO (IOMode (ReadMode), openFile)
import qualified System.IO.Streams as S
import System.IO.Streams.Binary (binaryInputStream)
import System.IO.Streams.List (outputToList)
main :: IO ()
main = do
filename : _ <- getArgs
h <- openFile filename ReadMode
s <- S.handleToInputStream h
i <- binaryInputStream s :: IO (S.InputStream Word32)
r <- outputToList $ S.connect i
print $ last r
'Bad' means: Very demanding in time and space, does not handle Decode exception.

Your immediate problem is caused by how you are using leftover. That function is used to "Provide a single piece of leftover input to be consumed by the next component in the current monadic binding", and so when you give it bs before looping with transform you are effectively throwing away the rest of the bytestring (i.e. what is after bs).
A correct solution based on your code would use the incremental input interface of Data.Binary.Get to replace your yield/leftover combination with something that consumes each chunk fully. A more pragmatic approach, though, is using the binary-conduit package, which provides that in the shape of conduitGet (its source gives a good idea of what a "manual" implementation would look like):
import Data.Conduit.Serialization.Binary
-- etc.
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = conduitGet G.getWord32be
One caveat is that this will throw a parse error if the total number of bytes is not a multiple of 4 (i.e. the last Word32 is incomplete). In the unlikely case of that not being what you want, a lazy way out would be simply using \bs -> C.take (4 * truncate (C.length bs / 4)) bs on the input bytestring.

With pipes (and pipes-group and pipes-bytestring) the demo problem reduces to combinators. First we resolve the incoming undifferentiated byte stream into little 4 byte chunks:
chunksOfStrict :: (Monad m) => Int -> Producer ByteString m r -> Producer ByteString m r
chunksOfStrict n = folds mappend mempty id . view (Bytes.chunksOf n)
then we map these to Word32s and (here) count them.
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
n <- P.length $ chunksOfStrict 4 (Bytes.fromHandle h) >-> P.map getWord32
print n
This will fail if we have less than 4 bytes or otherwise fail to parse but we can as well map with
getMaybeWord32 :: ByteString -> Maybe Word32
getMaybeWord32 bs = case G.runGetOrFail G.getWord32be $ BL.fromStrict bs of
Left r -> Nothing
Right (_, off, w32) -> Just w32
The following program will then print the parses for the valid 4 byte sequences
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
runEffect $ chunksOfStrict 4 (Bytes.fromHandle h)
>-> P.map getMaybeWord32
>-> P.concat -- here `concat` eliminates maybes
>-> P.print
There are other ways of dealing with failed parses, of course.
Here, though, is something closer to the program you asked for. It takes a four byte segment from a byte stream (Producer ByteString m r) and reads it as a Word32 if it is long enough; it then takes that many of the incoming bytes and accumulates them into a lazy bytestring, yielding it. It just repeats this until it runs out of bytes. In main below, I print each yielded lazy bytestring that is produced:
module Main (main) where
import Pipes
import qualified Pipes.Prelude as P
import Pipes.Group (folds)
import qualified Pipes.ByteString as Bytes ( splitAt, fromHandle, chunksOf )
import Control.Lens ( view ) -- or Lens.Simple (view) -- or Lens.Micro ((.^))
import qualified System.IO as IO ( IOMode(ReadMode), withFile )
import qualified Data.Binary.Get as G ( runGet, getWord32be )
import Data.ByteString ( ByteString )
import qualified Data.ByteString.Lazy.Char8 as BL
import System.Environment ( getArgs )
splitLazy :: (Monad m, Integral n) =>
n -> Producer ByteString m r -> m (BL.ByteString, Producer ByteString m r)
splitLazy n bs = do
(bss, rest) <- P.toListM' $ view (Bytes.splitAt n) bs
return (BL.fromChunks bss, rest)
measureChunks :: Monad m => Producer ByteString m r -> Producer BL.ByteString m r
measureChunks bs = do
(lbs, rest) <- lift $ splitLazy 4 bs
if BL.length lbs /= 4
then rest >-> P.drain -- in fact it will be empty
else do
let w32 = G.runGet G.getWord32be lbs
(lbs', rest') <- lift $ splitLazy w32 bs
yield lbs
measureChunks rest
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
runEffect $ measureChunks (Bytes.fromHandle h) >-> P.print
This is again crude in that it uses runGet not runGetOrFail, but this is easily repaired. The pipes standard procedure would be to stop the stream transformation on a failed parse and return the unparsed bytestream.
If you were anticipating that the Word32s were for large numbers, so that you did not want to accumulate the corresponding stream of bytes as a lazy bytestring, but say write them to different files without accumulating, we could change the program pretty easily to do that. This would require a sophisticated use of conduit but is the preferred approach with pipes and streaming.

Here's a relatively straightforward solution that I want to throw into the ring. It's a repeated use of splitAt wrapped into a State monad that gives an interface identical to (a subset of) Data.Binary.Get. The resulting [ByteString] is obtained in main with a whileJust over getBlob.
module Main (main) where
import Control.Monad.Loops
import Control.Monad.State
import qualified Data.Binary.Get as G (getWord32be, runGet)
import qualified Data.ByteString.Lazy as BL
import Data.Int (Int64)
import Data.Word (Word32)
import System.Environment (getArgs)
-- this is going to mimic the Data.Binary.Get.Get Monad
type Get = State BL.ByteString
getWord32be :: Get (Maybe Word32)
getWord32be = state $ \bs -> do
let (w, rest) = BL.splitAt 4 bs
case BL.length w of
4 -> (Just w', rest) where
w' = G.runGet G.getWord32be w
_ -> (Nothing, BL.empty)
getLazyByteString :: Int64 -> Get BL.ByteString
getLazyByteString n = state $ \bs -> BL.splitAt n bs
getBlob :: Get (Maybe BL.ByteString)
getBlob = do
ml <- getWord32be
case ml of
Nothing -> return Nothing
Just l -> do
blob <- getLazyByteString (fromIntegral l :: Int64)
return $ Just blob
runGet :: Get a -> BL.ByteString -> a
runGet g bs = fst $ runState g bs
main :: IO ()
main = do
fname <- head <$> getArgs
bs <- BL.readFile fname
let ls = runGet loop bs where
loop = whileJust getBlob return
print $ length ls
There's no error handling in getBlob, but it's easy to extend. Time and space complexity is quite good, as long as the resulting list is used carefully. (The python script that creates some random data for consumption by the above is here).

How to parse a large XML file in Haskell with limited amount of resources?

I want to extract information from a large XML file (around 20G) in Haskell. Since it is a large file, I used SAX parsing functions from Hexpath.
Here is a simple code I tested:
import qualified Data.ByteString.Lazy as L
import Text.XML.Expat.SAX as Sax
parse :: FilePath -> IO ()
parse path = do
inputText <- L.readFile path
let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text]
let txt = foldl' processEvent "" saxEvents
putStrLn txt
After activating profiling in Cabal, it says that parse.saxEvents took 85% of allocated memory. I also used foldr and the result is the same.
If processEvent becomes complex enough, the program crashes with a stack space overflow error.
What am I doing wrong?

You don't say what processEvent is like. In principle, it ought to be unproblematic to use lazy ByteString for a strict left fold over lazily generated input, so I'm not sure what is going wrong in your case. But one ought to use streaming-appropriate types when dealing with gigantic files!
In fact, hexpat does have 'streaming' interface (just like xml-conduit). It uses the not-too-well known List library and the rather ugly List class it defines. In principle the ListT type from the List package should work well. I gave up quickly because of a lack of combinators, and wrote an appropriate instance of the ugly List class for a wrapped version of Pipes.ListT which I then used to export ordinary Pipes.Producer functions like parseProduce. The trivial manipulations needed for this are appended below as PipesSax.hs
Once we have parseProducer we can convert a ByteString or Text Producer into a Producer of SaxEvents with Text or ByteString components. Here are some simple operations. I was using a 238M "input.xml"; the programs never need more than 6 mb of memory, to judge from looking at top.
-- Sax.hs Most of the IO actions use a registerIds pipe defined at the bottom which is tailored to a giant bit of xml of which this is a valid 1000 fragment http://sprunge.us/WaQK
{-#LANGUAGE OverloadedStrings #-}
import PipesSax ( parseProducer )
import Data.ByteString ( ByteString )
import Text.XML.Expat.SAX
import Pipes -- cabal install pipes pipes-bytestring
import Pipes.ByteString (toHandle, fromHandle, stdin, stdout )
import qualified Pipes.Prelude as P
import qualified System.IO as IO
import qualified Data.ByteString.Char8 as Char8
sax :: MonadIO m => Producer ByteString m ()
-> Producer (SAXEvent ByteString ByteString) m ()
sax = parseProducer defaultParseOptions
-- stream xml from stdin, yielding hexpat tagstream to stdout;
main0 :: IO ()
main0 = runEffect $ sax stdin >-> P.print
-- stream the extracted 'IDs' from stdin to stdout
main1 :: IO ()
main1 = runEffect $ sax stdin >-> registryIds >-> stdout
-- write all IDs to a file
main2 =
IO.withFile "input.xml" IO.ReadMode $ \inp ->
IO.withFile "output.txt" IO.WriteMode $ \out ->
runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out
-- folds:
-- print number of IDs
main3 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do n <- P.length $ sax (fromHandle inp) >-> registryIds
print n
-- sum the meaningful part of the IDs - a dumb fold for illustration
main4 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do let pipeline = sax (fromHandle inp) >-> registryIds >-> P.map readIntId
n <- P.fold (+) 0 id pipeline
print n
where
readIntId :: ByteString -> Integer
readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2
-- my xml has tags with attributes that appear via hexpat thus:
-- StartElement "FacilitySite" [("registryId","110007915364")]
-- and the like. This is just an arbitrary demo stream manipulation.
registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m ()
registryIds = do
e <- await -- we look for a 'SAXEvent'
case e of -- if it matches, we yield, else we go to the next event
StartElement "FacilitySite" [("registryId",a)] -> do yield a
yield "\n"
registryIds
_ -> registryIds
-- 'library': PipesSax.hs
This just newtypes Pipes.ListT to get the appropriate instances. We don't export anything to do with List or ListT but just use the standard Pipes.Producer concept.
{-#LANGUAGE TypeFamilies, GeneralizedNewtypeDeriving #-}
module PipesSax (parseProducerLocations, parseProducer) where
import Data.ByteString (ByteString)
import Text.XML.Expat.SAX
import Data.List.Class
import Control.Monad
import Control.Applicative
import Pipes
import qualified Pipes.Internal as I
parseProducer
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text) m ()
parseProducer opt = enumerate . enumerate_
. parseG opt
. Select_ . Select
parseProducerLocations
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text, XMLParseLocation) m ()
parseProducerLocations opt =
enumerate . enumerate_ . parseLocationsG opt . Select_ . Select
newtype ListT_ m a = Select_ { enumerate_ :: ListT m a }
deriving (Functor, Monad, MonadPlus, MonadIO
, Applicative, Alternative, Monoid, MonadTrans)
instance Monad m => List (ListT_ m) where
type ItemM (ListT_ m) = m
joinL = Select_ . Select . I.M . liftM (enumerate . enumerate_)
runList = liftM emend . next . enumerate . enumerate_
where
emend (Right (a,q)) = Cons a (Select_ (Select q))
emend _ = Nil

Join two consumers into a single consumer that returns multiple values?

I have been experimenting with the new pipes-http package and I had a thought. I have two parsers for a web page, one that returns line items and another a number from elsewhere in the page. When I grab the page, it'd be nice to string these parsers together and get their results at the same time from the same bytestring producer, rather than fetching the page twice or fetching all the html into memory and parsing it twice.
In other words, say you have two Consumers:
c1 :: Consumer a m r1
c2 :: Consumer a m r2
Is it possible to make a function like this:
combineConsumers :: Consumer a m r1 -> Consumer a m r2 -> Consumer a m (r1, r2)
combineConsumers = undefined
I have tried a few things, but I can't figure it out. I understand if it isn't possible, but it would be convenient.
Edit:
I'm sorry it turns out I was making an assumption about pipes-attoparsec, due to my experience with conduit-attoparsec that caused me to ask the wrong question. Pipes-attoparsec turns an attoparsec into a pipes Parser when I just assumed that it would return a pipes Consumer. That means that I can't actually turn two attoparsec parsers into consumers that take text and return a result, then use them with the plain old pipes ecosystem. I'm sorry but I just don't understand pipes-parse.
Even though it doesn't help me, Arthur's answer is pretty much what I envisioned when I asked the question, and I'll probably end up using his solution in the future. In the meantime I'm just going to use conduit.

It the results are "monoidal", you can use the tee function from the Pipes prelude, in combination with a WriterT.
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid
import Control.Monad
import Control.Monad.Writer
import Control.Monad.Writer.Class
import Pipes
import qualified Pipes.Prelude as P
import qualified Data.Text as T
textSource :: Producer T.Text IO ()
textSource = yield "foo" >> yield "bar" >> yield "foo" >> yield "nah"
counter :: Monoid w => T.Text
-> (T.Text -> w)
-> Consumer T.Text (WriterT w IO) ()
counter word inject = P.filter (==word) >-> P.mapM (tell . inject) >-> P.drain
main :: IO ()
main = do
result <-runWriterT $ runEffect $
hoist lift textSource >->
P.tee (counter "foo" inject1) >-> (counter "bar" inject2)
putStrLn . show $ result
where
inject1 _ = (,) (Sum 1) mempty
inject2 _ = (,) mempty (Sum 1)
Update: As mentioned in a comment, the real problem I see is that in pipes parsers aren't Consumers. And how can you run two parsers concurrently if they have different behaviours regarding leftovers? What happens if one of the parsers wants to "un-draw" some text and the other parser doesn't?
One possible solution is to run the parsers in a truly concurrent manner, in different threads. The primitives in the pipes-concurrency package let you "duplicate" a Producer by writing the same data to two different mailboxes. And then each parser can do whatever it wants with its own copy of the producer. Here's an example which also uses the pipes-parse, pipes-attoparsec and async packages:
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid
import qualified Data.Text as T
import Data.Attoparsec.Text hiding (takeWhile)
import Data.Attoparsec.Combinator
import Control.Applicative
import Control.Monad
import Control.Monad.State.Strict
import Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.Attoparsec as P
import qualified Pipes.Concurrent as P
import qualified Control.Concurrent.Async as A
parseChars :: Char -> Parser [Char]
parseChars c = fmap mconcat $
many (notChar c) *> many1 (some (char c) <* many (notChar c))
textSource :: Producer T.Text IO ()
textSource = yield "foo" >> yield "bar" >> yield "foo" >> yield "nah"
parseConc :: Producer T.Text IO ()
-> Parser a
-> Parser b
-> IO (Either P.ParsingError a,Either P.ParsingError b)
parseConc producer parser1 parser2 = do
(outbox1,inbox1,seal1) <- P.spawn' P.Unbounded
(outbox2,inbox2,seal2) <- P.spawn' P.Unbounded
feeding <- A.async $ runEffect $ producer >-> P.tee (P.toOutput outbox1)
>-> P.toOutput outbox2
sealing <- A.async $ A.wait feeding >> P.atomically seal1 >> P.atomically seal2
r <- A.runConcurrently $
(,) <$> (A.Concurrently $ parseInbox parser1 inbox1)
<*> (A.Concurrently $ parseInbox parser2 inbox2)
A.wait sealing
return r
where
parseInbox parser inbox = evalStateT (P.parse parser) (P.fromInput inbox)
main :: IO ()
main = do
(Right a, Right b) <- parseConc textSource (parseChars 'o') (parseChars 'a')
putStrLn . show $ (a,b)
The result is:
("oooo","aa")
I'm not sure how much overhead this approach introduces.

I think something is wrong with the way you are going about this, for the reasons Davorak mentions in his remark. But if you really need such a function, you can define it.
import Pipes.Internal
import Pipes.Core
zipConsumers :: Monad m => Consumer a m r -> Consumer a m s -> Consumer a m (r,s)
zipConsumers p q = go (p,q) where
go (p,q) = case (p,q) of
(Pure r , Pure s) -> Pure (r,s)
(M mpr , ps) -> M (do pr <- mpr
return (go (pr, ps)))
(pr , M mps) -> M (do ps <- mps
return (go (pr, ps)))
(Request _ f, Request _ g) -> Request () (\a -> go (f a, g a))
(Request _ f, Pure s) -> Request () (\a -> do r <- f a
return (r, s))
(Pure r , Request _ g) -> Request () (\a -> do s <- g a
return (r,s))
(Respond x _, _ ) -> closed x
(_ , Respond y _) -> closed y
If you are 'zipping' consumers without using their return value, only their 'effects' you can just use tee consumer1 >-> consumer2

The idiomatic solution is to rewrite your Consumers as a Fold or FoldM from the foldl library and then combine them using Applicative style. You can then convert this combined fold to one that works on pipes.
Let's assume that you either have two Folds:
fold1 :: Fold a r1
fold2 :: Fold a r2
... or two FoldMs:
foldM1 :: Monad m => FoldM a m r1
foldM2 :: Monad m => FoldM a m r2
Then you combine these into a single Fold/FoldM using Applicative style:
import Control.Applicative
foldBoth :: Fold a (r1, r2)
foldBoth = (,) <$> fold1 <*> fold2
foldBothM :: Monad m => FoldM a m (r1, r2)
foldBothM = (,) <$> foldM1 <*> foldM2
-- or: foldBoth = liftA2 (,) fold1 fold2
-- foldMBoth = liftA2 (,) foldM1 foldM2
You can turn either fold into a Pipes.Prelude-style fold or a Parser. Here are the necessary conversion functions:
import Control.Foldl (purely, impurely)
import qualified Pipes.Prelude as Pipes
import qualified Pipes.Parse as Parse
purely Pipes.fold
:: Monad m => Fold a b -> Producer a m () -> m b
impurely Pipes.foldM
:: Monad m => FoldM m a b -> Producer a m () -> m b
purely Parse.foldAll
:: Monad m => Fold a b -> Parser a m r
impurely Parse.foldMAll
:: Monad m => FoldM a m b -> Parser a m r
The reason for the purely and impurely functions is so that foldl and pipes can interoperate without either one incurring a dependency on the other. Also, they allow libraries other than pipes (like conduit) to reuse foldl without a dependency, too (Hint hint, #MichaelSnoyman).
I apologize that this feature is not documented, mainly because it took me a while to figure out how to get pipes and foldl to interoperate in a dependency-free manner, and that was after I wrote the pipes tutorial. I will update the tutorial to point out this trick.
To learn how to use foldl, just read the documentation in the main module. It's a very small and easy-to-learn library.

For what it's worth, in the conduit world, the relevant function is zipSinks. There might be some way to adapt this function to work for pipes, but automatic termination may get in the way.

Consumer forms a Monad so
combineConsumers = liftM2 (,)
will type check. Unfortunately, the semantics might be unlike what you're expecting: the first consumer will run to completion and then the second.

Nested ResourceT scopes within a conduit Sink

Is there a way to scope runResourceT to the lifetime of a single Sink?
I'm trying to build a Sink that wraps a potentially infinite number of Sinks. This works fine with threads but I'm trying to do it without threads. It seems like it should be possible. I've hit a road block due to the scoping of runResourceT: I get either too coarsely grained (but functional) or much too finely grained (totally broken) resource management.
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad.Trans (lift)
import Control.Monad.Trans.Resource
import Data.ByteString (ByteString)
import qualified Data.ByteString.Char8 as BC8 (pack)
import Data.Conduit
import qualified Data.Conduit.Binary as Cb
import qualified Data.Conduit.List as Cl
import System.FilePath ((<.>))
test :: IO ()
test =
runResourceT
$ Cl.sourceList (fmap (BC8.pack . show) [(1 :: Int)..1000])
$$ rotateResourceHog "/tmp/foo"
-- |
-- files are allocated on demand but handles are released at the same time
rotateResourceHog
:: MonadResource m
=> FilePath -> Sink ByteString m ()
rotateResourceHog filePath = step 0 where
step i = do
x <- Cl.peek
case x of
Just _ -> do
chunkWriter $ filePath <.> show (i :: Integer)
-- loop
step $ i+1
Nothing -> return ()
-- |
-- files are allocated on demand but handles are released immediately
rotateUsingClosedHandles
:: (MonadBaseControl IO m, MonadResource m)
=> FilePath -> Sink ByteString m ()
rotateUsingClosedHandles filePath = step 0 where
step i = do
x <- Cl.peek
case x of
Just _ -> do
transPipe runResourceT . chunkWriter $ filePath <.> show (i :: Integer)
-- loop
step $ i+1
Nothing -> return ()
chunkWriter
:: MonadResource m
=> FilePath -> Sink ByteString m ()
chunkWriter filePath = do
_ <- lift $ allocate (putStrLn "alloc") (\ _ -> putStrLn "free")
-- the actual conduit chain is more complicated
Cl.isolate 100 =$= Cb.sinkFile filePath

ResourceT is only intended to clean up resources in exceptional cases. It is not intended to provide prompt finalization, only guaranteed finalization. For promptness, conduit provides its own facilities for handling cleanup. In your case, you're looking for both: you want cleanup to happen as early as possible, and to occur even in the case of an exception being thrown. For this, you should use bracketP. For example:
chunkWriter
:: MonadResource m
=> FilePath -> Sink ByteString m ()
chunkWriter filePath = bracketP
(putStrLn "alloc")
(\() -> putStrLn "free")
(\() -> Cl.isolate 100 =$= Cb.sinkFile filePath)
This results in the desired interleaving of alloc and free outputs.

What is the preferred way to combine two sinks?

I've used zipSinks :: Monad m => Sink i m r -> Sink i m r' -> Sink i m (r, r') for this but it is considered deprecated.

Edit
After considering this, I don't think it is possible with the current version of Data.Conduit. Pipes aren't Categories, so &&& is out of the question. And there's no way that I can think of to pull results from upstream, feed them incrementally to both sinks, and short-circuit when the first sink finishes. (Although I don't think that Data.Conduit.Util.zipSinks short-circuits this way, it seems like it would be very desirable.) Except of course, to pattern match on both Sinks (like zipSinks in the package does), but that's what we're trying to avoid here.
That said, I would love to be proven wrong here.
It's not pretty, but you can do this in a kind-of obvious way.
First imports:
module Main where
import Control.Monad.Trans
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.Util as CU
import Data.Maybe
import Data.Text (unpack)
Now for zipSinks. Basically, you want to create a sink that pulls the input from upstream and sends it to each child sink separately. In this case, I've used CL.sourceList to do this. If await returns Nothing, maybeToList returns an empty list, so the child sinks are also run with no input. Finally, the output of each child sink is then fed into the tuple.
zipSinks :: Monad m => Sink i m r -> Sink i m r' -> Sink i m (r, r')
zipSinks s1 s2 = do
l <- fmap maybeToList await
o1 <- lift $ CL.sourceList l $$ s1
o2 <- lift $ CL.sourceList l $$ s2
return (o1, o2)
Here are some examples of using zipSinks. It appears to work fine both inside of IO and outside of it, and in the few tests I did, the output matches the output of zipped', created using the old zipSinks.
doubleHead :: Monad m => Sink Int m (Maybe Int)
doubleHead = await >>= return . fmap (2*)
-- old version
zipped' :: Monad m => Sink Int m (Maybe Int, Maybe Int)
zipped' = CU.zipSinks CL.head doubleHead
-- new version
zipped :: Monad m => Sink Int m (Maybe Int, Maybe Int)
zipped = zipSinks CL.head doubleHead
fromList = CL.sourceList [7, 8, 9] $$ zipped
-- (Just 7, Just 14)
fromFile :: String -> IO (Maybe Int, Maybe Int)
fromFile filename = runResourceT $
CB.sourceFile filename
$= CB.lines
$= CT.decode CT.utf8
$= CL.map (read . unpack)
$$ zipped
-- for a file with the lines:
--
-- 1
-- 2
-- 3
--
-- returns (Just 1, Just 2)

((The package is conduit-0.5.2.3. The whole module is just for backwards compatibility.))
[edit]
So, my straightforward monadic guess (see below) seems to be wrong, even though the types are correct.
Now, I can only guess that the answer is:
The replacing features are still in development, pretty much like all Pipe/Conduit and similar concepts and libraries.
I'd wait for the next API to solve this question and still use zipSink until then.
(Maybe it was just misplaced.)
[/edit]
I'm not that familar with this package, but wouldn't it do just the same as this?
zipSinks :: Monad m => Sink i m r -> Sink i m r' -> Sink i m (r, r')
zipSinks s1 s2 = (,) <$> s1 <*> s2
It is a Monad after all. (Functor, Applicative)
zipSinks :: Monad sink => sink r -> sink r' -> sink (r, r')
zipSinks s1 s2 = liftM2 (,) s1 s2

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Conduit - Combining multiple Sources/Producers into one - haskell

You can use ZipSource for this. In your case, it might look something like: sourceStdGens :: Source m StdGen sourceBytes :: Source m ByteString sourceBoth :: Source m (StdGen, ByteString) sourceBoth = getZipSource $ (,) <$> ZipSource sourceStdGens <*> ZipSource sourceBytes

Related

Efficient streaming and manipulation of a byte stream in Haskell

How to parse a large XML file in Haskell with limited amount of resources?

Join two consumers into a single consumer that returns multiple values?

Nested ResourceT scopes within a conduit Sink

What is the preferred way to combine two sinks?

Categories

Resources