Limiting pipes based on time? - haskell
Is it possible to create pipes that get all values that have been sent downstream in a certain time period? I'm implementing a server where the protocol allows me to concatenate outgoing packets and compress them together, so I'd like to effectively "empty out" the queue of downstream ByteStrings every 100ms and mappend them together to then yield on to the next pipe which does the compression.
Here's a solution using pipes-concurrency. You give it any Input and it will periodically drain the input of all values:
import Control.Applicative ((<|>))
import Control.Concurrent (threadDelay)
import Data.Foldable (forM_)
import Pipes
import Pipes.Concurrent
drainAll :: Input a -> STM (Maybe [a])
drainAll i = do
ma <- recv i
case ma of
Nothing -> return Nothing
Just a -> loop (a:)
where
loop diffAs = do
ma <- recv i <|> return Nothing
case ma of
Nothing -> return (Just (diffAs []))
Just a -> loop (diffAs . (a:))
bucketsEvery :: Int -> Input a -> Producer [a] IO ()
bucketsEvery microseconds i = loop
where
loop = do
lift $ threadDelay microseconds
ma <- lift $ atomically $ drainAll i
forM_ ma $ \a -> do
yield a
loop
This gives you much greater control over how you consume elements from upstream, by selecting the type of Buffer you use to build the Input.
If you're new to pipes-concurrency, you can read the tutorial which explains how to use spawn, Buffer and Input.
Here is a possible solution. It is based on a Pipe that tags ByteStrings going downstream with a Bool, in order to identify ByteStrings belonging to the same "time bucket".
First, some imports:
import Data.AdditiveGroup
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Builder as BB
import Data.Thyme.Clock
import Data.Thyme.Clock.POSIX
import Control.Monad.State.Strict
import Control.Lens (view)
import Control.Concurrent (threadDelay)
import Pipes
import Pipes.Lift
import qualified Pipes.Prelude as P
import qualified Pipes.Group as PG
Here is the tagging Pipe. It uses StateT internally:
tagger :: Pipe B.ByteString (B.ByteString,Bool) IO ()
tagger = do
startTime <- liftIO getPOSIXTime
evalStateP (startTime,False) $ forever $ do
b <- await
currentTime <- liftIO getPOSIXTime
-- (POSIXTime,Bool) inner state
(baseTime,tag) <- get
if (currentTime ^-^ baseTime > timeLimit)
then let tag' = not tag in
yield (b,tag') >> put (currentTime, tag')
else yield $ (b,tag)
where
timeLimit = fromSeconds 0.1
Then we can use functions from the pipes-group package to group ByteStrings belonging to the same "time bucket" into lazy ByteStrings:
batch :: Producer B.ByteString IO () -> Producer BL.ByteString IO ()
batch producer = PG.folds (<>) mempty BB.toLazyByteString
. PG.maps (flip for $ yield . BB.byteString . fst)
. view (PG.groupsBy $ \t1 t2-> snd t1 == snd t2)
$ producer >-> tagger
It seems to batch correctly. This program:
main :: IO ()
main = do
count <- P.length $ batch (yield "boo" >> yield "baa")
putStrLn $ show count
count <- P.length $ batch (yield "boo" >> yield "baa"
>> liftIO (threadDelay 200000) >> yield "ddd")
putStrLn $ show count
Has the output:
1
2
Notice that the contents of a "time bucket" are only yielded when the first element of the next bucket arrives. They are not yielded automatically each 100ms. This may or may not be a problem for you. It you want to yield automatically each 100ms, you would need a different solution, possibly based on pipes-concurrency.
Also, you could consider working directly with the FreeT-based "effectul lists" provided by pipes-group. That way you could start compressing the data in a "time bucket" before the bucket is full.
So unlike Daniel's answer my does not tag the data as it is produced. It just takes at least element from upstream and then continues to aggregate more values in the monoid until the time interval has passed.
This codes uses a list to aggregate, but there are better monoids to aggregate with
import Pipes
import qualified Pipes.Prelude as P
import Data.Time.Clock
import Data.Time.Calendar
import Data.Time.Format
import Data.Monoid
import Control.Monad
-- taken from pipes-rt
doubleToNomDiffTime :: Double -> NominalDiffTime
doubleToNomDiffTime x =
let d0 = ModifiedJulianDay 0
t0 = UTCTime d0 (picosecondsToDiffTime 0)
t1 = UTCTime d0 (picosecondsToDiffTime $ floor (x/1e-12))
in diffUTCTime t1 t0
-- Adapted from from pipes-parse-1.0
wrap
:: Monad m =>
Producer a m r -> Producer (Maybe a) m r
wrap p = do
p >-> P.map Just
forever $ yield Nothing
yieldAggregateOverTime
:: (Monoid y, -- monoid dependance so we can do aggregation
MonadIO m -- to beable to get the current time the
-- base monad must have access to IO
) =>
(t -> y) -- Change element from upstream to monoid
-> Double -- Time in seconds to aggregate over
-> Pipe (Maybe t) y m ()
yieldAggregateOverTime wrap period = do
t0 <- liftIO getCurrentTime
loop mempty (dtUTC `addUTCTime` t0)
where
dtUTC = doubleToNomDiffTime period
loop m ts = do
t <- liftIO getCurrentTime
v0 <- await -- await at least one element
case v0 of
Nothing -> yield m
Just v -> do
if t > ts
then do
yield (m <> wrap v)
loop mempty (dtUTC `addUTCTime` ts)
else do
loop (m <> wrap v) ts
main = do
runEffect $ wrap (each [1..]) >-> yieldAggregateOverTime (\x -> [x]) (0.0001)
>-> P.take 10 >-> P.print
Depending on cpu load you the output data will be aggregated differently. With at least on element in each chunk.
$ ghc Main.hs -O2
$ ./Main
[1,2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
$ ./Main
[1,2]
[3]
[4]
[5]
[6,7,8,9,10]
[11,12,13,14,15,16,17,18]
[19,20,21,22,23,24,25,26]
[27,28,29,30,31,32,33,34]
[35,36,37,38,39,40,41,42]
[43,44,45,46,47,48,49,50]
$ ./Main
[1,2,3,4,5,6]
[7]
[8]
[9,10,11,12,13,14,15,16,17,18,19,20]
[21,22,23,24,25,26,27,28,29,30,31,32,33]
[34,35,36,37,38,39,40,41,42,43,44]
[45,46,47,48,49,50,51,52,53,54,55]
[56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72]
[73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88]
[89,90,91,92,93,94,95,96,97,98,99,100,101,102,103]
$ ./Main
[1,2,3,4,5,6,7]
[8]
[9]
[10,11,12,13,14,15,16,17,18]
[19,20,21,22,23,24,25,26,27]
[28,29,30,31,32,33,34,35,36,37]
[38,39,40,41,42,43,44,45,46]
[47,48,49,50]
[51,52,53,54,55,56,57]
[58,59,60,61,62,63,64,65,66]
You might want to look at the source code of
pipes-rt it shows one approach to deal with time in pipes.
edit: Thanks to Daniel Díaz Carrete, adapted pipes-parse-1.0 technique to handle upstream termination. A pipes-group solution should be possible using the same technique as well.
Related
Using IO inside the ST monad and runSTUArray
I'm just starting to experiment with monad transformers so I may have missed something trivial. Anyway: how can I print from inside the ST monad? Example code: I want to create code that is almost as fast as C, by reading and writing as quickly as possible to memory, with C-like for loops. I have an ST monad action that safely mutates an unboxed array that I run with runSTUArray. My question is, how can I use IO actions inside the ST action? If it was a State monad action, I could use the StateT transformer to lift the IO actions to, but how is that done for the ST monad? Example: ST. -- how can I print form inside the ST monad? import Control.Monad.ST (ST) import Data.Array.Base ( STUArray(STUArray), newArray, readArray, writeArray ) import Data.Array.ST (runSTUArray) import Data.Array.Unboxed ( elems ) import Control.Monad (forM_) test :: IO () test = print . elems $ runSTUArray action where action :: ST s (STUArray s Int Int) action = do arr <- newArray (1,10) 1 forM_ [3..10] $ \i -> do -- liftIO . print $ "i is " ++ show i. --- <--- How should I do this? x1 <- readArray arr (i-1) x2 <- readArray arr (i-2) writeArray arr i (x1+x2) return arr Example: StateT, -- here it is possible to lift the print to use it inside the monad/ import Data.Array.IArray (listArray, (!), (//), elems) import Data.Array (Array) import Control.Monad.Trans.State.Strict (StateT (runStateT), get, put) import Control.Monad.IO.Class (MonadIO(liftIO)) import Control.Monad (forM_) test :: IO () test = do let n = listArray (1,10) [1..] :: Array Int Int (_,x) <- runStateT action n print $ elems x return () where action = do forM_ [3..10] $ \i -> do x <- get liftIO . print $ "i is " ++ show i. -- <--- here printing works fine put (x // [(i, x!(i-1) + x!(i-2))])
You don't print from an ST action, you print from an IO action. Luckily for you, there are IOUArrays -- and they are even STUArrays under the hood, so there can be no fear of performance lost.
Efficient streaming and manipulation of a byte stream in Haskell
While writing a deserialiser for a large (<bloblength><blob>)* encoded binary file I got stuck with the various Haskell produce-transform-consume libraries. So far I'm aware of four streaming libraries: Data.Conduit: Widely used, has very careful resource management Pipes: Similar to conduit (Haskell Cast #6 nicely reveals the differences between conduit and pipes) Data.Binary.Get: Offers useful functions such as getWord32be, but the streaming example is awkward System.IO.Streams: Seems to be the easiest one to use Here's a stripped down example of where things go wrong when I try to do Word32 streaming with conduit. A slightly more realistic example would first read a Word32 that determines the blob length and then yield a lazy ByteString of that length (which is then deserialised further). But here I just try to extract Word32's in streaming fashion from a binary file: module Main where -- build-depends: bytestring, conduit, conduit-extra, resourcet, binary import Control.Monad.Trans.Resource (MonadResource, runResourceT) import qualified Data.Binary.Get as G import qualified Data.ByteString as BS import qualified Data.ByteString.Char8 as C import qualified Data.ByteString.Lazy as BL import Data.Conduit import qualified Data.Conduit.Binary as CB import qualified Data.Conduit.List as CL import Data.Word (Word32) import System.Environment (getArgs) -- gets a Word32 from a ByteString. getWord32 :: C.ByteString -> Word32 getWord32 bs = do G.runGet G.getWord32be $ BL.fromStrict bs -- should read BytesString and return Word32 transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32 transform = do mbs <- await case mbs of Just bs -> do case C.null bs of False -> do yield $ getWord32 bs leftover $ BS.drop 4 bs transform True -> return () Nothing -> return () main :: IO () main = do filename <- fmap (!!0) getArgs -- should check length getArgs result <- runResourceT $ (CB.sourceFile filename) $$ transform =$ CL.consume print $ length result -- is always 8188 for files larger than 32752 bytes The output of the program is just the number of Word32's that were read. It turns out the stream terminates after reading the first chunk (about 32KiB). For some reason mbs is never Nothing, so I must check null bs which stops the stream when the chunk is consumed. Clearly, my conduit transform is faulty. I see two routes to a solution: The await doesn't want to go to the second chunk of the ByteStream, so is there another function that pulls the next chunk? In examples I've seen (e.g. Conduit 101) this is not how it's done This is just the wrong way to set up transform. How is this done properly? Is this the right way to go? (Performance does matter.) Update: Here's a BAD way to do it using Systems.IO.Streams: module Main where import Data.Word (Word32) import System.Environment (getArgs) import System.IO (IOMode (ReadMode), openFile) import qualified System.IO.Streams as S import System.IO.Streams.Binary (binaryInputStream) import System.IO.Streams.List (outputToList) main :: IO () main = do filename : _ <- getArgs h <- openFile filename ReadMode s <- S.handleToInputStream h i <- binaryInputStream s :: IO (S.InputStream Word32) r <- outputToList $ S.connect i print $ last r 'Bad' means: Very demanding in time and space, does not handle Decode exception.
Your immediate problem is caused by how you are using leftover. That function is used to "Provide a single piece of leftover input to be consumed by the next component in the current monadic binding", and so when you give it bs before looping with transform you are effectively throwing away the rest of the bytestring (i.e. what is after bs). A correct solution based on your code would use the incremental input interface of Data.Binary.Get to replace your yield/leftover combination with something that consumes each chunk fully. A more pragmatic approach, though, is using the binary-conduit package, which provides that in the shape of conduitGet (its source gives a good idea of what a "manual" implementation would look like): import Data.Conduit.Serialization.Binary -- etc. transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32 transform = conduitGet G.getWord32be One caveat is that this will throw a parse error if the total number of bytes is not a multiple of 4 (i.e. the last Word32 is incomplete). In the unlikely case of that not being what you want, a lazy way out would be simply using \bs -> C.take (4 * truncate (C.length bs / 4)) bs on the input bytestring.
With pipes (and pipes-group and pipes-bytestring) the demo problem reduces to combinators. First we resolve the incoming undifferentiated byte stream into little 4 byte chunks: chunksOfStrict :: (Monad m) => Int -> Producer ByteString m r -> Producer ByteString m r chunksOfStrict n = folds mappend mempty id . view (Bytes.chunksOf n) then we map these to Word32s and (here) count them. main :: IO () main = do filename:_ <- getArgs IO.withFile filename IO.ReadMode $ \h -> do n <- P.length $ chunksOfStrict 4 (Bytes.fromHandle h) >-> P.map getWord32 print n This will fail if we have less than 4 bytes or otherwise fail to parse but we can as well map with getMaybeWord32 :: ByteString -> Maybe Word32 getMaybeWord32 bs = case G.runGetOrFail G.getWord32be $ BL.fromStrict bs of Left r -> Nothing Right (_, off, w32) -> Just w32 The following program will then print the parses for the valid 4 byte sequences main :: IO () main = do filename:_ <- getArgs IO.withFile filename IO.ReadMode $ \h -> do runEffect $ chunksOfStrict 4 (Bytes.fromHandle h) >-> P.map getMaybeWord32 >-> P.concat -- here `concat` eliminates maybes >-> P.print There are other ways of dealing with failed parses, of course. Here, though, is something closer to the program you asked for. It takes a four byte segment from a byte stream (Producer ByteString m r) and reads it as a Word32 if it is long enough; it then takes that many of the incoming bytes and accumulates them into a lazy bytestring, yielding it. It just repeats this until it runs out of bytes. In main below, I print each yielded lazy bytestring that is produced: module Main (main) where import Pipes import qualified Pipes.Prelude as P import Pipes.Group (folds) import qualified Pipes.ByteString as Bytes ( splitAt, fromHandle, chunksOf ) import Control.Lens ( view ) -- or Lens.Simple (view) -- or Lens.Micro ((.^)) import qualified System.IO as IO ( IOMode(ReadMode), withFile ) import qualified Data.Binary.Get as G ( runGet, getWord32be ) import Data.ByteString ( ByteString ) import qualified Data.ByteString.Lazy.Char8 as BL import System.Environment ( getArgs ) splitLazy :: (Monad m, Integral n) => n -> Producer ByteString m r -> m (BL.ByteString, Producer ByteString m r) splitLazy n bs = do (bss, rest) <- P.toListM' $ view (Bytes.splitAt n) bs return (BL.fromChunks bss, rest) measureChunks :: Monad m => Producer ByteString m r -> Producer BL.ByteString m r measureChunks bs = do (lbs, rest) <- lift $ splitLazy 4 bs if BL.length lbs /= 4 then rest >-> P.drain -- in fact it will be empty else do let w32 = G.runGet G.getWord32be lbs (lbs', rest') <- lift $ splitLazy w32 bs yield lbs measureChunks rest main :: IO () main = do filename:_ <- getArgs IO.withFile filename IO.ReadMode $ \h -> do runEffect $ measureChunks (Bytes.fromHandle h) >-> P.print This is again crude in that it uses runGet not runGetOrFail, but this is easily repaired. The pipes standard procedure would be to stop the stream transformation on a failed parse and return the unparsed bytestream. If you were anticipating that the Word32s were for large numbers, so that you did not want to accumulate the corresponding stream of bytes as a lazy bytestring, but say write them to different files without accumulating, we could change the program pretty easily to do that. This would require a sophisticated use of conduit but is the preferred approach with pipes and streaming.
Here's a relatively straightforward solution that I want to throw into the ring. It's a repeated use of splitAt wrapped into a State monad that gives an interface identical to (a subset of) Data.Binary.Get. The resulting [ByteString] is obtained in main with a whileJust over getBlob. module Main (main) where import Control.Monad.Loops import Control.Monad.State import qualified Data.Binary.Get as G (getWord32be, runGet) import qualified Data.ByteString.Lazy as BL import Data.Int (Int64) import Data.Word (Word32) import System.Environment (getArgs) -- this is going to mimic the Data.Binary.Get.Get Monad type Get = State BL.ByteString getWord32be :: Get (Maybe Word32) getWord32be = state $ \bs -> do let (w, rest) = BL.splitAt 4 bs case BL.length w of 4 -> (Just w', rest) where w' = G.runGet G.getWord32be w _ -> (Nothing, BL.empty) getLazyByteString :: Int64 -> Get BL.ByteString getLazyByteString n = state $ \bs -> BL.splitAt n bs getBlob :: Get (Maybe BL.ByteString) getBlob = do ml <- getWord32be case ml of Nothing -> return Nothing Just l -> do blob <- getLazyByteString (fromIntegral l :: Int64) return $ Just blob runGet :: Get a -> BL.ByteString -> a runGet g bs = fst $ runState g bs main :: IO () main = do fname <- head <$> getArgs bs <- BL.readFile fname let ls = runGet loop bs where loop = whileJust getBlob return print $ length ls There's no error handling in getBlob, but it's easy to extend. Time and space complexity is quite good, as long as the resulting list is used carefully. (The python script that creates some random data for consumption by the above is here).
Collecting the Async results as they become available
How would you collect the results of a list of Async a in Haskell as they become available? The idea is to start processing the results of asynchronous tasks as soon as they are available. The best I could come up with is the following function: collect :: [Async a] -> IO [a] collect [] = return [] collect asyncs = do (a, r) <- waitAny asyncs rs <- collect (filter (/= a) asyncs) return (r:rs) However, this function does not exhibits the desired behavior since, as pointed out in the comment below, it doesn't return till all the asynchronous tasks are completed. Furthermore, collect runs in O(n^2) since I'm filtering the list at each recursive step. This could be improved by using a more efficient structure (and maybe indexing the position of the Async values in the list). Maybe there are library functions that take care of this, but I could not find them in the Control.Concurrent.Async module and I wonder why. EDIT: after thinking the problem a bit more carefully, I'm wondering whether such function is a good idea. I could just use fmap on the asynchronous tasks. Maybe it is a better practice to wait for the results when there is no other choice.
As I mentioned in my other answer, streaming results out of a list of Asyncs as they become available is best achieved using a stream processing library. Here's an example using pipes. import Control.Concurrent (threadDelay) import Control.Concurrent.Async import Control.Concurrent.STM import Data.Functor (($>)) import Pipes import Pipes.Concurrent -- from the pipes-concurrency package import qualified Pipes.Prelude as P asCompleted :: MonadIO m => [Async a] -> Producer a m () asCompleted asyncs = do (o, i, seal) <- liftIO $ spawn' unbounded liftIO $ forkIO $ do forConcurrently asyncs (\async -> atomically $ waitSTM async >>= send o) atomically seal fromInput i main = do actions <- traverse async [threadDelay 2000000 $> "bar", threadDelay 1000000 $> "foo"] runEffect $ asCompleted actions >-> P.print -- after one second, prints "foo", then "bar" a second later Using pipes-concurrency, we spawn' an Output-Input pair and immediately convert the Input to a Producer using fromInput. Asynchronously, we send items as they become available. When all the Asyncs have completed we seal the inbox to close down the Producer.
Implemented via TChan, additionally implemented a version which can react immediately, but it is more complex and also might have problems with exceptions (if you want to receive exceptions, use SlaveThread.fork instead of forkIO), so I commented that code in case you're not interested in it: import Control.Concurrent (threadDelay) import Control.Concurrent (forkIO) import Control.Concurrent.Async import Control.Concurrent.STM import Control.Monad collect :: [Async a] -> IO [a] collect = atomically . collectSTM collectSTM :: [Async a] -> STM [a] collectSTM as = do c <- newTChan collectSTMChan c as collectSTMChan :: TChan a -> [Async a] -> STM [a] collectSTMChan chan as = do mapM_ (waitSTM >=> writeTChan chan) as replicateM (length as) (readTChan chan) main :: IO () main = do a1 <- async (threadDelay 2000000 >> putStrLn "slept 2 secs" >> return 2) a2 <- async (threadDelay 3000000 >> putStrLn "slept 3 secs" >> return 3) a3 <- async (threadDelay 1000000 >> putStrLn "slept 1 sec" >> return 1) res <- collect [a1,a2,a3] putStrLn (show res) -- -- reacting immediately -- a1 <- async (threadDelay 2000000 >> putStrLn "slept 2 secs" >> return 2) -- a2 <- async (threadDelay 3000000 >> putStrLn "slept 3 secs" >> return 3) -- a3 <- async (threadDelay 1000000 >> putStrLn "slept 1 sec" >> return 1) -- c <- collectChan [a1,a2,a3] -- replicateM_ 3 (atomically (readTChan c) >>= \v -> putStrLn ("Received: " ++ show v)) -- collectChan :: [Async a] -> IO (TChan a) -- collectChan as = do -- c <- newTChanIO -- forM_ as $ \a -> forkIO ((atomically . (waitSTM >=> writeTChan c)) a) -- return c
I'm reading your question as "is it possible to sort a list of Asyncs by their completion time?". If that's what you meant, the answer is yes. import Control.Applicative (liftA2) import Control.Concurrent (threadDelay) import Control.Concurrent.Async import Data.Functor (($>)) import Data.List (sortBy) import Data.Ord (comparing) import Data.Time (getCurrentTime) sortByCompletion :: [Async a] -> IO [a] sortByCompletion = fmap (fmap fst . sortBy (comparing snd)) . mapConcurrently withCompletionTime where withCompletionTime async = liftA2 (,) (wait async) getCurrentTime main = do asyncs <- traverse async [threadDelay 2000000 $> "bar", threadDelay 1000000 $> "foo"] sortByCompletion asyncs -- ["foo", "bar"], after two seconds Using mapConcurrently we wait for each Async on a separate thread. Upon completion we get the current time - the time at which the Async completed - and use it to sort the results. This is O(n log n) complexity because we are sorting the list. (Your original algorithm was effectively a selection sort.) Like your collect, sortByCompletion doesn't return until all the Asyncs in the list have completed. If you wanted to stream results onto the main thread as they become available, well, lists aren't a very good tool for that. I'd use a streaming abstraction like conduit or pipes, or, working at a lower level, a TQueue. See my other answer for an example.
How to parse a large XML file in Haskell with limited amount of resources?
I want to extract information from a large XML file (around 20G) in Haskell. Since it is a large file, I used SAX parsing functions from Hexpath. Here is a simple code I tested: import qualified Data.ByteString.Lazy as L import Text.XML.Expat.SAX as Sax parse :: FilePath -> IO () parse path = do inputText <- L.readFile path let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text] let txt = foldl' processEvent "" saxEvents putStrLn txt After activating profiling in Cabal, it says that parse.saxEvents took 85% of allocated memory. I also used foldr and the result is the same. If processEvent becomes complex enough, the program crashes with a stack space overflow error. What am I doing wrong?
You don't say what processEvent is like. In principle, it ought to be unproblematic to use lazy ByteString for a strict left fold over lazily generated input, so I'm not sure what is going wrong in your case. But one ought to use streaming-appropriate types when dealing with gigantic files! In fact, hexpat does have 'streaming' interface (just like xml-conduit). It uses the not-too-well known List library and the rather ugly List class it defines. In principle the ListT type from the List package should work well. I gave up quickly because of a lack of combinators, and wrote an appropriate instance of the ugly List class for a wrapped version of Pipes.ListT which I then used to export ordinary Pipes.Producer functions like parseProduce. The trivial manipulations needed for this are appended below as PipesSax.hs Once we have parseProducer we can convert a ByteString or Text Producer into a Producer of SaxEvents with Text or ByteString components. Here are some simple operations. I was using a 238M "input.xml"; the programs never need more than 6 mb of memory, to judge from looking at top. -- Sax.hs Most of the IO actions use a registerIds pipe defined at the bottom which is tailored to a giant bit of xml of which this is a valid 1000 fragment http://sprunge.us/WaQK {-#LANGUAGE OverloadedStrings #-} import PipesSax ( parseProducer ) import Data.ByteString ( ByteString ) import Text.XML.Expat.SAX import Pipes -- cabal install pipes pipes-bytestring import Pipes.ByteString (toHandle, fromHandle, stdin, stdout ) import qualified Pipes.Prelude as P import qualified System.IO as IO import qualified Data.ByteString.Char8 as Char8 sax :: MonadIO m => Producer ByteString m () -> Producer (SAXEvent ByteString ByteString) m () sax = parseProducer defaultParseOptions -- stream xml from stdin, yielding hexpat tagstream to stdout; main0 :: IO () main0 = runEffect $ sax stdin >-> P.print -- stream the extracted 'IDs' from stdin to stdout main1 :: IO () main1 = runEffect $ sax stdin >-> registryIds >-> stdout -- write all IDs to a file main2 = IO.withFile "input.xml" IO.ReadMode $ \inp -> IO.withFile "output.txt" IO.WriteMode $ \out -> runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out -- folds: -- print number of IDs main3 = IO.withFile "input.xml" IO.ReadMode $ \inp -> do n <- P.length $ sax (fromHandle inp) >-> registryIds print n -- sum the meaningful part of the IDs - a dumb fold for illustration main4 = IO.withFile "input.xml" IO.ReadMode $ \inp -> do let pipeline = sax (fromHandle inp) >-> registryIds >-> P.map readIntId n <- P.fold (+) 0 id pipeline print n where readIntId :: ByteString -> Integer readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2 -- my xml has tags with attributes that appear via hexpat thus: -- StartElement "FacilitySite" [("registryId","110007915364")] -- and the like. This is just an arbitrary demo stream manipulation. registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m () registryIds = do e <- await -- we look for a 'SAXEvent' case e of -- if it matches, we yield, else we go to the next event StartElement "FacilitySite" [("registryId",a)] -> do yield a yield "\n" registryIds _ -> registryIds -- 'library': PipesSax.hs This just newtypes Pipes.ListT to get the appropriate instances. We don't export anything to do with List or ListT but just use the standard Pipes.Producer concept. {-#LANGUAGE TypeFamilies, GeneralizedNewtypeDeriving #-} module PipesSax (parseProducerLocations, parseProducer) where import Data.ByteString (ByteString) import Text.XML.Expat.SAX import Data.List.Class import Control.Monad import Control.Applicative import Pipes import qualified Pipes.Internal as I parseProducer :: (Monad m, GenericXMLString tag, GenericXMLString text) => ParseOptions tag text -> Producer ByteString m () -> Producer (SAXEvent tag text) m () parseProducer opt = enumerate . enumerate_ . parseG opt . Select_ . Select parseProducerLocations :: (Monad m, GenericXMLString tag, GenericXMLString text) => ParseOptions tag text -> Producer ByteString m () -> Producer (SAXEvent tag text, XMLParseLocation) m () parseProducerLocations opt = enumerate . enumerate_ . parseLocationsG opt . Select_ . Select newtype ListT_ m a = Select_ { enumerate_ :: ListT m a } deriving (Functor, Monad, MonadPlus, MonadIO , Applicative, Alternative, Monoid, MonadTrans) instance Monad m => List (ListT_ m) where type ItemM (ListT_ m) = m joinL = Select_ . Select . I.M . liftM (enumerate . enumerate_) runList = liftM emend . next . enumerate . enumerate_ where emend (Right (a,q)) = Cons a (Select_ (Select q)) emend _ = Nil
Pipes and callbacks in Haskell
I'm processing some audio using portaudio. The haskell FFI bindings call a user defined callback whenever there's audio data to be processed. This callback should be handled very quickly and ideally with no I/O. I wanted to save the audio input and return quickly since my application doesn't need to react to the audio in realtime (right now I'm just saving the audio data to a file; later I'll construct a simple speech recognition system). I like the idea of pipes and thought I could use that library. The problem is that I don't know how to create a Producer that returns data that came in through a callback. How do I handle my use case? Here's what I'm working with right now, in case that helps (the datum mvar isn't working right now but I don't like storing all the data in a seq... I'd rather process it as it came instead of just at the end): {-# LANGUAGE FlexibleInstances, MultiParamTypeClasses #-} module Main where import Codec.Wav import Sound.PortAudio import Sound.PortAudio.Base import Sound.PortAudio.Buffer import Foreign.Ptr import Foreign.ForeignPtr import Foreign.C.Types import Foreign.Storable import qualified Data.StorableVector as SV import qualified Data.StorableVector.Base as SVB import Control.Exception.Base (evaluate) import Data.Int import Data.Sequence as Seq import Control.Concurrent instance Buffer SV.Vector a where fromForeignPtr fp = return . SVB.fromForeignPtr fp toForeignPtr = return . (\(a, b, c) -> (a, c)) . SVB.toForeignPtr -- | Wrap a buffer callback into the generic stream callback type. buffCBtoRawCB' :: (StreamFormat input, StreamFormat output, Buffer a input, Buffer b output) => BuffStreamCallback input output a b -> StreamCallback input output buffCBtoRawCB' func = \a b c d e -> do fpA <- newForeignPtr_ d -- We will not free, as callback system will do that for us fpB <- newForeignPtr_ e -- We will not free, as callback system will do that for us storeInp <- fromForeignPtr fpA (fromIntegral $ 1 * c) storeOut <- fromForeignPtr fpB (fromIntegral $ 0 * c) func a b c storeInp storeOut callback :: MVar (Seq.Seq [Int32]) -> PaStreamCallbackTimeInfo -> [StreamCallbackFlag] -> CULong -> SV.Vector Int32 -> SV.Vector Int32 -> IO StreamResult callback seqmvar = \timeinfo flags numsamples input output -> do putStrLn $ "timeinfo: " ++ show timeinfo ++ "; flags are " ++ show flags ++ " in callback with " ++ show numsamples ++ " samples." print input -- write data to output --mapM_ (uncurry $ pokeElemOff output) $ zip (map fromIntegral [0..(numsamples-1)]) datum --print "wrote data" input' <- evaluate $ SV.unpack input modifyMVar_ seqmvar (\s -> return $ s Seq.|> input') case flags of [] -> return $ if unPaTime (outputBufferDacTime timeinfo) > 0.2 then Complete else Continue _ -> return Complete done doneMVar = do putStrLn "total done dood!" putMVar doneMVar True return () main = do let samplerate = 16000 Nothing <- initialize print "initialized" m <- newEmptyMVar datum <- newMVar Seq.empty Right s <- openDefaultStream 1 0 samplerate Nothing (Just $ buffCBtoRawCB' (callback datum)) (Just $ done m) startStream s _ <- takeMVar m -- wait until our callbacks decide they are done! Nothing <- terminate print "let's see what we've recorded..." stuff <- takeMVar datum print stuff -- write out wav file -- let datum = -- audio = Audio { sampleRate = samplerate -- , channelNumber = 1 -- , sampleData = datum -- } -- exportFile "foo.wav" audio print "main done"
The simplest solution is to use MVars to communicate between the callback and Producer. Here's how: import Control.Proxy import Control.Concurrent.MVar fromMVar :: (Proxy p) => MVar (Maybe a) -> () -> Producer p a IO () fromMVar mvar () = runIdentityP loop where loop = do ma <- lift $ takeMVar mvar case ma of Nothing -> return () Just a -> do respond a loop Your stream callback will write Just input to the MVar and your finalization callback will write Nothing to terminate the Producer. Here's a ghci example demonstrating how it works: >>> mvar <- newEmptyMVar :: IO (MVar (Maybe Int)) >>> forkIO $ runProxy $ fromMVar mvar >-> printD >>> putMVar mvar (Just 1) 1 >>> putMVar mvar (Just 2) 2 >>> putMVar mvar Nothing >>> putMVar mvar (Just 3) >>> Edit: The pipes-concurrency library now provides this feature, and it even has a section in the tutorial explaining specifically how to use it to get data out of callbacks.