Limiting pipes based on time? - haskell

Is it possible to create pipes that get all values that have been sent downstream in a certain time period? I'm implementing a server where the protocol allows me to concatenate outgoing packets and compress them together, so I'd like to effectively "empty out" the queue of downstream ByteStrings every 100ms and mappend them together to then yield on to the next pipe which does the compression.

Here's a solution using pipes-concurrency. You give it any Input and it will periodically drain the input of all values:
import Control.Applicative ((<|>))
import Control.Concurrent (threadDelay)
import Data.Foldable (forM_)
import Pipes
import Pipes.Concurrent
drainAll :: Input a -> STM (Maybe [a])
drainAll i = do
ma <- recv i
case ma of
Nothing -> return Nothing
Just a -> loop (a:)
where
loop diffAs = do
ma <- recv i <|> return Nothing
case ma of
Nothing -> return (Just (diffAs []))
Just a -> loop (diffAs . (a:))
bucketsEvery :: Int -> Input a -> Producer [a] IO ()
bucketsEvery microseconds i = loop
where
loop = do
lift $ threadDelay microseconds
ma <- lift $ atomically $ drainAll i
forM_ ma $ \a -> do
yield a
loop
This gives you much greater control over how you consume elements from upstream, by selecting the type of Buffer you use to build the Input.
If you're new to pipes-concurrency, you can read the tutorial which explains how to use spawn, Buffer and Input.

Here is a possible solution. It is based on a Pipe that tags ByteStrings going downstream with a Bool, in order to identify ByteStrings belonging to the same "time bucket".
First, some imports:
import Data.AdditiveGroup
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Builder as BB
import Data.Thyme.Clock
import Data.Thyme.Clock.POSIX
import Control.Monad.State.Strict
import Control.Lens (view)
import Control.Concurrent (threadDelay)
import Pipes
import Pipes.Lift
import qualified Pipes.Prelude as P
import qualified Pipes.Group as PG
Here is the tagging Pipe. It uses StateT internally:
tagger :: Pipe B.ByteString (B.ByteString,Bool) IO ()
tagger = do
startTime <- liftIO getPOSIXTime
evalStateP (startTime,False) $ forever $ do
b <- await
currentTime <- liftIO getPOSIXTime
-- (POSIXTime,Bool) inner state
(baseTime,tag) <- get
if (currentTime ^-^ baseTime > timeLimit)
then let tag' = not tag in
yield (b,tag') >> put (currentTime, tag')
else yield $ (b,tag)
where
timeLimit = fromSeconds 0.1
Then we can use functions from the pipes-group package to group ByteStrings belonging to the same "time bucket" into lazy ByteStrings:
batch :: Producer B.ByteString IO () -> Producer BL.ByteString IO ()
batch producer = PG.folds (<>) mempty BB.toLazyByteString
. PG.maps (flip for $ yield . BB.byteString . fst)
. view (PG.groupsBy $ \t1 t2-> snd t1 == snd t2)
$ producer >-> tagger
It seems to batch correctly. This program:
main :: IO ()
main = do
count <- P.length $ batch (yield "boo" >> yield "baa")
putStrLn $ show count
count <- P.length $ batch (yield "boo" >> yield "baa"
>> liftIO (threadDelay 200000) >> yield "ddd")
putStrLn $ show count
Has the output:
1
2
Notice that the contents of a "time bucket" are only yielded when the first element of the next bucket arrives. They are not yielded automatically each 100ms. This may or may not be a problem for you. It you want to yield automatically each 100ms, you would need a different solution, possibly based on pipes-concurrency.
Also, you could consider working directly with the FreeT-based "effectul lists" provided by pipes-group. That way you could start compressing the data in a "time bucket" before the bucket is full.

So unlike Daniel's answer my does not tag the data as it is produced. It just takes at least element from upstream and then continues to aggregate more values in the monoid until the time interval has passed.
This codes uses a list to aggregate, but there are better monoids to aggregate with
import Pipes
import qualified Pipes.Prelude as P
import Data.Time.Clock
import Data.Time.Calendar
import Data.Time.Format
import Data.Monoid
import Control.Monad
-- taken from pipes-rt
doubleToNomDiffTime :: Double -> NominalDiffTime
doubleToNomDiffTime x =
let d0 = ModifiedJulianDay 0
t0 = UTCTime d0 (picosecondsToDiffTime 0)
t1 = UTCTime d0 (picosecondsToDiffTime $ floor (x/1e-12))
in diffUTCTime t1 t0
-- Adapted from from pipes-parse-1.0
wrap
:: Monad m =>
Producer a m r -> Producer (Maybe a) m r
wrap p = do
p >-> P.map Just
forever $ yield Nothing
yieldAggregateOverTime
:: (Monoid y, -- monoid dependance so we can do aggregation
MonadIO m -- to beable to get the current time the
-- base monad must have access to IO
) =>
(t -> y) -- Change element from upstream to monoid
-> Double -- Time in seconds to aggregate over
-> Pipe (Maybe t) y m ()
yieldAggregateOverTime wrap period = do
t0 <- liftIO getCurrentTime
loop mempty (dtUTC `addUTCTime` t0)
where
dtUTC = doubleToNomDiffTime period
loop m ts = do
t <- liftIO getCurrentTime
v0 <- await -- await at least one element
case v0 of
Nothing -> yield m
Just v -> do
if t > ts
then do
yield (m <> wrap v)
loop mempty (dtUTC `addUTCTime` ts)
else do
loop (m <> wrap v) ts
main = do
runEffect $ wrap (each [1..]) >-> yieldAggregateOverTime (\x -> [x]) (0.0001)
>-> P.take 10 >-> P.print
Depending on cpu load you the output data will be aggregated differently. With at least on element in each chunk.
$ ghc Main.hs -O2
$ ./Main
[1,2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
$ ./Main
[1,2]
[3]
[4]
[5]
[6,7,8,9,10]
[11,12,13,14,15,16,17,18]
[19,20,21,22,23,24,25,26]
[27,28,29,30,31,32,33,34]
[35,36,37,38,39,40,41,42]
[43,44,45,46,47,48,49,50]
$ ./Main
[1,2,3,4,5,6]
[7]
[8]
[9,10,11,12,13,14,15,16,17,18,19,20]
[21,22,23,24,25,26,27,28,29,30,31,32,33]
[34,35,36,37,38,39,40,41,42,43,44]
[45,46,47,48,49,50,51,52,53,54,55]
[56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72]
[73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88]
[89,90,91,92,93,94,95,96,97,98,99,100,101,102,103]
$ ./Main
[1,2,3,4,5,6,7]
[8]
[9]
[10,11,12,13,14,15,16,17,18]
[19,20,21,22,23,24,25,26,27]
[28,29,30,31,32,33,34,35,36,37]
[38,39,40,41,42,43,44,45,46]
[47,48,49,50]
[51,52,53,54,55,56,57]
[58,59,60,61,62,63,64,65,66]
You might want to look at the source code of
pipes-rt it shows one approach to deal with time in pipes.
edit: Thanks to Daniel Díaz Carrete, adapted pipes-parse-1.0 technique to handle upstream termination. A pipes-group solution should be possible using the same technique as well.

Related

Using IO inside the ST monad and runSTUArray

I'm just starting to experiment with monad transformers so I may have missed something trivial. Anyway: how can I print from inside the ST monad?
Example code: I want to create code that is almost as fast as C, by reading and writing as quickly as possible to memory, with C-like for loops. I have an ST monad action that safely mutates an unboxed array that I run with runSTUArray.
My question is, how can I use IO actions inside the ST action? If it was a State monad action, I could use the StateT transformer to lift the IO actions to, but how is that done for the ST monad?
Example: ST. -- how can I print form inside the ST monad?
import Control.Monad.ST (ST)
import Data.Array.Base ( STUArray(STUArray), newArray, readArray, writeArray )
import Data.Array.ST (runSTUArray)
import Data.Array.Unboxed ( elems )
import Control.Monad (forM_)
test :: IO ()
test = print . elems $ runSTUArray action
where
action :: ST s (STUArray s Int Int)
action = do
arr <- newArray (1,10) 1
forM_ [3..10] $ \i -> do
-- liftIO . print $ "i is " ++ show i. --- <--- How should I do this?
x1 <- readArray arr (i-1)
x2 <- readArray arr (i-2)
writeArray arr i (x1+x2)
return arr
Example: StateT, -- here it is possible to lift the print to use it inside the monad/
import Data.Array.IArray (listArray, (!), (//), elems)
import Data.Array (Array)
import Control.Monad.Trans.State.Strict (StateT (runStateT), get, put)
import Control.Monad.IO.Class (MonadIO(liftIO))
import Control.Monad (forM_)
test :: IO ()
test = do
let n = listArray (1,10) [1..] :: Array Int Int
(_,x) <- runStateT action n
print $ elems x
return ()
where action = do
forM_ [3..10] $ \i -> do
x <- get
liftIO . print $ "i is " ++ show i. -- <--- here printing works fine
put (x // [(i, x!(i-1) + x!(i-2))])
You don't print from an ST action, you print from an IO action. Luckily for you, there are IOUArrays -- and they are even STUArrays under the hood, so there can be no fear of performance lost.

Efficient streaming and manipulation of a byte stream in Haskell

While writing a deserialiser for a large (<bloblength><blob>)* encoded binary file I got stuck with the various Haskell produce-transform-consume libraries. So far I'm aware of four streaming libraries:
Data.Conduit: Widely used, has very careful resource management
Pipes: Similar to conduit (Haskell Cast #6 nicely reveals the differences between conduit and pipes)
Data.Binary.Get: Offers useful functions such as getWord32be, but the streaming example is awkward
System.IO.Streams: Seems to be the easiest one to use
Here's a stripped down example of where things go wrong when I try to do Word32 streaming with conduit. A slightly more realistic example would first read a Word32 that determines the blob length and then yield a lazy ByteString of that length (which is then deserialised further).
But here I just try to extract Word32's in streaming fashion from a binary file:
module Main where
-- build-depends: bytestring, conduit, conduit-extra, resourcet, binary
import Control.Monad.Trans.Resource (MonadResource, runResourceT)
import qualified Data.Binary.Get as G
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Lazy as BL
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import Data.Word (Word32)
import System.Environment (getArgs)
-- gets a Word32 from a ByteString.
getWord32 :: C.ByteString -> Word32
getWord32 bs = do
G.runGet G.getWord32be $ BL.fromStrict bs
-- should read BytesString and return Word32
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = do
mbs <- await
case mbs of
Just bs -> do
case C.null bs of
False -> do
yield $ getWord32 bs
leftover $ BS.drop 4 bs
transform
True -> return ()
Nothing -> return ()
main :: IO ()
main = do
filename <- fmap (!!0) getArgs -- should check length getArgs
result <- runResourceT $ (CB.sourceFile filename) $$ transform =$ CL.consume
print $ length result -- is always 8188 for files larger than 32752 bytes
The output of the program is just the number of Word32's that were read. It turns out the stream terminates after reading the first chunk (about 32KiB). For some reason mbs is never Nothing, so I must check null bs which stops the stream when the chunk is consumed. Clearly, my conduit transform is faulty. I see two routes to a solution:
The await doesn't want to go to the second chunk of the ByteStream, so is there another function that pulls the next chunk? In examples I've seen (e.g. Conduit 101) this is not how it's done
This is just the wrong way to set up transform.
How is this done properly? Is this the right way to go? (Performance does matter.)
Update: Here's a BAD way to do it using Systems.IO.Streams:
module Main where
import Data.Word (Word32)
import System.Environment (getArgs)
import System.IO (IOMode (ReadMode), openFile)
import qualified System.IO.Streams as S
import System.IO.Streams.Binary (binaryInputStream)
import System.IO.Streams.List (outputToList)
main :: IO ()
main = do
filename : _ <- getArgs
h <- openFile filename ReadMode
s <- S.handleToInputStream h
i <- binaryInputStream s :: IO (S.InputStream Word32)
r <- outputToList $ S.connect i
print $ last r
'Bad' means: Very demanding in time and space, does not handle Decode exception.
Your immediate problem is caused by how you are using leftover. That function is used to "Provide a single piece of leftover input to be consumed by the next component in the current monadic binding", and so when you give it bs before looping with transform you are effectively throwing away the rest of the bytestring (i.e. what is after bs).
A correct solution based on your code would use the incremental input interface of Data.Binary.Get to replace your yield/leftover combination with something that consumes each chunk fully. A more pragmatic approach, though, is using the binary-conduit package, which provides that in the shape of conduitGet (its source gives a good idea of what a "manual" implementation would look like):
import Data.Conduit.Serialization.Binary
-- etc.
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = conduitGet G.getWord32be
One caveat is that this will throw a parse error if the total number of bytes is not a multiple of 4 (i.e. the last Word32 is incomplete). In the unlikely case of that not being what you want, a lazy way out would be simply using \bs -> C.take (4 * truncate (C.length bs / 4)) bs on the input bytestring.
With pipes (and pipes-group and pipes-bytestring) the demo problem reduces to combinators. First we resolve the incoming undifferentiated byte stream into little 4 byte chunks:
chunksOfStrict :: (Monad m) => Int -> Producer ByteString m r -> Producer ByteString m r
chunksOfStrict n = folds mappend mempty id . view (Bytes.chunksOf n)
then we map these to Word32s and (here) count them.
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
n <- P.length $ chunksOfStrict 4 (Bytes.fromHandle h) >-> P.map getWord32
print n
This will fail if we have less than 4 bytes or otherwise fail to parse but we can as well map with
getMaybeWord32 :: ByteString -> Maybe Word32
getMaybeWord32 bs = case G.runGetOrFail G.getWord32be $ BL.fromStrict bs of
Left r -> Nothing
Right (_, off, w32) -> Just w32
The following program will then print the parses for the valid 4 byte sequences
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
runEffect $ chunksOfStrict 4 (Bytes.fromHandle h)
>-> P.map getMaybeWord32
>-> P.concat -- here `concat` eliminates maybes
>-> P.print
There are other ways of dealing with failed parses, of course.
Here, though, is something closer to the program you asked for. It takes a four byte segment from a byte stream (Producer ByteString m r) and reads it as a Word32 if it is long enough; it then takes that many of the incoming bytes and accumulates them into a lazy bytestring, yielding it. It just repeats this until it runs out of bytes. In main below, I print each yielded lazy bytestring that is produced:
module Main (main) where
import Pipes
import qualified Pipes.Prelude as P
import Pipes.Group (folds)
import qualified Pipes.ByteString as Bytes ( splitAt, fromHandle, chunksOf )
import Control.Lens ( view ) -- or Lens.Simple (view) -- or Lens.Micro ((.^))
import qualified System.IO as IO ( IOMode(ReadMode), withFile )
import qualified Data.Binary.Get as G ( runGet, getWord32be )
import Data.ByteString ( ByteString )
import qualified Data.ByteString.Lazy.Char8 as BL
import System.Environment ( getArgs )
splitLazy :: (Monad m, Integral n) =>
n -> Producer ByteString m r -> m (BL.ByteString, Producer ByteString m r)
splitLazy n bs = do
(bss, rest) <- P.toListM' $ view (Bytes.splitAt n) bs
return (BL.fromChunks bss, rest)
measureChunks :: Monad m => Producer ByteString m r -> Producer BL.ByteString m r
measureChunks bs = do
(lbs, rest) <- lift $ splitLazy 4 bs
if BL.length lbs /= 4
then rest >-> P.drain -- in fact it will be empty
else do
let w32 = G.runGet G.getWord32be lbs
(lbs', rest') <- lift $ splitLazy w32 bs
yield lbs
measureChunks rest
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
runEffect $ measureChunks (Bytes.fromHandle h) >-> P.print
This is again crude in that it uses runGet not runGetOrFail, but this is easily repaired. The pipes standard procedure would be to stop the stream transformation on a failed parse and return the unparsed bytestream.
If you were anticipating that the Word32s were for large numbers, so that you did not want to accumulate the corresponding stream of bytes as a lazy bytestring, but say write them to different files without accumulating, we could change the program pretty easily to do that. This would require a sophisticated use of conduit but is the preferred approach with pipes and streaming.
Here's a relatively straightforward solution that I want to throw into the ring. It's a repeated use of splitAt wrapped into a State monad that gives an interface identical to (a subset of) Data.Binary.Get. The resulting [ByteString] is obtained in main with a whileJust over getBlob.
module Main (main) where
import Control.Monad.Loops
import Control.Monad.State
import qualified Data.Binary.Get as G (getWord32be, runGet)
import qualified Data.ByteString.Lazy as BL
import Data.Int (Int64)
import Data.Word (Word32)
import System.Environment (getArgs)
-- this is going to mimic the Data.Binary.Get.Get Monad
type Get = State BL.ByteString
getWord32be :: Get (Maybe Word32)
getWord32be = state $ \bs -> do
let (w, rest) = BL.splitAt 4 bs
case BL.length w of
4 -> (Just w', rest) where
w' = G.runGet G.getWord32be w
_ -> (Nothing, BL.empty)
getLazyByteString :: Int64 -> Get BL.ByteString
getLazyByteString n = state $ \bs -> BL.splitAt n bs
getBlob :: Get (Maybe BL.ByteString)
getBlob = do
ml <- getWord32be
case ml of
Nothing -> return Nothing
Just l -> do
blob <- getLazyByteString (fromIntegral l :: Int64)
return $ Just blob
runGet :: Get a -> BL.ByteString -> a
runGet g bs = fst $ runState g bs
main :: IO ()
main = do
fname <- head <$> getArgs
bs <- BL.readFile fname
let ls = runGet loop bs where
loop = whileJust getBlob return
print $ length ls
There's no error handling in getBlob, but it's easy to extend. Time and space complexity is quite good, as long as the resulting list is used carefully. (The python script that creates some random data for consumption by the above is here).

Collecting the Async results as they become available

How would you collect the results of a list of Async a in Haskell as they become available? The idea is to start processing the results of asynchronous tasks as soon as they are available.
The best I could come up with is the following function:
collect :: [Async a] -> IO [a]
collect [] = return []
collect asyncs = do
(a, r) <- waitAny asyncs
rs <- collect (filter (/= a) asyncs)
return (r:rs)
However, this function does not exhibits the desired behavior since, as pointed out in the comment below, it doesn't return till all the asynchronous tasks are completed. Furthermore, collect runs in O(n^2) since I'm filtering the list at each recursive step. This could be improved by using a more efficient structure (and maybe indexing the position of the Async values in the list).
Maybe there are library functions that take care of this, but I could not find them in the Control.Concurrent.Async module and I wonder why.
EDIT: after thinking the problem a bit more carefully, I'm wondering whether such function is a good idea. I could just use fmap on the asynchronous tasks. Maybe it is a better practice to wait for the results when there is no other choice.
As I mentioned in my other answer, streaming results out of a list of Asyncs as they become available is best achieved using a stream processing library. Here's an example using pipes.
import Control.Concurrent (threadDelay)
import Control.Concurrent.Async
import Control.Concurrent.STM
import Data.Functor (($>))
import Pipes
import Pipes.Concurrent -- from the pipes-concurrency package
import qualified Pipes.Prelude as P
asCompleted :: MonadIO m => [Async a] -> Producer a m ()
asCompleted asyncs = do
(o, i, seal) <- liftIO $ spawn' unbounded
liftIO $ forkIO $ do
forConcurrently asyncs (\async -> atomically $ waitSTM async >>= send o)
atomically seal
fromInput i
main = do
actions <- traverse async [threadDelay 2000000 $> "bar", threadDelay 1000000 $> "foo"]
runEffect $ asCompleted actions >-> P.print
-- after one second, prints "foo", then "bar" a second later
Using pipes-concurrency, we spawn' an Output-Input pair and immediately convert the Input to a Producer using fromInput. Asynchronously, we send items as they become available. When all the Asyncs have completed we seal the inbox to close down the Producer.
Implemented via TChan, additionally implemented a version which can react immediately, but it is more complex and also might have problems with exceptions (if you want to receive exceptions, use SlaveThread.fork instead of forkIO), so I commented that code in case you're not interested in it:
import Control.Concurrent (threadDelay)
import Control.Concurrent (forkIO)
import Control.Concurrent.Async
import Control.Concurrent.STM
import Control.Monad
collect :: [Async a] -> IO [a]
collect = atomically . collectSTM
collectSTM :: [Async a] -> STM [a]
collectSTM as = do
c <- newTChan
collectSTMChan c as
collectSTMChan :: TChan a -> [Async a] -> STM [a]
collectSTMChan chan as = do
mapM_ (waitSTM >=> writeTChan chan) as
replicateM (length as) (readTChan chan)
main :: IO ()
main = do
a1 <- async (threadDelay 2000000 >> putStrLn "slept 2 secs" >> return 2)
a2 <- async (threadDelay 3000000 >> putStrLn "slept 3 secs" >> return 3)
a3 <- async (threadDelay 1000000 >> putStrLn "slept 1 sec" >> return 1)
res <- collect [a1,a2,a3]
putStrLn (show res)
-- -- reacting immediately
-- a1 <- async (threadDelay 2000000 >> putStrLn "slept 2 secs" >> return 2)
-- a2 <- async (threadDelay 3000000 >> putStrLn "slept 3 secs" >> return 3)
-- a3 <- async (threadDelay 1000000 >> putStrLn "slept 1 sec" >> return 1)
-- c <- collectChan [a1,a2,a3]
-- replicateM_ 3 (atomically (readTChan c) >>= \v -> putStrLn ("Received: " ++ show v))
-- collectChan :: [Async a] -> IO (TChan a)
-- collectChan as = do
-- c <- newTChanIO
-- forM_ as $ \a -> forkIO ((atomically . (waitSTM >=> writeTChan c)) a)
-- return c
I'm reading your question as "is it possible to sort a list of Asyncs by their completion time?". If that's what you meant, the answer is yes.
import Control.Applicative (liftA2)
import Control.Concurrent (threadDelay)
import Control.Concurrent.Async
import Data.Functor (($>))
import Data.List (sortBy)
import Data.Ord (comparing)
import Data.Time (getCurrentTime)
sortByCompletion :: [Async a] -> IO [a]
sortByCompletion = fmap (fmap fst . sortBy (comparing snd)) . mapConcurrently withCompletionTime
where withCompletionTime async = liftA2 (,) (wait async) getCurrentTime
main = do
asyncs <- traverse async [threadDelay 2000000 $> "bar", threadDelay 1000000 $> "foo"]
sortByCompletion asyncs
-- ["foo", "bar"], after two seconds
Using mapConcurrently we wait for each Async on a separate thread. Upon completion we get the current time - the time at which the Async completed - and use it to sort the results. This is O(n log n) complexity because we are sorting the list. (Your original algorithm was effectively a selection sort.)
Like your collect, sortByCompletion doesn't return until all the Asyncs in the list have completed. If you wanted to stream results onto the main thread as they become available, well, lists aren't a very good tool for that. I'd use a streaming abstraction like conduit or pipes, or, working at a lower level, a TQueue. See my other answer for an example.

How to parse a large XML file in Haskell with limited amount of resources?

I want to extract information from a large XML file (around 20G) in Haskell. Since it is a large file, I used SAX parsing functions from Hexpath.
Here is a simple code I tested:
import qualified Data.ByteString.Lazy as L
import Text.XML.Expat.SAX as Sax
parse :: FilePath -> IO ()
parse path = do
inputText <- L.readFile path
let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text]
let txt = foldl' processEvent "" saxEvents
putStrLn txt
After activating profiling in Cabal, it says that parse.saxEvents took 85% of allocated memory. I also used foldr and the result is the same.
If processEvent becomes complex enough, the program crashes with a stack space overflow error.
What am I doing wrong?
You don't say what processEvent is like. In principle, it ought to be unproblematic to use lazy ByteString for a strict left fold over lazily generated input, so I'm not sure what is going wrong in your case. But one ought to use streaming-appropriate types when dealing with gigantic files!
In fact, hexpat does have 'streaming' interface (just like xml-conduit). It uses the not-too-well known List library and the rather ugly List class it defines. In principle the ListT type from the List package should work well. I gave up quickly because of a lack of combinators, and wrote an appropriate instance of the ugly List class for a wrapped version of Pipes.ListT which I then used to export ordinary Pipes.Producer functions like parseProduce. The trivial manipulations needed for this are appended below as PipesSax.hs
Once we have parseProducer we can convert a ByteString or Text Producer into a Producer of SaxEvents with Text or ByteString components. Here are some simple operations. I was using a 238M "input.xml"; the programs never need more than 6 mb of memory, to judge from looking at top.
-- Sax.hs Most of the IO actions use a registerIds pipe defined at the bottom which is tailored to a giant bit of xml of which this is a valid 1000 fragment http://sprunge.us/WaQK
{-#LANGUAGE OverloadedStrings #-}
import PipesSax ( parseProducer )
import Data.ByteString ( ByteString )
import Text.XML.Expat.SAX
import Pipes -- cabal install pipes pipes-bytestring
import Pipes.ByteString (toHandle, fromHandle, stdin, stdout )
import qualified Pipes.Prelude as P
import qualified System.IO as IO
import qualified Data.ByteString.Char8 as Char8
sax :: MonadIO m => Producer ByteString m ()
-> Producer (SAXEvent ByteString ByteString) m ()
sax = parseProducer defaultParseOptions
-- stream xml from stdin, yielding hexpat tagstream to stdout;
main0 :: IO ()
main0 = runEffect $ sax stdin >-> P.print
-- stream the extracted 'IDs' from stdin to stdout
main1 :: IO ()
main1 = runEffect $ sax stdin >-> registryIds >-> stdout
-- write all IDs to a file
main2 =
IO.withFile "input.xml" IO.ReadMode $ \inp ->
IO.withFile "output.txt" IO.WriteMode $ \out ->
runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out
-- folds:
-- print number of IDs
main3 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do n <- P.length $ sax (fromHandle inp) >-> registryIds
print n
-- sum the meaningful part of the IDs - a dumb fold for illustration
main4 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do let pipeline = sax (fromHandle inp) >-> registryIds >-> P.map readIntId
n <- P.fold (+) 0 id pipeline
print n
where
readIntId :: ByteString -> Integer
readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2
-- my xml has tags with attributes that appear via hexpat thus:
-- StartElement "FacilitySite" [("registryId","110007915364")]
-- and the like. This is just an arbitrary demo stream manipulation.
registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m ()
registryIds = do
e <- await -- we look for a 'SAXEvent'
case e of -- if it matches, we yield, else we go to the next event
StartElement "FacilitySite" [("registryId",a)] -> do yield a
yield "\n"
registryIds
_ -> registryIds
-- 'library': PipesSax.hs
This just newtypes Pipes.ListT to get the appropriate instances. We don't export anything to do with List or ListT but just use the standard Pipes.Producer concept.
{-#LANGUAGE TypeFamilies, GeneralizedNewtypeDeriving #-}
module PipesSax (parseProducerLocations, parseProducer) where
import Data.ByteString (ByteString)
import Text.XML.Expat.SAX
import Data.List.Class
import Control.Monad
import Control.Applicative
import Pipes
import qualified Pipes.Internal as I
parseProducer
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text) m ()
parseProducer opt = enumerate . enumerate_
. parseG opt
. Select_ . Select
parseProducerLocations
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text, XMLParseLocation) m ()
parseProducerLocations opt =
enumerate . enumerate_ . parseLocationsG opt . Select_ . Select
newtype ListT_ m a = Select_ { enumerate_ :: ListT m a }
deriving (Functor, Monad, MonadPlus, MonadIO
, Applicative, Alternative, Monoid, MonadTrans)
instance Monad m => List (ListT_ m) where
type ItemM (ListT_ m) = m
joinL = Select_ . Select . I.M . liftM (enumerate . enumerate_)
runList = liftM emend . next . enumerate . enumerate_
where
emend (Right (a,q)) = Cons a (Select_ (Select q))
emend _ = Nil

Pipes and callbacks in Haskell

I'm processing some audio using portaudio. The haskell FFI bindings call a user defined callback whenever there's audio data to be processed. This callback should be handled very quickly and ideally with no I/O. I wanted to save the audio input and return quickly since my application doesn't need to react to the audio in realtime (right now I'm just saving the audio data to a file; later I'll construct a simple speech recognition system).
I like the idea of pipes and thought I could use that library. The problem is that I don't know how to create a Producer that returns data that came in through a callback.
How do I handle my use case?
Here's what I'm working with right now, in case that helps (the datum mvar isn't working right now but I don't like storing all the data in a seq... I'd rather process it as it came instead of just at the end):
{-# LANGUAGE FlexibleInstances, MultiParamTypeClasses #-}
module Main where
import Codec.Wav
import Sound.PortAudio
import Sound.PortAudio.Base
import Sound.PortAudio.Buffer
import Foreign.Ptr
import Foreign.ForeignPtr
import Foreign.C.Types
import Foreign.Storable
import qualified Data.StorableVector as SV
import qualified Data.StorableVector.Base as SVB
import Control.Exception.Base (evaluate)
import Data.Int
import Data.Sequence as Seq
import Control.Concurrent
instance Buffer SV.Vector a where
fromForeignPtr fp = return . SVB.fromForeignPtr fp
toForeignPtr = return . (\(a, b, c) -> (a, c)) . SVB.toForeignPtr
-- | Wrap a buffer callback into the generic stream callback type.
buffCBtoRawCB' :: (StreamFormat input, StreamFormat output, Buffer a input, Buffer b output) =>
BuffStreamCallback input output a b -> StreamCallback input output
buffCBtoRawCB' func = \a b c d e -> do
fpA <- newForeignPtr_ d -- We will not free, as callback system will do that for us
fpB <- newForeignPtr_ e -- We will not free, as callback system will do that for us
storeInp <- fromForeignPtr fpA (fromIntegral $ 1 * c)
storeOut <- fromForeignPtr fpB (fromIntegral $ 0 * c)
func a b c storeInp storeOut
callback :: MVar (Seq.Seq [Int32]) -> PaStreamCallbackTimeInfo -> [StreamCallbackFlag] -> CULong
-> SV.Vector Int32 -> SV.Vector Int32 -> IO StreamResult
callback seqmvar = \timeinfo flags numsamples input output -> do
putStrLn $ "timeinfo: " ++ show timeinfo ++ "; flags are " ++ show flags ++ " in callback with " ++ show numsamples ++ " samples."
print input
-- write data to output
--mapM_ (uncurry $ pokeElemOff output) $ zip (map fromIntegral [0..(numsamples-1)]) datum
--print "wrote data"
input' <- evaluate $ SV.unpack input
modifyMVar_ seqmvar (\s -> return $ s Seq.|> input')
case flags of
[] -> return $ if unPaTime (outputBufferDacTime timeinfo) > 0.2 then Complete else Continue
_ -> return Complete
done doneMVar = do
putStrLn "total done dood!"
putMVar doneMVar True
return ()
main = do
let samplerate = 16000
Nothing <- initialize
print "initialized"
m <- newEmptyMVar
datum <- newMVar Seq.empty
Right s <- openDefaultStream 1 0 samplerate Nothing (Just $ buffCBtoRawCB' (callback datum)) (Just $ done m)
startStream s
_ <- takeMVar m -- wait until our callbacks decide they are done!
Nothing <- terminate
print "let's see what we've recorded..."
stuff <- takeMVar datum
print stuff
-- write out wav file
-- let datum =
-- audio = Audio { sampleRate = samplerate
-- , channelNumber = 1
-- , sampleData = datum
-- }
-- exportFile "foo.wav" audio
print "main done"
The simplest solution is to use MVars to communicate between the callback and Producer. Here's how:
import Control.Proxy
import Control.Concurrent.MVar
fromMVar :: (Proxy p) => MVar (Maybe a) -> () -> Producer p a IO ()
fromMVar mvar () = runIdentityP loop where
loop = do
ma <- lift $ takeMVar mvar
case ma of
Nothing -> return ()
Just a -> do
respond a
loop
Your stream callback will write Just input to the MVar and your finalization callback will write Nothing to terminate the Producer.
Here's a ghci example demonstrating how it works:
>>> mvar <- newEmptyMVar :: IO (MVar (Maybe Int))
>>> forkIO $ runProxy $ fromMVar mvar >-> printD
>>> putMVar mvar (Just 1)
1
>>> putMVar mvar (Just 2)
2
>>> putMVar mvar Nothing
>>> putMVar mvar (Just 3)
>>>
Edit: The pipes-concurrency library now provides this feature, and it even has a section in the tutorial explaining specifically how to use it to get data out of callbacks.

Resources