I have a computation that along with other things generates some data (a lot of it) and I want to write into a file.
The way the code is structured now is (simplified):
writeRecord :: Handle -> Record -> IO ()
writeRecord h r = hPutStrLn h (toByteString r)
This function is then called periodically during a bigger computation. It is almost like a log, and in fact, multiple files are being written simultaneously.
Now I want the output file to be compressed with Gzip.
In languages like Java I would do something like:
outStream = new GzipOutputStream(new FileOutputStream(path))
and then would just write into that wrapped output stream.
What is the way of doing it in Haskell?
I think writing something like
writeRecord h r = hPut h ((compressed . toByteString) r)
is not correct because compressing each small bit individually isn't efficient (I even tried it and the size of the compressed file is bigger than uncompressed this way).
I also don't think that I can just produce a lazy ByteString (or even a list of chunks) and then write it with compressed . fromChunks because this will require my "generator" building the full thing in memory. And the fact that more than one file is produced at the same time makes it even more complicated.
So what would be a way to solve this in Haskell? Writing to file(s) and have them gzipped?
All the streaming libraries support compression. If I understand the particular problem and the way you are thinking about it, io-streams might be the simplest for your purposes. Here I alternate between writing to trump and clinton output streams, which are written as compressed files. I follow by showing the pipes equivalent of Michael's conduit program
#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package io-streams
{-# LANGUAGE OverloadedStrings #-}
import qualified System.IO.Streams as IOS
import qualified System.IO as IO
import Data.ByteString (ByteString)
analyzer :: IOS.OutputStream ByteString -> IOS.OutputStream ByteString -> IO ()
analyzer clinton trump = do
IOS.write (Just "This is a string\n") clinton
IOS.write (Just "This is a string\n") trump
IOS.write (Just "Clinton string\n") clinton
IOS.write (Just "Trump string\n") trump
IOS.write (Just "Another Clinton string\n") clinton
IOS.write (Just "Another Trump string\n") trump
IOS.write Nothing clinton
IOS.write Nothing trump
main:: IO ()
main =
IOS.withFileAsOutput "some-file-clinton.txt.gz" $ \clinton_compressed ->
IOS.withFileAsOutput "some-file-trump.txt.gz" $ \trump_compressed -> do
clinton <- IOS.gzip IOS.defaultCompressionLevel clinton_compressed
trump <- IOS.gzip IOS.defaultCompressionLevel trump_compressed
analyzer clinton trump
Obviously you can mix all kinds of IO in analyzer between acts of writing to the two output streams - I'm just show in the writes, so to speak. In particular, if analyzer is understood as depending on an input stream, the writes can depend on reads from the input stream. Here's a (slightly!) more complicated program that does that. If I run the program above I see
$ stack gzip_so.hs
$ gunzip some-file-clinton.txt.gz
$ gunzip some-file-trump.txt.gz
$ cat some-file-clinton.txt
This is a string
Clinton string
Another Clinton string
$ cat some-file-trump.txt
This is a string
Trump string
Another Trump string
With pipes and conduit there are various ways of achieving the above effect, with a higher level of decomposition of parts. Writing to separate files will however be a little more subtle. Here in any case is the pipes equivalent of Michael S's conduit program:
#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package pipes-zlib
{-# LANGUAGE OverloadedStrings #-}
import Control.Monad.IO.Class (MonadIO, liftIO)
import Data.ByteString (ByteString, hPutStr)
import System.IO (IOMode(..), withFile, Handle)
import Pipes
import qualified Pipes.ByteString as PB
import qualified Pipes.GZip as P
-- Some helper function you may have
someAction :: IO ByteString
someAction = return "This is a string\n"
-- Original version
producerHandle :: Handle -> IO ()
producerHandle h = do
str <- someAction
hPutStr h str
producerPipe :: MonadIO m => Producer ByteString m ()
producerPipe = do
str <- liftIO someAction
yield str
main :: IO ()
main = withFile "some-file-pipes.txt.gz" WriteMode $ \h ->
runEffect $ P.compress P.defaultCompression producerPipe >-> PB.toHandle h
-- Edit
Here for what it's worth is yet another way of superimposing several producers on a single thread with pipes or conduit, to add to the different approaches Michael S and danidiaz mentioned:
#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package pipes-zlib
{-# LANGUAGE OverloadedStrings #-}
import Pipes
import Pipes.GZip
import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as Bytes
import System.IO
import Control.Monad (replicateM_)
producer = replicateM_ 50000 $ do
marie "This is going to Marie\n" -- arbitary IO can be interspersed here
arthur "This is going to Arthur\n" -- with liftIO
sylvia "This is going to Sylvia\n"
where
marie = yield; arthur = lift . yield; sylvia = lift . lift . yield
sinkHelper h p = runEffect (compress bestSpeed p >-> Bytes.toHandle h)
main :: IO ()
main =
withFile "marie.txt.gz" WriteMode $ \marie ->
withFile "arthur.txt.gz" WriteMode $ \arthur ->
withFile "sylvia.txt.gz" WriteMode $ \sylvia ->
sinkHelper sylvia
$ sinkHelper arthur
$ sinkHelper marie
$ producer
It is quite simple and fast, and can be written in conduit with the obvious alterations - but finding it natural involves a higher level of buy-in with the 'monad transformer stack' point of view. It would be the most natural way of writing such a program from the point of view of something like the streaming library.
Doing this with conduit is fairly straightforward, though you'd need to adjust your code a bit. I've put together an example of before and after code to demonstrate it. The basic idea is:
Replace hPutStr h with yield
Add some liftIO wrappers
Instead of using withBinaryFile or the like, use runConduitRes, gzip, and sinkFile
Here's the example:
#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package conduit-extra
{-# LANGUAGE OverloadedStrings #-}
import Control.Monad.IO.Class (MonadIO, liftIO)
import Data.ByteString (ByteString, hPutStr)
import Data.Conduit (ConduitM, (.|), yield, runConduitRes)
import Data.Conduit.Binary (sinkFile)
import Data.Conduit.Zlib (gzip)
import System.IO (Handle)
-- Some helper function you may have
someAction :: IO ByteString
someAction = return "This is a string\n"
-- Original version
producerHandle :: Handle -> IO ()
producerHandle h = do
str <- someAction
hPutStr h str
-- Conduit version
producerConduit :: MonadIO m => ConduitM i ByteString m ()
producerConduit = do
str <- liftIO someAction
yield str
main :: IO ()
main = runConduitRes $ producerConduit
.| gzip
.| sinkFile "some-file.txt.gz"
You can learn more about conduit in the conduit tutorial.
Your Java idea is interesting, give me a few more minutes, I'll add an answer that looks more like that.
EDIT
Here's a version that's closer to your Java style approach. It relies on a SinkFunc.hs module which is available as a Gist at: https://gist.github.com/snoyberg/283154123d30ff9e201ea4436a5dd22d
#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package conduit-extra
{-# LANGUAGE OverloadedStrings #-}
{-# OPTIONS_GHC -Wall -Werror #-}
import Data.ByteString (ByteString)
import Data.Conduit ((.|))
import Data.Conduit.Binary (sinkHandle)
import Data.Conduit.Zlib (gzip)
import System.IO (withBinaryFile, IOMode (WriteMode))
import SinkFunc (withSinkFunc)
-- Some helper function you may have
someAction :: IO ByteString
someAction = return "This is a string\n"
producerFunc :: (ByteString -> IO ()) -> IO ()
producerFunc write = do
str <- someAction
write str
main :: IO ()
main = withBinaryFile "some-file.txt.gz" WriteMode $ \h -> do
let sink = gzip .| sinkHandle h
withSinkFunc sink $ \write -> producerFunc write
EDIT 2 One more for good measure, actually using ZipSink to stream the data to multiple different files. There are lots of different ways of slicing this, but this is one way that works:
#!/usr/bin/env stack
-- stack --resolver lts-6.21 --install-ghc runghc --package conduit-extra
{-# LANGUAGE OverloadedStrings #-}
import Control.Monad.Trans.Resource (MonadResource)
import Data.ByteString (ByteString)
import Data.Conduit (ConduitM, (.|), yield, runConduitRes, ZipSink (..))
import Data.Conduit.Binary (sinkFile)
import qualified Data.Conduit.List as CL
import Data.Conduit.Zlib (gzip)
data Output = Foo ByteString | Bar ByteString
fromFoo :: Output -> Maybe ByteString
fromFoo (Foo bs) = Just bs
fromFoo _ = Nothing
fromBar :: Output -> Maybe ByteString
fromBar (Bar bs) = Just bs
fromBar _ = Nothing
producer :: Monad m => ConduitM i Output m ()
producer = do
yield $ Foo "This is going to Foo"
yield $ Bar "This is going to Bar"
sinkHelper :: MonadResource m
=> FilePath
-> (Output -> Maybe ByteString)
-> ConduitM Output o m ()
sinkHelper fp f
= CL.mapMaybe f
.| gzip
.| sinkFile fp
main :: IO ()
main = runConduitRes
$ producer
.| getZipSink
(ZipSink (sinkHelper "foo.txt.gz" fromFoo) *>
ZipSink (sinkHelper "bar.txt.gz" fromBar))
For incremental compression, I think you could make use of compressIO/foldCompressStream in Codec.Compression.Zlib.Internal.
If you're able to represent your producer action as an IO (Maybe a) (such as an MVar take or InputStream/Chan read) where Nothing signifies end of input, something like this should work:
import System.IO (Handle)
import qualified Data.ByteString as BS
import qualified Codec.Compression.Zlib.Internal as ZLib
compressedWriter :: Handle -> (IO (Maybe BS.ByteString)) -> IO ()
compressedWriter handle source =
ZLib.foldCompressStream
(\next -> source >>= maybe (next BS.empty) next)
(\chunk next -> BS.hPut handle chunk >> next)
(return ())
(ZLib.compressIO ZLib.rawFormat ZLib.defaultCompressParams)
This solution is similar to Michael Snoyman's EDIT 2, but uses the foldl, pipes, pipes-zlib and streaming-eversion packages.
{-# language OverloadedStrings #-}
module Main where
-- cabal install bytestring foldl pipes pipes-zlib streaming-eversion
import Data.Foldable
import Data.ByteString
import qualified Control.Foldl as L
import Pipes
import qualified Pipes.Prelude
import Pipes.Zlib (compress,defaultCompression,defaultWindowBits)
import Streaming.Eversion.Pipes (transvertMIO)
import System.IO
type Tag = String
producer :: Monad m => Producer (Tag,ByteString) m ()
producer = do
yield $ ("foo","This is going to Foo")
yield $ ("bar","This is going to Bar")
foldForTag :: Handle -> Tag -> L.FoldM IO (Tag,ByteString) ()
foldForTag handle tag =
L.premapM (\(tag',bytes) -> if tag' == tag then Just bytes else Nothing)
. L.handlesM L.folded
. transvertMIO (compress defaultCompression defaultWindowBits)
$ L.mapM_ (Data.ByteString.hPut handle)
main :: IO ()
main = do
withFile "foo.txt" WriteMode $ \h1 ->
withFile "bar.txt" WriteMode $ \h2 ->
let multifold = traverse_ (uncurry foldForTag) [(h1,"foo"),(h2,"bar")]
in L.impurely Pipes.Prelude.foldM multifold producer
This solution is similar to Michael Snoyman's EDIT 2, but uses the streaming, streaming-bytestring, pipes and pipes-zlib packages.
{-# language OverloadedStrings #-}
module Main where
-- cabal install bytestring streaming streaming-bytestring pipes pipes-zlib
import Data.ByteString
import qualified Data.ByteString.Streaming as B
import Streaming
import qualified Streaming.Prelude as S
import Pipes (next)
import qualified Pipes.Prelude
import Pipes.Zlib (compress,defaultCompression,defaultWindowBits)
import System.IO
type Tag = String
producer :: Monad m => Stream (Of (Tag,ByteString)) m ()
producer = do
S.yield ("foo","This is going to Foo")
S.yield ("bar","This is going to Bar")
-- I couldn't find a streaming-zlib on Hackage, took a pipes detour
compress' :: MonadIO m
=> Stream (Of ByteString) m r -> Stream (Of ByteString) m r
compress' = S.unfoldr Pipes.next
. compress defaultCompression defaultWindowBits
. Pipes.Prelude.unfoldr S.next
keepTag :: Monad m
=> Tag -> Stream (Of (Tag,ByteString)) m r -> Stream (Of ByteString) m r
keepTag tag = S.map snd . S.filter ((tag==) . fst)
main :: IO ()
main = runResourceT
. B.writeFile "foo.txt" . B.fromChunks . compress' . keepTag "foo"
. B.writeFile "bar.txt" . B.fromChunks . compress' . keepTag "bar"
$ S.copy producer
I make use of the copy function from Streaming.Prelude, that allows you to
Duplicate the content of stream, so that it can be acted on twice in
different ways, but without breaking streaming.
I want to extract information from a large XML file (around 20G) in Haskell. Since it is a large file, I used SAX parsing functions from Hexpath.
Here is a simple code I tested:
import qualified Data.ByteString.Lazy as L
import Text.XML.Expat.SAX as Sax
parse :: FilePath -> IO ()
parse path = do
inputText <- L.readFile path
let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text]
let txt = foldl' processEvent "" saxEvents
putStrLn txt
After activating profiling in Cabal, it says that parse.saxEvents took 85% of allocated memory. I also used foldr and the result is the same.
If processEvent becomes complex enough, the program crashes with a stack space overflow error.
What am I doing wrong?
You don't say what processEvent is like. In principle, it ought to be unproblematic to use lazy ByteString for a strict left fold over lazily generated input, so I'm not sure what is going wrong in your case. But one ought to use streaming-appropriate types when dealing with gigantic files!
In fact, hexpat does have 'streaming' interface (just like xml-conduit). It uses the not-too-well known List library and the rather ugly List class it defines. In principle the ListT type from the List package should work well. I gave up quickly because of a lack of combinators, and wrote an appropriate instance of the ugly List class for a wrapped version of Pipes.ListT which I then used to export ordinary Pipes.Producer functions like parseProduce. The trivial manipulations needed for this are appended below as PipesSax.hs
Once we have parseProducer we can convert a ByteString or Text Producer into a Producer of SaxEvents with Text or ByteString components. Here are some simple operations. I was using a 238M "input.xml"; the programs never need more than 6 mb of memory, to judge from looking at top.
-- Sax.hs Most of the IO actions use a registerIds pipe defined at the bottom which is tailored to a giant bit of xml of which this is a valid 1000 fragment http://sprunge.us/WaQK
{-#LANGUAGE OverloadedStrings #-}
import PipesSax ( parseProducer )
import Data.ByteString ( ByteString )
import Text.XML.Expat.SAX
import Pipes -- cabal install pipes pipes-bytestring
import Pipes.ByteString (toHandle, fromHandle, stdin, stdout )
import qualified Pipes.Prelude as P
import qualified System.IO as IO
import qualified Data.ByteString.Char8 as Char8
sax :: MonadIO m => Producer ByteString m ()
-> Producer (SAXEvent ByteString ByteString) m ()
sax = parseProducer defaultParseOptions
-- stream xml from stdin, yielding hexpat tagstream to stdout;
main0 :: IO ()
main0 = runEffect $ sax stdin >-> P.print
-- stream the extracted 'IDs' from stdin to stdout
main1 :: IO ()
main1 = runEffect $ sax stdin >-> registryIds >-> stdout
-- write all IDs to a file
main2 =
IO.withFile "input.xml" IO.ReadMode $ \inp ->
IO.withFile "output.txt" IO.WriteMode $ \out ->
runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out
-- folds:
-- print number of IDs
main3 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do n <- P.length $ sax (fromHandle inp) >-> registryIds
print n
-- sum the meaningful part of the IDs - a dumb fold for illustration
main4 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do let pipeline = sax (fromHandle inp) >-> registryIds >-> P.map readIntId
n <- P.fold (+) 0 id pipeline
print n
where
readIntId :: ByteString -> Integer
readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2
-- my xml has tags with attributes that appear via hexpat thus:
-- StartElement "FacilitySite" [("registryId","110007915364")]
-- and the like. This is just an arbitrary demo stream manipulation.
registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m ()
registryIds = do
e <- await -- we look for a 'SAXEvent'
case e of -- if it matches, we yield, else we go to the next event
StartElement "FacilitySite" [("registryId",a)] -> do yield a
yield "\n"
registryIds
_ -> registryIds
-- 'library': PipesSax.hs
This just newtypes Pipes.ListT to get the appropriate instances. We don't export anything to do with List or ListT but just use the standard Pipes.Producer concept.
{-#LANGUAGE TypeFamilies, GeneralizedNewtypeDeriving #-}
module PipesSax (parseProducerLocations, parseProducer) where
import Data.ByteString (ByteString)
import Text.XML.Expat.SAX
import Data.List.Class
import Control.Monad
import Control.Applicative
import Pipes
import qualified Pipes.Internal as I
parseProducer
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text) m ()
parseProducer opt = enumerate . enumerate_
. parseG opt
. Select_ . Select
parseProducerLocations
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text, XMLParseLocation) m ()
parseProducerLocations opt =
enumerate . enumerate_ . parseLocationsG opt . Select_ . Select
newtype ListT_ m a = Select_ { enumerate_ :: ListT m a }
deriving (Functor, Monad, MonadPlus, MonadIO
, Applicative, Alternative, Monoid, MonadTrans)
instance Monad m => List (ListT_ m) where
type ItemM (ListT_ m) = m
joinL = Select_ . Select . I.M . liftM (enumerate . enumerate_)
runList = liftM emend . next . enumerate . enumerate_
where
emend (Right (a,q)) = Cons a (Select_ (Select q))
emend _ = Nil
I'm reading from a file using sourceFile, but I also need to introduce randomness into the processing operation. The best approach I believe is to have a producer that is of the type
Producer m (StdGen, ByteString)
where StdGen is used to generate the random number.
I'm intending for the producer to perform the task of sourceFile, as well as producing a new seed to yield everytime it sends data downstream.
My problem is, there doesn't seem to be a source-combiner like zipSink for sinks. Reading through Conduit Overview, it seems to be suggesting that you can embed a Source inside a Conduit, but I'm failing to see how it is done in the example.
Can anyone provide an example of which you fuse two or more IO sources into one single Producer/Source?
EDIT :
An example:
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE RankNTypes #-}
{-# LANGUAGE OverloadedStrings #-}
import System.Random (StdGen(..), split, newStdGen, randomR)
import ClassyPrelude.Conduit as Prelude
import Control.Monad.Trans.Resource (runResourceT, ResourceT(..))
import qualified Data.ByteString as BS
-- generate a infinite source of random number seeds
sourceStdGen :: MonadIO m => Source m StdGen
sourceStdGen = do
g <- liftIO newStdGen
loop g
where loop gin = do
let g' = fst (split gin)
yield gin
loop g'
-- combine the sources into one
sourceInput :: (MonadResource m, MonadIO m) => FilePath -> Source m (StdGen, ByteString)
sourceInput fp = getZipSource $ (,)
<$> ZipSource sourceStdGen
<*> ZipSource (sourceFile fp)
-- a simple conduit, which generates a random number from provide StdGen
-- and append the byte value to the provided ByteString
simpleConduit :: Conduit (StdGen, ByteString) (ResourceT IO) ByteString
simpleConduit = mapC process
process :: (StdGen, ByteString) -> ByteString
process (g, bs) =
let rnd = fst $ randomR (40,50) g
in bs ++ pack [rnd]
main :: IO ()
main = do
runResourceT $ sourceInput "test.txt" $$ simpleConduit =$ sinkFile "output.txt"
So this example takes what's in the input file and write it to the output file, as well as appending a random ASCII value between 40 and 50 to the end of the file. (Don't ask me why)
You can use ZipSource for this. In your case, it might look something like:
sourceStdGens :: Source m StdGen
sourceBytes :: Source m ByteString
sourceBoth :: Source m (StdGen, ByteString)
sourceBoth = getZipSource $ (,)
<$> ZipSource sourceStdGens
<*> ZipSource sourceBytes
You can do it in the IO monad then lift the result to a Producer.
do (i, newSeed) <- next currentSeed
b <- generateByteStringFromRandomNumber i
return (b, newSeed)
That IO action can be lifted into the appropriate conduit with a simple lift:
-- assuming the above action is named x and takes the current seed as an argument
-- the corresponding producer/source is:
lift $ x currentSeed
Is it possible to create pipes that get all values that have been sent downstream in a certain time period? I'm implementing a server where the protocol allows me to concatenate outgoing packets and compress them together, so I'd like to effectively "empty out" the queue of downstream ByteStrings every 100ms and mappend them together to then yield on to the next pipe which does the compression.
Here's a solution using pipes-concurrency. You give it any Input and it will periodically drain the input of all values:
import Control.Applicative ((<|>))
import Control.Concurrent (threadDelay)
import Data.Foldable (forM_)
import Pipes
import Pipes.Concurrent
drainAll :: Input a -> STM (Maybe [a])
drainAll i = do
ma <- recv i
case ma of
Nothing -> return Nothing
Just a -> loop (a:)
where
loop diffAs = do
ma <- recv i <|> return Nothing
case ma of
Nothing -> return (Just (diffAs []))
Just a -> loop (diffAs . (a:))
bucketsEvery :: Int -> Input a -> Producer [a] IO ()
bucketsEvery microseconds i = loop
where
loop = do
lift $ threadDelay microseconds
ma <- lift $ atomically $ drainAll i
forM_ ma $ \a -> do
yield a
loop
This gives you much greater control over how you consume elements from upstream, by selecting the type of Buffer you use to build the Input.
If you're new to pipes-concurrency, you can read the tutorial which explains how to use spawn, Buffer and Input.
Here is a possible solution. It is based on a Pipe that tags ByteStrings going downstream with a Bool, in order to identify ByteStrings belonging to the same "time bucket".
First, some imports:
import Data.AdditiveGroup
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Builder as BB
import Data.Thyme.Clock
import Data.Thyme.Clock.POSIX
import Control.Monad.State.Strict
import Control.Lens (view)
import Control.Concurrent (threadDelay)
import Pipes
import Pipes.Lift
import qualified Pipes.Prelude as P
import qualified Pipes.Group as PG
Here is the tagging Pipe. It uses StateT internally:
tagger :: Pipe B.ByteString (B.ByteString,Bool) IO ()
tagger = do
startTime <- liftIO getPOSIXTime
evalStateP (startTime,False) $ forever $ do
b <- await
currentTime <- liftIO getPOSIXTime
-- (POSIXTime,Bool) inner state
(baseTime,tag) <- get
if (currentTime ^-^ baseTime > timeLimit)
then let tag' = not tag in
yield (b,tag') >> put (currentTime, tag')
else yield $ (b,tag)
where
timeLimit = fromSeconds 0.1
Then we can use functions from the pipes-group package to group ByteStrings belonging to the same "time bucket" into lazy ByteStrings:
batch :: Producer B.ByteString IO () -> Producer BL.ByteString IO ()
batch producer = PG.folds (<>) mempty BB.toLazyByteString
. PG.maps (flip for $ yield . BB.byteString . fst)
. view (PG.groupsBy $ \t1 t2-> snd t1 == snd t2)
$ producer >-> tagger
It seems to batch correctly. This program:
main :: IO ()
main = do
count <- P.length $ batch (yield "boo" >> yield "baa")
putStrLn $ show count
count <- P.length $ batch (yield "boo" >> yield "baa"
>> liftIO (threadDelay 200000) >> yield "ddd")
putStrLn $ show count
Has the output:
1
2
Notice that the contents of a "time bucket" are only yielded when the first element of the next bucket arrives. They are not yielded automatically each 100ms. This may or may not be a problem for you. It you want to yield automatically each 100ms, you would need a different solution, possibly based on pipes-concurrency.
Also, you could consider working directly with the FreeT-based "effectul lists" provided by pipes-group. That way you could start compressing the data in a "time bucket" before the bucket is full.
So unlike Daniel's answer my does not tag the data as it is produced. It just takes at least element from upstream and then continues to aggregate more values in the monoid until the time interval has passed.
This codes uses a list to aggregate, but there are better monoids to aggregate with
import Pipes
import qualified Pipes.Prelude as P
import Data.Time.Clock
import Data.Time.Calendar
import Data.Time.Format
import Data.Monoid
import Control.Monad
-- taken from pipes-rt
doubleToNomDiffTime :: Double -> NominalDiffTime
doubleToNomDiffTime x =
let d0 = ModifiedJulianDay 0
t0 = UTCTime d0 (picosecondsToDiffTime 0)
t1 = UTCTime d0 (picosecondsToDiffTime $ floor (x/1e-12))
in diffUTCTime t1 t0
-- Adapted from from pipes-parse-1.0
wrap
:: Monad m =>
Producer a m r -> Producer (Maybe a) m r
wrap p = do
p >-> P.map Just
forever $ yield Nothing
yieldAggregateOverTime
:: (Monoid y, -- monoid dependance so we can do aggregation
MonadIO m -- to beable to get the current time the
-- base monad must have access to IO
) =>
(t -> y) -- Change element from upstream to monoid
-> Double -- Time in seconds to aggregate over
-> Pipe (Maybe t) y m ()
yieldAggregateOverTime wrap period = do
t0 <- liftIO getCurrentTime
loop mempty (dtUTC `addUTCTime` t0)
where
dtUTC = doubleToNomDiffTime period
loop m ts = do
t <- liftIO getCurrentTime
v0 <- await -- await at least one element
case v0 of
Nothing -> yield m
Just v -> do
if t > ts
then do
yield (m <> wrap v)
loop mempty (dtUTC `addUTCTime` ts)
else do
loop (m <> wrap v) ts
main = do
runEffect $ wrap (each [1..]) >-> yieldAggregateOverTime (\x -> [x]) (0.0001)
>-> P.take 10 >-> P.print
Depending on cpu load you the output data will be aggregated differently. With at least on element in each chunk.
$ ghc Main.hs -O2
$ ./Main
[1,2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
$ ./Main
[1,2]
[3]
[4]
[5]
[6,7,8,9,10]
[11,12,13,14,15,16,17,18]
[19,20,21,22,23,24,25,26]
[27,28,29,30,31,32,33,34]
[35,36,37,38,39,40,41,42]
[43,44,45,46,47,48,49,50]
$ ./Main
[1,2,3,4,5,6]
[7]
[8]
[9,10,11,12,13,14,15,16,17,18,19,20]
[21,22,23,24,25,26,27,28,29,30,31,32,33]
[34,35,36,37,38,39,40,41,42,43,44]
[45,46,47,48,49,50,51,52,53,54,55]
[56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72]
[73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88]
[89,90,91,92,93,94,95,96,97,98,99,100,101,102,103]
$ ./Main
[1,2,3,4,5,6,7]
[8]
[9]
[10,11,12,13,14,15,16,17,18]
[19,20,21,22,23,24,25,26,27]
[28,29,30,31,32,33,34,35,36,37]
[38,39,40,41,42,43,44,45,46]
[47,48,49,50]
[51,52,53,54,55,56,57]
[58,59,60,61,62,63,64,65,66]
You might want to look at the source code of
pipes-rt it shows one approach to deal with time in pipes.
edit: Thanks to Daniel Díaz Carrete, adapted pipes-parse-1.0 technique to handle upstream termination. A pipes-group solution should be possible using the same technique as well.