I use Haskell stream processing library pipes to write a command line tool. Each command line actions may output result to stdout and logs to stderr with pipes API.
I need Consumer which has type as Consumer (Either String String) m r to print chunk of data (Left to stderr, Right to stdout) with single Consumer.
Code I wrote (should be improved)
This function consumeEither doesn't have flexibility so I want to improve it.
consumeEither :: (MonadIO m) => Consumer (Either String String) m ()
consumeEither = do
eitherS <- await
case eitherS of
(Left l) -> for (yield l) (liftIO . (IO.hPutStrLn IO.stderr))
(Right r) -> for (yiled r) (liftIO . putStrLn)
Furthermore it would be useful to provide a function which takes two Consumers and merge them into one Consumer.
Question
Does anybody know good example or implementation of the following interface?
merge :: (Monad m) => Consumer a m r -> Consumer b m r -> Consumer (Either a b) m r
1st argument as stderr
2nd argument as stdout
Usage of the function
import Pipes
import qualified Pipes.Prelude as P
import qualified System.IO as IO
stdoutOrErr :: Consumer (Either String String) IO ()
stdoutOrErr = merge (P.toHandle IO.stderr) P.stdoutLn
Thanks
(This is #Michael's answer, but I'd like to write it up here so we can move the question out of the unanswered queue for the Haskell tag.)
See (+++) in pipes-extras. Keep in mind a Consumer is a Pipe (to nowhere), so P.toHandle IO.stderr +++ P.stdoutLn :: MonadIO m => Pipe (Either String String) (Either b d) m ().
To get a Consumer, you would have to get rid of the Lefts e.g with >-> P.concat or >-> P.drain. There are more robust and handsome ways of doing this with Folds.
Related
I want to process stream of events received via the MQTT. Library which I'm using uses a callback to provide the results. Processing I'm doing depends on the previous state not only the latest event. Also in the future events might be gathered from the other sources.
At the first I decided to compose it into the list which sounds as a good idea. I had the minor issue cause IO prevents lazy evaluation and waiting for infinite stream might be long, but I solved it with interleaving IO.
stream :: IO [Event] allows me to do the nice stuff like foldl, foldM map, mapM, etc... Unfortunately with this approach I rather wont be able to combine two streams, cause there is no more locking feature there.
I was diging through many libs, and found STM with TQueue for example. Unfortunately it is not what I exactly want.
I decide to create custom type and make it Foldable so I will be able to fold it. I failed due to IO.
import Control.Concurrent.STM
newtype Stream a = Stream (STM a)
runStream
:: ((a -> IO ()) -> IO i)
-> IO (Stream a)
runStream block = do
queue <- newTQueueIO
block (atomically . writeTQueue queue)
return $ Stream (readTQueue queue)
foldStream :: (a -> b -> IO b) -> b -> Stream a -> IO b
foldStream f s (Stream read) = do
n <- atomically read
m <- f n s
foldStream f m (Stream read)
mapStream :: (a -> b) -> Stream a -> Stream b
mapStream f (Stream read) = Stream $ f <$> read
zipStream :: [Stream a] -> Stream a
zipStream = undefined
Whih can be used like main = foldStream (\x _ -> print x) () =<< events
Is it possible to implement base some of base classes to work with this stream as with regular List?
The usual trick in these cases is to make the callback write to a queue, and then read from the other end of the queue.
Using a bounded, closeable queue from the stm-chans package, we can define this function:
import Control.Concurrent.STM
import Control.Concurrent.STM.TBMQueue
foldQueue :: TBMQueue a -> (x -> a -> IO x) -> IO x -> (x -> IO b) -> IO b
foldQueue queue step start done =
let go state =
do m <- atomically (readTBMQueue queue)
case m of
Nothing -> done state
Just a -> step state a >>= go
in start >>= go
It takes the channel, a step function (similar to the one required by foldM), an action to obtain the initial state, and a "done" action that returns the final result, and then feeds data from the channel until it is closed. Notice that the fold state x is chosen by the caller of foldQueue.
If later we want to upgrade to the monadic folds from the foldl package—which have a very useful Applicative instance—we can do it like this:
import qualified Control.Foldl as L
foldQueue' :: TBMQueue a -> L.FoldM IO a b -> IO b
foldQueue' queue = L.impurely (foldQueue queue)
Using impurely from the "foldl" package.
Sometimes (like when parsing, grouping, or decoding) it's easier to use a pull-based consumer. We can do that with the streaming package:
import Streaming
import qualified Streaming.Prelude as S
foldQueue' :: TBMQueue a -> (Stream (Of a) IO () -> IO r) -> IO r
foldQueue' queue consume = consume (S.untilRight (do
m <- atomically (readTBMQueue queue)
return (case m of
Nothing -> Right ()
Just a -> Left a)))
Given a function that consumes a stream, we feed to it a stream of values read from the queue.
Often, reading from the channel and writing to it must happen in different threads. We can use functions like concurrently from async to handle it cleanly.
This is a question about the Haskell streaming library.
Stream (Of a) m r is a "stream of individual Haskell values derived from actions in some monad m and returning a value of type r". Streaming.Prelude defines many useful functions that allow nice streaming applications:
import qualified Streaming.Prelude as S
S.print $ do
S.yield "a"
S.yield "b"
S.yield "c"
The tutorial is good for getting started.
Now, the particular issue at hand is how to use this framework with a monad that requires careful instantiation and release of resources. The streaming-with package seems to be the right candidate, it has a function
bracket :: MonadMask m => m a -> (a -> m c) -> (a -> m b) -> m b
that acquires (m a), releases (a->m c) and uses (a->m b) a resource. All three actions are encapsulated in the returned m b. withFile is a good example for how to use this:
withFile :: FilePath -> IOMode -> (Handle -> m r) -> m r
withFile fp md = bracket (liftIO (openFile fp md)) (liftIO . hClose)
Acquisition and release of the handle are nicely sandwiching the usage Handle->m r.
But: I absolutely do not see how this should be used with Stream (Of a) m r. I have to provide an a->m b and I get a m b. How is this supposed to be connected so that I obtain a Stream?
To figure this out, let's play with withFile:
import System.IO
use :: Handle -> IO (S.Stream (Of String) IO ())
use = return . S.repeatM . hGetLine
main :: IO ()
main = do
str <- S.withFile "input.dat" ReadMode use
S.print str
but that results in hGetLine: illegal operation (handle is closed). That actually makes sense, by the time S.print str is called withFile has already acquired and released the handle.
So let's move the stream consumption inside the use function:
use :: Handle -> IO ()
use h = do
S.print $ S.repeatM (hGetLine h)
and that gives a hGetLine: invalid argument (invalid byte sequence). I'm not quite sure what this error means. An isEOFError would be acceptable, but 'invalid byte sequence'? In any case, this doesn't work either.
I'm running out of ideas... How is this done?
The withFile is just a toy example, the question is really about how to correctly create and consume a stream inside a bracket.
let's move the stream consumption inside the use function
This is indeed the right approach.
I'm actually getting a proper hGetLine: end of file when running the example code. The problem is that S.repeatM (hGetLine h) never bothers to check if it has reached then end of the file, and throws an exception when it bumps into it.
The following definition of use doesn't have that problem:
use :: Handle -> IO ()
use h = do
S.print $ S.untilRight $ do eof <- System.IO.hIsEOF h
if eof then Right <$> pure ()
else Left <$> hGetLine h
It uses the untilRight function.
I have the following types built from Data.Conduit:
type Footers = [(ByteString, ByteString)]
type DataAndConclusion = ConduitM () ByteString IO Footers
The idea of the second type being "produce a lot of ByteStrings, and if you can produce all of them, return a Footers". The condition is because conduits are governed by downstream functions, so the consumer of DataAndConclusion may have no need to consume all its items, and in that case the return wouldn't be reached. Which is precisely the behavior that I need. But when the end of the source is reached, I would like to have the produced Footers. This would be useful for example if the DataAndConclusions were incrementally computing an MD5 and such MD5 was only needed if the entire message was processed by the downstream (for example, downstream could be simply sending it through the network, but it doesn't make sense to finish computing and send the MD5 if the socket was closed before the last piece was sent by downstream).
So, basically I want to have something with this signature to consume a DataAndConclusions:
type MySink = Sink ByteString IO ()
mySink :: MySink
mySink = ...
difficultFunction :: ConduitM () a2 m r1 -> ConduitM a2 Void m r2 -> m (Maybe r1)
Question is, is there any way to implement "difficultFunction"? How?
There should be definitely a nice solution, but I wasn't able to construct it using ConduitM primitives. Something with signature
ConduitM i a m r1 -> ConduitM a o m r2 -> ConduitM i o m (Maybe r1, r2)
Looks like a primitive function with this signature would be a good addition for the conduit library.
Nevertheless, #danidiaz's suggestion about StateT lead me to the following generic solution that lifts the whole computation to WriterT internally in order to remember the output of the first conduit, if it's reached:
import Control.Monad
import Control.Monad.Trans
import Control.Monad.Trans.Writer
import Data.Conduit
import Data.Monoid
import Data.Void
difficultFunction :: (Monad m)
=> ConduitM () a2 m r1 -> ConduitM a2 Void m r2
-> m (r2, Maybe r1)
difficultFunction l r = liftM (fmap getLast) $ runWriterT (l' $$ r')
where
l' = transPipe lift l >>= lift . tell . Last . Just
r' = transPipe lift r
(untested!)
This would be useful for example if the DataAndConclusions were
incrementally computing an MD5 and such MD5 was only needed if the
entire message was processed by the downstream
Instead of relying on the return value of the upstream conduit, in this case perhaps you could accumulate the ongoing MD5 computation in a StateT layer beneath ConduitM, and access it after running the conduit.
The other part of the puzzle is detecting that the producer has finished first. Sinks can detect upstream end-of-input in await calls. You could write a Sink that notifies you of upstream termination in its own result type, perhaps with a Maybe.
But what if you are given a Sink that doesn't already do that? We would need a function like Sink i m r -> Sink i m (Maybe r). "Given a Sink that may halt early, return a new Sink that returns Nothing if upstream finishes first". But I don't know how to write that function.
Edit: This conduit sets an IORef to True when it detects upstream termination:
detectUpstreamClose :: IORef Bool -> Conduit i IO i
detectUpstreamClose ref = conduit
where
conduit = do
m <- await
case m of
Nothing -> liftIO (writeIORef ref True)
Just i -> do
yield i
conduit
detectUpstreamClose could be inserted in a pipeline, and the IORef could be checked afterwards.
I'm trying to dispatch the items from a conduit into many output files, the problem is very similar to Conduit - Multiple output file within the pipeline, with a few differences:
In the previous solution, every sink has a filter that decides if the element belongs to that sink or not. In my case every element coming from the upstream goes exactly to one file, and in the case where there is a big number of files it would be better to make only one operation to decide to which file is it going.
The files are created on demand. A "selector" function decides which sink the next element is going to, and if it doesn't exist yet it creates it using a "create new sink" function.
For example if the Source yields: 8 4 7 1 5
And the sink selector is a module 3, then the sequence of actions would be:
Create file 2
Add 8 to file 2
Create file 1
Add 4 to file 1
Add 7 to file 1
Add 1 to file 1
Add 5 to file 2
I'm thinking of a type for this dispatcher like this:
dispatcherSink_ :: (Monad m) =>
(a -> k) -> -- sink selector
(k -> Sink a m ()) -> -- new sink
Sink a m ()
I've tried to write the function using evalStateC with an internal StateT holding a Map of Sinks, but I'm not able to tie up the types. I'm not sure if you can even use the same sink twice.
Is what I'm trying to do even possible?
I'm still a newbie in Haskell, so any help will be appreciated.
Edited
I though I could create a map of ResumableSinks, there is a library in Hackage for that, but it depends on an old and very specific version of Conduit, so cabal couldn't install it.
In the end I didn't find a way to write the function with the previous type, able to work with any sink, so I came up with a function that works with files directly:
import System.IO (hClose,openFile,IOMode(WriteMode))
import Conduit
import Data.IOData
import qualified Data.Foldable as F
import qualified Data.Map.Strict as M
import Control.Monad.State.Strict
import Data.ByteString.Char8 (pack)
fileDispatcherSink ::
(MonadIO m, IOData c,Ord k) =>
(a -> k) ->
(a -> c) ->
(k -> FilePath) ->
Sink a m ()
fileDispatcherSink selector toChunked path =
evalStateC M.empty $ dispatcher
where
dispatcher = do
next <- await
m <- get
case next of
Nothing -> liftIO $ F.traverse_ hClose m
Just a -> do
let k = selector a
h <- case M.lookup k m of
Nothing -> do
nh <- liftIO $ openFile (path k) WriteMode
put $ M.insert k nh m
return nh
Just h -> return h
yield (toChunked a) $$ sinkHandle h
dispatcher
testSource :: (Monad m) => Source m Int
testSource = yieldMany [8, 4, 7, 1, 5]
main :: IO ()
main = testSource
$$ fileDispatcherSink (`mod` 3) (pack . show) ((++ ".txt") . show)
Is there a way to write the _dispatcherSink__ function?
There is a conceptual problem with implementing
dispatcherSink_ :: (Monad m) =>
(a -> k) -> -- sink selector
(k -> Sink a m ()) -> -- new sink
Sink a m ()
. In conduit, data is pulled from upstream to downstream, instead of being pushed. So a Sink decides if it requests a next input value from its upstream conduit or not. So you can't really have a map of Sinks, read an input value and then feed it to one of the Sinks. The Sink you select might not decide to read the input value, it might decide to finish, and then what will you do with the input value? You can create a new sink for that key, but it can also decide not to accept the input.
So instead of a Sink you'll most likely need some different concept, something to which you can push a value and also what you can finalize. An idea (untested):
data PushSink m i = PushSink { psPush :: i -> m (PushSink m i)
, psFinalize :: m () }
An implementation for writing files would open a file, keep the handle, and psPush would just write a chunk into the file, returning the same object, while psFinalize would close the file.
And then you can implement a variant like this
dispatcherSink_ :: (Monad m) =>
(a -> k) -> -- sink selector
(k -> m (PushSink a m)) -> -- new sink
Sink a m ()
which pushes values to PushSinks and finalizes them all when there is no input.
I'm having a hard time to write a pipe with this signature:
toOneBigList :: (Monad m, Proxy p) => () -> Pipe p a [a] m r
It should simply take all as from upstream and send them in a list downstream.
All my attempts look fundamentally broken.
Can anybody point me in the right direction?
There are two pipes-based solutions and I'll let you pick which one you prefer.
Note: It's not clear why you output the list on the downstream interface instead of just returning it directly as a result.
Conduit-style
The first one, which is very close to the conduit-based solution uses the upcoming pipes-pase, which is basically complete and just needs documentation. You can find the latest draft on Github.
Using pipes-parse, the solution is identical to the conduit solution that Petr gave:
import Control.Proxy
import Control.Proxy.Parse
combine
:: (Monad m, Proxy p)
=> () -> Pipe (StateP [Maybe a] p) (Maybe a) [a] m ()
combine () = loop []
where
loop as = do
ma <- draw
case ma of
Nothing -> respond (reverse as)
Just a -> loop (a:as)
draw is like conduit's await: it requests a value from either the leftovers buffer (that's the StateP part) or from upstream if the buffer is empty. Nothing indicates end of file.
You can wrap a pipe that does not have an end of file signal using the wrap function from pipes-parse, which has type:
wrap :: (Monad m, Proxy p) => p a' a b' b m r -> p a' a b' (Maybe b) m s
Classic Pipes Style
The second alternative is a bit simpler. If you want to fold a given pipe you can do so directly using WriterP:
import Control.Proxy
import Control.Proxy.Trans.Writer
foldIt
:: (Monad m, Proxy p) =>
(() -> Pipe p a b m ()) -> () -> Pipe p a [b] m ()
foldIt p () = runIdentityP $ do
r <- execWriterK (liftP . p >-> toListD >-> unitU) ()
respond r
That's a higher-level description of what is going on, but it requires passing in the pipe as an explicit argument. It's up to you which one you prefer.
By the way, this is why I was asking why you want to send a single value downstream. The above is much simpler if you return the folded list:
foldIt p = execWriterK (liftP . p >-> toListD)
The liftP might not even be necessary if p is completely polymorphic in its proxy type. I only include it as a precaution.
Bonus Solution
The reason pipes-parse does not provide the toOneBigList is that it's always a pipes anti-pattern to group the results into a list. pipes has several nice features that make it possible to never have to group the input into a list, even if you are trying to yield multiple lists. For example, using respond composition you can have a proxy yield the subset of the stream it would have traversed and then inject a handler that uses that subset:
example :: (Monad m, Proxy p) => () -> Pipe p a (() -> Pipe p a a m ()) m r
example () = runIdentityP $ forever $ do
respond $ \() -> runIdentityP $ replicateM_ 3 $ request () >>= respond
printIt :: (Proxy p, Show a) => () -> Pipe p a a IO r
printIt () = runIdentityP $ do
lift $ putStrLn "Here we go!"
printD ()
useIt :: (Proxy p, Show a) => () -> Pipe p a a IO r
useIt = example />/ (\p -> (p >-> printIt) ())
Here's an example of how to use it:
>>> runProxy $ enumFromToS 1 10 >-> useIt
Here we go!
1
2
3
Here we go!
4
5
6
Here we go!
7
8
9
Here we go!
10
This means you never need to bring a single element into memory even when you need to group elements.
I'll give only a partial answer, perhaps somebody else will have a better one.
As far as I know, standard pipes have no mechanism of detecting when the other part of the pipeline terminates. The first pipe that terminates produces the final result of the pipe-line and all the others are just dropped. So if you have a pipe that consumes input forever (to eventually produce a list), it will have no chance acting and producing output when its upstream finishes. (This is intentional so that both up- and down-stream parts are dual to each other.) Perhaps this is solved in some library building on top of pipes.
The situation is different with conduit. It has consume function that combines all inputs into a list and returns (not outputs) it. Writing a function like the one you need, that outputs the list at the end, is not difficult:
import Data.Conduit
combine :: (Monad m) => Conduit a m [a]
combine = loop []
where
loop xs = await >>= maybe (yield $ reverse xs) (loop . (: xs))