Conduit - Dispatch into multiple output files

Conduit - Dispatch into multiple output files - haskell

I'm trying to dispatch the items from a conduit into many output files, the problem is very similar to Conduit - Multiple output file within the pipeline, with a few differences:
In the previous solution, every sink has a filter that decides if the element belongs to that sink or not. In my case every element coming from the upstream goes exactly to one file, and in the case where there is a big number of files it would be better to make only one operation to decide to which file is it going.
The files are created on demand. A "selector" function decides which sink the next element is going to, and if it doesn't exist yet it creates it using a "create new sink" function.
For example if the Source yields: 8 4 7 1 5
And the sink selector is a module 3, then the sequence of actions would be:
Create file 2
Add 8 to file 2
Create file 1
Add 4 to file 1
Add 7 to file 1
Add 1 to file 1
Add 5 to file 2
I'm thinking of a type for this dispatcher like this:
dispatcherSink_ :: (Monad m) =>
(a -> k) -> -- sink selector
(k -> Sink a m ()) -> -- new sink
Sink a m ()
I've tried to write the function using evalStateC with an internal StateT holding a Map of Sinks, but I'm not able to tie up the types. I'm not sure if you can even use the same sink twice.
Is what I'm trying to do even possible?
I'm still a newbie in Haskell, so any help will be appreciated.
Edited
I though I could create a map of ResumableSinks, there is a library in Hackage for that, but it depends on an old and very specific version of Conduit, so cabal couldn't install it.
In the end I didn't find a way to write the function with the previous type, able to work with any sink, so I came up with a function that works with files directly:
import System.IO (hClose,openFile,IOMode(WriteMode))
import Conduit
import Data.IOData
import qualified Data.Foldable as F
import qualified Data.Map.Strict as M
import Control.Monad.State.Strict
import Data.ByteString.Char8 (pack)
fileDispatcherSink ::
(MonadIO m, IOData c,Ord k) =>
(a -> k) ->
(a -> c) ->
(k -> FilePath) ->
Sink a m ()
fileDispatcherSink selector toChunked path =
evalStateC M.empty $ dispatcher
where
dispatcher = do
next <- await
m <- get
case next of
Nothing -> liftIO $ F.traverse_ hClose m
Just a -> do
let k = selector a
h <- case M.lookup k m of
Nothing -> do
nh <- liftIO $ openFile (path k) WriteMode
put $ M.insert k nh m
return nh
Just h -> return h
yield (toChunked a) $$ sinkHandle h
dispatcher
testSource :: (Monad m) => Source m Int
testSource = yieldMany [8, 4, 7, 1, 5]
main :: IO ()
main = testSource
$$ fileDispatcherSink (`mod` 3) (pack . show) ((++ ".txt") . show)
Is there a way to write the _dispatcherSink__ function?

There is a conceptual problem with implementing
dispatcherSink_ :: (Monad m) =>
(a -> k) -> -- sink selector
(k -> Sink a m ()) -> -- new sink
Sink a m ()
. In conduit, data is pulled from upstream to downstream, instead of being pushed. So a Sink decides if it requests a next input value from its upstream conduit or not. So you can't really have a map of Sinks, read an input value and then feed it to one of the Sinks. The Sink you select might not decide to read the input value, it might decide to finish, and then what will you do with the input value? You can create a new sink for that key, but it can also decide not to accept the input.
So instead of a Sink you'll most likely need some different concept, something to which you can push a value and also what you can finalize. An idea (untested):
data PushSink m i = PushSink { psPush :: i -> m (PushSink m i)
, psFinalize :: m () }
An implementation for writing files would open a file, keep the handle, and psPush would just write a chunk into the file, returning the same object, while psFinalize would close the file.
And then you can implement a variant like this
dispatcherSink_ :: (Monad m) =>
(a -> k) -> -- sink selector
(k -> m (PushSink a m)) -> -- new sink
Sink a m ()
which pushes values to PushSinks and finalizes them all when there is no input.

Related

Handling event streams in haskell

I want to process stream of events received via the MQTT. Library which I'm using uses a callback to provide the results. Processing I'm doing depends on the previous state not only the latest event. Also in the future events might be gathered from the other sources.
At the first I decided to compose it into the list which sounds as a good idea. I had the minor issue cause IO prevents lazy evaluation and waiting for infinite stream might be long, but I solved it with interleaving IO.
stream :: IO [Event] allows me to do the nice stuff like foldl, foldM map, mapM, etc... Unfortunately with this approach I rather wont be able to combine two streams, cause there is no more locking feature there.
I was diging through many libs, and found STM with TQueue for example. Unfortunately it is not what I exactly want.
I decide to create custom type and make it Foldable so I will be able to fold it. I failed due to IO.
import Control.Concurrent.STM
newtype Stream a = Stream (STM a)
runStream
:: ((a -> IO ()) -> IO i)
-> IO (Stream a)
runStream block = do
queue <- newTQueueIO
block (atomically . writeTQueue queue)
return $ Stream (readTQueue queue)
foldStream :: (a -> b -> IO b) -> b -> Stream a -> IO b
foldStream f s (Stream read) = do
n <- atomically read
m <- f n s
foldStream f m (Stream read)
mapStream :: (a -> b) -> Stream a -> Stream b
mapStream f (Stream read) = Stream $ f <$> read
zipStream :: [Stream a] -> Stream a
zipStream = undefined
Whih can be used like main = foldStream (\x _ -> print x) () =<< events
Is it possible to implement base some of base classes to work with this stream as with regular List?

The usual trick in these cases is to make the callback write to a queue, and then read from the other end of the queue.
Using a bounded, closeable queue from the stm-chans package, we can define this function:
import Control.Concurrent.STM
import Control.Concurrent.STM.TBMQueue
foldQueue :: TBMQueue a -> (x -> a -> IO x) -> IO x -> (x -> IO b) -> IO b
foldQueue queue step start done =
let go state =
do m <- atomically (readTBMQueue queue)
case m of
Nothing -> done state
Just a -> step state a >>= go
in start >>= go
It takes the channel, a step function (similar to the one required by foldM), an action to obtain the initial state, and a "done" action that returns the final result, and then feeds data from the channel until it is closed. Notice that the fold state x is chosen by the caller of foldQueue.
If later we want to upgrade to the monadic folds from the foldl package—which have a very useful Applicative instance—we can do it like this:
import qualified Control.Foldl as L
foldQueue' :: TBMQueue a -> L.FoldM IO a b -> IO b
foldQueue' queue = L.impurely (foldQueue queue)
Using impurely from the "foldl" package.
Sometimes (like when parsing, grouping, or decoding) it's easier to use a pull-based consumer. We can do that with the streaming package:
import Streaming
import qualified Streaming.Prelude as S
foldQueue' :: TBMQueue a -> (Stream (Of a) IO () -> IO r) -> IO r
foldQueue' queue consume = consume (S.untilRight (do
m <- atomically (readTBMQueue queue)
return (case m of
Nothing -> Right ()
Just a -> Left a)))
Given a function that consumes a stream, we feed to it a stream of values read from the queue.
Often, reading from the channel and writing to it must happen in different threads. We can use functions like concurrently from async to handle it cleanly.

How to merge two Consumer into one in Haskell Pipes?

I use Haskell stream processing library pipes to write a command line tool. Each command line actions may output result to stdout and logs to stderr with pipes API.
I need Consumer which has type as Consumer (Either String String) m r to print chunk of data (Left to stderr, Right to stdout) with single Consumer.
Code I wrote (should be improved)
This function consumeEither doesn't have flexibility so I want to improve it.
consumeEither :: (MonadIO m) => Consumer (Either String String) m ()
consumeEither = do
eitherS <- await
case eitherS of
(Left l) -> for (yield l) (liftIO . (IO.hPutStrLn IO.stderr))
(Right r) -> for (yiled r) (liftIO . putStrLn)
Furthermore it would be useful to provide a function which takes two Consumers and merge them into one Consumer.
Question
Does anybody know good example or implementation of the following interface?
merge :: (Monad m) => Consumer a m r -> Consumer b m r -> Consumer (Either a b) m r
1st argument as stderr
2nd argument as stdout
Usage of the function
import Pipes
import qualified Pipes.Prelude as P
import qualified System.IO as IO
stdoutOrErr :: Consumer (Either String String) IO ()
stdoutOrErr = merge (P.toHandle IO.stderr) P.stdoutLn
Thanks

(This is #Michael's answer, but I'd like to write it up here so we can move the question out of the unanswered queue for the Haskell tag.)
See (+++) in pipes-extras. Keep in mind a Consumer is a Pipe (to nowhere), so P.toHandle IO.stderr +++ P.stdoutLn :: MonadIO m => Pipe (Either String String) (Either b d) m ().
To get a Consumer, you would have to get rid of the Lefts e.g with >-> P.concat or >-> P.drain. There are more robust and handsome ways of doing this with Folds.

Haskell Conduit: is it possible to optionally have the result of a source?

I have the following types built from Data.Conduit:
type Footers = [(ByteString, ByteString)]
type DataAndConclusion = ConduitM () ByteString IO Footers
The idea of the second type being "produce a lot of ByteStrings, and if you can produce all of them, return a Footers". The condition is because conduits are governed by downstream functions, so the consumer of DataAndConclusion may have no need to consume all its items, and in that case the return wouldn't be reached. Which is precisely the behavior that I need. But when the end of the source is reached, I would like to have the produced Footers. This would be useful for example if the DataAndConclusions were incrementally computing an MD5 and such MD5 was only needed if the entire message was processed by the downstream (for example, downstream could be simply sending it through the network, but it doesn't make sense to finish computing and send the MD5 if the socket was closed before the last piece was sent by downstream).
So, basically I want to have something with this signature to consume a DataAndConclusions:
type MySink = Sink ByteString IO ()
mySink :: MySink
mySink = ...
difficultFunction :: ConduitM () a2 m r1 -> ConduitM a2 Void m r2 -> m (Maybe r1)
Question is, is there any way to implement "difficultFunction"? How?

There should be definitely a nice solution, but I wasn't able to construct it using ConduitM primitives. Something with signature
ConduitM i a m r1 -> ConduitM a o m r2 -> ConduitM i o m (Maybe r1, r2)
Looks like a primitive function with this signature would be a good addition for the conduit library.
Nevertheless, #danidiaz's suggestion about StateT lead me to the following generic solution that lifts the whole computation to WriterT internally in order to remember the output of the first conduit, if it's reached:
import Control.Monad
import Control.Monad.Trans
import Control.Monad.Trans.Writer
import Data.Conduit
import Data.Monoid
import Data.Void
difficultFunction :: (Monad m)
=> ConduitM () a2 m r1 -> ConduitM a2 Void m r2
-> m (r2, Maybe r1)
difficultFunction l r = liftM (fmap getLast) $ runWriterT (l' $$ r')
where
l' = transPipe lift l >>= lift . tell . Last . Just
r' = transPipe lift r
(untested!)

This would be useful for example if the DataAndConclusions were
incrementally computing an MD5 and such MD5 was only needed if the
entire message was processed by the downstream
Instead of relying on the return value of the upstream conduit, in this case perhaps you could accumulate the ongoing MD5 computation in a StateT layer beneath ConduitM, and access it after running the conduit.
The other part of the puzzle is detecting that the producer has finished first. Sinks can detect upstream end-of-input in await calls. You could write a Sink that notifies you of upstream termination in its own result type, perhaps with a Maybe.
But what if you are given a Sink that doesn't already do that? We would need a function like Sink i m r -> Sink i m (Maybe r). "Given a Sink that may halt early, return a new Sink that returns Nothing if upstream finishes first". But I don't know how to write that function.
Edit: This conduit sets an IORef to True when it detects upstream termination:
detectUpstreamClose :: IORef Bool -> Conduit i IO i
detectUpstreamClose ref = conduit
where
conduit = do
m <- await
case m of
Nothing -> liftIO (writeIORef ref True)
Just i -> do
yield i
conduit
detectUpstreamClose could be inserted in a pipeline, and the IORef could be checked afterwards.

How can I write a pipe that sends downstream a list of what it receives from upstream?

I'm having a hard time to write a pipe with this signature:
toOneBigList :: (Monad m, Proxy p) => () -> Pipe p a [a] m r
It should simply take all as from upstream and send them in a list downstream.
All my attempts look fundamentally broken.
Can anybody point me in the right direction?

There are two pipes-based solutions and I'll let you pick which one you prefer.
Note: It's not clear why you output the list on the downstream interface instead of just returning it directly as a result.
Conduit-style
The first one, which is very close to the conduit-based solution uses the upcoming pipes-pase, which is basically complete and just needs documentation. You can find the latest draft on Github.
Using pipes-parse, the solution is identical to the conduit solution that Petr gave:
import Control.Proxy
import Control.Proxy.Parse
combine
:: (Monad m, Proxy p)
=> () -> Pipe (StateP [Maybe a] p) (Maybe a) [a] m ()
combine () = loop []
where
loop as = do
ma <- draw
case ma of
Nothing -> respond (reverse as)
Just a -> loop (a:as)
draw is like conduit's await: it requests a value from either the leftovers buffer (that's the StateP part) or from upstream if the buffer is empty. Nothing indicates end of file.
You can wrap a pipe that does not have an end of file signal using the wrap function from pipes-parse, which has type:
wrap :: (Monad m, Proxy p) => p a' a b' b m r -> p a' a b' (Maybe b) m s
Classic Pipes Style
The second alternative is a bit simpler. If you want to fold a given pipe you can do so directly using WriterP:
import Control.Proxy
import Control.Proxy.Trans.Writer
foldIt
:: (Monad m, Proxy p) =>
(() -> Pipe p a b m ()) -> () -> Pipe p a [b] m ()
foldIt p () = runIdentityP $ do
r <- execWriterK (liftP . p >-> toListD >-> unitU) ()
respond r
That's a higher-level description of what is going on, but it requires passing in the pipe as an explicit argument. It's up to you which one you prefer.
By the way, this is why I was asking why you want to send a single value downstream. The above is much simpler if you return the folded list:
foldIt p = execWriterK (liftP . p >-> toListD)
The liftP might not even be necessary if p is completely polymorphic in its proxy type. I only include it as a precaution.
Bonus Solution
The reason pipes-parse does not provide the toOneBigList is that it's always a pipes anti-pattern to group the results into a list. pipes has several nice features that make it possible to never have to group the input into a list, even if you are trying to yield multiple lists. For example, using respond composition you can have a proxy yield the subset of the stream it would have traversed and then inject a handler that uses that subset:
example :: (Monad m, Proxy p) => () -> Pipe p a (() -> Pipe p a a m ()) m r
example () = runIdentityP $ forever $ do
respond $ \() -> runIdentityP $ replicateM_ 3 $ request () >>= respond
printIt :: (Proxy p, Show a) => () -> Pipe p a a IO r
printIt () = runIdentityP $ do
lift $ putStrLn "Here we go!"
printD ()
useIt :: (Proxy p, Show a) => () -> Pipe p a a IO r
useIt = example />/ (\p -> (p >-> printIt) ())
Here's an example of how to use it:
>>> runProxy $ enumFromToS 1 10 >-> useIt
Here we go!
1
2
3
Here we go!
4
5
6
Here we go!
7
8
9
Here we go!
10
This means you never need to bring a single element into memory even when you need to group elements.

I'll give only a partial answer, perhaps somebody else will have a better one.
As far as I know, standard pipes have no mechanism of detecting when the other part of the pipeline terminates. The first pipe that terminates produces the final result of the pipe-line and all the others are just dropped. So if you have a pipe that consumes input forever (to eventually produce a list), it will have no chance acting and producing output when its upstream finishes. (This is intentional so that both up- and down-stream parts are dual to each other.) Perhaps this is solved in some library building on top of pipes.
The situation is different with conduit. It has consume function that combines all inputs into a list and returns (not outputs) it. Writing a function like the one you need, that outputs the list at the end, is not difficult:
import Data.Conduit
combine :: (Monad m) => Conduit a m [a]
combine = loop []
where
loop xs = await >>= maybe (yield $ reverse xs) (loop . (: xs))

What is the preferred way to combine two sinks?

I've used zipSinks :: Monad m => Sink i m r -> Sink i m r' -> Sink i m (r, r') for this but it is considered deprecated.

Edit
After considering this, I don't think it is possible with the current version of Data.Conduit. Pipes aren't Categories, so &&& is out of the question. And there's no way that I can think of to pull results from upstream, feed them incrementally to both sinks, and short-circuit when the first sink finishes. (Although I don't think that Data.Conduit.Util.zipSinks short-circuits this way, it seems like it would be very desirable.) Except of course, to pattern match on both Sinks (like zipSinks in the package does), but that's what we're trying to avoid here.
That said, I would love to be proven wrong here.
It's not pretty, but you can do this in a kind-of obvious way.
First imports:
module Main where
import Control.Monad.Trans
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.Util as CU
import Data.Maybe
import Data.Text (unpack)
Now for zipSinks. Basically, you want to create a sink that pulls the input from upstream and sends it to each child sink separately. In this case, I've used CL.sourceList to do this. If await returns Nothing, maybeToList returns an empty list, so the child sinks are also run with no input. Finally, the output of each child sink is then fed into the tuple.
zipSinks :: Monad m => Sink i m r -> Sink i m r' -> Sink i m (r, r')
zipSinks s1 s2 = do
l <- fmap maybeToList await
o1 <- lift $ CL.sourceList l $$ s1
o2 <- lift $ CL.sourceList l $$ s2
return (o1, o2)
Here are some examples of using zipSinks. It appears to work fine both inside of IO and outside of it, and in the few tests I did, the output matches the output of zipped', created using the old zipSinks.
doubleHead :: Monad m => Sink Int m (Maybe Int)
doubleHead = await >>= return . fmap (2*)
-- old version
zipped' :: Monad m => Sink Int m (Maybe Int, Maybe Int)
zipped' = CU.zipSinks CL.head doubleHead
-- new version
zipped :: Monad m => Sink Int m (Maybe Int, Maybe Int)
zipped = zipSinks CL.head doubleHead
fromList = CL.sourceList [7, 8, 9] $$ zipped
-- (Just 7, Just 14)
fromFile :: String -> IO (Maybe Int, Maybe Int)
fromFile filename = runResourceT $
CB.sourceFile filename
$= CB.lines
$= CT.decode CT.utf8
$= CL.map (read . unpack)
$$ zipped
-- for a file with the lines:
--
-- 1
-- 2
-- 3
--
-- returns (Just 1, Just 2)

((The package is conduit-0.5.2.3. The whole module is just for backwards compatibility.))
[edit]
So, my straightforward monadic guess (see below) seems to be wrong, even though the types are correct.
Now, I can only guess that the answer is:
The replacing features are still in development, pretty much like all Pipe/Conduit and similar concepts and libraries.
I'd wait for the next API to solve this question and still use zipSink until then.
(Maybe it was just misplaced.)
[/edit]
I'm not that familar with this package, but wouldn't it do just the same as this?
zipSinks :: Monad m => Sink i m r -> Sink i m r' -> Sink i m (r, r')
zipSinks s1 s2 = (,) <$> s1 <*> s2
It is a Monad after all. (Functor, Applicative)
zipSinks :: Monad sink => sink r -> sink r' -> sink (r, r')
zipSinks s1 s2 = liftM2 (,) s1 s2

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Conduit - Dispatch into multiple output files - haskell

Related

Handling event streams in haskell

How to merge two Consumer into one in Haskell Pipes?

Haskell Conduit: is it possible to optionally have the result of a source?

How can I write a pipe that sends downstream a list of what it receives from upstream?

What is the preferred way to combine two sinks?

Categories

Resources