I'm trying to write a webscraper using Pipes and I've come to the part of following scraped links. I have a process function that downloads a url, finds links, and yields them.
process :: Pipe Item Item (StateT CState IO) ()
....
for (each links) yield
....
Now I want to some how recursively follow these links, threading the StateT through. I realize there is probably doing something more idiomatic then using a single pipe for the bulk of the scraper (especially as I start adding more features), I'm open for suggestions. I'm probably going to have to rethink the design when I consider multithreading w/ shared state anyways.
You can connect a Pipe a b m r to a side-effect through the m parameter, which swaps out which Monad the pipe is operating over. You can use this to requeue links by connecting the downstream end of your pipe to another pipe that sticks the links in a queue and connecting the upstream end of your pipe to a pipe that reads links from the queue.
Our goal is to write
import Pipes
loopLeft :: Monad m => Pipe (Either l a) (Either l b) m r -> Pipe a b m r
We'll take a pipe whose downstream output, Either l b, is either an Left l to send back upstream or a Right b to send downstream, and send the ls back in the upstream input Either l a, which is either a queued up Left l or a Right a coming from upstream. We'll connect the Left ls together to make a pipe that only sees as coming from upstream and only yields bs headed downstream.
At the downstream end we'll push the ls from Left l onto a stack. We yield the rs from Right r downstream.
import Control.Monad
import Control.Monad.Trans.State
pushLeft :: Monad m => Pipe (Either l a) a (StateT [l] m) r
pushLeft = forever $ do
o <- await
case o of
Right a -> yield a
Left l -> do
stack <- lift get
lift $ put (l : stack)
At the upstream end we'll look for something on top of the stack to yield. If there isn't one, we'll await for a value from upstream and yield it.
popLeft :: Monad m => Pipe a (Either l a) (StateT [l] m) r
popLeft = forever $ do
stack <- lift get
case stack of
[] -> await >>= yield . Right
(x : xs) -> do
lift $ put xs
yield (Left x)
Now we can write loopLeft. We compose the upstream and downstream pipes together with pipe composition popLeft >-> hoist lift p >-> pushLeft. The hoist lift turns a Pipe a b m r into a Pipe a b (t m) r. The distribute turns a Pipe a b (t m) r into a t (Pipe a b m) r. To get back to a Pipe a b m r we run the whole StateT computation starting with an empty stack []. In Pipes.Lift there's a nice name evalStateP for the combination of evalStateT and distribute.
import Pipes.Lift
loopLeft :: Monad m => Pipe (Either l a) (Either l b) m r -> Pipe a b m r
loopLeft p = flip evalStateT [] . distribute $ popLeft >-> hoist lift p >-> pushLeft
I would do it like this:
import Pipes
type Url = String
getLinks :: Url -> IO [Url]
getLinks = undefined
crawl :: MonadIO m => Pipe Url Url m a
crawl = loop []
where
loop [] = do url <- await; loop [url]
loop (url:urls) = do
yield url
urls' <- liftIO $ getLinks url
loop (urls ++ urls')
You can achieve DFS or BFS depending on how you combine url' with urls.
Related
I use Haskell stream processing library pipes to write a command line tool. Each command line actions may output result to stdout and logs to stderr with pipes API.
I need Consumer which has type as Consumer (Either String String) m r to print chunk of data (Left to stderr, Right to stdout) with single Consumer.
Code I wrote (should be improved)
This function consumeEither doesn't have flexibility so I want to improve it.
consumeEither :: (MonadIO m) => Consumer (Either String String) m ()
consumeEither = do
eitherS <- await
case eitherS of
(Left l) -> for (yield l) (liftIO . (IO.hPutStrLn IO.stderr))
(Right r) -> for (yiled r) (liftIO . putStrLn)
Furthermore it would be useful to provide a function which takes two Consumers and merge them into one Consumer.
Question
Does anybody know good example or implementation of the following interface?
merge :: (Monad m) => Consumer a m r -> Consumer b m r -> Consumer (Either a b) m r
1st argument as stderr
2nd argument as stdout
Usage of the function
import Pipes
import qualified Pipes.Prelude as P
import qualified System.IO as IO
stdoutOrErr :: Consumer (Either String String) IO ()
stdoutOrErr = merge (P.toHandle IO.stderr) P.stdoutLn
Thanks
(This is #Michael's answer, but I'd like to write it up here so we can move the question out of the unanswered queue for the Haskell tag.)
See (+++) in pipes-extras. Keep in mind a Consumer is a Pipe (to nowhere), so P.toHandle IO.stderr +++ P.stdoutLn :: MonadIO m => Pipe (Either String String) (Either b d) m ().
To get a Consumer, you would have to get rid of the Lefts e.g with >-> P.concat or >-> P.drain. There are more robust and handsome ways of doing this with Folds.
I'm trying to dispatch the items from a conduit into many output files, the problem is very similar to Conduit - Multiple output file within the pipeline, with a few differences:
In the previous solution, every sink has a filter that decides if the element belongs to that sink or not. In my case every element coming from the upstream goes exactly to one file, and in the case where there is a big number of files it would be better to make only one operation to decide to which file is it going.
The files are created on demand. A "selector" function decides which sink the next element is going to, and if it doesn't exist yet it creates it using a "create new sink" function.
For example if the Source yields: 8 4 7 1 5
And the sink selector is a module 3, then the sequence of actions would be:
Create file 2
Add 8 to file 2
Create file 1
Add 4 to file 1
Add 7 to file 1
Add 1 to file 1
Add 5 to file 2
I'm thinking of a type for this dispatcher like this:
dispatcherSink_ :: (Monad m) =>
(a -> k) -> -- sink selector
(k -> Sink a m ()) -> -- new sink
Sink a m ()
I've tried to write the function using evalStateC with an internal StateT holding a Map of Sinks, but I'm not able to tie up the types. I'm not sure if you can even use the same sink twice.
Is what I'm trying to do even possible?
I'm still a newbie in Haskell, so any help will be appreciated.
Edited
I though I could create a map of ResumableSinks, there is a library in Hackage for that, but it depends on an old and very specific version of Conduit, so cabal couldn't install it.
In the end I didn't find a way to write the function with the previous type, able to work with any sink, so I came up with a function that works with files directly:
import System.IO (hClose,openFile,IOMode(WriteMode))
import Conduit
import Data.IOData
import qualified Data.Foldable as F
import qualified Data.Map.Strict as M
import Control.Monad.State.Strict
import Data.ByteString.Char8 (pack)
fileDispatcherSink ::
(MonadIO m, IOData c,Ord k) =>
(a -> k) ->
(a -> c) ->
(k -> FilePath) ->
Sink a m ()
fileDispatcherSink selector toChunked path =
evalStateC M.empty $ dispatcher
where
dispatcher = do
next <- await
m <- get
case next of
Nothing -> liftIO $ F.traverse_ hClose m
Just a -> do
let k = selector a
h <- case M.lookup k m of
Nothing -> do
nh <- liftIO $ openFile (path k) WriteMode
put $ M.insert k nh m
return nh
Just h -> return h
yield (toChunked a) $$ sinkHandle h
dispatcher
testSource :: (Monad m) => Source m Int
testSource = yieldMany [8, 4, 7, 1, 5]
main :: IO ()
main = testSource
$$ fileDispatcherSink (`mod` 3) (pack . show) ((++ ".txt") . show)
Is there a way to write the _dispatcherSink__ function?
There is a conceptual problem with implementing
dispatcherSink_ :: (Monad m) =>
(a -> k) -> -- sink selector
(k -> Sink a m ()) -> -- new sink
Sink a m ()
. In conduit, data is pulled from upstream to downstream, instead of being pushed. So a Sink decides if it requests a next input value from its upstream conduit or not. So you can't really have a map of Sinks, read an input value and then feed it to one of the Sinks. The Sink you select might not decide to read the input value, it might decide to finish, and then what will you do with the input value? You can create a new sink for that key, but it can also decide not to accept the input.
So instead of a Sink you'll most likely need some different concept, something to which you can push a value and also what you can finalize. An idea (untested):
data PushSink m i = PushSink { psPush :: i -> m (PushSink m i)
, psFinalize :: m () }
An implementation for writing files would open a file, keep the handle, and psPush would just write a chunk into the file, returning the same object, while psFinalize would close the file.
And then you can implement a variant like this
dispatcherSink_ :: (Monad m) =>
(a -> k) -> -- sink selector
(k -> m (PushSink a m)) -> -- new sink
Sink a m ()
which pushes values to PushSinks and finalizes them all when there is no input.
I'm getting started with Netwire version 5.
I have no problem writing all the wires I want to transform my inputs into my outputs.
Now the time has come to write the IO wrapper to tie in my real-world inputs, and I am a bit confused.
Am I supposed to create a custom session type for the s parameter of Wire s e m a b and embed my sensor values in there?
If so, I have these questions:
What's up with the Monoid s context of class (Monoid s, Real t) => HasTime t s | s -> t? What is it used for?
I was thinking of tacking on a Map String Double with my sensor readings, but how should my monoid crunch the dictionaries? Should it be left-biased? Right-biased? None of the above?
If not, what am I supposed to do? I want to end up with wires of the form Wire s InhibitionReason Identity () Double for some s, representing my input.
It's my understanding that I don't want or need to use the monadic m parameter of Wire for this purpose, allowing the wires themselves to be pure and confining the IO to the code that steps through the top-level wire(s). Is this incorrect?
The simplest way to put data into a Wire s e m a b is via the input a. It's possible, through the use of WPure or WGen to get data out of the state delta s or the underlying Monad m, but these take us further away from the main abstractions. The main abstractions are Arrow and Category, which only know about a b, and not about s e m.
Here's an example of a very simple program, providing input as the input a. double is the outermost wire of the program. repl is a small read-eval-print loop that calls stepWire to run the wire.
import FRP.Netwire
import Control.Wire.Core
import Prelude hiding (id, (.))
double :: Arrow a => a [x] [x]
double = arr (\xs -> xs ++ xs)
repl :: Wire (Timed Int ()) e IO String String -> IO ()
repl w = do
a <- getLine
(eb, w') <- stepWire w (Timed 1 ()) (Right a)
putStrLn . either (const "Inhibited") id $ eb
repl w'
main = repl double
Notice that we pass in the time difference to stepWire, not the total elapsed time. We can check that this is the correct thing to do by running a different top-level wire.
timeString :: (HasTime t s, Show t, Monad m) => Wire s e m a String
timeString = arr show . time
main = repl timeString
Which has the desired output:
a
1
b
2
c
3
I just solved this in an Arrow way, so this might be more composible. You can read my posts if you like. Kleisli Arrow in Netwire 5? and Console interactivity in Netwire?. The second post has a complete interactive program
First, you need this to lift Kleisli functions (That is, anything a -> m b):
mkKleisli :: (Monad m, Monoid e) => (a -> m b) -> Wire s e m a b
mkKleisli f = mkGen_ $ \a -> liftM Right $ f a
Then, assuming you want to get characters from terminal, you can lift hGetChar by doing this:
inputWire :: Wire s () IO () Char
inputWire = mkKleisli $ \_ -> hGetChar stdin
I haven't tested this runWire function (I just stripped code off from my previous posts), but it should run your wires:
runWire :: (Monad m) => Session m s -> Wire s e m () () -> m ()
runWire s w = do
(ds, s') <- stepSession s
-- | You don't really care about the () returned
(_, w') <- stepWire w ds (Right ())
runWire s' w'
You can compose the input wire wherever you like like any other Wires or Arrows. In my example, I did this (don't just copy, other parts of the program are different):
mainWire = proc _ -> do
c <- inputWire -< ()
q <- quitWire -< c
outputWire -< c
returnA -< q
Or, one-liner:
mainWire = inputWire >>> (quitWire &&& outputWire) >>> arr (\(q,_) -> q)
I have been experimenting with the new pipes-http package and I had a thought. I have two parsers for a web page, one that returns line items and another a number from elsewhere in the page. When I grab the page, it'd be nice to string these parsers together and get their results at the same time from the same bytestring producer, rather than fetching the page twice or fetching all the html into memory and parsing it twice.
In other words, say you have two Consumers:
c1 :: Consumer a m r1
c2 :: Consumer a m r2
Is it possible to make a function like this:
combineConsumers :: Consumer a m r1 -> Consumer a m r2 -> Consumer a m (r1, r2)
combineConsumers = undefined
I have tried a few things, but I can't figure it out. I understand if it isn't possible, but it would be convenient.
Edit:
I'm sorry it turns out I was making an assumption about pipes-attoparsec, due to my experience with conduit-attoparsec that caused me to ask the wrong question. Pipes-attoparsec turns an attoparsec into a pipes Parser when I just assumed that it would return a pipes Consumer. That means that I can't actually turn two attoparsec parsers into consumers that take text and return a result, then use them with the plain old pipes ecosystem. I'm sorry but I just don't understand pipes-parse.
Even though it doesn't help me, Arthur's answer is pretty much what I envisioned when I asked the question, and I'll probably end up using his solution in the future. In the meantime I'm just going to use conduit.
It the results are "monoidal", you can use the tee function from the Pipes prelude, in combination with a WriterT.
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid
import Control.Monad
import Control.Monad.Writer
import Control.Monad.Writer.Class
import Pipes
import qualified Pipes.Prelude as P
import qualified Data.Text as T
textSource :: Producer T.Text IO ()
textSource = yield "foo" >> yield "bar" >> yield "foo" >> yield "nah"
counter :: Monoid w => T.Text
-> (T.Text -> w)
-> Consumer T.Text (WriterT w IO) ()
counter word inject = P.filter (==word) >-> P.mapM (tell . inject) >-> P.drain
main :: IO ()
main = do
result <-runWriterT $ runEffect $
hoist lift textSource >->
P.tee (counter "foo" inject1) >-> (counter "bar" inject2)
putStrLn . show $ result
where
inject1 _ = (,) (Sum 1) mempty
inject2 _ = (,) mempty (Sum 1)
Update: As mentioned in a comment, the real problem I see is that in pipes parsers aren't Consumers. And how can you run two parsers concurrently if they have different behaviours regarding leftovers? What happens if one of the parsers wants to "un-draw" some text and the other parser doesn't?
One possible solution is to run the parsers in a truly concurrent manner, in different threads. The primitives in the pipes-concurrency package let you "duplicate" a Producer by writing the same data to two different mailboxes. And then each parser can do whatever it wants with its own copy of the producer. Here's an example which also uses the pipes-parse, pipes-attoparsec and async packages:
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid
import qualified Data.Text as T
import Data.Attoparsec.Text hiding (takeWhile)
import Data.Attoparsec.Combinator
import Control.Applicative
import Control.Monad
import Control.Monad.State.Strict
import Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.Attoparsec as P
import qualified Pipes.Concurrent as P
import qualified Control.Concurrent.Async as A
parseChars :: Char -> Parser [Char]
parseChars c = fmap mconcat $
many (notChar c) *> many1 (some (char c) <* many (notChar c))
textSource :: Producer T.Text IO ()
textSource = yield "foo" >> yield "bar" >> yield "foo" >> yield "nah"
parseConc :: Producer T.Text IO ()
-> Parser a
-> Parser b
-> IO (Either P.ParsingError a,Either P.ParsingError b)
parseConc producer parser1 parser2 = do
(outbox1,inbox1,seal1) <- P.spawn' P.Unbounded
(outbox2,inbox2,seal2) <- P.spawn' P.Unbounded
feeding <- A.async $ runEffect $ producer >-> P.tee (P.toOutput outbox1)
>-> P.toOutput outbox2
sealing <- A.async $ A.wait feeding >> P.atomically seal1 >> P.atomically seal2
r <- A.runConcurrently $
(,) <$> (A.Concurrently $ parseInbox parser1 inbox1)
<*> (A.Concurrently $ parseInbox parser2 inbox2)
A.wait sealing
return r
where
parseInbox parser inbox = evalStateT (P.parse parser) (P.fromInput inbox)
main :: IO ()
main = do
(Right a, Right b) <- parseConc textSource (parseChars 'o') (parseChars 'a')
putStrLn . show $ (a,b)
The result is:
("oooo","aa")
I'm not sure how much overhead this approach introduces.
I think something is wrong with the way you are going about this, for the reasons Davorak mentions in his remark. But if you really need such a function, you can define it.
import Pipes.Internal
import Pipes.Core
zipConsumers :: Monad m => Consumer a m r -> Consumer a m s -> Consumer a m (r,s)
zipConsumers p q = go (p,q) where
go (p,q) = case (p,q) of
(Pure r , Pure s) -> Pure (r,s)
(M mpr , ps) -> M (do pr <- mpr
return (go (pr, ps)))
(pr , M mps) -> M (do ps <- mps
return (go (pr, ps)))
(Request _ f, Request _ g) -> Request () (\a -> go (f a, g a))
(Request _ f, Pure s) -> Request () (\a -> do r <- f a
return (r, s))
(Pure r , Request _ g) -> Request () (\a -> do s <- g a
return (r,s))
(Respond x _, _ ) -> closed x
(_ , Respond y _) -> closed y
If you are 'zipping' consumers without using their return value, only their 'effects' you can just use tee consumer1 >-> consumer2
The idiomatic solution is to rewrite your Consumers as a Fold or FoldM from the foldl library and then combine them using Applicative style. You can then convert this combined fold to one that works on pipes.
Let's assume that you either have two Folds:
fold1 :: Fold a r1
fold2 :: Fold a r2
... or two FoldMs:
foldM1 :: Monad m => FoldM a m r1
foldM2 :: Monad m => FoldM a m r2
Then you combine these into a single Fold/FoldM using Applicative style:
import Control.Applicative
foldBoth :: Fold a (r1, r2)
foldBoth = (,) <$> fold1 <*> fold2
foldBothM :: Monad m => FoldM a m (r1, r2)
foldBothM = (,) <$> foldM1 <*> foldM2
-- or: foldBoth = liftA2 (,) fold1 fold2
-- foldMBoth = liftA2 (,) foldM1 foldM2
You can turn either fold into a Pipes.Prelude-style fold or a Parser. Here are the necessary conversion functions:
import Control.Foldl (purely, impurely)
import qualified Pipes.Prelude as Pipes
import qualified Pipes.Parse as Parse
purely Pipes.fold
:: Monad m => Fold a b -> Producer a m () -> m b
impurely Pipes.foldM
:: Monad m => FoldM m a b -> Producer a m () -> m b
purely Parse.foldAll
:: Monad m => Fold a b -> Parser a m r
impurely Parse.foldMAll
:: Monad m => FoldM a m b -> Parser a m r
The reason for the purely and impurely functions is so that foldl and pipes can interoperate without either one incurring a dependency on the other. Also, they allow libraries other than pipes (like conduit) to reuse foldl without a dependency, too (Hint hint, #MichaelSnoyman).
I apologize that this feature is not documented, mainly because it took me a while to figure out how to get pipes and foldl to interoperate in a dependency-free manner, and that was after I wrote the pipes tutorial. I will update the tutorial to point out this trick.
To learn how to use foldl, just read the documentation in the main module. It's a very small and easy-to-learn library.
For what it's worth, in the conduit world, the relevant function is zipSinks. There might be some way to adapt this function to work for pipes, but automatic termination may get in the way.
Consumer forms a Monad so
combineConsumers = liftM2 (,)
will type check. Unfortunately, the semantics might be unlike what you're expecting: the first consumer will run to completion and then the second.
I'm having a hard time to write a pipe with this signature:
toOneBigList :: (Monad m, Proxy p) => () -> Pipe p a [a] m r
It should simply take all as from upstream and send them in a list downstream.
All my attempts look fundamentally broken.
Can anybody point me in the right direction?
There are two pipes-based solutions and I'll let you pick which one you prefer.
Note: It's not clear why you output the list on the downstream interface instead of just returning it directly as a result.
Conduit-style
The first one, which is very close to the conduit-based solution uses the upcoming pipes-pase, which is basically complete and just needs documentation. You can find the latest draft on Github.
Using pipes-parse, the solution is identical to the conduit solution that Petr gave:
import Control.Proxy
import Control.Proxy.Parse
combine
:: (Monad m, Proxy p)
=> () -> Pipe (StateP [Maybe a] p) (Maybe a) [a] m ()
combine () = loop []
where
loop as = do
ma <- draw
case ma of
Nothing -> respond (reverse as)
Just a -> loop (a:as)
draw is like conduit's await: it requests a value from either the leftovers buffer (that's the StateP part) or from upstream if the buffer is empty. Nothing indicates end of file.
You can wrap a pipe that does not have an end of file signal using the wrap function from pipes-parse, which has type:
wrap :: (Monad m, Proxy p) => p a' a b' b m r -> p a' a b' (Maybe b) m s
Classic Pipes Style
The second alternative is a bit simpler. If you want to fold a given pipe you can do so directly using WriterP:
import Control.Proxy
import Control.Proxy.Trans.Writer
foldIt
:: (Monad m, Proxy p) =>
(() -> Pipe p a b m ()) -> () -> Pipe p a [b] m ()
foldIt p () = runIdentityP $ do
r <- execWriterK (liftP . p >-> toListD >-> unitU) ()
respond r
That's a higher-level description of what is going on, but it requires passing in the pipe as an explicit argument. It's up to you which one you prefer.
By the way, this is why I was asking why you want to send a single value downstream. The above is much simpler if you return the folded list:
foldIt p = execWriterK (liftP . p >-> toListD)
The liftP might not even be necessary if p is completely polymorphic in its proxy type. I only include it as a precaution.
Bonus Solution
The reason pipes-parse does not provide the toOneBigList is that it's always a pipes anti-pattern to group the results into a list. pipes has several nice features that make it possible to never have to group the input into a list, even if you are trying to yield multiple lists. For example, using respond composition you can have a proxy yield the subset of the stream it would have traversed and then inject a handler that uses that subset:
example :: (Monad m, Proxy p) => () -> Pipe p a (() -> Pipe p a a m ()) m r
example () = runIdentityP $ forever $ do
respond $ \() -> runIdentityP $ replicateM_ 3 $ request () >>= respond
printIt :: (Proxy p, Show a) => () -> Pipe p a a IO r
printIt () = runIdentityP $ do
lift $ putStrLn "Here we go!"
printD ()
useIt :: (Proxy p, Show a) => () -> Pipe p a a IO r
useIt = example />/ (\p -> (p >-> printIt) ())
Here's an example of how to use it:
>>> runProxy $ enumFromToS 1 10 >-> useIt
Here we go!
1
2
3
Here we go!
4
5
6
Here we go!
7
8
9
Here we go!
10
This means you never need to bring a single element into memory even when you need to group elements.
I'll give only a partial answer, perhaps somebody else will have a better one.
As far as I know, standard pipes have no mechanism of detecting when the other part of the pipeline terminates. The first pipe that terminates produces the final result of the pipe-line and all the others are just dropped. So if you have a pipe that consumes input forever (to eventually produce a list), it will have no chance acting and producing output when its upstream finishes. (This is intentional so that both up- and down-stream parts are dual to each other.) Perhaps this is solved in some library building on top of pipes.
The situation is different with conduit. It has consume function that combines all inputs into a list and returns (not outputs) it. Writing a function like the one you need, that outputs the list at the end, is not difficult:
import Data.Conduit
combine :: (Monad m) => Conduit a m [a]
combine = loop []
where
loop xs = await >>= maybe (yield $ reverse xs) (loop . (: xs))