Why doesn't runConduit send all the data? - haskell

here's some xml i'm parsing:
<?xml version="1.0" encoding="utf-8"?>
<data>
<row ows_Document='Weekly Report 10.21.2020'
ows_Category='Weekly Report'/>
<row ows_Document='Daily Update 10.20.2020'
ows_Category='Daily Update'/>
<row ows_Document='Weekly Report 10.14.2020'
ows_Category='Weekly Report'/>
<row ows_Document='Weekly Report 10.07.2020'
ows_Category='Weekly Report'/>
<row ows_Document='Spanish: Reporte Semanal 07.10.2020'
ows_Category='Weekly Report'/>
</data>
i've been trying to figure out how to get the conduit parser to reject records unless ows_Category is Weekly Report and ows_Document doesn't contain Spanish. at first, i used a dummy value (in parseDoc' below) to filter them out after parsing, but then i realized i should be able to use Maybe (in the otherwise identical parseDoc below), together with join to collapse out my Maybe layer with the one used by tag' event parser that fails based on name or attribute matches. it compiles, but behaves bizarrely, apparently not even trying to send certain elements to the parser! how could this be?
{-# LANGUAGE OverloadedStrings #-}
import Conduit
import Control.Monad
import qualified Data.ByteString.Lazy.Char8 as L8
import Data.Foldable
import Data.String
import qualified Data.Text as T
import Data.XML.Types
import Text.XML.Stream.Parse
newtype Doc = Doc
{ name :: String
} deriving (Show)
main :: IO ()
main = do
r <- L8.readFile "oha.xml"
let doc = Doc . T.unpack
check (x,y) a b = if y == "Weekly Report" && not (T.isInfixOf "Spanish" x) then a else b
t :: (MonadThrow m, MonadIO m) => ((T.Text, T.Text) -> ConduitT Event o m c)
-> ConduitT Event o m (Maybe c)
t f = tag' "row" ((,) <$> requireAttr "ows_Document" <*> requireAttr "ows_Category") $ \x -> do
liftIO $ print x
f x
parseDoc, parseDoc' :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
parseDoc = (join <$>) . t $ \z#(x,_) -> return $ check z (Just $ doc x) Nothing -- this version doesn't get sent all of the data! why!?!?
parseDoc' = t $ \z#(x,_) -> return $ doc $ check z x $ T.pack bad -- dummy value
parseDocs :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
-> ConduitT Event o m [Doc]
parseDocs = f tagNoAttr "data" . many'
f g n = force (n <> " required") . g (fromString n)
go p = runConduit $ parseLBS def r .| parseDocs p
bad = "no good"
traverse_ print =<< go parseDoc
putStrLn ""
traverse_ print =<< filter ((/= bad) . name) <$> go parseDoc'
output -- notice how parseDoc isn't even sent one of the records (one that should succeed, from 10.14), while parseDoc' behaves as expected:
("Weekly Report 10.21.2020","Weekly Report")
("Daily Update 10.20.2020","Daily Update")
("Weekly Report 10.07.2020","Weekly Report")
("Spanish: Reporte Semanal 07.10.2020","Weekly Report")
Doc {name = "Weekly Report 10.21.2020"}
Doc {name = "Weekly Report 10.07.2020"}
("Weekly Report 10.21.2020","Weekly Report")
("Daily Update 10.20.2020","Daily Update")
("Weekly Report 10.14.2020","Weekly Report")
("Weekly Report 10.07.2020","Weekly Report")
("Spanish: Reporte Semanal 07.10.2020","Weekly Report")
Doc {name = "Weekly Report 10.21.2020"}
Doc {name = "Weekly Report 10.14.2020"}
Doc {name = "Weekly Report 10.07.2020"}
when i tried further simplifying by removing everything to do with ows_Category, suddenly parseDoc worked fine, establishing the soundness of the idea? when i instead removed everything to do with ows_Document, the problem remained.
i suspect i'm supposed to be doing this with requireAttrRaw, but i haven't been able to make sense of it and can't find doc/examples.
does this have to do with Applicative -- now that i think about it, it shouldn't be able to fail based on examining values, right?
UPDATES
i found this answer from the author for a previous version of the library, which includes the intriguing force "fail msg" $ return Nothing in a similar situation, but that abandons all parsing instead of just failing the current parse.
this comment suggests i need to throw an exception, and in the source, they use something like lift $ throwM $ XmlException "failed check" $ Just event, but like force ... return Nothing, this kills all parsing, instead of just the current parser. also i don't know how to get my hands on the event.
here's a merged pull request claiming to have addressed this issue, but it doesn't discuss how to use it, only that it is "trivial" :)
ANSWER
to be explicit about the answer:
parseAttributes :: AttrParser (T.Text, T.Text)
parseAttributes = do
d <- requireAttr "ows_Document"
c <- requireAttr "ows_Category"
ignoreAttrs
guard $ not (T.isInfixOf "Spanish" d) && c == "Weekly Report"
return d
parseDoc :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
parseDoc = tag' "row" parseAttributes $ return . doc
or, since in this case the attribute values can be checked independently:
parseAttributes = requireAttrRaw' "ows_Document" (not . T.isInfixOf "Spanish")
<* requireAttrRaw' "ows_Category" ("Weekly Report" ==)
<* ignoreAttrs
where requireAttrRaw' n f = requireAttrRaw ("required attr value failed condition: " <> n) $ \(n',as) ->
asum $ (\(ContentText a) -> guard (n' == fromString n && f a) *> pure a) <$> as
but the latter leaves open these questions regarding requireAttrRaw:
shouldn't we need to know the namespace if we're in charge of verifying Name?
why does requireAttrRaw send us [Content] instead of two Maybe Content, one each for ContentText and ContentEntity?
what are we supposed to do with ContentEntity "For pass-through parsing"?

tl;dr In tag' "row" parseAttributes parseContent, the check function belongs to parseAttributes, not to parseContent.
Why it does not behave as expected
xml-conduit is (notably) designed around the following invariants:
when parsers are of type ConduitT Event o m (Maybe a), the Maybe layer encodes whether Events have been consumed
tag' parseName parseAttributes parseContent consumes Events if and only if both parseName and parseAttributes succeed
tag' parseName parseAttributes parseContent runs parseContent if and only if both parseName and parseAttributes succeed
In parseDoc:
the check function is called in the parseContent part; at this stage, tag' is already committed to consume Events, as per invariant 2
a stack of 2 Maybe layers are joined together:
the output of the check function, which encodes whether the current <row/> element is relevant
the "standard" Maybe layer from tag' signature, which encodes whether Events have been consumed, as per invariant 1
This essentially breaks invariant 1: when check returns Nothing, parseDoc returns Nothing despite consuming Events of the whole <row/> element.
This results in undefined behavior of all combinators of xml-conduit, notably many' (analyzed below.)
Why it behaves the way it does
The many' combinator relies on invariant 1 to do its job.
It is defined as many' consumer = manyIgnore consumer ignoreAnyTreeContent, that is:
try consumer
if consumer returns Nothing, then skip element or content using ignoreAnyTreeContent, assuming it hasn't been consumed yet by consumer, and recurse back to step (1)
In your case, consumer returns Nothing for the Daily Update 10.20.2020 item, even though the complete <row/> element has been consumed. Therefore, ignoreAnyTreeContent is run as a means to skip that particular <row/>, but actually ends up skipping the next one instead (Weekly Report 10.14.2020).
How to achieve the expected behavior
Move the check logic to the parseAttributes part, so that Event consumption becomes coupled to whether check passes.

Related

Haskell: Replace mapM in a monad transformer stack to achieve lazy evaluation (no space leaks)

It has already been discussed that mapM is inherently not lazy, e.g. here and here. Now I'm struggling with a variation of this problem where the mapM in question is deep inside a monad transformer stack.
Here's a function taken from a concrete, working (but space-leaking) example using LevelDB that I put on gist.github.com:
-- read keys [1..n] from db at DirName and check that the values are correct
doRead :: FilePath -> Int -> IO ()
doRead dirName n = do
success <- runResourceT $ do
db <- open dirName defaultOptions{ cacheSize= 2048 }
let check' = check db def in -- is an Int -> ResourceT IO Bool
and <$> mapM check' [1..n] -- space leak !!!
putStrLn $ if success then "OK" else "Fail"
This function reads the values corresponding to keys [1..n] and checks that they are all correct. The troublesome line inside the ResourceT IO a monad is
and <$> mapM check' [1..n]
One solution would be to use streaming libraries such as pipes, conduit, etc. But these seem rather heavy and I'm not at all sure how to use them in this situation.
Another path I looked into is ListT as suggested here. But the type signatures of ListT.fromFoldable :: [Bool]->ListT Bool and ListT.fold :: (r -> a -> m r) -> r -> t m a -> mr (where m=IO and a,r=Bool) do not match the problem at hand.
What is a 'nice' way to get rid of the space leak?
Update: Note that this problem has nothing to do with monad transformer stacks! Here's a summary of the proposed solutions:
1) Using Streaming:
import Streaming
import qualified Streaming.Prelude as S
S.all_ id (S.mapM check' (S.each [1..n]))
2) Using Control.Monad.foldM:
foldM (\a i-> do {b<-check' i; return $! a && b}) True [1..n]
3) Using Control.Monad.Loops.allM
allM check' [1..n]
I know you mention you don't want to use streaming libraries, but your problem seems pretty easy to solve with streaming without changing the code too much.
import Streaming
import qualified Streaming.Prelude as S
We use each [1..n] instead of [1..n] to get a stream of elements:
each :: (Monad m, Foldable f) => f a -> Stream (Of a) m ()
Stream the elements of a pure, foldable container.
(We could also write something like S.take n $ S.enumFrom 1).
We use S.mapM check' instead of mapM check':
mapM :: Monad m => (a -> m b) -> Stream (Of a) m r -> Stream (Of b) m r
Replace each element of a stream with the result of a monadic action
And then we fold the stream of booleans with S.all_ id:
all_ :: Monad m => (a -> Bool) -> Stream (Of a) m r -> m Bool
Putting it all together:
S.all_ id (S.mapM check' (S.each [1..n]))
Not too different from the code you started with, and without the need for any new operator.
I think what you need is allM from the monad-loops package.
Then it would be just allM check' [1..n]
(Or if you don't want the import it's a pretty small function to copy.)

Is there a pre-defined function for conduit analogy of `takeWhile`?

I find the following function missing from the Data.Conduit.List module, and I couldn't find an easy way to compose this using functions in that module.
takeWhile :: Monad m => (a -> Bool) -> Consumer a m [a]
takeWhile p = await >>= \case
Nothing -> return []
Just b -> if p b
then (b :) <$> takeWhile p
else (leftover b) >> return []
This function is very useful in my application where I sometimes need to group the next few items together, and I am not sure how many are there.
The missing of this function is kind of strange to me as there are take :: Monad m => Int -> Consumer a m [a], and groupBy :: Monad m => (a -> a -> Bool) -> Conduit a m [a], but no takeWhile.
Am I missing something?
Edit: Per #ErikR's request, here is two simple examples that can perhaps clarify why I think this function could be useful.
Case 1: the protocol specifies there be a header section in the stream. For simplicity let's assume it's a String stream and the header items are marked by a leading #.
Stream content:
#language=English
#encoding=Unicode
Apple
Orange
Blue
Red
Sheep
Dog
...
Code using takeWhile:
myConduit :: Conduit String IO String ()
myConduit = do
headers <- takeWhile ((== '#') . head)
awaitForever $ \ item -> do
case getLanguage headers of
English -> ...
French -> ...
Case 2: the protocol specifies that items with prefix # has several continuations prefixed by +.
Stream content:
Apple
Orange
Blue
#Has
+kell
#A
+Really
+Long
+Word
Dog
...
Code using takeWhile:
myConduit :: Conduit String IO String ()
myConduit = runMaybeC . forever $ do
a <- maybe (lift mzero) return =<< await
aConts <- if head item == '#' then takeWhile ((== '+') . head)
else return []
liftIO . putStrLn . concat $ a : aConts
However, aside from being useful, it is also for completeness. I see that Data.Conduit.List's goal is to provide a set of "list-like" operations in the Conduit context. I think bread-and-butter functions like takeWhile should be provided, along with its siblings like dropWhile, so that people don't have to change their style of coding when thinking about conduits as lists.

Signalling to downstream that upstream is exhausted

Question
Using the Haskell pipes library, I'm trying to define a Pipe with the following type:
signalExhausted :: Monad m => Pipe a (Value a) m r
where the Value data type is defined by:
data Value a = Value a | Exhausted
The pipe should obey the following laws:
toList (each [] >-> signalExhausted) == [Exhausted]
toList (each xs >-> signalExhausted) == map Value xs ++ [Exhausted]
In other words, the pipe should be equivalent to Pipes.Prelude.map Value, except that it should yield an additional Exhausted after all upstream values have been processed, giving downstream a chance to perform some final action.
Can such a Pipe be defined?
Example
> let xs = words "hubble bubble toil and trouble"
> toList $ each xs >-> signalExhausted
[Value "hubble", Value "bubble", Value "toil", Value "and", Value "trouble", Exhausted]
Notes
I'm aware that the pipes-parse library provides the functions draw and parseForever. These look useful, but I can't quite see how to combine them into a Pipe that matches the specification above.
A pipe like signalExhausted can't be defined, but a function equivalent to (>-> signalExhausted) can.
>-> is a specialized version of the pull category. Execution is driven by the downstream proxies pulling data from upstream proxies. The downstream proxy sends an empty request () upstream and blocks until a response holding a value comes back from the upstream proxy. When the upstream proxy is exhausted and doesn't have any more values to send back, it returns. You can see the return that matters for these examples in the definition of each.
each = F.foldr (\a p -> yield a >> p) (return ())
-- what to do when the data's exhausted ^
The downstream proxy needs a value to continue running, but there's no value the pipes library can possibly provide it, so the downstream proxy never runs again. Since it never runs again, there's no way it can modify or react to the data.
There are two solutions to this problem. The simplest is to map Value over the upstream pipe and add a yield Exhausted after it's done.
import Pipes
import qualified Pipes.Prelude as P
data Value a = Value a | Exhausted
deriving (Show)
signalExhausted p = p >-> P.map Value >> yield Exhausted
This does exactly what you're looking for except the function signalExhausted takes the place of (>-> signalExhausted).
let xs = words "hubble bubble toil and trouble"
print . P.toList . signalExhausted $ each xs
[Value "hubble",Value "bubble",Value "toil",Value "and",Value "trouble",Exhausted]
The more general solution to this problem is to stop the upstream proxy from returning and instead signal downstream when it is exhausted. I demonstrated how to do so in an answer to a related question.
import Control.Monad
import Pipes.Core
returnDownstream :: Monad m => Proxy a' a b' b m r -> Proxy a' a b' (Either r b) m r'
returnDownstream = (forever . respond . Left =<<) . (respond . Right <\\)
This replaces each respond with respond . Right and replaces return with forever . respond . left, sending returns downstream along with responses.
returnDownstream is more general than what you are looking for. We can demonstrate how to use it to recreate signalExhausted. returnDownstream transforms a pipe that returns into one that never returns, and instead forwards its return value downstream as the Left value of an Either.
signalExhausted p = returnDownstream p >-> respondLeftOnce
respondLeftOnce is an example downstream proxy. The downstream proxy can discern between regular values held in Right and the return value held in Left.
respondLeftOnce :: Monad m => Pipe (Either e a) (Value a) m ()
respondLeftOnce = go
where
go = do
ea <- await
case ea of
Right a -> yield (Value a) >> go
Left _ -> yield Exhausted -- The upstream proxy is exhausted; do something else

How do I abstract this pattern in haskell?

Scenario: I have an interpreter that builds up values bottom-up from an AST. Certain nodes come with permissions -- additional boolean expressions. Permission failures should propagate, but if a node above in the AST comes with a permission, a success can recover the computation and stop the propagation of the error.
At first I thought the Error MyError MyValue monad would be enough: one of the members of MyError could be PermError, and I could use catchError to recover from PermError if the second check succeeds. However, MyValue is gone by the time I get to the handler. I guess there could ultimately be a way of having PermError carry a MyValue field so that the handler could restore it, but it would probably be ugly and checking for an exception at each step would defeat the concept of an exceptional occurrence.
I'm trying to think of an alternative abstraction. Basically I have to return a datatype Either AllErrorsExceptPermError (Maybe PermError, MyValue) or more simply (Maybe AllErrors, MyValue) (the other errors are unrecoverable and fit the error monad pretty well) and I'm looking for something that would save me from juggling the tuple around, since there seems to be a common pattern in how the operations are chained. My haskell knowledge only goes so far. How would you use haskell to your advantage in this situation?
While I write this I came up with an idea (SO is a fancy rubber duck): a Monad that that handles internally a type (a, b) (and ultimately returns it when the monadic computation terminates, there has to be some kind of runMyMonad), but lets me work with the type b directly as much as possible. Something like
data T = Pass | Fail | Nothing
instance Monad (T , b) where
return v = (Nothing, v)
(Pass, v) >>= g = let (r', v') = g v in (if r' == Fail then Fail else Pass, v')
(Fail, v) >>= g = let (r', v') = g v in (if r' == Pass then Pass else Fail, v')
(Nothing, _) >>= g = error "This should not have been propagated, all chains should start with Pass or Fail"
errors have been simplified into T, and the instance line probably has a syntax error, but you should get the idea. Does this make sense?
I think you can use State monad for permissions and value calculation and wrap that inside ErrorT monad transformer to handle the errors. Below is such an example which shows the idea , here the calculation is summing up a list, permissions are number of even numbers in the list and error condition is when we see 0 in the list.
import Control.Monad.Error
import Control.Monad.State
data ZeroError = ZeroError String
deriving (Show)
instance Error ZeroError where
fun :: [Int] -> ErrorT ZeroError (State Int) Int
fun [] = return 0
fun (0:xs) = throwError $ ZeroError "Zero found"
fun (x:xs) = do
i <- get
put $ (if even(x) then i+1 else i)
z <- fun xs
return $ x+z
main = f $ runState (runErrorT $ fun [1,2,4,5,10]) 0
where
f (Left e,evens) = putStr $ show e
f (Right r,evens) = putStr $ show (r,evens)

Pipe with dynamic request/response type?

This seems like a reasonable thing to want, but I'm having type troubles. I'd like to have a Client that can send a list of options to a Server, which will choose one and return the chosen element. So something like this:
module Toy where
import Pipes
asker :: Monad m => () -> Client ([a], a -> String) a m ()
asker () = do
_ <- request ([0.0, 2.0], show)
_ <- request (["3", "4"], show)
return ()
The idea is that the server can call the a -> String function on each element of the list to display them to a user. I'd like to be able to vary a, as long as the list and function match.
Is something like this possible? Maybe the constraints I want can be encoded into a GADT somehow?
You can't do it quite the way you asked, but you can cheat a little bit and get something that's almost as good:
{-# LANGUAGE ExistentialQuantification #-}
module Toy where
import Control.Monad
import Pipes
import Pipes.Prelude (foreverK)
data Request = forall a . Request [a] (a -> String)
asker :: Monad m => () -> Client Request Int m ()
asker () = do
_ <- request (Request [0.0, 2.0] show)
_ <- request (Request ["3", "4"] show)
return ()
server :: Request -> Server Request Int IO r
server = foreverK $ \req -> case req of
Request as f -> do
choice <- lift $ do
let select = do
putStrLn "Select an option"
forM_ (zip [0..] as) $ \(n, a) ->
putStrLn $ show n ++ ": " ++ f a
n <- readLn
if (n >= length as)
then do
putStrLn "Invalid selection"
select
else return n
select
respond choice
Instead of returning back the value selected, you return back an Int corresponding to the index of the selected element. The rest is just using ExistentialQuantification.
Like others recommended, I suggest that you actually just send a list of Strings instead of using the existential quantification trick, but I included it just to show how that would be done just in case you were curious.

Resources