Efficiently search for a single element in a large Pandoc

Efficiently search for a single element in a large Pandoc - haskell

Unless I'm missing something it seems that there are only two ways to "traverse" a Pandoc data-structure:
Manually pattern-matching on Block and Inline constructors
Via the Walkable type-class and related utility function
Using the Walkable type-class, is there an efficient way to search for the first matching element (preferably in a breadth-first manner), and stop the traversal as soon as its found? It seems to me that all functions around the Walkable type-class are going to traverse the entire data structure no matter what.
If not, I guess the only way is to pattern-match the Block and Inline constructors and build this on my own.

The other answer points out the useful query function. I'd add that there's a package of pandoc lenses. You asked about breadth-first traversal too, so here's both.
import Data.Semigroup (First (..))
dfsFirstLink :: Walkable Inline b => b -> Maybe Text
dfsFirstLink = fmap getFirst . query (preview $ _Link . _2 . _1 . to First)
bfsFirstLink :: Walkable Inline b => b -> Maybe Text
bfsFirstLink = fmap getFirst . getConst . traverseOf (levels query . folded) (Const . preview (_Link . _2 . _1 . to First))
-- Construct a walkable value where dfs != bfs
p :: Pandoc
p = Pandoc mempty [Plain [Note [Plain [Link mempty [] ("a","b")]]],Plain [Link mempty [] ("c","d")]]
>> dfsFirstLink p
Just "a"
>> bfsFirstLink p
Just "c"
Though unfortunately some ad-hoc experiments suggest it may not be as lazy as one might hope.

The Walkable typeclass contains a function called query with the following type signature:
query :: Monoid c => (a -> c) -> b -> c
In Data.Semigroup, there's a type called First, with a semigroup instance where the accumulating behavior is to return the "leftmost value".
This can be combined with the Monoid on Maybe, which turns any Semigroup into a Monoid with mempty of Nothing, to give the behavior you want.
For example, adapting a function from Inline -> Maybe String, to Pandoc -> Maybe String, can be done like so:
import Text.Pandoc
import Text.Pandoc.Walk (query)
import Data.Semigroup
findUrl :: Inline -> Maybe String
findUrl (Link _ _ target) = Just $ fst target
findUrl _ = Nothing
findFirstUrl :: Pandoc -> Maybe String
findFirstUrl = (fmap getFirst) . (query findUrl')
where
findUrl' :: Inline -> Maybe (First String)
findUrl' = (fmap First) . findUrl
With regards to your concern that this will traverse the entire data structure: Haskell is lazy; it shouldn't traverse any further than it needs to.
As pointed out in the comments, it's also possible to write this by specializing query to the List Monoid:
import Text.Pandoc
import Text.Pandoc.Walk (query)
import Data.Maybe (listToMaybe)
findUrl :: Inline -> [String]
findUrl (Link _ _ target) = [fst target]
findUrl _ = []
findFirstUrl :: Pandoc -> Maybe String
findFirstUrl = listToMaybe . (query findUrl)

Related

How to convert a haskell List into a monadic function that uses list values for operations?

I am having trouble wrapping my head around making to work a conversion of a list into a monadic function that uses values of the list.
For example, I have a list [("dir1/content1", "1"), ("dir1/content11", "11"), ("dir2/content2", "2"), ("dir2/content21", "21")] that I want to be converted into a monadic function that is mapped to a following do statement:
do
mkBlob ("dir1/content1", "1")
mkBlob ("dir1/content11", "11")
mkBlob ("dir2/content2", "2")
mkBlob ("dir2/content21", "21")
I imagine it to be a function similar to this:
contentToTree [] = return
contentToTree (x:xs) = (mkBlob x) =<< (contentToTree xs)
But this does not work, failing with an error:
• Couldn't match expected type ‘() -> TreeT LgRepo m ()’
with actual type ‘TreeT LgRepo m ()’
• Possible cause: ‘(>>=)’ is applied to too many arguments
In the expression: (mkBlob x) >>= (contentToTree xs)
In an equation for ‘contentToTree’:
contentToTree (x : xs) = (mkBlob x) >>= (contentToTree xs)
• Relevant bindings include
contentToTree :: [(TreeFilePath, String)] -> () -> TreeT LgRepo m ()
I do not quite understand how to make it work.
Here is my relevant code:
import Data.Either
import Git
import Data.Map
import Conduit
import qualified Data.List as L
import qualified Data.ByteString.Char8 as BS
import qualified Data.ByteString.Lazy as BL
import Control.Monad (join)
type FileName = String
data Content = Content {
content :: Either (Map FileName Content) String
} deriving (Eq, Show)
contentToPaths :: String -> Content -> [(TreeFilePath, String)]
contentToPaths path (Content content) = case content of
Left m -> join $ L.map (\(k, v) -> (contentToPaths (if L.null path then k else path ++ "/" ++ k) v)) $ Data.Map.toList m
Right c -> [(BS.pack path, c)]
mkBlob :: MonadGit r m => (TreeFilePath, String) -> TreeT r m ()
mkBlob (path, content) = putBlob path
=<< lift (createBlob $ BlobStream $
sourceLazy $ BL.fromChunks [BS.pack content])
sampleContent = Content $ Left $ fromList [
("dir1", Content $ Left $ fromList [
("content1", Content $ Right "1"),
("content11", Content $ Right "11")
]),
("dir2", Content $ Left $ fromList [
("content2", Content $ Right "2"),
("content21", Content $ Right "21")
])
]
Would be grateful for any tips or help.

You have:
A list of values of some type a (in this case a ~ (String, String)). So, xs :: [a]
A function f from a to some type b in a monadic context, m b. Since you're ignoring the return value, we can imagine b ~ (). So, f :: Monad m => a -> m ().
You want to perform the operation, yielding some monadic context and an unimportant value, m (). So overall, we want some function doStuffWithList :: Monad m => [a] -> (a -> m ()) -> m (). We can search Hoogle for this type, and it yields some results. Unfortunately, as we've chosen to order the arguments, the first several results are little-used functions from other packages. If you scroll further, you start to find stuff in base - very promising. As it turns out, the function you are looking for is traverse_ :: (Foldable t, Applicative f) => (a -> f b) -> t a -> f (). With that, we can replace your do-block with just:
traverse_ mkBlob [ ("dir1/content1", "1")
, ("dir1/content11", "11")
, ("dir2/content2", "2")
, ("dir2/content21", "21")
]
As it happens there are many names for this function, some for historical reasons and some for stylistic reasons. mapM_, forM_, and for_ are all the same and all in base, so you could use any of these. But the M_ versions are out of favor these days because really you only need Applicative, not Monad; and the for versions take their arguments in an order that's convenient for lambdas but inconvenient for named functions. So, traverse_ is the one I'd suggest.

Assuming mkBlob is a function that looks like
mkBlob :: (String, String) -> M ()
where M is some specific monad, then you have the list
xs = [("dir1/content1", "1"), ("dir1/content11", "11"), ("dir2/content2", "2"), ("dir2/content21", "21")]
whose type is xs :: [(String, String)]. The first thing we need is to run the mkBlob function on each element, i.e. via map.
map mkBlob xs :: [M ()]
Now, we have a list of monadic actions, so we can use sequence to run them in sequence.
sequence (map mkBlob xs) :: M [()]
The resulting [()] value is all but useless, so we can use void to get rid of it
void . sequence . map mkBlob $ xs :: M ()
Now, void . sequence is called sequence_ in Haskell (since this pattern is fairly common), and sequence . map is called mapM. Putting the two together, the function you want is called mapM_.
mapM_ mkBlob xs :: M ()

Haskell : convert Either ParseError a to a

I'm actually using Parsec to make an Expression Parser and I have a little question (I'm french also sorry for my english ).
I have this code :
data Expression ... -- Recursive type Expression
type Store [(String, Float)] -- variable's storage
type Parser a = Parsec String () a
parseExpr :: [Char] -> Either ParseError Expression
parseExpr string = parse expr "" stream
where
stream = filter (not . isSpace) string
-- Parser's rules ...
raiseError a = Nothing
evalParser :: [Char] -> Store -> Float
evalParser expr store = fromMaybe 0 (either raiseError (eval store)(parseExpr expr))
This code works really well, but i need this function :
parseExpression :: String -> Maybe Expression
And I have no ideas of the right syntax to use.
Someone can help me please ?

I'll start it for you, and you can finish it:
eitherToMaybe :: Either a b -> Maybe b
eitherToMaybe (Left a) = ???
eitherToMaybe (Right b) = ???
A severe over-generalization looks like this (you can find this function with a less direct implementation in the monadplus package):
import Control.Applicative (Alternative (..))
import Data.Profunctor.Unsafe ((#.))
import Data.Monoid (Alt (..))
afold :: (Foldable f, Alternative g)
=> f a -> g a
afold = getAlt . foldMap (Alt #. pure)
But you don't really need to get into that business just yet.

Haskell: Replace mapM in a monad transformer stack to achieve lazy evaluation (no space leaks)

It has already been discussed that mapM is inherently not lazy, e.g. here and here. Now I'm struggling with a variation of this problem where the mapM in question is deep inside a monad transformer stack.
Here's a function taken from a concrete, working (but space-leaking) example using LevelDB that I put on gist.github.com:
-- read keys [1..n] from db at DirName and check that the values are correct
doRead :: FilePath -> Int -> IO ()
doRead dirName n = do
success <- runResourceT $ do
db <- open dirName defaultOptions{ cacheSize= 2048 }
let check' = check db def in -- is an Int -> ResourceT IO Bool
and <$> mapM check' [1..n] -- space leak !!!
putStrLn $ if success then "OK" else "Fail"
This function reads the values corresponding to keys [1..n] and checks that they are all correct. The troublesome line inside the ResourceT IO a monad is
and <$> mapM check' [1..n]
One solution would be to use streaming libraries such as pipes, conduit, etc. But these seem rather heavy and I'm not at all sure how to use them in this situation.
Another path I looked into is ListT as suggested here. But the type signatures of ListT.fromFoldable :: [Bool]->ListT Bool and ListT.fold :: (r -> a -> m r) -> r -> t m a -> mr (where m=IO and a,r=Bool) do not match the problem at hand.
What is a 'nice' way to get rid of the space leak?
Update: Note that this problem has nothing to do with monad transformer stacks! Here's a summary of the proposed solutions:
1) Using Streaming:
import Streaming
import qualified Streaming.Prelude as S
S.all_ id (S.mapM check' (S.each [1..n]))
2) Using Control.Monad.foldM:
foldM (\a i-> do {b<-check' i; return $! a && b}) True [1..n]
3) Using Control.Monad.Loops.allM
allM check' [1..n]

I know you mention you don't want to use streaming libraries, but your problem seems pretty easy to solve with streaming without changing the code too much.
import Streaming
import qualified Streaming.Prelude as S
We use each [1..n] instead of [1..n] to get a stream of elements:
each :: (Monad m, Foldable f) => f a -> Stream (Of a) m ()
Stream the elements of a pure, foldable container.
(We could also write something like S.take n $ S.enumFrom 1).
We use S.mapM check' instead of mapM check':
mapM :: Monad m => (a -> m b) -> Stream (Of a) m r -> Stream (Of b) m r
Replace each element of a stream with the result of a monadic action
And then we fold the stream of booleans with S.all_ id:
all_ :: Monad m => (a -> Bool) -> Stream (Of a) m r -> m Bool
Putting it all together:
S.all_ id (S.mapM check' (S.each [1..n]))
Not too different from the code you started with, and without the need for any new operator.

I think what you need is allM from the monad-loops package.
Then it would be just allM check' [1..n]
(Or if you don't want the import it's a pretty small function to copy.)

Is there a pre-defined function for conduit analogy of `takeWhile`?

I find the following function missing from the Data.Conduit.List module, and I couldn't find an easy way to compose this using functions in that module.
takeWhile :: Monad m => (a -> Bool) -> Consumer a m [a]
takeWhile p = await >>= \case
Nothing -> return []
Just b -> if p b
then (b :) <$> takeWhile p
else (leftover b) >> return []
This function is very useful in my application where I sometimes need to group the next few items together, and I am not sure how many are there.
The missing of this function is kind of strange to me as there are take :: Monad m => Int -> Consumer a m [a], and groupBy :: Monad m => (a -> a -> Bool) -> Conduit a m [a], but no takeWhile.
Am I missing something?
Edit: Per #ErikR's request, here is two simple examples that can perhaps clarify why I think this function could be useful.
Case 1: the protocol specifies there be a header section in the stream. For simplicity let's assume it's a String stream and the header items are marked by a leading #.
Stream content:
#language=English
#encoding=Unicode
Apple
Orange
Blue
Red
Sheep
Dog
...
Code using takeWhile:
myConduit :: Conduit String IO String ()
myConduit = do
headers <- takeWhile ((== '#') . head)
awaitForever $ \ item -> do
case getLanguage headers of
English -> ...
French -> ...
Case 2: the protocol specifies that items with prefix # has several continuations prefixed by +.
Stream content:
Apple
Orange
Blue
#Has
+kell
#A
+Really
+Long
+Word
Dog
...
Code using takeWhile:
myConduit :: Conduit String IO String ()
myConduit = runMaybeC . forever $ do
a <- maybe (lift mzero) return =<< await
aConts <- if head item == '#' then takeWhile ((== '+') . head)
else return []
liftIO . putStrLn . concat $ a : aConts
However, aside from being useful, it is also for completeness. I see that Data.Conduit.List's goal is to provide a set of "list-like" operations in the Conduit context. I think bread-and-butter functions like takeWhile should be provided, along with its siblings like dropWhile, so that people don't have to change their style of coding when thinking about conduits as lists.

Join two consumers into a single consumer that returns multiple values?

I have been experimenting with the new pipes-http package and I had a thought. I have two parsers for a web page, one that returns line items and another a number from elsewhere in the page. When I grab the page, it'd be nice to string these parsers together and get their results at the same time from the same bytestring producer, rather than fetching the page twice or fetching all the html into memory and parsing it twice.
In other words, say you have two Consumers:
c1 :: Consumer a m r1
c2 :: Consumer a m r2
Is it possible to make a function like this:
combineConsumers :: Consumer a m r1 -> Consumer a m r2 -> Consumer a m (r1, r2)
combineConsumers = undefined
I have tried a few things, but I can't figure it out. I understand if it isn't possible, but it would be convenient.
Edit:
I'm sorry it turns out I was making an assumption about pipes-attoparsec, due to my experience with conduit-attoparsec that caused me to ask the wrong question. Pipes-attoparsec turns an attoparsec into a pipes Parser when I just assumed that it would return a pipes Consumer. That means that I can't actually turn two attoparsec parsers into consumers that take text and return a result, then use them with the plain old pipes ecosystem. I'm sorry but I just don't understand pipes-parse.
Even though it doesn't help me, Arthur's answer is pretty much what I envisioned when I asked the question, and I'll probably end up using his solution in the future. In the meantime I'm just going to use conduit.

It the results are "monoidal", you can use the tee function from the Pipes prelude, in combination with a WriterT.
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid
import Control.Monad
import Control.Monad.Writer
import Control.Monad.Writer.Class
import Pipes
import qualified Pipes.Prelude as P
import qualified Data.Text as T
textSource :: Producer T.Text IO ()
textSource = yield "foo" >> yield "bar" >> yield "foo" >> yield "nah"
counter :: Monoid w => T.Text
-> (T.Text -> w)
-> Consumer T.Text (WriterT w IO) ()
counter word inject = P.filter (==word) >-> P.mapM (tell . inject) >-> P.drain
main :: IO ()
main = do
result <-runWriterT $ runEffect $
hoist lift textSource >->
P.tee (counter "foo" inject1) >-> (counter "bar" inject2)
putStrLn . show $ result
where
inject1 _ = (,) (Sum 1) mempty
inject2 _ = (,) mempty (Sum 1)
Update: As mentioned in a comment, the real problem I see is that in pipes parsers aren't Consumers. And how can you run two parsers concurrently if they have different behaviours regarding leftovers? What happens if one of the parsers wants to "un-draw" some text and the other parser doesn't?
One possible solution is to run the parsers in a truly concurrent manner, in different threads. The primitives in the pipes-concurrency package let you "duplicate" a Producer by writing the same data to two different mailboxes. And then each parser can do whatever it wants with its own copy of the producer. Here's an example which also uses the pipes-parse, pipes-attoparsec and async packages:
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid
import qualified Data.Text as T
import Data.Attoparsec.Text hiding (takeWhile)
import Data.Attoparsec.Combinator
import Control.Applicative
import Control.Monad
import Control.Monad.State.Strict
import Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.Attoparsec as P
import qualified Pipes.Concurrent as P
import qualified Control.Concurrent.Async as A
parseChars :: Char -> Parser [Char]
parseChars c = fmap mconcat $
many (notChar c) *> many1 (some (char c) <* many (notChar c))
textSource :: Producer T.Text IO ()
textSource = yield "foo" >> yield "bar" >> yield "foo" >> yield "nah"
parseConc :: Producer T.Text IO ()
-> Parser a
-> Parser b
-> IO (Either P.ParsingError a,Either P.ParsingError b)
parseConc producer parser1 parser2 = do
(outbox1,inbox1,seal1) <- P.spawn' P.Unbounded
(outbox2,inbox2,seal2) <- P.spawn' P.Unbounded
feeding <- A.async $ runEffect $ producer >-> P.tee (P.toOutput outbox1)
>-> P.toOutput outbox2
sealing <- A.async $ A.wait feeding >> P.atomically seal1 >> P.atomically seal2
r <- A.runConcurrently $
(,) <$> (A.Concurrently $ parseInbox parser1 inbox1)
<*> (A.Concurrently $ parseInbox parser2 inbox2)
A.wait sealing
return r
where
parseInbox parser inbox = evalStateT (P.parse parser) (P.fromInput inbox)
main :: IO ()
main = do
(Right a, Right b) <- parseConc textSource (parseChars 'o') (parseChars 'a')
putStrLn . show $ (a,b)
The result is:
("oooo","aa")
I'm not sure how much overhead this approach introduces.

I think something is wrong with the way you are going about this, for the reasons Davorak mentions in his remark. But if you really need such a function, you can define it.
import Pipes.Internal
import Pipes.Core
zipConsumers :: Monad m => Consumer a m r -> Consumer a m s -> Consumer a m (r,s)
zipConsumers p q = go (p,q) where
go (p,q) = case (p,q) of
(Pure r , Pure s) -> Pure (r,s)
(M mpr , ps) -> M (do pr <- mpr
return (go (pr, ps)))
(pr , M mps) -> M (do ps <- mps
return (go (pr, ps)))
(Request _ f, Request _ g) -> Request () (\a -> go (f a, g a))
(Request _ f, Pure s) -> Request () (\a -> do r <- f a
return (r, s))
(Pure r , Request _ g) -> Request () (\a -> do s <- g a
return (r,s))
(Respond x _, _ ) -> closed x
(_ , Respond y _) -> closed y
If you are 'zipping' consumers without using their return value, only their 'effects' you can just use tee consumer1 >-> consumer2

The idiomatic solution is to rewrite your Consumers as a Fold or FoldM from the foldl library and then combine them using Applicative style. You can then convert this combined fold to one that works on pipes.
Let's assume that you either have two Folds:
fold1 :: Fold a r1
fold2 :: Fold a r2
... or two FoldMs:
foldM1 :: Monad m => FoldM a m r1
foldM2 :: Monad m => FoldM a m r2
Then you combine these into a single Fold/FoldM using Applicative style:
import Control.Applicative
foldBoth :: Fold a (r1, r2)
foldBoth = (,) <$> fold1 <*> fold2
foldBothM :: Monad m => FoldM a m (r1, r2)
foldBothM = (,) <$> foldM1 <*> foldM2
-- or: foldBoth = liftA2 (,) fold1 fold2
-- foldMBoth = liftA2 (,) foldM1 foldM2
You can turn either fold into a Pipes.Prelude-style fold or a Parser. Here are the necessary conversion functions:
import Control.Foldl (purely, impurely)
import qualified Pipes.Prelude as Pipes
import qualified Pipes.Parse as Parse
purely Pipes.fold
:: Monad m => Fold a b -> Producer a m () -> m b
impurely Pipes.foldM
:: Monad m => FoldM m a b -> Producer a m () -> m b
purely Parse.foldAll
:: Monad m => Fold a b -> Parser a m r
impurely Parse.foldMAll
:: Monad m => FoldM a m b -> Parser a m r
The reason for the purely and impurely functions is so that foldl and pipes can interoperate without either one incurring a dependency on the other. Also, they allow libraries other than pipes (like conduit) to reuse foldl without a dependency, too (Hint hint, #MichaelSnoyman).
I apologize that this feature is not documented, mainly because it took me a while to figure out how to get pipes and foldl to interoperate in a dependency-free manner, and that was after I wrote the pipes tutorial. I will update the tutorial to point out this trick.
To learn how to use foldl, just read the documentation in the main module. It's a very small and easy-to-learn library.

For what it's worth, in the conduit world, the relevant function is zipSinks. There might be some way to adapt this function to work for pipes, but automatic termination may get in the way.

Consumer forms a Monad so
combineConsumers = liftM2 (,)
will type check. Unfortunately, the semantics might be unlike what you're expecting: the first consumer will run to completion and then the second.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Efficiently search for a single element in a large Pandoc - haskell

Related

How to convert a haskell List into a monadic function that uses list values for operations?

Haskell : convert Either ParseError a to a

Haskell: Replace mapM in a monad transformer stack to achieve lazy evaluation (no space leaks)

Is there a pre-defined function for conduit analogy of `takeWhile`?

Join two consumers into a single consumer that returns multiple values?

Categories

Resources