Haskell: Parsec: Pipeline of transformers of the whole file

Haskell: Parsec: Pipeline of transformers of the whole file - haskell

I'm trying to use parsec to read a C/C++/java source file and do a series of transformations on the entire file. The first phase removes strings and the second phase removes comments. (That's because you might get a /* inside a string.)
So each phase transforms a string onto Either String Error, and I want to bind (in the sense of Either) them together to make a pipeline of transformations of the whole file. This seems like a fairly general requirement.
import Text.ParserCombinators.Parsec
commentless, stringless :: Parser String
stringless = fmap concat ( (many (noneOf "\"")) `sepBy` quotedString )
quotedString = (char '"') >> (many quotedChar) >> (char '"')
quotedChar = try (string "\\\"" >> return '"' ) <|> (noneOf "\"")
commentless = fmap concat $ notComment `sepBy` comment
notComment = manyTill anyChar (lookAhead (comment <|> eof))
comment = (string "//" >> manyTill anyChar newline >> spaces >> return ())
<|> (string "/*" >> manyTill anyChar (string "*/") >> spaces >> return ())
main =
do c <- getContents
case parse commentless "(stdin)" c of -- THIS WORKS
-- case parse stringless "(stdin)" c of -- THIS WORKS TOO
-- case parse (stringless `THISISWHATIWANT` commentless) "(stdin)" c of
Left e -> do putStrLn "Error parsing input:"
print e
Right r -> print r
So how can I do this? I tried parserBind but it didn't work.
(In case anybody cares why, I'm trying to do a kind of light parse where I just extract what I want but avoid parsing the entire grammar or even knowing whether it's C++ or Java. All I need to extract is the starting and ending line numbers of all classes and functions. So I envisage a bunch of preprocessing phases that just scrub out comments, #defines/ifdefs, template preambles and contents of parentheses (because of the semicolons in for clauses), then I'll parse for snippets preceding {s (or following }s because of typedefs) and stuff those snippets through yet another phase to get the type and name of whatever it is, then recurse to just the second level to get java member functions.)

You need to bind Either Error, not Parser. You need to move the bind outside the parse, and use multiple parses:
parse stringless "(stdin)" input >>= parse commentless "(stdin)"
There is probably a better approach than what you are using, but this will do what you want.

Related

Incomplete input using endOfInput parser

Consider the string "(=x250) toto e", and the function:
charToText :: Char -> Text
charToText c = pack [c]
The string is successfully parsed by:
mconcat <$> manyTill (charToText <$> anyChar) (char 'e')
with the expected result "(=x250) toto ".
However, the parser:
mconcat <$> manyTill (charToText <$> anyChar) endOfInput
returns Partial _.
Why is it so ? I thought endOfInput would succeed at the end of the string and stop manyTill (as in the first example).

To get a complete answer, you'll need to provide a fully self-contained example that generates the Result: incomplete input error message, but your Attoparsec parser is working correctly. You can see similar behavior with a much simpler example:
λ> parse (char 'e') "e"
Done "" 'e'
λ> parse endOfInput ""
Partial _
Attoparsec parsers by design allow incremental supply of additional input. When they are run on some (potentially partial) input, they return a Done result if the parser unconditionally succeeds on the supplied input. They return a Partial result if more input is needed to decide if the parser succeeds.
For my example above, char 'e' always successfully parses the partial input "e", no matter what additional input you might decide to supply, hence the result is a Done.
However, endOfInput might succeed on the partial input "", but only if no additional input is going to be supplied. If there is additional input, endOfInput will fail. Because of this, a Partial result is returned.
It's the same for your example. The success of your second parser depends on whether or not additional input is supplied. If there's no additional input, the parser is Done, but if there is additional input, the parser has more to do.
You will either need to arrange to run your parser with parseOnly:
λ> parseOnly (manyTill anyChar endOfInput) "foo"
Right "foo"
or you will feed your parse result an empty bytestring which will indicate that no further input is available:
λ> parse (manyTill anyChar endOfInput) "foo"
Partial _
λ> feed (parse (manyTill anyChar endOfInput) "foo") ""
Done "" "foo"

Parsec start-of-row pattern?

I am trying to parse mediawiki text using Parsec. Some of the constructs in mediawiki markup can only occur at the start of rows (such as the header markup ==header level 2==). In regexp I would use an anchor (such as ^) to find the start of a line.
One attempt in GHCi is
Prelude Text.Parsec> parse (char '\n' *> string "==" *> many1 letter <* string "==") "" "\n==hej=="
Right "hej"
but this is not too good since it will fail on the first line of a file. I feel like this should be a solved problem...
What is the most idiomatic "Start of line" parsing in Parsec?

You can use getPosition and sourceColumn in order to find out the column number that the parser is currently looking at. The column number will be 1 if the current position is at the start of a line (such as at the start of input or after a \n or \r character).
There isn't a built-in combinator for this, but you can easily make it:
import Text.Parsec
import Control.Monad (guard)
startOfLine :: Monad m => ParsecT s u m ()
startOfLine = do
pos <- getPosition
guard (sourceColumn pos == 1)
Now you can write your header parser as:
header = startOfLine *> string "==" *> many1 letter <* string "=="

Probably you can use many (char '\n') instead of just char '\n'. In parser combinators there's no sense of start of the line because they always run at the start of input. The only thing you can do is to check manually which symbols your input can start from. Using many (char '\n') ensures that there only zero or more empty lines before header == my header ==.

attoparsec: "nested" parsers -- parse a subset of the input with a different parser

Well in fact I'm pretty sure I'm using the wrong terminology. Here is the problem I want to solve: a parser for the markdown format, well a subset of it.
My problem is with the blockquote feature. Each line in a blockquote starts with >; otherwise everything is the normal structure in a markdown file.
You can't look at individual lines separately, because you need to separate paragraphs from normal lines, eg
> a
> b
is not the same as
> a
>
> b
and things like that (same if a list is blockquoted you don't want x lists but one list with x elements). A natural and trivial way is to "take off" the > signs, parse the blockquote on its own, ignoring anything around it, wrap it with a BlockQuote type constructor, put that in the outer AST and resume parsing of the original input. It's what pango does if I'm not wrong:
https://hackage.haskell.org/package/pandoc-1.14.0.4/docs/src/Text-Pandoc-Readers-Markdown.html#blockQuote
blockQuote :: MarkdownParser (F Blocks)
blockQuote = do
raw <- emailBlockQuote
-- parse the extracted block, which may contain various block elements:
contents <- parseFromString parseBlocks $ (intercalate "\n" raw) ++ "\n\n"
return $ B.blockQuote <$> contents
And then:
http://hackage.haskell.org/package/pandoc-1.5.1/docs/src/Text-Pandoc-Shared.html#parseFromString
-- | Parse contents of 'str' using 'parser' and return result.
parseFromString :: GenParser tok st a -> [tok] -> GenParser tok st a
parseFromString parser str = do
oldPos <- getPosition
oldInput <- getInput
setInput str
result <- parser
setInput oldInput
setPosition oldPos
return result
Now parseFromString looks quite hacky to me and besides that it's also Parsec not attoparsec so I can't use it in my project. I'm not sure how I could take that Text from the blockquote and parse it and return the parsing result so that it "fits" within the current parsing. Seems impossible?
I've been googling on the issue and I think that pipes-parse and conduit can help on that area although I struggle to find examples and what I see appears considerably less nice to look at than "pure" parsec/attoparsec parsers.
Other options to parse blockquotes would be to rewrite the usual parsers but with the > catch... Complicating and duplicating a lot. Parsing blockquotes counting each line separately and writing some messy "merge" function. Or parsing to a first AST that would contain the blockquotes as Text inside a first BlockquoteText type constructor waiting for a transformation where they would be parsed separately, not very elegant but it has the benefit of simplicity, which does count for something.
I would probably go for the latter, but surely there's a better way?

I have asked myself the same question. Why is there no standard combinator for nested parsers like how you describe? My default mode is to trust the package author, especially when that author also co-wrote "Real World Haskell". If such an obvious capability is missing, perhaps it is by design and I should look for a better way. However, I've managed to convince myself that such a convenient combinator is mostly harmless. Useful whenever an all-or-nothing type parser is appropriate for the inner parse.
Implementation
import Data.Attoparsec.Text
import qualified Data.Text as T
import Data.Text(Text)
import Control.Applicative
I've divided the required functionality into two parsers. The first, constP, performs an "in place" parse of some given text. It substitutes the constant parser's fail with empty (from Alternative), but otherwise has no other side effects.
constP :: Parser a -> Text -> Parser a
constP p t = case parseOnly p t of
Left _ -> empty
Right a -> return a
The second part comes from parseOf, which performs the constant, inner parse based on the result of the outer parse. The empty alternative here allows a failed parse to return without consuming any input.
parseOf :: Parser Text -> Parser a -> Parser a
parseOf ptxt pa = bothParse <|> empty
where
bothParse = ptxt >>= constP pa
The block quote markdown can be written in the desired fashion. This implementation requires the resulting block to be totally parsed.
blockQuoteMarkdown :: Parser [[Double]]
blockQuoteMarkdown = parseOf blockQuote ( markdownSurrogate <*
endOfInput
)
Instead of the actual markdown parser, I just implemented a quick parser of space separated doubles. The complication of the parser comes from allowing the last, non-empty line, either end in a new line or not.
markdownSurrogate :: Parser [[Double]]
markdownSurrogate = do
lns <- many (mdLine <* endOfLine)
option lns ((lns ++) . pure <$> mdLine1)
where
mdLine = sepBy double (satisfy (==' '))
mdLine1 = sepBy1 double (satisfy (==' '))
These two parsers are responsible for returning the text internal to block quotes.
blockQuote :: Parser Text
blockQuote = T.unlines <$> many blockLine
blockLine :: Parser Text
blockLine = char '>' *> takeTill isEndOfLine <* endOfLine
Finally, a test of the parser.
parseMain :: IO ()
parseMain = do
putStrLn ""
doParse "a" markdownSurrogate a
doParse "_" markdownSurrogate ""
doParse "b" markdownSurrogate b
doParse "ab" markdownSurrogate ab
doParse "a_b" markdownSurrogate a_b
doParse "badMarkdown x" markdownSurrogate x
doParse "badMarkdown axb" markdownSurrogate axb
putStrLn ""
doParse "BlockQuote ab" blockQuoteMarkdown $ toBlockQuote ab
doParse "BlockQuote a_b" blockQuoteMarkdown $ toBlockQuote a_b
doParse "BlockQuote axb" blockQuoteMarkdown $ toBlockQuote axb
where
a = "7 3 1"
b = "4 4 4"
x = "a b c"
ab = T.unlines [a,b]
a_b = T.unlines [a,"",b]
axb = T.unlines [a,x,b]
doParse desc p str = do
print $ T.concat ["Parsing ",desc,": \"",str,"\""]
let i = parse (p <* endOfInput ) str
print $ feed i ""
toBlockQuote = T.unlines
. map (T.cons '>')
. T.lines
*Main> parseMain
"Parsing a: \"7 3 1\""
Done "" [[7.0,3.0,1.0]]
"Parsing _: \"\""
Done "" []
"Parsing b: \"4 4 4\""
Done "" [[4.0,4.0,4.0]]
"Parsing ab: \"7 3 1\n4 4 4\n\""
Done "" [[7.0,3.0,1.0],[4.0,4.0,4.0]]
"Parsing a_b: \"7 3 1\n\n4 4 4\n\""
Done "" [[7.0,3.0,1.0],[],[4.0,4.0,4.0]]
"Parsing badMarkdown x: \"a b c\""
Fail "a b c" [] "endOfInput"
"Parsing badMarkdown axb: \"7 3 1\na b c\n4 4 4\n\""
Fail "a b c\n4 4 4\n" [] "endOfInput"
"Parsing BlockQuote ab: \">7 3 1\n>4 4 4\n\""
Done "" [[7.0,3.0,1.0],[4.0,4.0,4.0]]
"Parsing BlockQuote a_b: \">7 3 1\n>\n>4 4 4\n\""
Done "" [[7.0,3.0,1.0],[],[4.0,4.0,4.0]]
"Parsing BlockQuote axb: \">7 3 1\n>a b c\n>4 4 4\n\""
Fail ">7 3 1\n>a b c\n>4 4 4\n" [] "Failed reading: empty"
Discussion
The notable difference comes in the semantics of failure. For instance, when parsing axb and blockquoted axb, which are the following two strings, respectively
7 3 1
a b c
4 4 4
and
> 7 3 1
> a b c
> 4 4 4
the markdown parse results in
Fail "a b c\n4 4 4\n" [] "endOfInput"
whereas the quoted results in
Fail ">7 3 1\n>a b c\n>4 4 4\n" [] "Failed reading: empty"
The markdown consumes "7 3 1\n", but this is nowhere reported in the quoted failure. Instead, fail becomes all or nothing.
Likewise, there is no allowance for handling unparsed text in the case of partial success. But I don't see a need for this, given the use case. For example, if a parse looked something like the following
"{ <tok> unhandled }more to parse"
where {} denotes the recognized block quote context, and <tok> is parsed within that inner context. A partial success then would have to lift "unhandled" out of that block quote context and somehow combine it with "more to parse".
I see no general way to do this, but it is allowed through choice of inner parser return type. For instance, by some parser parseOf blockP innP :: Parser (<tok>,Maybe Text). However, if this need arises I would expect that there is a better way to handle the situation than with nested parsers.
There may also be concerns about the loss of attoparsec Partial parsing. That is, the implementation of constP uses parseOnly, which collapses the parse return Fail and Partial into a single Left failure state. In other words, we lose the ability to feed our inner parser with more text as it becomes available. However, note that text to parse is itself the result of an outer parse; it will only be available after enough text has been fed to the outer parse. So this shouldn't be an issue either.

Conduit and Attoparsec - extracting delimited text

Say I have a document with text delimited by Jade-style brackets, like {{foo}}. I've written an Attoparsec parser that seems to extract foo properly:
findFoos :: Parser [T.Text]
findFoos = many $ do
manyTill anyChar (string "{{")
manyTill letter (string "}}")
Testing it shows that it works:
> parseOnly findFoos "{{foo}}"
Right ["foo"]
> parseOnly findFoos "{{foo}} "
Right ["foo"]
Now, with the Data.Conduit.Attoparsec module in conduit-extra, I seem to be running into strange behavior:
> yield "{{foo}}" $= (mapOutput snd $ CA.conduitParser findFoos) $$ CL.mapM_ print
["foo"]
> yield "{{foo}} " $= (mapOutput snd $ CA.conduitParser findFoos) $$ CL.mapM_ print
-- floods stdout with empty lists
Is this the desired behavior? Is there a conduit utility I should be using here? Any help with this would be tremendous!

Because it uses many, findFoos will return [] without consuming input when it doesn't find any delimited text.
On the other hand, conduitParser applies a parser repeatedly on a stream, returning each parsed value until it exhausts the stream.
The problem with "{{foo}} " is that the parser will consume {{foo}}, but the blank space remains unconsumed in the stream, so further invocations of the parser always return [].
If you redefine findFoos to consume one quoted element at a time, including the trailing blanks, it should work:
findFoos' :: Parser String
findFoos' = do
manyTill anyChar (string "{{")
manyTill letter (string "}}") <* skipSpace
Real-world examples will have other characters between bracketed texts, so skipping the "extra stuff" after each parse (without consuming any of the {{ opening braces for the next parse) will be a bit more involved.
Perhaps something like the following will work:
findFoos'' :: Parser String
findFoos'' = do
manyTill anyChar (string "{{")
manyTill letter (string "}}") <* skipMany everythingExceptOpeningBraces
where
-- is there a simpler / more efficient way of doing this?
everythingExceptOpeningBraces =
-- skip one or more non-braces
(skip (/='{') *> skipWhile (/='{'))
<|>
-- skip single brace followed by non-brace character
(skip (=='{') *> skip (/='{'))
<|>
-- skip a brace at the very end
(skip (=='{') *> endOfInput)
(This parser will fail, however, if there aren't any bracketed texts in the stream. Perhaps you could build a Parser (Maybe Text) that returns Nothing in that case.)

Parsec - error "combinator 'many' is applied to a parser that accepts an empty string"

I'm trying to write a parser using Parsec that will parse literate Haskell files, such as the following:
The classic 'Hello, world' program.
\begin{code}
main = putStrLn "Hello, world"
\end{code}
More text.
I've written the following, sort-of-inspired by the examples in RWH:
import Text.ParserCombinators.Parsec
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "(unknown)" input
literateFile
= many codeOrProse
codeOrProse
= code <|> prose
code
= do eol
string "\\begin{code}"
eol
content <- many anyChar
eol
string "\\end{code}"
eol
return $ Haskell content
prose
= do content <- many anyChar
return $ Text content
eol
= try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
Which I hoped would result in something along the lines of:
[Text "The classic 'Hello, world' program.", Haskell "main = putStrLn \"Hello, world\"", Text "More text."]
(allowing for whitespace etc).
This compiles fine, but when run, I get the error:
*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string
Can anyone shed any light on this, and possibly help with a solution please?

As sth pointed out many anyChar is the problem. But not just in prose but also in code. The problem with code is, that content <- many anyChar will consume everything: The newlines and the \end{code} tag.
So, you need to have some way to tell the prose and the code apart. An easy (but maybe too naive) way to do so, is to look for backslashes:
literateFile = many codeOrProse <* eof
code = do string "\\begin{code}"
content <- many $ noneOf "\\"
string "\\end{code}"
return $ Haskell content
prose = do content <- many1 $ noneOf "\\"
return $ Text content
Now, you don't completely have the desired result, because the Haskell part will also contain newlines, but you can filter these out quite easily (given a function filterNewlines you could say `content <- filterNewlines <$> (many $ noneOf "\\")).
Edit
Okay, I think I found a solution (requires the newest Parsec version, because of lookAhead):
import Text.ParserCombinators.Parsec
import Control.Applicative hiding (many, (<|>))
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "" input
literateFile
= many codeOrProse
codeOrProse = code <|> prose
code = do string "\\begin{code}\n"
c <- untilP (string "\\end{code}\n")
string "\\end{code}\n"
return $ Haskell c
prose = do t <- untilP $ (string "\\begin{code}\n") <|> (eof >> return "")
return $ Text t
untilP p = do s <- many $ noneOf "\n"
newline
s' <- try (lookAhead p >> return "") <|> untilP p
return $ s ++ s'
untilP p parses a line, then checks if the beginning of the next line can be successfully parsed by p. If so, it returns the empty string, otherwise it goes on. The lookAhead is needed, because otherwise the begin\end-tags would be consumed and code couldn't recognize them.
I guess it could still be made more concise (i.e. not having to repeat string "\\end{code}\n" inside code).

I haven't tested it, but:
many anyChar can match an empty string
Therefore prose can match an empty string
Therefore codeOrProse can match an empty string
Therefore literateFile can loop forever, matching infinitely many empty strings
Changing prose to match many1 characters might fix this problem.
(I'm not very familiar with Parsec, but how will prose know how many characters it should match? It might consume the whole input, never giving the code parser a second chance to look for the start of a new code segment. Alternatively it might only match one character in each call, making the many/many1 in it useless.)

For reference, here's another version I came up with (slightly expanded to handle other cases):
import Text.ParserCombinators.Parsec
main
= do contents <- readFile "test.tex"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
| Section String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "(unknown)" input
literateFile
= do es <- many elements
eof
return es
elements
= try section
<|> try quotedBackslash
<|> try code
<|> prose
code
= do string "\\begin{code}"
c <- anyChar `manyTill` try (string "\\end{code}")
return $ Haskell c
quotedBackslash
= do string "\\\\"
return $ Text "\\\\"
prose
= do t <- many1 (noneOf "\\")
return $ Text t
section
= do string "\\section{"
content <- many1 (noneOf "}")
char '}'
return $ Section content

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Haskell: Parsec: Pipeline of transformers of the whole file - haskell

You need to bind Either Error, not Parser. You need to move the bind outside the parse, and use multiple parses: parse stringless "(stdin)" input >>= parse commentless "(stdin)" There is probably a better approach than what you are using, but this will do what you want.

Related

Incomplete input using endOfInput parser

Parsec start-of-row pattern?

attoparsec: "nested" parsers -- parse a subset of the input with a different parser

Conduit and Attoparsec - extracting delimited text

Parsec - error "combinator 'many' is applied to a parser that accepts an empty string"

Categories

Resources