Conduit and Attoparsec - extracting delimited text

Conduit and Attoparsec - extracting delimited text - haskell

Say I have a document with text delimited by Jade-style brackets, like {{foo}}. I've written an Attoparsec parser that seems to extract foo properly:
findFoos :: Parser [T.Text]
findFoos = many $ do
manyTill anyChar (string "{{")
manyTill letter (string "}}")
Testing it shows that it works:
> parseOnly findFoos "{{foo}}"
Right ["foo"]
> parseOnly findFoos "{{foo}} "
Right ["foo"]
Now, with the Data.Conduit.Attoparsec module in conduit-extra, I seem to be running into strange behavior:
> yield "{{foo}}" $= (mapOutput snd $ CA.conduitParser findFoos) $$ CL.mapM_ print
["foo"]
> yield "{{foo}} " $= (mapOutput snd $ CA.conduitParser findFoos) $$ CL.mapM_ print
-- floods stdout with empty lists
Is this the desired behavior? Is there a conduit utility I should be using here? Any help with this would be tremendous!

Because it uses many, findFoos will return [] without consuming input when it doesn't find any delimited text.
On the other hand, conduitParser applies a parser repeatedly on a stream, returning each parsed value until it exhausts the stream.
The problem with "{{foo}} " is that the parser will consume {{foo}}, but the blank space remains unconsumed in the stream, so further invocations of the parser always return [].
If you redefine findFoos to consume one quoted element at a time, including the trailing blanks, it should work:
findFoos' :: Parser String
findFoos' = do
manyTill anyChar (string "{{")
manyTill letter (string "}}") <* skipSpace
Real-world examples will have other characters between bracketed texts, so skipping the "extra stuff" after each parse (without consuming any of the {{ opening braces for the next parse) will be a bit more involved.
Perhaps something like the following will work:
findFoos'' :: Parser String
findFoos'' = do
manyTill anyChar (string "{{")
manyTill letter (string "}}") <* skipMany everythingExceptOpeningBraces
where
-- is there a simpler / more efficient way of doing this?
everythingExceptOpeningBraces =
-- skip one or more non-braces
(skip (/='{') *> skipWhile (/='{'))
<|>
-- skip single brace followed by non-brace character
(skip (=='{') *> skip (/='{'))
<|>
-- skip a brace at the very end
(skip (=='{') *> endOfInput)
(This parser will fail, however, if there aren't any bracketed texts in the stream. Perhaps you could build a Parser (Maybe Text) that returns Nothing in that case.)

Related

Incomplete input using endOfInput parser

Consider the string "(=x250) toto e", and the function:
charToText :: Char -> Text
charToText c = pack [c]
The string is successfully parsed by:
mconcat <$> manyTill (charToText <$> anyChar) (char 'e')
with the expected result "(=x250) toto ".
However, the parser:
mconcat <$> manyTill (charToText <$> anyChar) endOfInput
returns Partial _.
Why is it so ? I thought endOfInput would succeed at the end of the string and stop manyTill (as in the first example).

To get a complete answer, you'll need to provide a fully self-contained example that generates the Result: incomplete input error message, but your Attoparsec parser is working correctly. You can see similar behavior with a much simpler example:
λ> parse (char 'e') "e"
Done "" 'e'
λ> parse endOfInput ""
Partial _
Attoparsec parsers by design allow incremental supply of additional input. When they are run on some (potentially partial) input, they return a Done result if the parser unconditionally succeeds on the supplied input. They return a Partial result if more input is needed to decide if the parser succeeds.
For my example above, char 'e' always successfully parses the partial input "e", no matter what additional input you might decide to supply, hence the result is a Done.
However, endOfInput might succeed on the partial input "", but only if no additional input is going to be supplied. If there is additional input, endOfInput will fail. Because of this, a Partial result is returned.
It's the same for your example. The success of your second parser depends on whether or not additional input is supplied. If there's no additional input, the parser is Done, but if there is additional input, the parser has more to do.
You will either need to arrange to run your parser with parseOnly:
λ> parseOnly (manyTill anyChar endOfInput) "foo"
Right "foo"
or you will feed your parse result an empty bytestring which will indicate that no further input is available:
λ> parse (manyTill anyChar endOfInput) "foo"
Partial _
λ> feed (parse (manyTill anyChar endOfInput) "foo") ""
Done "" "foo"

Parsec start-of-row pattern?

I am trying to parse mediawiki text using Parsec. Some of the constructs in mediawiki markup can only occur at the start of rows (such as the header markup ==header level 2==). In regexp I would use an anchor (such as ^) to find the start of a line.
One attempt in GHCi is
Prelude Text.Parsec> parse (char '\n' *> string "==" *> many1 letter <* string "==") "" "\n==hej=="
Right "hej"
but this is not too good since it will fail on the first line of a file. I feel like this should be a solved problem...
What is the most idiomatic "Start of line" parsing in Parsec?

You can use getPosition and sourceColumn in order to find out the column number that the parser is currently looking at. The column number will be 1 if the current position is at the start of a line (such as at the start of input or after a \n or \r character).
There isn't a built-in combinator for this, but you can easily make it:
import Text.Parsec
import Control.Monad (guard)
startOfLine :: Monad m => ParsecT s u m ()
startOfLine = do
pos <- getPosition
guard (sourceColumn pos == 1)
Now you can write your header parser as:
header = startOfLine *> string "==" *> many1 letter <* string "=="

Probably you can use many (char '\n') instead of just char '\n'. In parser combinators there's no sense of start of the line because they always run at the start of input. The only thing you can do is to check manually which symbols your input can start from. Using many (char '\n') ensures that there only zero or more empty lines before header == my header ==.

Haskell: Parsec: Pipeline of transformers of the whole file

I'm trying to use parsec to read a C/C++/java source file and do a series of transformations on the entire file. The first phase removes strings and the second phase removes comments. (That's because you might get a /* inside a string.)
So each phase transforms a string onto Either String Error, and I want to bind (in the sense of Either) them together to make a pipeline of transformations of the whole file. This seems like a fairly general requirement.
import Text.ParserCombinators.Parsec
commentless, stringless :: Parser String
stringless = fmap concat ( (many (noneOf "\"")) `sepBy` quotedString )
quotedString = (char '"') >> (many quotedChar) >> (char '"')
quotedChar = try (string "\\\"" >> return '"' ) <|> (noneOf "\"")
commentless = fmap concat $ notComment `sepBy` comment
notComment = manyTill anyChar (lookAhead (comment <|> eof))
comment = (string "//" >> manyTill anyChar newline >> spaces >> return ())
<|> (string "/*" >> manyTill anyChar (string "*/") >> spaces >> return ())
main =
do c <- getContents
case parse commentless "(stdin)" c of -- THIS WORKS
-- case parse stringless "(stdin)" c of -- THIS WORKS TOO
-- case parse (stringless `THISISWHATIWANT` commentless) "(stdin)" c of
Left e -> do putStrLn "Error parsing input:"
print e
Right r -> print r
So how can I do this? I tried parserBind but it didn't work.
(In case anybody cares why, I'm trying to do a kind of light parse where I just extract what I want but avoid parsing the entire grammar or even knowing whether it's C++ or Java. All I need to extract is the starting and ending line numbers of all classes and functions. So I envisage a bunch of preprocessing phases that just scrub out comments, #defines/ifdefs, template preambles and contents of parentheses (because of the semicolons in for clauses), then I'll parse for snippets preceding {s (or following }s because of typedefs) and stuff those snippets through yet another phase to get the type and name of whatever it is, then recurse to just the second level to get java member functions.)

You need to bind Either Error, not Parser. You need to move the bind outside the parse, and use multiple parses:
parse stringless "(stdin)" input >>= parse commentless "(stdin)"
There is probably a better approach than what you are using, but this will do what you want.

What's the cleanest way to do case-insensitive parsing with Text.Combinators.Parsec?

I'm writing my first program with Parsec. I want to parse MySQL schema dumps and would like to come up with a nice way to parse strings representing certain keywords in case-insensitive fashion. Here is some code showing the approach I'm using to parse "CREATE" or "create". Is there a better way to do this? An answer that doesn't resort to buildExpressionParser would be best. I'm taking baby steps here.
p_create_t :: GenParser Char st Statement
p_create_t = do
x <- (string "CREATE" <|> string "create")
xs <- manyTill anyChar (char ';')
return $ CreateTable (x ++ xs) [] -- refine later

You can build the case-insensitive parser out of character parsers.
-- Match the lowercase or uppercase form of 'c'
caseInsensitiveChar c = char (toLower c) <|> char (toUpper c)
-- Match the string 's', accepting either lowercase or uppercase form of each character
caseInsensitiveString s = try (mapM caseInsensitiveChar s) <?> "\"" ++ s ++ "\""

Repeating what I said in a comment, as it was apparently helpful:
The simple sledgehammer solution here is to simply map toLower over the entire input before running the parser, then do all your keyword matching in lowercase.
This presents obvious difficulties if you're parsing something that needs to be case-insensitive in some places and case-sensitive in others, or if you care about preserving case for cosmetic reasons. For example, although HTML tags are case-insensitive, converting an entire webpage to lowercase while parsing it would probably be undesirable. Even when compiling a case-insensitive programming language, converting identifiers could be annoying, as any resulting error messages would not match what the programmer wrote.

No, Parsec cannot do that in clean way. string is implemented on top of
primitive tokens combinator that is hard-coded to use equality test
(==). It's a bit simpler to parse case-insensitive character, but you
probably want more.
There is however a modern fork of Parsec, called
Megaparsec which has
built-in solutions for everything you may want:
λ> parseTest (char' 'a') "b"
parse error at line 1, column 1:
unexpected 'b'
expecting 'A' or 'a'
λ> parseTest (string' "foo") "Foo"
"Foo"
λ> parseTest (string' "foo") "FOO"
"FOO"
λ> parseTest (string' "foo") "fo!"
parse error at line 1, column 1:
unexpected "fo!"
expecting "foo"
Note the last error message, it's better than what you can get parsing
characters one by one (especially useful in your particular case). string'
is implemented just like Parsec's string but uses case-insensitive
comparison to compare characters. There are also oneOf' and noneOf' that
may be helpful in some cases.
Disclosure: I'm one of the authors of Megaparsec.

Instead of mapping the entire input with toLower, consider using caseString from Text.ParserCombinators.Parsec.Rfc2234 (from the hsemail package)
Text.ParsecCombinators.Parsec.Rfc2234
p_create_t :: GenParser Char st Statement
p_create_t = do
x <- (caseString "create")
xs <- manyTill anyChar (char ';')
return $ CreateTable (x ++ xs) [] -- refine later
So now x will be whatever case-variant is present in the input without changing your input.
ps: I know that this is an ancient question, I just thought that I would add this as this question came up while I was searching for a similar problem

There is a package name parsec-extra for this purpuse. You need install this package then use 'caseInsensitiveString' parser.
:m Text.Parsec
:m +Text.Parsec.Extra
*> parseTest (caseInsensitiveString "values") "vaLUES"
"values"
*> parseTest (caseInsensitiveString "values") "VAlues"
"values"
Link to package is here:
https://hackage.haskell.org/package/parsec-extra

Parsec - error "combinator 'many' is applied to a parser that accepts an empty string"

I'm trying to write a parser using Parsec that will parse literate Haskell files, such as the following:
The classic 'Hello, world' program.
\begin{code}
main = putStrLn "Hello, world"
\end{code}
More text.
I've written the following, sort-of-inspired by the examples in RWH:
import Text.ParserCombinators.Parsec
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "(unknown)" input
literateFile
= many codeOrProse
codeOrProse
= code <|> prose
code
= do eol
string "\\begin{code}"
eol
content <- many anyChar
eol
string "\\end{code}"
eol
return $ Haskell content
prose
= do content <- many anyChar
return $ Text content
eol
= try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
Which I hoped would result in something along the lines of:
[Text "The classic 'Hello, world' program.", Haskell "main = putStrLn \"Hello, world\"", Text "More text."]
(allowing for whitespace etc).
This compiles fine, but when run, I get the error:
*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string
Can anyone shed any light on this, and possibly help with a solution please?

As sth pointed out many anyChar is the problem. But not just in prose but also in code. The problem with code is, that content <- many anyChar will consume everything: The newlines and the \end{code} tag.
So, you need to have some way to tell the prose and the code apart. An easy (but maybe too naive) way to do so, is to look for backslashes:
literateFile = many codeOrProse <* eof
code = do string "\\begin{code}"
content <- many $ noneOf "\\"
string "\\end{code}"
return $ Haskell content
prose = do content <- many1 $ noneOf "\\"
return $ Text content
Now, you don't completely have the desired result, because the Haskell part will also contain newlines, but you can filter these out quite easily (given a function filterNewlines you could say `content <- filterNewlines <$> (many $ noneOf "\\")).
Edit
Okay, I think I found a solution (requires the newest Parsec version, because of lookAhead):
import Text.ParserCombinators.Parsec
import Control.Applicative hiding (many, (<|>))
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "" input
literateFile
= many codeOrProse
codeOrProse = code <|> prose
code = do string "\\begin{code}\n"
c <- untilP (string "\\end{code}\n")
string "\\end{code}\n"
return $ Haskell c
prose = do t <- untilP $ (string "\\begin{code}\n") <|> (eof >> return "")
return $ Text t
untilP p = do s <- many $ noneOf "\n"
newline
s' <- try (lookAhead p >> return "") <|> untilP p
return $ s ++ s'
untilP p parses a line, then checks if the beginning of the next line can be successfully parsed by p. If so, it returns the empty string, otherwise it goes on. The lookAhead is needed, because otherwise the begin\end-tags would be consumed and code couldn't recognize them.
I guess it could still be made more concise (i.e. not having to repeat string "\\end{code}\n" inside code).

I haven't tested it, but:
many anyChar can match an empty string
Therefore prose can match an empty string
Therefore codeOrProse can match an empty string
Therefore literateFile can loop forever, matching infinitely many empty strings
Changing prose to match many1 characters might fix this problem.
(I'm not very familiar with Parsec, but how will prose know how many characters it should match? It might consume the whole input, never giving the code parser a second chance to look for the start of a new code segment. Alternatively it might only match one character in each call, making the many/many1 in it useless.)

For reference, here's another version I came up with (slightly expanded to handle other cases):
import Text.ParserCombinators.Parsec
main
= do contents <- readFile "test.tex"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
| Section String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "(unknown)" input
literateFile
= do es <- many elements
eof
return es
elements
= try section
<|> try quotedBackslash
<|> try code
<|> prose
code
= do string "\\begin{code}"
c <- anyChar `manyTill` try (string "\\end{code}")
return $ Haskell c
quotedBackslash
= do string "\\\\"
return $ Text "\\\\"
prose
= do t <- many1 (noneOf "\\")
return $ Text t
section
= do string "\\section{"
content <- many1 (noneOf "}")
char '}'
return $ Section content

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Conduit and Attoparsec - extracting delimited text - haskell

Related

Incomplete input using endOfInput parser

Parsec start-of-row pattern?

Haskell: Parsec: Pipeline of transformers of the whole file

What's the cleanest way to do case-insensitive parsing with Text.Combinators.Parsec?

Parsec - error "combinator 'many' is applied to a parser that accepts an empty string"

Categories

Resources