I have an attoparsec parser, and tests for it, what annoys me is that if I comment part of the parser and run the tests, the parser doesn't return Left "parse error at line ..." but instead I get Right [].
Note that I'm using parseOnly to make it clear that there'll be no more input.
Otherwise it's nice to get the partially parsed input, it can definitely be useful and I'm glad to have it. However I'd like to be informed that the whole input was not consumed. Maybe to get a character offset of the last consumed letter, or if that's what it takes, at least an option to be returned Left.
If it's relevant, the parser can be found there.
If I comment for instance the line:
<|> PlainText <$> choice (string <$> ["[", "]", "*", "`"])
And run the tests, I get for instance:
1) notes parsing tests parses notes properly
simple test
expected: Right [NormalLine [PlainText "one line* # hello world"]]
but got: Right []
This is from that test.
Depending on if consuming the whole input should be the property of parseNoteDocument or just the tests, I'd extend one or the other with endOfInput or atEnd.
I'd suggest to define a proper Parser for your documents, like
parseNoteDocument' :: Text -> Parsec NoteDocument
parseNoteDocument' = many parseLine
and then define parseNoteDocument in terms of it. Then you can use parseNoteDocument' in the tests by defining a helper that parses a given piece of text using
parseNoteDocument' <* endOfInput
to ensure that the whole input is consumed.
Related
When using the <|> combinator in a Parsec or Megaparsec parser, 'try' maybe needed to force backtracking in case the first parser fails having consumed input. In Parsec I need to use 'try' when parsing strings:
λ: parse (try (string "abc") <|> string "abd") "" "abd"
Right "abd"
Without the 'try', the parse fails as the first parser consumes the 'a', leaving just 'bd' for the second parser which then naturally fails as well.
In Megaparsec, the 'try' is not needed:
λ: parse (string "abc" <|> string "abd") "" "abd"
Right "abd"
and so somehow in Megaparsec the string parser helpfully does not consume input when it fails.
My questions are:
How might I have found out, other than experimenting, that the string parser behaviors are different between Parsec and Megaparsec - I have no seen it documented?
How can I easily (that is, without experimentation) tell in general whether a parser consumes input if it fails?
Thank you.
I would like to write a Parsec parser that would parse a sequence of numbers, returning a sorted list, but failing on encountering a duplicate number (the error should point to the position of the duplicate). For the sorted part I can simply use:
manySorted = manyAccum insert
But how can I trigger a Parsec error if the number is already on the list. It doesn't seem like manyAccum allows that and I couldn't figure out how to make my own clone of manyAccum that would (implementation uses unParser which doesn't seem to be exposed outside of Parsec library).
You could try to obtain the current parser's position by
sourcePos :: Monad m => ParsecT s u m SourcePos
sourcePos = statePos `liftM` getParserState
Then accumulate the positions together with parsed values so that you can report the position of the original.
If you're sure you want the parser to be doing this, you can use the fail method with an error message to fail out of the parser. (Complete with an error location, and trying other <|> possibilities, and so on.)
I am trying to write a Forth interpreter in Haskell. There are many sub problems and categories to accomplish this, however, I am trying to accomplish the most basic of steps, and I have been at it for some time in different approaches. The simple input case I am trying to get to is "25 12 +" -> [37]. I am not worried about the lists in Forth are backwards from Haskell, but I do want to try and accommodate the extensibility of the input string down the road, so I am using Maybe, as if there is an error, I will just do Nothing.
I first tried to break the input string into a list of "words" using Prelude's words function. From there I used Prelude's reads function to turn it into a list of tuples (Int,String). So this works great, up until I get to a command "word", such as the char + in the sample problem.
So how do I parse/interpret the string's command to something I can use?
Do I create a new data structure that has all the Forth commands or special characters? (assuming this, how do I convert it from the string format to that data type?)
Need anything else, just ask. I appreciate the help thinking this through.
read is essentially a very simple string parser. Rather than adapting it, you might want to consider learning to use a parser combinator library such as Parsec.
There are a bunch of different tutorials about parser combinators so you'll probably need to do a bit of reading before they 'click.' However, the first example in this tutorial is quite closely related to your problem.
import Text.Parsec
import Text.Parsec.String
play :: String -> Either ParseError Integer
play s = parse pmain "parameter" s
pmain :: Parser Integer
pmain = do
x <- pnum `chainl1` pplus
eof
return x
pnum = read `fmap` many1 digit
pplus = char '+' >> return (+)
It's a simple parser that evaluates arbitrarily long lists:
*Main> play "1+2+3+4+5"
Right 15
It also produces useful parse errors:
*Main> play "1+2+3+4+5~"
Left "parameter" (line 1, column 10):
unexpected '~'
expecting digit, "+" or end of input
If you can understand this simple parser, you should be able to work out how to adapt it to your particular problem (referring to the list of generic combinators in the documentation for Text.Parsec.Combinator). It will take a little longer at first than using read, but using a proper parsing library will make it much easier to achieve the ultimate goal of parsing Forth's whole grammar.
One common problem I have with Parsec is that it tends to ignore invalid input if it occurs in the "right" place.
As a concrete example, suppose we have integer :: Parser Int, and I write
expression = sepBy integer (char '+')
(Ignore whitespace issues for a moment.)
This correctly parses something like "123+456+789". However, if I feed it "123+456-789", it merrily ignores the illegal "-" character and the trailing part of the expression; I actually wanted an error message telling me about the invalid input, not just having it silently ignore that part.
I understand why this happens; what I'm not sure about is how to fix it. What is the general method for designing parsers that consume all supplied input and succeed only if all of it is a valid expression?
It's actually pretty simple--just ensure it's followed by eof:
parse (expression <* eof) "<interactive>" "123+456-789"
eof matches the end of the input, even if the input is just a string and not a file.
Obviously, this only makes sense at the top level of your parser.
I want to write a simple parser for a subset of Jade, generating some XmlHtml for further processing.
The parser is quite simple, but as often with Parsec, a bit long. Since I don't know if I am allowed to make such long code posts, I have the full working example here.
I've dabbled with Parsec before, but rarely successfully. Right now, I don't quite understand why it seems to swallow following lines. For example, the jade input of
.foo.bar
| Foo
| Bar
| Baz
tested with parseTest tag txt, returns this:
Element {elementTag = "div", elementAttrs = [("class","foo bar")], elementChildren = [TextNode "Foo"]}
My parser seems to be able to deal with any kind of nesting, but never more than one line. What did I miss?
If Parsec cannot match the remaining input, it will stop parsing at that point and simply ignore that input. Here, the problem is that after having parsed a tag, you don't consume the whitespace in the beginning of the line before the next tag, so Parsec cannot parse the remaining input and bails. (There might also be other issues, I can't test the code right now)
There are many ways of adding something that consumes the spaces, but I am not familiar with Jade so I cannot tell you which way is the "correct" way (I don't know how the indentation syntax works) but just adding whiteSpace somewhere at the end of tag should do it.
By the way, you should consider splitting up your parser into a Lexer and Parser. The Lexer produces a token stream like [Ident "bind", OpenParen, Ident "tag", Equals, StringLiteral "longname", ..., Indentation 1, ...] and the parser parses that token stream (Yes, Parsec can parse lists of anything). I think that it would make your job easier/less confusing.