Parser skips lines - haskell

I want to write a simple parser for a subset of Jade, generating some XmlHtml for further processing.
The parser is quite simple, but as often with Parsec, a bit long. Since I don't know if I am allowed to make such long code posts, I have the full working example here.
I've dabbled with Parsec before, but rarely successfully. Right now, I don't quite understand why it seems to swallow following lines. For example, the jade input of
.foo.bar
| Foo
| Bar
| Baz
tested with parseTest tag txt, returns this:
Element {elementTag = "div", elementAttrs = [("class","foo bar")], elementChildren = [TextNode "Foo"]}
My parser seems to be able to deal with any kind of nesting, but never more than one line. What did I miss?

If Parsec cannot match the remaining input, it will stop parsing at that point and simply ignore that input. Here, the problem is that after having parsed a tag, you don't consume the whitespace in the beginning of the line before the next tag, so Parsec cannot parse the remaining input and bails. (There might also be other issues, I can't test the code right now)
There are many ways of adding something that consumes the spaces, but I am not familiar with Jade so I cannot tell you which way is the "correct" way (I don't know how the indentation syntax works) but just adding whiteSpace somewhere at the end of tag should do it.
By the way, you should consider splitting up your parser into a Lexer and Parser. The Lexer produces a token stream like [Ident "bind", OpenParen, Ident "tag", Equals, StringLiteral "longname", ..., Indentation 1, ...] and the parser parses that token stream (Yes, Parsec can parse lists of anything). I think that it would make your job easier/less confusing.

Related

Finding the start of an expression when the end of the previous one is difficult to express

I've got a file format that looks a little like this:
blockA {
uniqueName42 -> uniqueName aWord1 anotherWord "Some text"
anotherUniqueName -> uniqueName23 aWord2
blockB {
thing -> anotherThing
}
}
Lots more blocks with arbitrary nesting levels.
The lines with the arrow in them define relationships between two things. Each relationship has some optional metadata (multi-word quoted or single word unquoted).
The challenge I'm having is that because the there can be an arbitrary number of metadata items in a relationship my parser is treating anotherUniqueName as a metadata item from the first relationship rather than the start of the second relationship.
You can see this in the image below. The parser is only recognising one relationshipDeclaration when a second should start with StringLiteral: anotherUniqueName
The parser looks a bit like this:
block
: BLOCK LBRACE relationshipDeclaration* RBRACE
;
relationshipDeclaration
: StringLiteral? ARROW StringLiteral StringLiteral*
;
I'm hoping to avoid lexical modes because the fact that these relationships can appear almost anywhere in the file will leave me up to my eyes in NL+ :-(
Would appreciate any ideas on what options I have. Is there a way to look ahead, spot the '->', for example?
Thanks a million.
Your example certainly looks like the NL is what signals the end of a relationshipDeclaration.
If that's the case, then you'll need NLs to be tokens available to your parse rules so the parser can know recognize the end.
As you've alluded to, you could potentially use -> to trigger a different Lexer Mode and generate different tokens for content between the -> and the NL and then use those tokens in your parse rule for relationshipDeclaration.
If it's as simple as your snippet indicates, then just capturing RD_StringLiteral tokens in that lexical mode, would probably be easier to deal with than handling all the places you might need to allow for NL. This would be pretty simple as Lexer modes go.
(BTW you can use x+ to get the same effect as x x*)
relationshipDeclaration
: StringLiteral? ARROW RD_StringLiteral+
;
I don't think there's a third option for dealing with this.

ANTLR4 - How to parse content between same string values

I'm trying to write an antlr4 parser rule that can match the content between some arbitrary string values that are same. So far I couldn't find a method to do it.
For example, in the below input, I need a rule to extract Hello and Bye. I'm not interested in extracting xyz though.
TEXT Hello TEXT
TEXT1 Bye TEXT1
TEXT5 xyz TEXT8
As it is very much similar to an XML element grammar, I tried an example for XML Parser given in ANTLR4 XML Grammar, but it parses an input like <ABC> ... </XYZ> without error which is not what I wanted.
I also tried using semantic predicates without much success.
Could anyone please help with a hint on how to match content that is embedded between same strings?
Thank you!
Satheesh
Not sure how this works out performance wise, because of many many checks the parser has to do, but you could try something like:
token:
start = IDENTIFIER WORD* end = IDENTIFIER { start == end }?
;
The part between the curly braces is a validating semantic predicate. The lexer tokens are self-explanatory, I believe.
The more I think about it, it might be better you just tokenize the input and write an owner parser that processes the input and acts accordingly. Depends of course on the complexity of the syntax.

attoparsec: succeeding on part of the input instead of failing

I have an attoparsec parser, and tests for it, what annoys me is that if I comment part of the parser and run the tests, the parser doesn't return Left "parse error at line ..." but instead I get Right [].
Note that I'm using parseOnly to make it clear that there'll be no more input.
Otherwise it's nice to get the partially parsed input, it can definitely be useful and I'm glad to have it. However I'd like to be informed that the whole input was not consumed. Maybe to get a character offset of the last consumed letter, or if that's what it takes, at least an option to be returned Left.
If it's relevant, the parser can be found there.
If I comment for instance the line:
<|> PlainText <$> choice (string <$> ["[", "]", "*", "`"])
And run the tests, I get for instance:
1) notes parsing tests parses notes properly
simple test
expected: Right [NormalLine [PlainText "one line* # hello world"]]
but got: Right []
This is from that test.
Depending on if consuming the whole input should be the property of parseNoteDocument or just the tests, I'd extend one or the other with endOfInput or atEnd.
I'd suggest to define a proper Parser for your documents, like
parseNoteDocument' :: Text -> Parsec NoteDocument
parseNoteDocument' = many parseLine
and then define parseNoteDocument in terms of it. Then you can use parseNoteDocument' in the tests by defining a helper that parses a given piece of text using
parseNoteDocument' <* endOfInput
to ensure that the whole input is consumed.

Convert one full String to ints and words as an interpreter in Haskell

I am trying to write a Forth interpreter in Haskell. There are many sub problems and categories to accomplish this, however, I am trying to accomplish the most basic of steps, and I have been at it for some time in different approaches. The simple input case I am trying to get to is "25 12 +" -> [37]. I am not worried about the lists in Forth are backwards from Haskell, but I do want to try and accommodate the extensibility of the input string down the road, so I am using Maybe, as if there is an error, I will just do Nothing.
I first tried to break the input string into a list of "words" using Prelude's words function. From there I used Prelude's reads function to turn it into a list of tuples (Int,String). So this works great, up until I get to a command "word", such as the char + in the sample problem.
So how do I parse/interpret the string's command to something I can use?
Do I create a new data structure that has all the Forth commands or special characters? (assuming this, how do I convert it from the string format to that data type?)
Need anything else, just ask. I appreciate the help thinking this through.
read is essentially a very simple string parser. Rather than adapting it, you might want to consider learning to use a parser combinator library such as Parsec.
There are a bunch of different tutorials about parser combinators so you'll probably need to do a bit of reading before they 'click.' However, the first example in this tutorial is quite closely related to your problem.
import Text.Parsec
import Text.Parsec.String
play :: String -> Either ParseError Integer
play s = parse pmain "parameter" s
pmain :: Parser Integer
pmain = do
x <- pnum `chainl1` pplus
eof
return x
pnum = read `fmap` many1 digit
pplus = char '+' >> return (+)
It's a simple parser that evaluates arbitrarily long lists:
*Main> play "1+2+3+4+5"
Right 15
It also produces useful parse errors:
*Main> play "1+2+3+4+5~"
Left "parameter" (line 1, column 10):
unexpected '~'
expecting digit, "+" or end of input
If you can understand this simple parser, you should be able to work out how to adapt it to your particular problem (referring to the list of generic combinators in the documentation for Text.Parsec.Combinator). It will take a little longer at first than using read, but using a proper parsing library will make it much easier to achieve the ultimate goal of parsing Forth's whole grammar.

Parsec: Consume all input

One common problem I have with Parsec is that it tends to ignore invalid input if it occurs in the "right" place.
As a concrete example, suppose we have integer :: Parser Int, and I write
expression = sepBy integer (char '+')
(Ignore whitespace issues for a moment.)
This correctly parses something like "123+456+789". However, if I feed it "123+456-789", it merrily ignores the illegal "-" character and the trailing part of the expression; I actually wanted an error message telling me about the invalid input, not just having it silently ignore that part.
I understand why this happens; what I'm not sure about is how to fix it. What is the general method for designing parsers that consume all supplied input and succeed only if all of it is a valid expression?
It's actually pretty simple--just ensure it's followed by eof:
parse (expression <* eof) "<interactive>" "123+456-789"
eof matches the end of the input, even if the input is just a string and not a file.
Obviously, this only makes sense at the top level of your parser.

Resources