Is it an ANTLR4 EOF Bug or My Error - antlr4

This is my trimmed down ANTLR4 grammar (note I'm using a constant false to replace my method that returns false ):
grammar AnnotProcessor;
cppCompilationUnit: content+ EOF;
content: anything
| {false}? .;
anything: ANY_CHAR;
ANY_CHAR: [_a-zA-Z0-9];
My test file contains only 1 word "hello" and the test results are like these:
D:\work\antlr4\work>java org.antlr.v4.runtime.misc.TestRig AnnotProcessor cppCompilationUnit -tree in.cpp
line 1:5 no viable alternative at input '<EOF>'
(cppCompilationUnit (content (anything h)) (content (anything e)) (content (anything l)) (content (anything l)) (content (anything o)))
D:\work\antlr4\work>java org.antlr.v4.runtime.misc.TestRig AnnotProcessor cppCompilationUnit -tokens in.cpp
[#0,0:0='h',<1>,1:0]
[#1,1:1='e',<1>,1:1]
[#2,2:2='l',<1>,1:2]
[#3,3:3='l',<1>,1:3]
[#4,4:4='o',<1>,1:4]
[#5,5:4='<EOF>',<-1>,1:5]
line 1:5 no viable alternative at input '<EOF>'
Why it keeps saying "line 1:5 no viable alternative at input '< EOF >'" when I add a semantic predicate (although a dummy here) as an alternative? If I remove the alternative with the false semantic predicate, the error disappears as expected.
PS: I'm using antlr-4.0-complete.jar

Yes, this is a bug. In ANTLR 4, the introduction of an alternative starting with {false}? [almost] anywhere in the grammar should not affect the parse result for any [valid] input.
Can you report the issue here:
https://github.com/antlr/antlr4/issues
Edit: This is issue #218, now fixed
https://github.com/antlr/antlr4/issues/218
PS: Your use of the non-greedy operator +? is either unnecessary or incorrect in the cppCompilationUnit rule. If you meant to require at least one content element, you can simply use +. However, what I think you meant to write is zero-or-more content elements: (content+)?, which can be simplified to just content*.

Related

Python ANTLR4 example - Parser doesn't seem to parse correctly

To demonstrate the problem, I'm going to create a simple grammar to merely detect Python-like variables.
I create a virtual environment and install antlr4-python3-runtime in it, as mentioned in "Where can I get the runtime?":
Then, I create a PyVar.g4 file with the following content:
grammar PyVar;
program: IDENTIFIER+;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*;
NEWLINE: '\n' | '\r\n';
WHITESPACE: [ ]+ -> skip;
Now if I test the grammar with grun, I can see that the grammar detects the variables just fine:
Now I'm trying to write a parser in Python to do just that. I generate the Lexer and Parser, using this command:
antlr4 -Dlanguage=Python3 PyVar.g4
And they're generated with no errors:
But when I use the example provided in "How do I run the generated lexer and/or parser?", I get no output:
What am I not doing right?
There are two problems here.
1. The grammar:
In the line where I had,
program: IDENTIFIER+;
the parser will only detect one or more variables, and it will not detect any newline. The output you see when running grun is the output created by the lexer, that's why newlines are present in the tokens. So I had to replace it with something like this, for the parser to detect newlines.
program: (IDENTIFIER | NEWLINE)+;
2. Printing the output of parser
In PyVar.py file, I created a tree with this line:
tree = parser.program()
But it didn't print its output, nor did I know how to, but the OP's comment on this accepted answer suggests using tree.toStringTree().
Now if we fix those, we can see that it works:

ANTLR4 not reporting ambiguity

Given the following grammar:
grammar ReportAmbiguity;
unit : statements+;
statements :
callStatement+
// '.' // <- uncomment this line
;
callStatement : 'CALL' ID (argsByRef | argsByVal)*;
argsByRef : ('BY' 'REF')? ID+;
argsByVal : 'BY' 'VAL' ID+;
ID : ('A'..'Z')+;
WS : (' '|'\n')+ -> channel(HIDDEN);
When parsing the string "CALL FUNCTION BY VAL A B" through the non-root rule callStatement everything works and the parser correctly reports an ambiguity:
line 1:24 reportAttemptingFullContext d=6 (argsByVal), input='B'
line 1:24 reportAmbiguity d=6 (argsByVal): ambigAlts={1, 2}, input='B'
Parser correcly outputs the tree: (callStatement CALL FUNCTION (argsByVal BY VAL A B)).
Now consider uncommenting the line shown above (the 7th). Testing everything again.
The parser still outputs the same tree, but the ambiguity reports are gone. Why this obviously ambiguous grammar with such an ambiguous input is not being reported anymore?
(This is part of a bigger problem. I'm trying to understand this so I can pin down another possible problem with my grammar.)
EDIT 1
Using antlr4 version 4.6.
I've prepared a pet project in github: https://github.com/rslemos/pet-grammars (in module g, type mvn clean test -Dtest=br.eti.rslemos.petgrammars.ReportAmbiguityUnitTest to have the commented version tested; uncomment the 7th line and run it again to see it failing).
EDIT 2
Changed unit: statements*; to unit: statements+;. This change itself changes nothing to the original problem. It only allows another experience (further edition pending).
EDIT 3
Another way to trigger this bug is to change unit: statements+; to unit: statements+ unit;.
Like when adding '.' to statements, this change also makes antlr4 forgo ambiguity detection.
I think this has something to do with an EOF that possibly follows argsByVal.
The first alternative (append '.' to statements) precludes EOF from appearing just after argsByVal.
The second one (append unit to itself) makes it a non-root rule (and it seems that antlr implicitly appends EOF to every root rule).
I always thought antlr4 rules were meant to be invoked anyway we liked, with no rule given some special treatment, the root rule being so called just because we (grammar author) know which rule is the root.
EDIT 4
Could be related to https://github.com/antlr/antlr4/issues/1545.

Can JAPE match paragraph Annotation in LHS?

I'm working on a math word problem solver, and would like to pass whole problems to my GATE Embedded application using JAPE. I'm using GATE IDE to display the output, as well as run the pipeline of GATE components. Each problem will be in its own paragraph, and each document will have several problems on it.
Is there a way to match any paragraph using the JAPE left-hand side regex?
I see three options here (there may be more elegant solutions):
1) Use simple rule like:
Phase: find
Input: Token
Options: control = once
Rule:OneToken
(
{Token}
)
In RHS you could get a text and use standard Java approach for getting paragraphs from plain text.
2) Use LHS (if you really want only LHS)
Rule: NewLine
(
({SpaceToken.string=="\n"}) |
({SpaceToken.string=="\r"}) |
({SpaceToken.string=="\n"}{SpaceToken.string=="\r"}) |
({SpaceToken.string=="\r"}{SpaceToken.string=="\n"})
):left
Build annotation NewLine, then write a Jape rule similar to 1) but with NewLine instead of Token. Take all NewLines from outputAS and build your Paragraph annotations.
3) Sometimes there may be right paragraphs in Original markups. In this case you could use Annotation Set Transfer PR and get them in Default Annotations Set.
why not just use RegEx Sentence splitter PR to use Split as the Input in your jape rules?

Antlr4 match whole input string or bust

I am new to Antlr4 and have been wracking my brain for some days now about a behaviour that I simply don't understand. I have the following combined grammar and expect it to fail and report an error, but it doesn't:
grammar MWE;
parse: cell EOF;
cell: WORD;
WORD: ('a'..'z')+;
If I feed it the input
a4
I expect it to not be able to parse it, because I want it to match the whole input string and not just a part of it, as signified by the EOF. But instead it reports no error (I listen for errors with a errorlistener implementing the IAntlrErrorListener interface) and gives me the following parse tree:
(parse (cell a) <EOF>)
Why is this?
The error recovery mechanism when input is reached which no lexer rule matches is to drop a character and continue with the next one. In your case, the lexer is dropping the 4 character, so your parser is seeing the equivalent of this input:
a
The solution is to instruct the lexer to create a token for the dropped character rather than ignore it, and pass that token on to the parser where an error will be reported. In the grammar, this rule takes the following form and is always added as the last rule in the grammar. If you have multiple lexer modes, a rule with this form should appear as the last rule in the default mode as well as the last rule in each extra mode.
ErrChar
: .
;

Parser skips lines

I want to write a simple parser for a subset of Jade, generating some XmlHtml for further processing.
The parser is quite simple, but as often with Parsec, a bit long. Since I don't know if I am allowed to make such long code posts, I have the full working example here.
I've dabbled with Parsec before, but rarely successfully. Right now, I don't quite understand why it seems to swallow following lines. For example, the jade input of
.foo.bar
| Foo
| Bar
| Baz
tested with parseTest tag txt, returns this:
Element {elementTag = "div", elementAttrs = [("class","foo bar")], elementChildren = [TextNode "Foo"]}
My parser seems to be able to deal with any kind of nesting, but never more than one line. What did I miss?
If Parsec cannot match the remaining input, it will stop parsing at that point and simply ignore that input. Here, the problem is that after having parsed a tag, you don't consume the whitespace in the beginning of the line before the next tag, so Parsec cannot parse the remaining input and bails. (There might also be other issues, I can't test the code right now)
There are many ways of adding something that consumes the spaces, but I am not familiar with Jade so I cannot tell you which way is the "correct" way (I don't know how the indentation syntax works) but just adding whiteSpace somewhere at the end of tag should do it.
By the way, you should consider splitting up your parser into a Lexer and Parser. The Lexer produces a token stream like [Ident "bind", OpenParen, Ident "tag", Equals, StringLiteral "longname", ..., Indentation 1, ...] and the parser parses that token stream (Yes, Parsec can parse lists of anything). I think that it would make your job easier/less confusing.

Resources