Why is this left-recursive and how do I fix it? - antlr4

I'm learning ANTLR4 and I'm confused at one point. For a Java-like language, I'm trying to add rules for constructs like member chaining, something like that:
expr1.MethodCall(expr2).MethodCall(expr3);
I'm getting an error, saying that two of my rules are mutually left-recursive:
expression
: literal
| variableReference
| LPAREN expression RPAREN
| statementExpression
| memberAccess
;
memberAccess: expression DOT (methodCall | fieldReference);
I thought I understood why the above rule combination is considered left-recursive: because memberAccess is a candidate of expression and memberAccess starts with an expression.
However, my understanding broke down when I saw (by looking at the Java example) that if I just move the contents of memberAccess to expression, I got no errors from ANTLR4 (even though it still doesn't parse what I want, seems to fall into a loop):
expression
: literal
| variableReference
| LPAREN expression RPAREN
| statementExpression
| expression DOT (methodCall | fieldReference)
;
Why is the first example left-recursive but the second isn't?
And what do I have to do to actually parse the initial line?

The second is left-recursive but not mutually left recursive. ANTLR4 can eliminate left-recursive rules with an inbuilt algorithm. It cannot eliminate mutually left recursive rules. There probably exists an algorithm, but this would hardly preserve actions and semantic predicates.

For some reason, ANTLRWorks 2 was not responding when my grammar had left-recursion, causing me to (erroneously) believe that my grammar was wrong.
Compiling and testing from commandline revealed that the version with immediate left-recursion did, in fact, compile and parse correctly.
(I'm leaving this here in case anyone else is confused by the behavior of the IDE.)

Related

How do I disambiguate an OSC addresses from regular division by a value in ANTLR4?

I have a grammar where I recently added syntax for a constant OSC address --- it looks like this
OSCAddressConstant: ('/' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_')+)+;
Typical examples might be
/a/b/c
/Handle/SetValue
/1/Volume/Page3
Unfortunately, I discovered rather quickly that simple expressions with division: e.g.
foo = 20/10
now fail with type errors because the parser thinks that the /10 is an OSC address and so we get "integer" "Divide" "OSCAddressConstant"
What is the recommended (and hopefully) simplest way to disambiguate these other than changing the actual syntax of the OSC address, which would be a pity.
Thanks in advance
(NB - I saw a similar question about ambiguity between division and regular expression syntax but I did not understand the solution - there was a reference to the use of #member but it was unclear what to do with it - I've not seen that before and other questions about #member seem to have gone unanswered)
That OSCAddressConstant rule is rather a higher level rule, like a complex identifier, possibly qualified. Such higher level constructs should go into the parser, not the lexer.
Just like you would define a qualified identifier as:
ID: [a-zA-Z][a-zA-Z0-9]*;
DOT: '.';
qualified: ID (DOT ID)?;
you can define your OSC address as:
EID: [a-zA-Z0-9_]+;
DIV: '/';
oscAddressConstant: (DIV EID)+;
The only drawback with this approach is: when you usually ignore whitespaces this syntax will allow constructs like: / abc / 12. But if that's something you do not want handle whitespaces in the semantic phase and throw an error then.

Why does Java8 grammar example build so strange tree for addition? [duplicate]

I'm using the C grammar here: https://github.com/antlr/grammars-v4/tree/master/c to parse the expression int a2 = 5;. ANTLR version is 4.3.
The "5" here matches a very large chain of rules: initializer->assignmentExpression->conditionalExpression->logicalOrExpression->logicalAndExpression->... around 10 more -> primaryExpression->5.
While the parsing is correct eventually, this seems like a bug in the grammar. Can someone suggest fixes or clarifications?
No, it is no bug. The lower down the tree simply means the higher the precedence of the operator(s).
EDIT
The fact that the rules are chained like that is probably because Terence wrote the grammar from the C11 spec (it says so in the comments of the grammar). And in the official specs, the rules are probably written like that. You could rewrite the grammar in a more compact way however. ANTLR4 allows for direct recursive rules, making the rules:
expr
: add
;
add
: mult (('+'|'-') mult)*
;
mult
: unary (('*'|'/') unary)*
;
unary
: '-' atom
| atom
;
atom
: '(' expr ')'
| NUMBER
;
equivalent to the following single (ANTLR4) rule:
expr
: '-' expr
| expr ('*'|'/') expr // higher precedence than rules starting with `expr` defined below
| expr ('+'|'-') expr
| '(' expr ')'
| NUMBER
;
The grammar could possibly be designed different though, resulting in a less deep result.
See https://github.com/antlr/grammars-v4/blob/master/java/Java.g4#L497 for an example. This combines many levels on precedence in one rule. I'm not sure if a similar rule could be created (and would be readable) for C but it might be possible.
This kind of rule (including direct left recursion) was not available in previous versions of Antlr4 so the C grammar might have been created in times when this kind of rule would not be available.

ANTLR sub rule does not match when the parent rule is missing following token

Consider the following simple grammar using ANTLR 4.7.1.
grammar Grammar;
ID: [a-z];
DOT: '.';
LPAREN: '(';
RPAREN: ')';
SEMICOLON: ';';
LT: '<';
GT: '>';
term
: ID LT ID GT LPAREN expr RPAREN # CallExpr
| ID # Id
| LPAREN expr RPAREN # ParenExpr
;
expr
: term DOT? # PrimaryExpr
| expr bop=(GT | LT) expr # BinaryExpr
;
update : expr SEMICOLON ;
When matching the snippet a<b>(c) against the rule expr, ANTLR reports that there is an ambiguity, as the expression can be either a PrimaryExpr, or a BinaryExpr with one of its operands is also a BinaryExpr. That's expected and it is a feature of the language that we develop. Thanks to the priority, the parser prefers the former, which is what we want.
When a<b>(c); is matched against update, everything also works as expected - there is an ambiguity, but the PrimaryExpr has precedence.
However, when I try to match a<b>(c) against update, I expect that apart from reporting missing semicolon, there is the same ambiguity. What happens instead is that only the BinaryExpr rule is matched. This is an issue, because such a snippet can appear in a code being written, and it causes the editor plugin autocomplete (and other features) to work incorrectly. Any pointers to why that happens and how to fix it?
Things I've experimented with, that merely increased the confusion further:
When the BinaryExpr rule is removed, then a<b>(c) matches update through PrimaryExpr (with missing semicolon). How can ANTLR fallback to a different derivation when I remove the one it uses, yet do not report ambiguity?
When DOT? is removed from PrimaryExpr, the issue is gone.
When using custom reporting with update : expr (SEMICOLON | {notifyErrorListeners("Missing ';'");}), the issue is gone.
We can use the third option to fix the issue, it doesn't seem to break anything in our test suite, but I strongly feel I am not fixing the root cause and missing something fundamental which will bite us later.
I've found somewhat similar issue https://github.com/antlr/antlr4/issues/1545, but that one is already fixed.

How does the sample grammar on the antlr4 home page work?

Calculator math operator precedence is often remembered pneumonic PMDAS.
The grammar on the ANTLR home page (using the same abbreviations) has order MDASP. This isn't PMDAS or reverse PMDAS like I would expect. E.g. this stackoverflow answer contains a grammar that looks like PMDAS.
But no matter what expressions I put into the command line; the parse tree looks correct!
grammar Expr;
prog: (expr NEWLINE)* ;
expr: expr ('*'|'/') expr
| expr ('+'|'-') expr
| INT
| '(' expr ')'
;
NEWLINE : [\r\n]+ ;
INT : [0-9]+ ;
How does this work?
The question is a little tricky to answer as im not entirely sure what you were trying to parse but pseudo code for what this grammar expects may help you understand it:
An int is one or more numbers from 0 to 9
A new line is one or more \r\n
an (expr)ession is made up of any of these:
an expression with a '*' or '/' and another expression
an expression with a '+' or '-' and another expression
an int
a curly brace containing an expression followed by a curly brace
the (prog)ram is made up of zero or more expressions followed by new lines.
Also remember that ANTLR:
goes for the longest sequence first. if two rules or more match the
longest possible sequence then it chooses the lexical rule specified
first
This link may be very useful to you. Anyway if you post the tree you are struggling to understand we could try and help you further. Good luck with your project.

Solving ambiguous input: mismatched input

I have this grammar:
grammar MkSh;
script
: (statement
| targetRule
)*
;
statement
: assignment
;
assignment
: ID '=' STRING
;
targetRule
: TARGET ':' TARGET*
;
ID
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS
: ( ' '
| '\t'
| '\r'
| '\n'
) -> channel(HIDDEN)
;
STRING
: '\"' CHR* '\"'
;
fragment
CHR
: ('a'..'z'|'A'..'Z'|' ')
;
TARGET
: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-'|'/'|'.')+
;
and this input file:
hello="world"
target: CLASSES
When running my parser I'm getting this error:
line 3:6 mismatched input ':' expecting '='
line 3:15 mismatched input ';' expecting '='
Which is because of the parser is taking "target" as an ID instead of a TARGET. I want the parser to choose the rule based on the separator character (':' vs '=').
How can I get that to happen?
(This is my first Antlr project so I'm open to anything.)
First, you need to know that the word target is matched as a ID token and not as a TARGET token, and since you have written the rule ID before TARGET, it will always be recognized as ID by the lexer. Notice that the word target completely complies to both ID and TARGET lexer rule, (I'm going to suppose that you are writing a laguage), meaning that the target which is a keyword can also be used as an id. In the book - "The definitive ANTLR reference" there is a subtitle "Treating Keywords As Identifiers" that deals with exactely these kinds of issues. I suggest you take a look at that. Or if you prefer the quick answer the solution is to use lexer modes. Also would be better to split grammar into parser and lexer grammar.
As #cantSleepNow alludes to, you've defined a token (TARGET) that is a lexical superset of another token (ID), and then told the lexer to only tokenize a string as TARGET if it cannot be tokenized as ID. All made more obscure by the fact that ANTLR lexing rules look like ANTLR parsing rules, though they are really quite different beasts.
(Warning: writing off the top of my head without testing :-)
Your real project might be more complex, but in the possibly simplified example you posted, you could defer distinguishing the two to the parsing phase, instead of distinguishing them in the lexer:
id : TARGET
{ complain if not legal identifier (e.g., contains slashes, etc.) }
;
assignment
: id '=' STRING
;
Seems like that would solve the lexing issue, and allow you to give a more intelligent error message than "syntax error" when a user gets the syntax for ID wrong. The grammar remains ambiguous, but maybe ANTLR roulette will happen to make the choice you prefer in the ambiguous case. Of course, unambiguous grammers tend to make for languages that humans find more readable, and now you can see why the classic makefile syntax requires a newline after an assignment or target rule.

Resources