Why does Java8 grammar example build so strange tree for addition? [duplicate] - antlr4

I'm using the C grammar here: https://github.com/antlr/grammars-v4/tree/master/c to parse the expression int a2 = 5;. ANTLR version is 4.3.
The "5" here matches a very large chain of rules: initializer->assignmentExpression->conditionalExpression->logicalOrExpression->logicalAndExpression->... around 10 more -> primaryExpression->5.
While the parsing is correct eventually, this seems like a bug in the grammar. Can someone suggest fixes or clarifications?

No, it is no bug. The lower down the tree simply means the higher the precedence of the operator(s).
EDIT
The fact that the rules are chained like that is probably because Terence wrote the grammar from the C11 spec (it says so in the comments of the grammar). And in the official specs, the rules are probably written like that. You could rewrite the grammar in a more compact way however. ANTLR4 allows for direct recursive rules, making the rules:
expr
: add
;
add
: mult (('+'|'-') mult)*
;
mult
: unary (('*'|'/') unary)*
;
unary
: '-' atom
| atom
;
atom
: '(' expr ')'
| NUMBER
;
equivalent to the following single (ANTLR4) rule:
expr
: '-' expr
| expr ('*'|'/') expr // higher precedence than rules starting with `expr` defined below
| expr ('+'|'-') expr
| '(' expr ')'
| NUMBER
;

The grammar could possibly be designed different though, resulting in a less deep result.
See https://github.com/antlr/grammars-v4/blob/master/java/Java.g4#L497 for an example. This combines many levels on precedence in one rule. I'm not sure if a similar rule could be created (and would be readable) for C but it might be possible.
This kind of rule (including direct left recursion) was not available in previous versions of Antlr4 so the C grammar might have been created in times when this kind of rule would not be available.

Related

How do I disambiguate an OSC addresses from regular division by a value in ANTLR4?

I have a grammar where I recently added syntax for a constant OSC address --- it looks like this
OSCAddressConstant: ('/' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_')+)+;
Typical examples might be
/a/b/c
/Handle/SetValue
/1/Volume/Page3
Unfortunately, I discovered rather quickly that simple expressions with division: e.g.
foo = 20/10
now fail with type errors because the parser thinks that the /10 is an OSC address and so we get "integer" "Divide" "OSCAddressConstant"
What is the recommended (and hopefully) simplest way to disambiguate these other than changing the actual syntax of the OSC address, which would be a pity.
Thanks in advance
(NB - I saw a similar question about ambiguity between division and regular expression syntax but I did not understand the solution - there was a reference to the use of #member but it was unclear what to do with it - I've not seen that before and other questions about #member seem to have gone unanswered)
That OSCAddressConstant rule is rather a higher level rule, like a complex identifier, possibly qualified. Such higher level constructs should go into the parser, not the lexer.
Just like you would define a qualified identifier as:
ID: [a-zA-Z][a-zA-Z0-9]*;
DOT: '.';
qualified: ID (DOT ID)?;
you can define your OSC address as:
EID: [a-zA-Z0-9_]+;
DIV: '/';
oscAddressConstant: (DIV EID)+;
The only drawback with this approach is: when you usually ignore whitespaces this syntax will allow constructs like: / abc / 12. But if that's something you do not want handle whitespaces in the semantic phase and throw an error then.

ANTLR sub rule does not match when the parent rule is missing following token

Consider the following simple grammar using ANTLR 4.7.1.
grammar Grammar;
ID: [a-z];
DOT: '.';
LPAREN: '(';
RPAREN: ')';
SEMICOLON: ';';
LT: '<';
GT: '>';
term
: ID LT ID GT LPAREN expr RPAREN # CallExpr
| ID # Id
| LPAREN expr RPAREN # ParenExpr
;
expr
: term DOT? # PrimaryExpr
| expr bop=(GT | LT) expr # BinaryExpr
;
update : expr SEMICOLON ;
When matching the snippet a<b>(c) against the rule expr, ANTLR reports that there is an ambiguity, as the expression can be either a PrimaryExpr, or a BinaryExpr with one of its operands is also a BinaryExpr. That's expected and it is a feature of the language that we develop. Thanks to the priority, the parser prefers the former, which is what we want.
When a<b>(c); is matched against update, everything also works as expected - there is an ambiguity, but the PrimaryExpr has precedence.
However, when I try to match a<b>(c) against update, I expect that apart from reporting missing semicolon, there is the same ambiguity. What happens instead is that only the BinaryExpr rule is matched. This is an issue, because such a snippet can appear in a code being written, and it causes the editor plugin autocomplete (and other features) to work incorrectly. Any pointers to why that happens and how to fix it?
Things I've experimented with, that merely increased the confusion further:
When the BinaryExpr rule is removed, then a<b>(c) matches update through PrimaryExpr (with missing semicolon). How can ANTLR fallback to a different derivation when I remove the one it uses, yet do not report ambiguity?
When DOT? is removed from PrimaryExpr, the issue is gone.
When using custom reporting with update : expr (SEMICOLON | {notifyErrorListeners("Missing ';'");}), the issue is gone.
We can use the third option to fix the issue, it doesn't seem to break anything in our test suite, but I strongly feel I am not fixing the root cause and missing something fundamental which will bite us later.
I've found somewhat similar issue https://github.com/antlr/antlr4/issues/1545, but that one is already fixed.

Natty converting from anlr3 to antlr 4

as I'm new to antlr I have plenty of problems with syntactic predicates.
I'v been trying to convert this grammar,which is part of natty grammar, in order to parse it with antlr4,I really got confused how to change it in a meaningful way.
date_time
: (
(date)=>date (date_time_separator explicit_time)?
| explicit_time (time_date_separator date)?
) -> ^(DATE_TIME date? explicit_time?)
| relative_time -> ^(DATE_TIME relative_time?)
;`
Syntactic predicates and re-write rules are no longer supported in ANTLR4. ANTLR4's parsing algorithm should be powerful enough for not needing syntactic predicates, and if you want to traverse the parse tree, have a look at these links:
ANTLR4 visitor pattern on simple arithmetic example
https://github.com/antlr/antlr4/blob/master/doc/tree-matching.md
So, the rule you posted would look like this in ANTLR4:
date_time
: date ( date_time_separator explicit_time )?
| explicit_time ( time_date_separator date )?
| relative_time
;

How does the sample grammar on the antlr4 home page work?

Calculator math operator precedence is often remembered pneumonic PMDAS.
The grammar on the ANTLR home page (using the same abbreviations) has order MDASP. This isn't PMDAS or reverse PMDAS like I would expect. E.g. this stackoverflow answer contains a grammar that looks like PMDAS.
But no matter what expressions I put into the command line; the parse tree looks correct!
grammar Expr;
prog: (expr NEWLINE)* ;
expr: expr ('*'|'/') expr
| expr ('+'|'-') expr
| INT
| '(' expr ')'
;
NEWLINE : [\r\n]+ ;
INT : [0-9]+ ;
How does this work?
The question is a little tricky to answer as im not entirely sure what you were trying to parse but pseudo code for what this grammar expects may help you understand it:
An int is one or more numbers from 0 to 9
A new line is one or more \r\n
an (expr)ession is made up of any of these:
an expression with a '*' or '/' and another expression
an expression with a '+' or '-' and another expression
an int
a curly brace containing an expression followed by a curly brace
the (prog)ram is made up of zero or more expressions followed by new lines.
Also remember that ANTLR:
goes for the longest sequence first. if two rules or more match the
longest possible sequence then it chooses the lexical rule specified
first
This link may be very useful to you. Anyway if you post the tree you are struggling to understand we could try and help you further. Good luck with your project.

Why is this left-recursive and how do I fix it?

I'm learning ANTLR4 and I'm confused at one point. For a Java-like language, I'm trying to add rules for constructs like member chaining, something like that:
expr1.MethodCall(expr2).MethodCall(expr3);
I'm getting an error, saying that two of my rules are mutually left-recursive:
expression
: literal
| variableReference
| LPAREN expression RPAREN
| statementExpression
| memberAccess
;
memberAccess: expression DOT (methodCall | fieldReference);
I thought I understood why the above rule combination is considered left-recursive: because memberAccess is a candidate of expression and memberAccess starts with an expression.
However, my understanding broke down when I saw (by looking at the Java example) that if I just move the contents of memberAccess to expression, I got no errors from ANTLR4 (even though it still doesn't parse what I want, seems to fall into a loop):
expression
: literal
| variableReference
| LPAREN expression RPAREN
| statementExpression
| expression DOT (methodCall | fieldReference)
;
Why is the first example left-recursive but the second isn't?
And what do I have to do to actually parse the initial line?
The second is left-recursive but not mutually left recursive. ANTLR4 can eliminate left-recursive rules with an inbuilt algorithm. It cannot eliminate mutually left recursive rules. There probably exists an algorithm, but this would hardly preserve actions and semantic predicates.
For some reason, ANTLRWorks 2 was not responding when my grammar had left-recursion, causing me to (erroneously) believe that my grammar was wrong.
Compiling and testing from commandline revealed that the version with immediate left-recursion did, in fact, compile and parse correctly.
(I'm leaving this here in case anyone else is confused by the behavior of the IDE.)

Resources