Natty converting from anlr3 to antlr 4 - antlr4

as I'm new to antlr I have plenty of problems with syntactic predicates.
I'v been trying to convert this grammar,which is part of natty grammar, in order to parse it with antlr4,I really got confused how to change it in a meaningful way.
date_time
: (
(date)=>date (date_time_separator explicit_time)?
| explicit_time (time_date_separator date)?
) -> ^(DATE_TIME date? explicit_time?)
| relative_time -> ^(DATE_TIME relative_time?)
;`

Syntactic predicates and re-write rules are no longer supported in ANTLR4. ANTLR4's parsing algorithm should be powerful enough for not needing syntactic predicates, and if you want to traverse the parse tree, have a look at these links:
ANTLR4 visitor pattern on simple arithmetic example
https://github.com/antlr/antlr4/blob/master/doc/tree-matching.md
So, the rule you posted would look like this in ANTLR4:
date_time
: date ( date_time_separator explicit_time )?
| explicit_time ( time_date_separator date )?
| relative_time
;

Related

How do I disambiguate an OSC addresses from regular division by a value in ANTLR4?

I have a grammar where I recently added syntax for a constant OSC address --- it looks like this
OSCAddressConstant: ('/' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_')+)+;
Typical examples might be
/a/b/c
/Handle/SetValue
/1/Volume/Page3
Unfortunately, I discovered rather quickly that simple expressions with division: e.g.
foo = 20/10
now fail with type errors because the parser thinks that the /10 is an OSC address and so we get "integer" "Divide" "OSCAddressConstant"
What is the recommended (and hopefully) simplest way to disambiguate these other than changing the actual syntax of the OSC address, which would be a pity.
Thanks in advance
(NB - I saw a similar question about ambiguity between division and regular expression syntax but I did not understand the solution - there was a reference to the use of #member but it was unclear what to do with it - I've not seen that before and other questions about #member seem to have gone unanswered)
That OSCAddressConstant rule is rather a higher level rule, like a complex identifier, possibly qualified. Such higher level constructs should go into the parser, not the lexer.
Just like you would define a qualified identifier as:
ID: [a-zA-Z][a-zA-Z0-9]*;
DOT: '.';
qualified: ID (DOT ID)?;
you can define your OSC address as:
EID: [a-zA-Z0-9_]+;
DIV: '/';
oscAddressConstant: (DIV EID)+;
The only drawback with this approach is: when you usually ignore whitespaces this syntax will allow constructs like: / abc / 12. But if that's something you do not want handle whitespaces in the semantic phase and throw an error then.

Why does Java8 grammar example build so strange tree for addition? [duplicate]

I'm using the C grammar here: https://github.com/antlr/grammars-v4/tree/master/c to parse the expression int a2 = 5;. ANTLR version is 4.3.
The "5" here matches a very large chain of rules: initializer->assignmentExpression->conditionalExpression->logicalOrExpression->logicalAndExpression->... around 10 more -> primaryExpression->5.
While the parsing is correct eventually, this seems like a bug in the grammar. Can someone suggest fixes or clarifications?
No, it is no bug. The lower down the tree simply means the higher the precedence of the operator(s).
EDIT
The fact that the rules are chained like that is probably because Terence wrote the grammar from the C11 spec (it says so in the comments of the grammar). And in the official specs, the rules are probably written like that. You could rewrite the grammar in a more compact way however. ANTLR4 allows for direct recursive rules, making the rules:
expr
: add
;
add
: mult (('+'|'-') mult)*
;
mult
: unary (('*'|'/') unary)*
;
unary
: '-' atom
| atom
;
atom
: '(' expr ')'
| NUMBER
;
equivalent to the following single (ANTLR4) rule:
expr
: '-' expr
| expr ('*'|'/') expr // higher precedence than rules starting with `expr` defined below
| expr ('+'|'-') expr
| '(' expr ')'
| NUMBER
;
The grammar could possibly be designed different though, resulting in a less deep result.
See https://github.com/antlr/grammars-v4/blob/master/java/Java.g4#L497 for an example. This combines many levels on precedence in one rule. I'm not sure if a similar rule could be created (and would be readable) for C but it might be possible.
This kind of rule (including direct left recursion) was not available in previous versions of Antlr4 so the C grammar might have been created in times when this kind of rule would not be available.

Localize token for different languages

Developing a new grammar with ANTLR. My grammar supports basic math and boolean expressions like "4 equals (2 minuses 2)" or "true", "false". All operators are in natural language. I want to support other languages in their nature. For example, "4 equals 4" is "4 ist 4" in German.
What is the best practice to localize tokens and/or expressions?
In our project we follow this structure. There are files FooLexerBase.g and FooLexerLang1.g, FooLexerLang2.g and so on. The base grammar defines common token rules. Tokens that depend on language are not defined in the base, but can be referred to. These tokens are defined in the language-specific grammars, that all also include the base.
So, basically it looks something like this:
FooLexerBase.g:
lexer grammar FooLexerBase;
...
FLOATING_POINT
: DIGIT+ EXPONENT
| DIGIT+ DECIMAL_SEP DIGIT* EXPONENT?
| DECIMAL_SEP DIGIT+ EXPONENT?;
...
DIGIT and EXPONENT are defined in the base, since they are common, while DECIMAL_SEP is language-specific.
For example, FooLexerGerman.g looks like this:
lexer grammar FooLexerGerman;
import base = FooBase;
...
fragment
DECIMAL_SEP: ',';
...
Finally, parser grammar is common for all languages. It is defined this way:
parser grammar FooParser;
options {
tokenVocab = FooLexerBase;
}
...
It is important to not process FooLexerBase with ANTLR, but pass all other grammars through it.
At runtime you build a parser and pass an appropriate lexer as argument to the constructor. I guess it looks more or less the same in any programming language (we use Java).

How does the sample grammar on the antlr4 home page work?

Calculator math operator precedence is often remembered pneumonic PMDAS.
The grammar on the ANTLR home page (using the same abbreviations) has order MDASP. This isn't PMDAS or reverse PMDAS like I would expect. E.g. this stackoverflow answer contains a grammar that looks like PMDAS.
But no matter what expressions I put into the command line; the parse tree looks correct!
grammar Expr;
prog: (expr NEWLINE)* ;
expr: expr ('*'|'/') expr
| expr ('+'|'-') expr
| INT
| '(' expr ')'
;
NEWLINE : [\r\n]+ ;
INT : [0-9]+ ;
How does this work?
The question is a little tricky to answer as im not entirely sure what you were trying to parse but pseudo code for what this grammar expects may help you understand it:
An int is one or more numbers from 0 to 9
A new line is one or more \r\n
an (expr)ession is made up of any of these:
an expression with a '*' or '/' and another expression
an expression with a '+' or '-' and another expression
an int
a curly brace containing an expression followed by a curly brace
the (prog)ram is made up of zero or more expressions followed by new lines.
Also remember that ANTLR:
goes for the longest sequence first. if two rules or more match the
longest possible sequence then it chooses the lexical rule specified
first
This link may be very useful to you. Anyway if you post the tree you are struggling to understand we could try and help you further. Good luck with your project.

Why is this left-recursive and how do I fix it?

I'm learning ANTLR4 and I'm confused at one point. For a Java-like language, I'm trying to add rules for constructs like member chaining, something like that:
expr1.MethodCall(expr2).MethodCall(expr3);
I'm getting an error, saying that two of my rules are mutually left-recursive:
expression
: literal
| variableReference
| LPAREN expression RPAREN
| statementExpression
| memberAccess
;
memberAccess: expression DOT (methodCall | fieldReference);
I thought I understood why the above rule combination is considered left-recursive: because memberAccess is a candidate of expression and memberAccess starts with an expression.
However, my understanding broke down when I saw (by looking at the Java example) that if I just move the contents of memberAccess to expression, I got no errors from ANTLR4 (even though it still doesn't parse what I want, seems to fall into a loop):
expression
: literal
| variableReference
| LPAREN expression RPAREN
| statementExpression
| expression DOT (methodCall | fieldReference)
;
Why is the first example left-recursive but the second isn't?
And what do I have to do to actually parse the initial line?
The second is left-recursive but not mutually left recursive. ANTLR4 can eliminate left-recursive rules with an inbuilt algorithm. It cannot eliminate mutually left recursive rules. There probably exists an algorithm, but this would hardly preserve actions and semantic predicates.
For some reason, ANTLRWorks 2 was not responding when my grammar had left-recursion, causing me to (erroneously) believe that my grammar was wrong.
Compiling and testing from commandline revealed that the version with immediate left-recursion did, in fact, compile and parse correctly.
(I'm leaving this here in case anyone else is confused by the behavior of the IDE.)

Resources