ANTLR4: what design pattern to follow? - antlr4

I have a ANTR4 rule "expression" that can be either "maths" or "comparison", but "comparison" can contain "maths". Here a concrete code:
expression
: ID
| maths
| comparison
;
maths
: maths_atom ((PLUS | MINUS) maths_atom) ? // "?" because in fact there is first multiplication then pow and I don't want to force a multiplication to make an addition
;
maths_atom
: NUMBER
| ID
| OPEN_PAR expression CLOSE_PAR
;
comparison
: comp_atom ((EQUALS | NOT_EQUALS) comp_atom) ?
;
comp_atom
: ID
| maths // here is the expression of interest
| OPEN_PAR expression CLOSE_PAR
;
If I give, for instance, 6 as input, this is fine for the parse tree, because it detects maths. But in the ANTLR4 plugin for Intellij Idea, it mark my expression rule as red - ambiguity. Should I say goodbye to a short parse tree and allow only maths trough comparison in expression so it is not so ambiguous anymore ?

The problem is that when the parser sees 6, which is a NUMBER, it has two paths of reaching it through your grammar:
expression - maths - maths_atom - NUMBER
or
expression - comparison - comp_atom - NUMBER
This ambiguity triggers the error that you see.
You can fix this by flattening your parser grammar as shown in this tutorial:
start
: expr | <EOF>
;
expr
: expr (PLUS | MINUS) expr # ADDGRP
| expr (EQUALS | NOT_EQUALS) expr # COMPGRP
| OPEN_PAR expression CLOSE_PAR # PARENGRP
| NUMBER # NUM
| ID # IDENT
;

Related

ANTLR parser to throw exception for "true and or false" statement

I'm using ANTLR 4 and have a fairly complex grammar. I'm trying to simplify here...
Given an expression like: true and or false I want a parsing error since the operands defined expect expressions on either side and this has an expr operand operand expr
My reduced grammar is:
grammar MappingExpression;
/* The start rule; begin parsing here.
operator precedence is implied by the ordering in this list */
// =======================
// = PARSER RULES
// =======================
expr:
| op=(TRUE|FALSE) # boolean
| expr op=AND expr # logand
| expr op=OR expr # logor
;
TRUE : 'true';
FALSE : 'false';
WS : [ \t\r\n]+ -> skip; // ignore whitespace
AND : 'and';
OR : 'or';
however, it seems that the parser stops after evaluating true even though it has all four tokens identified (e.g., alt state returned becomes 2 in the parser).
If I can't get a parsing exception (because it is seeing what I deem operands as expressions), if I got the entire parse tree I could throw a runtime exception for two operands in a row (e.g., 'and' and 'or').
Originally, I'd just had:
expr 'and' expr #logand
expr 'or' expr #logor
and this suffered the same parsing problem (stopping early).
You should get a parsing error if you force the parser to consume all tokens by "anchoring" a rule with the built-in EOF
parse
: expr EOF
;
This is what I get when parsing the input true and or false:
See the error in the lower left corner:
line 1:9 extraneous input 'or' expecting {'true', 'false'}
line 1:17 missing {'true', 'false'} at '<EOF>'
Bart Kiers answer above is correct. I just wanted to provide more details for people working with Java who have experienced incomplete parsing issues.
I'd had a fairly complex g4 file that defined an expr as a series of OR'ed rules associated with tags (e.g., following a # that become the method name in the ExpressionsVisitor). While this seemed to work there were situations where I'd expected parsing errors but received none. I also had situations where only part of an input to the parser was interpreted making it impossible to process the entire input statement.
I repaired the g4 file as follows (the full version is here):
// =======================
// = PARSER RULES
// =======================
expr_to_eof : expr EOF ;
expr:
ID # id
| '*' # field_values
| DESCEND # descendant
| DOLLAR # context_ref
| ROOT # root_path
| ARR_OPEN exprOrSeqList? ARR_CLOSE # array_constructor
| OBJ_OPEN fieldList? OBJ_CLOSE # object_constructor
| expr '.' expr # path
| expr ARR_OPEN ARR_CLOSE # to_array
| expr ARR_OPEN expr ARR_CLOSE # array
| expr OBJ_OPEN fieldList? OBJ_CLOSE # object
| VAR_ID (emptyValues | exprValues) # function_call
| FUNCTIONID varList '{' exprList? '}' # function_decl
| VAR_ID ASSIGN (expr | (FUNCTIONID varList '{' exprList? '}')) # var_assign
| (FUNCTIONID varList '{' exprList? '}') exprValues # function_exec
| op=(TRUE|FALSE) # boolean
| op='-' expr # unary_op
| expr op=('*'|'/'|'%') expr # muldiv_op
| expr op=('+'|'-') expr # addsub_op
| expr op='&' expr # concat_op
| expr op=('<'|'<='|'>'|'>='|'!='|'=') expr # comp_op
| expr 'in' expr # membership
| expr 'and' expr #logand
| expr 'or' expr # logor
| expr '?' expr (':' expr)? # conditional
| expr CHAIN expr # fct_chain
| '(' (expr (';' (expr)?)*)? ')' # parens
| VAR_ID # var_recall
| NUMBER # number
| STRING # string
| 'null' # null
;
Based on Bart's suggestion I added the top rule for expr_to_eof that resulted in that method being added to the MappingExpressionParser. So, in my Expressions class where before I'd called tree = parser.expr(); I now needed to call tree = parser.expr_to_eof(); which resulted in a ParseTree that included a last child for the Token.EOF.
Because my code needed to check some conditions for the first and last step performed it was easiest for me to add the following to strip out the <EOF> and get back the ParseTree (ExprContext rather than Expr_to_eofContext) I had been using by adding this statement:
newTree = ((Expr_to_eofContext)tree).expr();
So, overall, it was quite easy to fix a long standing bug (and others I'd postponed addressing) just by adding the new rule in the .g4 file and changing the parser so it would parse to end of file () and then extract the entire expression that was parsed.
I expect this will allow me to add considerably more functions to JSONata4Java to match the JavaScript version jsonata.js
Thanks again Bart!

How to do Priority of Operations (+ * - /) in my grammars?

I define my own grammars using antlr 4 and I want to build tree true According to Priority of Operations (+ * - /) ....
I find sample on do Priority of Operations (* +) it work fine ...
I try to edit it to add the Priority of Operations (- /) but I failed :(
the grammars for Priority of Operations (+ *) is :
println:PRINTLN expression SEMICOLON {System.out.println($expression.value);};
expression returns [Object value]:
t1=factor {$value=(int)$t1.value;}
(PLUS t2=factor{$value=(int)$value+(int)$t2.value;})*;
factor returns [Object value]: t1=term {$value=(int)$t1.value;}
(MULT t2=term{$value=(int)$value*(int)$t2.value;})*;
term returns [Object value]:
NUMBER {$value=Integer.parseInt($NUMBER.text);}
| ID {$value=symbolTable.get($value=$ID.text);}
| PAR_OPEN expression {$value=$expression.value;} PAR_CLOSE
;
MULT :'*';
PLUS :'+';
MINUS:'-';
DIV:'/' ;
How I can add to them the Priority of Operations (- /) ?
In ANTLR3 (and ANTLR4) * and / can be given a higher precedence than + and - like this:
println
: PRINTLN expression SEMICOLON
;
expression
: factor ( PLUS factor
| MINUS factor
)*
;
factor
: term ( MULT term
| DIV term
)*
;
term
: NUMBER
| ID
| PAR_OPEN expression PAR_CLOSE
;
But in ANTLR4, this will also work:
println
: PRINTLN expression SEMICOLON
;
expression
: NUMBER
| ID
| PAR_OPEN expression PAR_CLOSE
| expression ( MULT | DIV ) expression // higher precedence
| expression ( PLUS | MINUS ) expression // lower precedence
;
You normally solve this by defining expression, term, and factor production rules. Here's a grammar (specified in EBNF) that implements unary + and unary -, along with the 4 binary arithmetic operators, plus parentheses:
start ::= expression
expression ::= term (('+' term) | ('-' term))*
term ::= factor (('*' factor) | ('/' factor))*
factor :: = (number | group | '-' factor | '+' factor)
group ::= '(' expression ')'
where number is a numeric literal.

Antlr4 perentheses and arithmetics

I am parsing an SQL like language of which I need to handle arithmetics with precedence.
Things could be like this:
(a + b) - c
(a + b) / 1000
a + (b - c)
a + (SELECT...)
(SELECT... ) + (SELECT ...)
etc..
I am using the antlr4 listeners pattern and so I can't find a way to build a representation tree for these arithmetic clauses.
grammer parts:
arithmetic_select_clause:
result_column arithmeticExpression result_column # ArithmeticSelect
| result_column arithmeticExpression arithmetic_select_clause # ArithmeticSelect
| arithmetic_select_clause arithmeticExpression result_column # ArithmeticSelect
| '(' arithmetic_select_clause ')' # ArithmeticSelectParentheses
;
arithmeticExpression : '+' # arithmeticsAdd
| '-' # arithmeticsSubtract
| '*' # arithmeticsMultiply
| '/' # arithmeticsDivide
| '%' # arithmeticsModulus
;
I can create a tree using the antlr listenres but I cant handle precedence.
Help please
ANTLR can help you there but you need to follow a few rules for it to do so. The arithmeticExpression rule needs to contain both operands and be directly recursive so that ANTLR can figure out how to rewrite it.
Here's an example of what you could do:
expression : '(' expression ')'
| expression op=('*'|'/'|'%') expression
| expression op=('+'|'-') expression
| result_column
| arithmetic_select_clause
;
This rule is left-recursive but ANTLR will rewrite it to eliminate the left-recursion. Relevant docs.
Notice how the levels of precedence are ordered. Each level gets its alternative. Same-precedence operators are on one level.
Also, for processing math expressions it's much easier to use a visitor than a listener. ANTLR can generate the base classes for you. It'll be much easier to traverse/process the parse tree in the precedence order this way.

How can non-associative operators like "<" be specified in ANTLR4 grammars?

In a rule expr : expr '<' expr | ...;
the ANTLR parser will accept expressions like 1 < 2 < 3 (and construct left-associative trees corrsponding to brackets (1 < 2) < 3.
You can tell ANTLR to treat operators as right associative, e.g.
expr : expr '<'<assoc=right> expr | ...;
to yield parse trees 1 < (2 < 3).
However, in many languages, relational operators are non-associative, i.e., an expression 1 < 2 < 3 is forbidden.
This can be specified in YACC and its derivates.
Can it also be specified in ANTLR?
E.g., as expr : expr '<'<assoc=no> expr | ...;
I was unable to find something in the ANTLR4-book so far.
How about the following approach. Basically the "result" of a < b has a type not compatible for another application of operator < or >:
expression
: boolExpression
| nonBoolExpression
;
boolExpression
: nonBoolExpression '<' nonBoolExpression
| nonBoolExpression '>' nonBoolExpression
| ...
;
nonBoolExpression
: expression '*' expression
| expression '+' expression
| ...
;
Although personally I'd go with Darien and rather detect the error after parsing.

Struggling to parse array notation

I have a very simple grammar to parse statements.
Here are examples of the type of statements that need be parsed:
a.b.c
a.b.c == "88"
The issue I am having is that array notation is not matching. For example, things that are not working:
a.b[0].c
a[3][4]
I hope someone can point out what I am doing wrong here. (I am testing in ANTLRWorks)
Here is the grammar (generationUnit is my entry point):
grammar RatBinding;
generationUnit: testStatement | statement;
arrayAccesor : identifier arrayNotation+;
arrayNotation: '[' Number ']';
testStatement:
(statement | string | Number | Bool )
(greaterThanAndEqual
| lessThanOrEqual
| greaterThan
| lessThan | notEquals | equals)
(statement | string | Number | Bool )
;
part: identifier | arrayAccesor;
statement: part ('.' part )*;
string: ('"' identifier '"') | ('\'' identifier '\'');
greaterThanAndEqual: '>=';
lessThanOrEqual: '<=';
greaterThan: '>';
lessThan: '<';
notEquals : '!=';
equals: '==';
identifier: Letter (Letter|Digit)*;
Bool : 'true' | 'false';
ArrayLeft: '\u005B';
ArrayRight: '\u005D';
Letter
: '\u0024' |
'\u0041'..'\u005a' |
'\u005f '|
'\u0061'..'\u007a' |
'\u00c0'..'\u00d6' |
'\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' |
'\u0100'..'\u1fff' |
'\u3040'..'\u318f' |
'\u3300'..'\u337f' |
'\u3400'..'\u3d2d' |
'\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
Digit
: '\u0030'..'\u0039' |
'\u0660'..'\u0669' |
'\u06f0'..'\u06f9' |
'\u0966'..'\u096f' |
'\u09e6'..'\u09ef' |
'\u0a66'..'\u0a6f' |
'\u0ae6'..'\u0aef' |
'\u0b66'..'\u0b6f' |
'\u0be7'..'\u0bef' |
'\u0c66'..'\u0c6f' |
'\u0ce6'..'\u0cef' |
'\u0d66'..'\u0d6f' |
'\u0e50'..'\u0e59' |
'\u0ed0'..'\u0ed9' |
'\u1040'..'\u1049'
;
WS : [ \r\t\u000C\n]+ -> channel(HIDDEN)
;
You referenced the non-existent rule Number in the arrayNotation parser rule.
A Digit rule does exist in the lexer, but it will only match a single-digit number. For example, 1 is a Digit, but 10 is two separate Digit tokens so a[10] won't match the arrayAccesor rule. You probably want to resolve this in two parts:
Create a Number token consisting of one or more digits.
Number
: Digit+
;
Mark Digit as a fragment rule to indicate that it doesn't form tokens on its own, but is merely intended to be referenced from other lexer rules.
fragment // prevents a Digit token from being created on its own
Digit
: ...
You will not need to change arrayNotation because it already references the Number rule you created here.
Bah, waste of space. I Used Number instead of Digit in my array declaration.

Resources