Disambiguating unary and binary minus in antlr4 grammar - antlr4

I am creating an antlr4 grammar for a moderately simple language. I am struggling to get the grammar to differentiate between unary and binary minus. I have read all the other posts that I can find on this topic here on Stackoverflow, but have found that the answers either apply to antlr3 in ways I cannot figure out how to express in antlr4, or that I seem not to be adept in translating the advice of these answers to my own situation. I often end with the problem that antlr cannot unambiguously resolve the rules if I play around with other alternatives.
Below is the antlr file in its entirety. The ambiguity in this version occurs around the production:
binop_expr
: SUMOP product
| product ( SUMOP product )*
;
(I had originally used UNARY_ABELIAN_OP instead of the second SUMOP, but that led to a different kind of ambiguity — the tool apparently couldn't recognise that it needed to differentiate between the same token in two different contexts. I mention this because one of the posts here recommends using a different name for the unary operator.)
grammar Kant;
program
: type_declaration_list main
;
type_declaration_list
: type_declaration
| type_declaration_list type_declaration
| /* null */
;
type_declaration
: 'context' JAVA_ID '{' context_body '}'
| 'class' JAVA_ID '{' class_body '}'
| 'class' JAVA_ID 'extends' JAVA_ID '{' class_body '}'
;
context_body
: context_body context_body_element
| context_body_element
| /* null */
;
context_body_element
: method_decl
| object_decl
| role_decl
| stageprop_decl
;
role_decl
: 'role' JAVA_ID '{' role_body '}'
| 'role' JAVA_ID '{' role_body '}' REQUIRES '{' self_methods '}'
| access_qualifier 'role' JAVA_ID '{' role_body '}'
| access_qualifier 'role' JAVA_ID '{' role_body '}' REQUIRES '{' self_methods '}'
;
role_body
: method_decl
| role_body method_decl
| object_decl // illegal
| role_body object_decl // illegal — for better error messages only
;
self_methods
: self_methods ';' method_signature
| method_signature
| self_methods /* null */ ';'
;
stageprop_decl
: 'stageprop' JAVA_ID '{' stageprop_body '}'
| 'stageprop' JAVA_ID '{' stageprop_body '}' REQUIRES '{' self_methods '}'
| access_qualifier 'stageprop' JAVA_ID '{' stageprop_body '}'
| access_qualifier 'stageprop' JAVA_ID '{' stageprop_body '}' REQUIRES '{' self_methods '}'
;
stageprop_body
: method_decl
| stageprop_body method_decl
;
class_body
: class_body class_body_element
| class_body_element
| /* null */
;
class_body_element
: method_decl
| object_decl
;
method_decl
: method_decl_hook '{' expr_and_decl_list '}'
;
method_decl_hook
: method_signature
| method_signature CONST
;
method_signature
: access_qualifier return_type method_name '(' param_list ')'
| access_qualifier return_type method_name
| access_qualifier method_name '(' param_list ')'
;
expr_and_decl_list
: object_decl
| expr ';' object_decl
| expr_and_decl_list object_decl
| expr_and_decl_list expr
| expr_and_decl_list /*null-expr */ ';'
| /* null */
;
return_type
: type_name
| /* null */
;
method_name
: JAVA_ID
;
access_qualifier
: 'public' | 'private' | /* null */
;
object_decl
: access_qualifier compound_type_name identifier_list ';'
| access_qualifier compound_type_name identifier_list
| compound_type_name identifier_list /* null expr */ ';'
| compound_type_name identifier_list
;
compound_type_name
: type_name '[' ']'
| type_name
;
type_name
: JAVA_ID
| 'int'
| 'double'
| 'char'
| 'String'
;
identifier_list
: JAVA_ID
| identifier_list ',' JAVA_ID
| JAVA_ID ASSIGN expr
| identifier_list ',' JAVA_ID ASSIGN expr
;
param_list
: param_decl
| param_list ',' param_decl
| /* null */
;
param_decl
: type_name JAVA_ID
;
main
: expr
;
expr
: block
| expr '.' message
| expr '.' CLONE
| expr '.' JAVA_ID
| ABELIAN_INCREMENT_OP expr '.' JAVA_ID
| expr '.' JAVA_ID ABELIAN_INCREMENT_OP
| /* this. */ message
| JAVA_ID
| constant
| if_expr
| for_expr
| while_expr
| do_while_expr
| switch_expr
| BREAK
| CONTINUE
| boolean_expr
| binop_expr
| '(' expr ')'
| <assoc=right> expr ASSIGN expr
| NEW message
| NEW type_name '[' expr ']'
| RETURN expr
| RETURN
;
relop_expr
: sexpr RELATIONAL_OPERATOR sexpr
;
// This is just a duplication of expr. We separate it out
// because a top-down antlr4 parser can't handle the
// left associative ambiguity. It is used only
// for abelian types.
sexpr
: block
| sexpr '.' message
| sexpr '.' CLONE
| sexpr '.' JAVA_ID
| ABELIAN_INCREMENT_OP sexpr '.' JAVA_ID
| sexpr '.' JAVA_ID ABELIAN_INCREMENT_OP
| /* this. */ message
| JAVA_ID
| constant
| if_expr
| for_expr
| while_expr
| do_while_expr
| switch_expr
| BREAK
| CONTINUE
| '(' sexpr ')'
| <assoc=right> sexpr ASSIGN sexpr
| NEW message
| NEW type_name '[' expr ']'
| RETURN expr
| RETURN
;
block
: '{' expr_and_decl_list '}'
| '{' '}'
;
expr_or_null
: expr
| /* null */
;
if_expr
: 'if' '(' boolean_expr ')' expr
| 'if' '(' boolean_expr ')' expr 'else' expr
;
for_expr
: 'for' '(' object_decl boolean_expr ';' expr ')' expr // O.K. — expr can be a block
| 'for' '(' JAVA_ID ':' expr ')' expr
;
while_expr
: 'while' '(' boolean_expr ')' expr
;
do_while_expr
: 'do' expr 'while' '(' boolean_expr ')'
;
switch_expr
: SWITCH '(' expr ')' '{' ( switch_body )* '}'
;
switch_body
: ( CASE constant | DEFAULT ) ':' expr_and_decl_list
;
binop_expr
: SUMOP product
| product ( SUMOP product )*
;
product
: atom ( MULOP atom )*
;
atom
: null_expr
| JAVA_ID
| JAVA_ID ABELIAN_INCREMENT_OP
| ABELIAN_INCREMENT_OP JAVA_ID
| constant
| '(' expr ')'
| array_expr '[' sexpr ']'
| array_expr '[' sexpr ']' ABELIAN_INCREMENT_OP
| ABELIAN_INCREMENT_OP array_expr '[' sexpr ']'
;
null_expr
: NULL
;
array_expr
: sexpr
;
boolean_expr
: boolean_product ( BOOLEAN_SUMOP boolean_product )*
;
boolean_product
: boolean_atom ( BOOLEAN_MULOP boolean_atom )*
;
boolean_atom
: BOOLEAN
| JAVA_ID
| '(' boolean_expr ')'
| LOGICAL_NOT boolean_expr
| relop_expr
;
constant
: STRING
| INTEGER
| FLOAT
| BOOLEAN
;
message
: <assoc=right> JAVA_ID '(' argument_list ')'
;
argument_list
: expr
| argument_list ',' expr
| /* null */
;
// Lexer rules
STRING : '"' ( ~'"' | '\\' '"' )* '"' ;
INTEGER : ('1' .. '9')+ ('0' .. '9')* | '0';
FLOAT : (('1' .. '9')* | '0') '.' ('0' .. '9')* ;
BOOLEAN : 'true' | 'false' ;
SWITCH : 'switch' ;
CASE : 'case' ;
DEFAULT : 'default' ;
BREAK : 'break' ;
CONTINUE : 'continue' ;
RETURN : 'return' ;
REQUIRES : 'requires' ;
NEW : 'new' ;
CLONE : 'clone' ;
NULL : 'null' ;
CONST : 'const' ;
RELATIONAL_OPERATOR : '!=' | '==' | '>' | '<' | '>=' | '<=';
LOGICAL_NOT : '!' ;
BOOLEAN_MULOP : '&&' ;
BOOLEAN_SUMOP : '||' | '^' ;
SUMOP : '+' | '-' ;
MULOP : '*' | '/' ;
ABELIAN_INCREMENT_OP : '++' | '--' ;
JAVA_ID: (('a' .. 'z') | ('A' .. 'Z')) (('a' .. 'z') | ('A' .. 'Z') | ('0' .. '9') | '_')* ;
INLINE_COMMENT: '//' ~[\r\n]* -> channel(HIDDEN) ;
C_COMMENT: '/*' .*? '*/' -> channel(HIDDEN) ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ -> channel(HIDDEN) ;
ASSIGN : '=' ;
Typical of the problem is that the parser can't recognise the unary minus in this expression (it simply does not accept the construct):
Base b1 = new Base(-threetwoone);

Try removing the unary expression from binop_expr, and add it to the expr rule:
expr
: ...
| unary_expr
| binop_expr
| ...
;
unary_expr
: SUMOP binop_expr
;
binop_expr
: product ( SUMOP product )*
;

The problem seems to arise from having two kinds of expressions in the grammar that need better to be grouped with each other instead of being intermixed as much as they are. The overall intent is to create a language where everything is an expression, including, for example, loops and conditionals. (You can use your imagination to deduce the values they generate; however, that is a semantic rather than syntactic issue).
One set of these expressions have nice abelian properties (they behave like ordinary arithmetic types with well-behaved operations). The language specifies how lower-level productions are tied together by operations that give them context: so two identifiers, "a" and "b", can be associated by the operator "+" as "a + b". It's a highly contexutalized grammar.
The other of these sets of expressions comprises productions with a much lower degree of syntactic cohesion. One can combine "if" expressions, "for" expressions, and declarations in any order. The only linguistic property associating them is catenation, as we find in productions for object_body, class_body, and expr_and_decl_list.
The basic problem is that these two grammars became intertwined in the antlr spec. Combined with the fact that operators like "+" are context-sensitive, and that the context can arise either from catenation or the more contextualised abelian alternative, the parser was left with two alternative ways of parsing some expressions. I managed to wiggle around most of the ambiguity, but it got pushed into the area of unary prefix operators. So if "a" is an expression (it is), and "b" is an expression, then the string "a + b" is ambiguous. It can mean either product ( SUMOP product )*, or it can mean expr_and_decl_list expr where expr_and_decl_list was earlier reduced from the expr "a". The expr in expr_and_decl_list expr can reduce from "+b".
I re-wrote the grammar, inspired by other examples I've found on the web. (One of the more influential was one that Bart Kiers pointed me to: If/else statements in ANTLR using listeners) You can find the new grammar below. Stuff that belonged together with the grammar for the abelian expressions but which used to reduce to the expr target, like NEW expr and expr '.' JAVA_ID, has now been separated out to reduce to an abelian_expr target. All the productions that reduce down to an abelian_expr are now more or less direct descendants of the abelian_expr node in the grammar; that helps antlr deal intelligently with what otherwise would be ambiguities. A unary "+" expression no longer can reduce directly to expr but can find its way to expr only through abelian_expr, so it will never be treated as production that can simply be catenated as a largely uncontextualized expression.
I can't prove that that's what was going on, but signs point in that direction. If someone else has a more formal or deductive analysis (particularly if it points to another conclusion) I'd love to hear your reasoning.
The lesson here seems to be that what amounted to a high-level, fundamental flaw in the design of the language managed to show up as only an isolated and in some sense minor bug: the inability to disambiguate between the unary and binary uses of an operator.
grammar Kant;
program
: type_declaration_list main
| type_declaration_list // missing main
;
main
: expr
;
type_declaration_list
: type_declaration
| type_declaration_list type_declaration
| /* null */
;
type_declaration
: 'context' JAVA_ID '{' context_body '}'
| 'class' JAVA_ID '{' class_body '}'
| 'class' JAVA_ID 'extends' JAVA_ID '{' class_body '}'
;
context_body
: context_body context_body_element
| context_body_element
| /* null */
;
context_body_element
: method_decl
| object_decl
| role_decl
| stageprop_decl
;
role_decl
: 'role' JAVA_ID '{' role_body '}'
| 'role' JAVA_ID '{' role_body '}' REQUIRES '{' self_methods '}'
| access_qualifier 'role' JAVA_ID '{' role_body '}'
| access_qualifier 'role' JAVA_ID '{' role_body '}' REQUIRES '{' self_methods '}'
;
role_body
: method_decl
| role_body method_decl
| object_decl // illegal
| role_body object_decl // illegal — for better error messages only
;
self_methods
: self_methods ';' method_signature
| method_signature
| self_methods /* null */ ';'
;
stageprop_decl
: 'stageprop' JAVA_ID '{' stageprop_body '}'
| 'stageprop' JAVA_ID '{' stageprop_body '}' REQUIRES '{' self_methods '}'
| access_qualifier 'stageprop' JAVA_ID '{' stageprop_body '}'
| access_qualifier 'stageprop' JAVA_ID '{' stageprop_body '}' REQUIRES '{' self_methods '}'
;
stageprop_body
: method_decl
| stageprop_body method_decl
| object_decl // illegal
| stageprop_body object_decl // illegal — for better error messages only
;
class_body
: class_body class_body_element
| class_body_element
| /* null */
;
class_body_element
: method_decl
| object_decl
;
method_decl
: method_decl_hook '{' expr_and_decl_list '}'
;
method_decl_hook
: method_signature
;
method_signature
: access_qualifier return_type method_name '(' param_list ')' CONST*
| access_qualifier return_type method_name CONST*
| access_qualifier method_name '(' param_list ')' CONST*
;
expr_and_decl_list
: object_decl
| expr ';' object_decl
| expr_and_decl_list object_decl
| expr_and_decl_list expr
| expr_and_decl_list /*null-expr */ ';'
| /* null */
;
return_type
: type_name
| /* null */
;
method_name
: JAVA_ID
;
access_qualifier
: 'public' | 'private' | /* null */
;
object_decl
: access_qualifier compound_type_name identifier_list ';'
| access_qualifier compound_type_name identifier_list
| compound_type_name identifier_list /* null expr */ ';'
| compound_type_name identifier_list
;
compound_type_name
: type_name '[' ']'
| type_name
;
type_name
: JAVA_ID
| 'int'
| 'double'
| 'char'
| 'String'
;
identifier_list
: JAVA_ID
| identifier_list ',' JAVA_ID
| JAVA_ID ASSIGN expr
| identifier_list ',' JAVA_ID ASSIGN expr
;
param_list
: param_decl
| param_list ',' param_decl
| /* null */
;
param_decl
: type_name JAVA_ID
;
expr
: abelian_expr
| boolean_expr
| block
| if_expr
| for_expr
| while_expr
| do_while_expr
| switch_expr
| BREAK
| CONTINUE
| RETURN expr
| RETURN
;
abelian_expr
: <assoc=right>abelian_expr POW abelian_expr
| ABELIAN_SUMOP expr
| LOGICAL_NEGATION expr
| NEW message
| NEW type_name '[' expr ']'
| abelian_expr ABELIAN_MULOP abelian_expr
| abelian_expr ABELIAN_SUMOP abelian_expr
| abelian_expr ABELIAN_RELOP abelian_expr
| null_expr
| /* this. */ message
| JAVA_ID
| JAVA_ID ABELIAN_INCREMENT_OP
| ABELIAN_INCREMENT_OP JAVA_ID
| constant
| '(' abelian_expr ')'
| abelian_expr '[' expr ']'
| abelian_expr '[' expr ']' ABELIAN_INCREMENT_OP
| ABELIAN_INCREMENT_OP expr '[' expr ']'
| ABELIAN_INCREMENT_OP expr '.' JAVA_ID
| abelian_expr '.' JAVA_ID ABELIAN_INCREMENT_OP
| abelian_expr '.' message
| abelian_expr '.' CLONE
| abelian_expr '.' CLONE '(' ')'
| abelian_expr '.' JAVA_ID
| <assoc=right> abelian_expr ASSIGN expr
;
message
: <assoc=right> JAVA_ID '(' argument_list ')'
;
boolean_expr
: boolean_expr BOOLEAN_MULOP expr
| boolean_expr BOOLEAN_SUMOP expr
| constant // 'true' / 'false'
| abelian_expr
;
block
: '{' expr_and_decl_list '}'
| '{' '}'
;
expr_or_null
: expr
| /* null */
;
if_expr
: 'if' '(' expr ')' expr
| 'if' '(' expr ')' expr 'else' expr
;
for_expr
: 'for' '(' object_decl expr ';' expr ')' expr // O.K. — expr can be a block
| 'for' '(' JAVA_ID ':' expr ')' expr
;
while_expr
: 'while' '(' expr ')' expr
;
do_while_expr
: 'do' expr 'while' '(' expr ')'
;
switch_expr
: SWITCH '(' expr ')' '{' ( switch_body )* '}'
;
switch_body
: ( CASE constant | DEFAULT ) ':' expr_and_decl_list
;
null_expr
: NULL
;
constant
: STRING
| INTEGER
| FLOAT
| BOOLEAN
;
argument_list
: expr
| argument_list ',' expr
| /* null */
;
// Lexer rules
STRING : '"' ( ~'"' | '\\' '"' )* '"' ;
INTEGER : ('1' .. '9')+ ('0' .. '9')* | '0';
FLOAT : (('1' .. '9')* | '0') '.' ('0' .. '9')* ;
BOOLEAN : 'true' | 'false' ;
SWITCH : 'switch' ;
CASE : 'case' ;
DEFAULT : 'default' ;
BREAK : 'break' ;
CONTINUE : 'continue' ;
RETURN : 'return' ;
REQUIRES : 'requires' ;
NEW : 'new' ;
CLONE : 'clone' ;
NULL : 'null' ;
CONST : 'const' ;
ABELIAN_RELOP : '!=' | '==' | '>' | '<' | '>=' | '<=';
LOGICAL_NOT : '!' ;
POW : '**' ;
BOOLEAN_SUMOP : '||' | '^' ;
BOOLEAN_MULOP : '&&' ;
ABELIAN_SUMOP : '+' | '-' ;
ABELIAN_MULOP : '*' | '/' | '%' ;
MINUS : '-' ;
PLUS : '+' ;
LOGICAL_NEGATION : '!' ;
ABELIAN_INCREMENT_OP : '++' | '--' ;
JAVA_ID: (('a' .. 'z') | ('A' .. 'Z')) (('a' .. 'z') | ('A' .. 'Z') | ('0' .. '9') | '_')* ;
INLINE_COMMENT: '//' ~[\r\n]* -> channel(HIDDEN) ;
C_COMMENT: '/*' .*? '*/' -> channel(HIDDEN) ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ -> channel(HIDDEN) ;
ASSIGN : '=' ;

Related

Antlr4 Grammar fails to parse

I am new to ANTLR, and here is a grammar that I am working on and its failing for a given input string - A.B() && ((C.D() || E.F())).
I have tried a number of combinations but its failing on the same place.
grammar Expressions;
expression
: logicBlock (logicalConnector logicBlock)*
| NOT? '('? logicBlock ')'? (logicalConnector NOT? '('? logicBlock ')'?)*
;
logicBlock
: logicUnit comparator THINGS
| logicUnit comparator logicUnit
| logicUnit
;
logicUnit
: NOT? '(' method ')'
| NOT? method
;
method
: object '.' function ('.' function)*
;
object
: THINGS
|'(' THINGS ')'
;
function
: THINGS '(' arguments? ')'
;
arguments
: (object | function | method | logicUnit | logicBlock)
(
','
(object | function | method | logicUnit | logicBlock)
)*
;
logicalConnector
: AND | OR | PLUS | MINUS
;
comparator
: GT | LT | GTE | LTE | EQUALS | NOTEQUALS
;
AND : '&&' ;
OR : '||' ;
EQUALS : '==' ;
ASSIGN : '=' ;
GT : '>' ;
LT : '<' ;
GTE : '>=' ;
LTE : '<=' ;
NOTEQUALS : '!=' ;
NOT : '!' ;
PLUS : '+' ;
MINUS : '-' ;
IF : 'if' ;
THINGS
: [a-zA-Z] [a-zA-Z0-9]*
| '"' .*? '"'
| [0-9]+
| ([0-9]+)? '.' ([0-9])+
;
WS : [ \t\r\n]+ -> skip
;
The error that I getting for this input - A.B() && ((C.D() || E.F())) is below. any help and/or suggestion to improve would be highly appreciated.
This rule looks odd to me:
expression
: logicBlock (logicalConnector logicBlock)*
| NOT? '('? logicBlock ')'? (logicalConnector NOT? '('? logicBlock ')'?)*
// ^ ^ ^ ^
// | | | |
;
All the optional parenthesis can't be right: this means the the first can be present, and the other 3 would be omitted (and any other combination), leaving your expression with unbalanced parenthesis.
The get your grammar working in the input A.B() && ((C.D() || E.F())) , you'll want to do something like this:
expression
: logicBlock (logicalConnector logicBlock)*
| NOT? logicBlock (logicalConnector NOT? logicBlock)*
;
logicBlock
: '(' expression ')'
| logicUnit comparator THINGS
| logicUnit comparator logicUnit
| logicUnit
;
logicUnit
: '(' expression ')'
| NOT? '(' method ')'
| NOT? method
;
But ANTLR4 supports left recursive rules, allowing you to define the expression rule like this:
expression
: '(' expression ')'
| NOT expression
| expression comparator expression
| expression logicalConnector expression
| method
| object
| THINGS
;
method
: object '.' function ( '.' function )*
;
object
: THINGS
|'(' THINGS ')'
;
function
: THINGS '(' arguments? ')'
;
arguments
: expression ( ',' expression )*
;
logicalConnector
: AND | OR | PLUS | MINUS
;
comparator
: GT | LT | GTE | LTE | EQUALS | NOTEQUALS
;
making it much more readable. Sure, it doesn't produce the exact parse tree your original grammar was producing, but you might be flexible in that regard. This proposed grammar also matches more than yours: like NOT A, which your grammar does not allow, where my proposed grammar does accept it.

Why is ending parenthesis not recognizes as valid in my antlr4 grammar?

I am writing a DSL with ANTLR4, now I have a problem with ending right parenthesis. Why is this command not valid
This is the command:
set(buffer,variableX|"foo");
the parse tree with the error
Here is my grammar
grammar Expr;
prog: expr+ EOF;
expr:
statement #StatementExpr
|NOT expr #NotExpr
| expr AND expr #AndExpr
| expr (OR | XOR) expr #OrExpr
| function #FunctionExpr
| LPAREN expr RPAREN #ParenExpr
| writeCommand #WriteExpr
;
writeCommand: setCommand | setIfCommand;
statement: ID '=' getCommand NEWLINE #Assign;
setCommand: 'set' LPAREN variableType '|' parameter RPAREN SEMI;
setIfCommand: 'setIf' LPAREN variableType '|' expr '?' parameter ':' parameter RPAREN SEMI;
getCommand: getFieldValue | getInstanceAttribValue|getInstanceAttribValue|getFormAttribValue|getMandatorAttribValue;
getFieldValue: 'getFieldValue' LPAREN instanceID=ID COMMA fieldname=ID RPAREN;
getInstanceAttribValue: 'getInstanceAttrib' LPAREN instanceId=ID COMMA moduleId=ID COMMA attribname=ID RPAREN;
getFormAttribValue: 'getFormAttrib' LPAREN formId=ID COMMA moduleId=ID COMMA attribname=ID RPAREN;
getMandatorAttribValue: 'getMandatorAttrib' LPAREN mandator=ID COMMA moduleId=ID COMMA attribname=ID RPAREN;
twoParameterList: parameter '|' parameter;
parameter:variableType | constType;
pdixFuncton:ID;
constType:
ID
| '"' ID '"';
variableType:
valueType
|instanceType
|formType
|bufferType
|instanceAttribType
|formAttribType
|mandatorAttribType
;
valueType:'value' COMMA parameter (COMMA functionParameter)?;
instanceType: 'instance' COMMA instanceParameter;
formType: 'form' COMMA formParameter;
bufferType: 'buffer' COMMA ID '|' parameter;
instanceParameter: 'instanceId'
| 'instanceKey'
| 'firstpenId'
| 'lastpenId'
| 'lastUpdate'
| 'started'
;
formParameter: 'formId'
|'formKey'
|'lastUpdate'
;
functionParameter: 'lastPen'
| 'fieldGroup'
| ' fieldType'
| 'fieldSource'
| 'updateId'
| 'sessionId'
| 'icrConfidence'
| 'icrRecognition'
| 'lastUpdate';
instanceAttribType:('instattrib' | 'instanceattrib') COMMA attributeType;
formAttribType:'formattrib' COMMA attributeType;
mandatorAttribType: 'mandatorattrib' COMMA attributeType;
attributeType:ID '#' ID;
function:
commandIsSet #IsSet
| commandIsEmpty #IsEmpty
| commandIsEqual #IsEqual
;
commandIsSet: IS_SET LPAREN parameter RPAREN;
commandIsEmpty: IS_EMPTY LPAREN parameter RPAREN;
commandIsEqual: IS_EQUAL LPAREN twoParameterList RPAREN;
LPAREN : '(';
RPAREN : ')';
LBRACE : '{';
RBRACE : '}';
LBRACK : '[';
RBRACK : ']';
SEMI : ';';
COMMA : ',';
DOT : '.';
ASSIGN : '=';
GT : '>';
LT : '<';
BANG : '!';
TILDE : '~';
QUESTION : '?';
COLON : ':';
EQUAL : '==';
LE : '<=';
GE : '>=';
NOTEQUAL : '!=';
AND : 'and';
OR : 'or';
XOR :'xor';
NOT :'not' ;
INC : '++';
DEC : '--';
ADD : '+';
SUB : '-';
MUL : '*';
DIV : '/';
INT: [0-9]+;
NEWLINE: '\r'? '\n';
IS_SET:'isSet';
IS_EMPTY:'isEmpty';
IS_EQUAL:'isEqual';
WS: (' '|'\t' | '\n' | '\r' )+ -> skip;
ID
: JavaLetter JavaLetterOrDigit*
;
fragment
JavaLetter
: [a-zA-Z$_] // these are the "java letters" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierStart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
fragment
JavaLetterOrDigit
: [a-zA-Z0-9$_] // these are the "java letters or digits" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierPart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
fragment DoubleQuote: '"' ; // Hard to read otherwise.
The input set(buffer,variableX|"foo"); cannot be parsed by:
setCommand: 'set' LPAREN variableType '|' parameter RPAREN SEMI;
because buffer,variableX|"foo" is being matched by:
bufferType: 'buffer' COMMA ID '|' parameter;
which causes '|' parameter from the setCommand to not be able to match anything. If the input was something like set(buffer,variableX|"foo"|"bar");, it would parse successfully. Or remove '|' parameter from either the setCommand rule or from the bufferType rule.

How to parse statement in order of desired precedence using antlr?

I have an RSQL grammar defined:
grammar Rsql;
statement
: L_PAREN wrapped=statement R_PAREN
| left=statement op=( AND_OPERATOR | OR_OPERATOR ) right=statement
| node=comparison
;
comparison
: single_comparison
| multi_comparison
| bool_comparison
;
single_comparison
: key=IDENTIFIER op=( EQ | NE | GT | GTE | LT | LTE ) value=single_value
;
multi_comparison
: key=IDENTIFIER op=( IN | NIN ) value=multi_value
;
bool_comparison
: key=IDENTIFIER op=EX value=boolean_value
;
boolean_value
: BOOLEAN
;
single_value
: boolean_value
| ( STRING_LITERAL | IDENTIFIER )
| NUMERIC_LITERAL
;
multi_value
: L_PAREN single_value ( COMMA single_value )* R_PAREN
| single_value
;
TRUE: 'true';
FALSE: 'false';
AND_OPERATOR: ';';
OR_OPERATOR: ',';
L_PAREN: '(';
R_PAREN: ')';
COMMA: ',';
EQ: '==';
NE: '!=';
IN: '=in=';
NIN: '=out=';
GT: '=gt=';
LT: '=lt=';
GTE: '=ge=';
LTE: '=le=';
EX: '=ex=';
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]*
;
BOOLEAN
: TRUE
| FALSE
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( [-+]? DIGIT+ )?
| '.' DIGIT+ ( [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( STRING_ESCAPE_SEQ | ~[\\\r\n'] )* '\''
| '"' ( STRING_ESCAPE_SEQ | ~[\\\r\n"] )* '"'
;
STRING_ESCAPE_SEQ
: '\\' .
;
fragment DIGIT : [0-9];
No matter how I attempt to parse this (listener/visitor), the statements with parenthesis always get evaluated in order. It is my understanding that the order in the rule would be the precedence. However, the parse tree for a statement like "name==foo,(name==bar;age=gt=35)" is always
no matter where the parenthesis appear. Please help me discover what I'm missing. Thanks!

how can I refactor this ANTLR4 grammar so that it isn't mutually left recursive?

I can't seem to figure out why this grammar won't compile. It compiled fine until I modified line 145 from
(Identifier '.')* functionCall
to
(primary '.')? functionCall
I've been trying to figure out how to solve this issue for a while but I can't seem to be able to. Here's the error:
The following sets of rules are mutually left-recursive [primary]
grammar Tadpole;
#header
{package net.tadpole.compiler.parser;}
file
: fileContents*
;
fileContents
: structDec
| functionDec
| statement
| importDec
;
importDec
: 'import' Identifier ';'
;
literal
: IntegerLiteral
| FloatingPointLiteral
| BooleanLiteral
| CharacterLiteral
| StringLiteral
| NoneLiteral
| arrayLiteral
;
arrayLiteral
: '[' expressionList? ']'
;
expressionList
: expression (',' expression)*
;
expression
: primary
| unaryExpression
| <assoc=right> expression binaryOpPrec0 expression
| <assoc=left> expression binaryOpPrec1 expression
| <assoc=left> expression binaryOpPrec2 expression
| <assoc=left> expression binaryOpPrec3 expression
| <assoc=left> expression binaryOpPrec4 expression
| <assoc=left> expression binaryOpPrec5 expression
| <assoc=left> expression binaryOpPrec6 expression
| <assoc=left> expression binaryOpPrec7 expression
| <assoc=left> expression binaryOpPrec8 expression
| <assoc=left> expression binaryOpPrec9 expression
| <assoc=left> expression binaryOpPrec10 expression
| <assoc=right> expression binaryOpPrec11 expression
;
unaryExpression
: unaryOp expression
| prefixPostfixOp primary
| primary prefixPostfixOp
;
unaryOp
: '+'
| '-'
| '!'
| '~'
;
prefixPostfixOp
: '++'
| '--'
;
binaryOpPrec0
: '**'
;
binaryOpPrec1
: '*'
| '/'
| '%'
;
binaryOpPrec2
: '+'
| '-'
;
binaryOpPrec3
: '>>'
| '>>>'
| '<<'
;
binaryOpPrec4
: '<'
| '>'
| '<='
| '>='
| 'is'
;
binaryOpPrec5
: '=='
| '!='
;
binaryOpPrec6
: '&'
;
binaryOpPrec7
: '^'
;
binaryOpPrec8
: '|'
;
binaryOpPrec9
: '&&'
;
binaryOpPrec10
: '||'
;
binaryOpPrec11
: '='
| '**='
| '*='
| '/='
| '%='
| '+='
| '-='
| '&='
| '|='
| '^='
| '>>='
| '>>>='
| '<<='
| '<-'
;
primary
: literal
| fieldName
| '(' expression ')'
| '(' type ')' (primary | unaryExpression)
| 'new' objType '(' expressionList? ')'
| primary '.' fieldName
| primary dimension
| (primary '.')? functionCall
;
functionCall
: functionName '(' expressionList? ')'
;
functionName
: Identifier
;
dimension
: '[' expression ']'
;
statement
: '{' statement* '}'
| expression ';'
| 'recall' ';'
| 'return' expression? ';'
| variableDec
| 'if' '(' expression ')' statement ('else' statement)?
| 'while' '(' expression ')' statement
| 'do' expression 'while' '(' expression ')' ';'
| 'do' '{' statement* '}' 'while' '(' expression ')' ';'
;
structDec
: 'struct' structName ('(' parameterList ')')? '{' variableDec* functionDec* '}'
;
structName
: Identifier
;
fieldName
: Identifier
;
variableDec
: type fieldName ('=' expression)? ';'
;
type
: primitiveType ('[' ']')*
| objType ('[' ']')*
;
primitiveType
: 'byte'
| 'short'
| 'int'
| 'long'
| 'char'
| 'boolean'
| 'float'
| 'double'
;
objType
: (Identifier '.')? structName
;
functionDec
: 'def' functionName '(' parameterList? ')' ':' type '->' functionBody
;
functionBody
: statement
;
parameterList
: parameter (',' parameter)*
;
parameter
: type fieldName
;
IntegerLiteral
: DecimalIntegerLiteral
| HexIntegerLiteral
| OctalIntegerLiteral
| BinaryIntegerLiteral
;
fragment
DecimalIntegerLiteral
: DecimalNumeral IntegerSuffix?
;
fragment
HexIntegerLiteral
: HexNumeral IntegerSuffix?
;
fragment
OctalIntegerLiteral
: OctalNumeral IntegerSuffix?
;
fragment
BinaryIntegerLiteral
: BinaryNumeral IntegerSuffix?
;
fragment
IntegerSuffix
: [lL]
;
fragment
DecimalNumeral
: Digit (Digits? | Underscores Digits)
;
fragment
Digits
: Digit (DigitsAndUnderscores? Digit)?
;
fragment
Digit
: [0-9]
;
fragment
DigitsAndUnderscores
: DigitOrUnderscore+
;
fragment
DigitOrUnderscore
: Digit
| '_'
;
fragment
Underscores
: '_'+
;
fragment
HexNumeral
: '0' [xX] HexDigits
;
fragment
HexDigits
: HexDigit (HexDigitsAndUnderscores? HexDigit)?
;
fragment
HexDigit
: [0-9a-fA-F]
;
fragment
HexDigitsAndUnderscores
: HexDigitOrUnderscore+
;
fragment
HexDigitOrUnderscore
: HexDigit
| '_'
;
fragment
OctalNumeral
: '0' [oO] Underscores? OctalDigits
;
fragment
OctalDigits
: OctalDigit (OctalDigitsAndUnderscores? OctalDigit)?
;
fragment
OctalDigit
: [0-7]
;
fragment
OctalDigitsAndUnderscores
: OctalDigitOrUnderscore+
;
fragment
OctalDigitOrUnderscore
: OctalDigit
| '_'
;
fragment
BinaryNumeral
: '0' [bB] BinaryDigits
;
fragment
BinaryDigits
: BinaryDigit (BinaryDigitsAndUnderscores? BinaryDigit)?
;
fragment
BinaryDigit
: [01]
;
fragment
BinaryDigitsAndUnderscores
: BinaryDigitOrUnderscore+
;
fragment
BinaryDigitOrUnderscore
: BinaryDigit
| '_'
;
// §3.10.2 Floating-Point Literals
FloatingPointLiteral
: DecimalFloatingPointLiteral FloatingPointSuffix?
| HexadecimalFloatingPointLiteral FloatingPointSuffix?
;
fragment
FloatingPointSuffix
: [fFdD]
;
fragment
DecimalFloatingPointLiteral
: Digits '.' Digits? ExponentPart?
| '.' Digits ExponentPart?
| Digits ExponentPart
| Digits
;
fragment
ExponentPart
: ExponentIndicator SignedInteger
;
fragment
ExponentIndicator
: [eE]
;
fragment
SignedInteger
: Sign? Digits
;
fragment
Sign
: [+-]
;
fragment
HexadecimalFloatingPointLiteral
: HexSignificand BinaryExponent
;
fragment
HexSignificand
: HexNumeral '.'?
| '0' [xX] HexDigits? '.' HexDigits
;
fragment
BinaryExponent
: BinaryExponentIndicator SignedInteger
;
fragment
BinaryExponentIndicator
: [pP]
;
BooleanLiteral
: 'true'
| 'false'
;
CharacterLiteral
: '\'' SingleCharacter '\''
| '\'' EscapeSequence '\''
;
fragment
SingleCharacter
: ~['\\]
;
StringLiteral
: '"' StringCharacters? '"'
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\\]
| EscapeSequence
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]
| OctalEscape
| UnicodeEscape
;
fragment
OctalEscape
: '\\' OctalDigit
| '\\' OctalDigit OctalDigit
| '\\' ZeroToThree OctalDigit OctalDigit
;
fragment
ZeroToThree
: [0-3]
;
fragment
UnicodeEscape
: '\\' 'u' HexDigit HexDigit HexDigit HexDigit
;
NoneLiteral
: 'nil'
;
Identifier
: IdentifierStartChar IdentifierChar*
;
fragment
IdentifierStartChar
: [a-zA-Z$_] // these are the "java letters" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierStart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
fragment
IdentifierChar
: [a-zA-Z0-9$_] // these are the "java letters or digits" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierPart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
WS : [ \t\r\n\u000C]+ -> skip
;
LINE_COMMENT
: '#' ~[\r\n]* -> skip
;
The left recursive invocation needs to be the first, so no parenthesis can be placed before it.
You can rewrite it like this:
primary
: literal
| fieldName
| '(' expression ')'
| '(' type ')' (primary | unaryExpression)
| 'new' objType '(' expressionList? ')'
| primary '.' fieldName
| primary dimension
| primary '.' functionCall
| functionCall
;
which is equivalent.

ANTLR 4 precedence not as expected

I've defined a flavor of SQL that we use in my company as the following:
/** Grammars always start with a grammar header. This grammar is called
* GigyaSQL and must match the filename: GigyaSQL.g4
*/
grammar GigyaSQL;
parse
: selectClause
fromClause
( whereClause )?
( filterClause )?
( groupByClause )?
( limitClause )?
;
selectClause
: K_SELECT result_column ( ',' result_column )*
;
result_column
: '*' # selectAll
| table_name '.' '*' # selectAllFromTable
| select_expr ( K_AS? column_alias )? # selectExpr
| with_table # selectWithTable
;
fromClause
: K_FROM table_name
;
table_name
: any_name # simpleTable
| any_name K_WITH with_table # tableWithTable
;
any_name
: IDENTIFIER
| STRING_LITERAL
| '(' any_name ')'
;
with_table
: COUNTERS
;
select_expr
: literal_value
| range_function_in_select
| interval_function_in_select
| ( table_name '.' )? column_name
| function_name '(' argument_list ')'
;
whereClause
: K_WHERE condition_expr
;
condition_expr
: literal_value # literal
| ( table_name '.' )? column_name # column_name_expr
| unary_operator condition_expr # unary_expr
| condition_expr binary_operator condition_expr # binary_expr
| K_IFELEMENT '(' with_table ',' condition_expr ')' # if_element
| function_name '(' argument_list ')' # function_expr
| '(' condition_expr ')' # brackets_expr
| condition_expr K_NOT? K_LIKE condition_expr # like_expr
| condition_expr K_NOT? K_CONTAINS condition_expr # contains_expr
| condition_expr K_IS K_NOT? condition_expr # is_expr
//| condition_expr K_NOT? K_BETWEEN condition_expr K_AND condition_expr
| condition_expr K_NOT? K_IN '(' ( literal_value ( ',' literal_value )*) ')' # in_expr
;
filterClause
: K_FILTER with_table K_BY condition_expr
;
groupByClause
: K_GROUP K_BY group_expr ( ',' group_expr )*
;
group_expr
: literal_value
| ( table_name '.' )? column_name
| function_name '(' argument_list ')'
| range_function_in_group
| interval_function_in_group
;
limitClause
: K_LIMIT NUMERIC_LITERAL
;
argument_list
: ( select_expr ( ',' select_expr )* | '*' )
;
unary_operator
: MINUS
| PLUS
| '~'
| K_NOT
;
binary_operator
: ( '*' | DIVIDE | MODULAR )
| ( PLUS | MINUS )
//| ( '<<' | '>>' | '&' | '|' )
| ( LTH | LEQ | GTH | GEQ )
| ( EQUAL | NOT_EQUAL | K_IN | K_LIKE )
//| ( '=' | '==' | '!=' | '<>' | K_IS | K_IS K_NOT | K_IN | K_LIKE | K_GLOB | K_MATCH | K_REGEXP )
| K_AND
| K_OR
;
range_function_in_select
: K_RANGE '(' select_expr ')'
;
range_function_in_group
: K_RANGE '(' select_expr ',' range_pair (',' range_pair)* ')'
;
range_pair // Tried to use INT instead (for decimal numbers) but that didn't work fine (didn't parse a = 1 correctly)
: '"' NUMERIC_LITERAL ',' NUMERIC_LITERAL '"'
| '"' ',' NUMERIC_LITERAL '"'
| '"' NUMERIC_LITERAL ',' '"'
;
interval_function_in_select
: K_INTERVAL '(' select_expr ')'
;
interval_function_in_group
: K_INTERVAL '(' select_expr ',' NUMERIC_LITERAL ')'
;
function_name
: any_name
;
literal_value
: NUMERIC_LITERAL
| STRING_LITERAL
// | BLOB_LITERAL
| K_NULL
// | K_CURRENT_TIME
// | K_CURRENT_DATE
// | K_CURRENT_TIMESTAMP
;
column_name
: any_name
;
column_alias
: IDENTIFIER
| STRING_LITERAL
;
SPACES
: [ \u000B\t\r\n] -> skip
;
COUNTERS : 'counters' | 'COUNTERS';
//INT : '0' | DIGIT+ ;
EQUAL : '=';
NOT_EQUAL : '<>' | '!=';
LTH : '<' ;
LEQ : '<=';
GTH : '>';
GEQ : '>=';
//MULTIPLY: '*';
DIVIDE : '/';
MODULAR : '%';
PLUS : '+';
MINUS : '-';
K_AND : A N D;
K_AS : A S;
K_BY : B Y;
K_CONTAINS: C O N T A I N S;
K_DISTINCT : D I S T I N C T;
K_FILTER : F I L T E R;
K_FROM : F R O M;
K_GROUP : G R O U P;
K_IFELEMENT : I F E L E M E N T;
K_IN : I N;
K_INTERVAL : I N T E R V A L;
K_IS : I S;
K_LIKE : L I K E;
K_LIMIT : L I M I T;
K_NOT : N O T;
K_NULL : N U L L;
K_OR : O R;
K_RANGE : R A N G E;
K_REGEXP : R E G E X P;
K_SELECT : S E L E C T;
K_WHERE : W H E R E;
K_WITH : W I T H;
IDENTIFIER
: '"' (~'"' | '""')* '"'
| '`' (~'`' | '``')* '`'
| '[' ~']'* ']'
| [a-zA-Z_] [.a-zA-Z_0-9]* // TODO - need to check if the period is correcly handled
| [a-zA-Z_] [a-zA-Z_0-9]* // TODO check: needs more chars in set
;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
;
NUMERIC_LITERAL
:// INT
DIGIT+ ('.' DIGIT*)? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
and I try to parse the following query:
SELECT * from accounts where not data.zzz > 124
I get the following tree:
But I wanted to get the tree similar to when I use parenthesis:
SELECT * from accounts where not (data.zzz > 124)
I don't understand why it's working that way sinze the unary rule is before others.
Any suggestion?
That is the correct result for the given grammar. As you've already mentioned, the unary_operator is before the binary_operator meaning any operand for the NOT keyword is binded to it first before other operators. And since it is unary, it takes the data.zzz as its operand and after that the whole NOT expression becomes an operand of the binary_operator.
To get what you want, just shift down the unary_operator according to the precedence level of it (as I recall, in SQL, NOT's precedence is lower than that of binary operators, and the NOT operator should not have the same precedence as the MINUS PLUS and ~ just like what your grammar does) e.g.
condition_expr
: literal_value # literal
| ( table_name '.' )? column_name # column_name_expr
| condition_expr binary_operator condition_expr # binary_expr
| unary_operator condition_expr # unary_expr
| K_IFELEMENT '(' with_table ',' condition_expr ')' # if_element
| function_name '(' argument_list ')' # function_expr
| '(' condition_expr ')' # brackets_expr
| condition_expr K_NOT? K_LIKE condition_expr # like_expr
| condition_expr K_NOT? K_CONTAINS condition_expr # contains_expr
| condition_expr K_IS K_NOT? condition_expr # is_expr
//| condition_expr K_NOT? K_BETWEEN condition_expr K_AND condition_expr
| condition_expr K_NOT? K_IN '(' ( literal_value ( ',' literal_value )*) ')' # in_expr
;
And this gives what you want:

Resources