I have a grammar in ANTLR and I have a file for testing my grammar.
But I don't know what is wrong with my output.
This is my grammar:
grammar proj;
start
: (assign|define|read|write|condition|while|module|callingmodule)+
;
assign
: T_ID T_ENTESAB expentesab T_SEPARATOR
;
expentesab
: T_ID
| T_NUMBER
| T_SABETMANTEGHI
| expentesab operator expentesab
| expentesab operator
| operator expentesab
| T_PARANTEZBAZ expentesab T_PARANTEZBASTE
| expentesab T_COMMA expentesab
| T_PARANTEZBAZ expentesab T_COMMA expentesab T_PARANTEZBASTE
| expentesab T_COMMA T_PARANTEZBAZ expentesab T_PARANTEZBASTE
;
operator
: T_ADD
| T_SUB
| T_MUL
| T_DIV
| T_POW
| T_FACT
| T_AND
| T_OR
| T_XOR
;
define
: T_ID T_2POINT T_TYPE T_SEPARATOR
;
read
: T_READ expread T_SEPARATOR
;
expread
: T_ID
| T_NUMBER
| operator
| T_PARANTEZBAZ expread T_PARANTEZBASTE
| expread operator expread
;
write
: T_WRITE expwrite T_SEPARATOR
;
expwrite
: T_ID
| T_NUMBER
| operator
| T_PARANTEZBAZ expwrite T_PARANTEZBASTE
| expwrite operator expwrite
| expwrite T_COMPARE expwrite
;
condition
: T_IF expcon T_THEN code_if T_ELSE code_if T_SEPARATOR
| T_IF expcon T_THEN code_if T_SEPARATOR
;
expcon
: assign
| define
| expcon T_COMPARE expcon
| expcon operator expcon
| operator expcon
;
code_if
: condition
| block
| define
| assign
| callingmodule
| code_if operator code_if
| T_PARANTEZBAZ code_if T_PARANTEZBASTE T_SEPARATOR
| T_PARANTEZBAZ code_if operator code_if T_PARANTEZBASTE T_SEPARATOR
;
callingmodule
: T_ID T_PARANTEZBAZ params T_PARANTEZBASTE T_SEPARATOR
| T_ID T_PARANTEZBAZ T_PARANTEZBASTE T_SEPARATOR
;
params
: expparam(T_COMMA expparam)*
;
expparam
: T_ID
| shart
| T_ID operator T_NUMBER
| T_PARANTEZBAZ expparam T_PARANTEZBASTE
;
while
: T_WHILE expwhile code_while
;
expwhile
: T_SABETMANTEGHI
| T_NUMBER
| T_PARANTEZBAZ expwhile T_PARANTEZBASTE
| expwhile operator expwhile
| T_ID T_COMPARE T_ID
| expwhile T_AND expwhile
| expwhile T_OR expwhile
| expwhile T_XOR expwhile
;
code_while
: block
| module
| callingmodule
| define
| assign
| code_while operator code_while T_SEPARATOR
| T_PARANTEZBAZ code_while T_PARANTEZBASTE T_SEPARATOR
| T_PARANTEZBAZ code_while operator code_while T_PARANTEZBASTE T_SEPARATOR
;
block
: T_BEGIN inner_block T_END
;
inner_block
: define
| assign
| condition
| callingmodule
| block
| T_ID operator T_ID
| T_PARANTEZBAZ inner_block T_PARANTEZBASTE T_SEPARATOR
| T_PARANTEZBAZ T_ID operator T_ID T_PARANTEZBASTE T_SEPARATOR
;
module
: T_MODULE T_ID T_INPUT T_2POINT (define)+ T_OUTPUT T_2POINT T_TYPE block
| T_MODULE T_ID block
;
shart
: expcon T_CONDITION code_if T_2POINT code_if
| expcon T_CONDITION code_if T_2POINT code_if T_SEPARATOR
;
T_TYPE: ('s'|'S')('t'|'T')('r'|'R')('i'|'I')('n'|'N')('g'|'G')|('r'|'R')('e'|'E')('a'|'A')('l'|'L')|
('b'|'B')('o'|'O')('o'|'O')('l'|'L');
T_END: ('e'|'E')('n'|'N')('d'|'D');
T_BEGIN:('b'|'B')('e'|'E')('g'|'G')('i'|'I')('n'|'N');
T_WHILE:('w'|'W')('h'|'H')('i'|'I')('l'|'L')('e'|'E');
T_IF:('i'|'I')('f'|'F');
T_THEN:('t'|'T')('h'|'H')('e'|'E')('n'|'N');
T_ELSE:('e'|'E')('l'|'L')('s'|'S')('e'|'E');
T_READ:('r'|'R')('e'|'E')('a'|'A')('d'|'D');
T_WRITE:('w'|'W')('r'|'R')('i'|'I')('t'|'T')('e'|'E');
T_MODULE:('M'|'m')('O'|'o')('D'|'d')('U'|'u')('L'|'l')('E'|'e');
T_INPUT:('I'|'i')('N'|'n')('P'|'p')('U'|'u')('T'|'t');
T_OUTPUT:('O'|'o')('U'|'u')('T'|'t')('P'|'p')('U'|'u')('T'|'t');
T_RETURN:('R'|'r')('E'|'e')('T'|'t')('U'|'u')('R'|'r')('N'|'n');
T_SEPARATOR : ';';
T_SABETMANTEGHI: ('t'|'T')('r'|'R')('u'|'U')('e'|'E')|('f'|'F')('a'|'A')('l'|'L')('s'|'S')('e'|'E');
T_NUMBER:T_HEXNUMBER|T_INTEGERNUMBER;
T_HEXNUMBER: '0' ('x'|'X') ('0'..'9'|'a'..'f'|'A'..'F')+|'0' ('x'|'X') ('0'..'9'|'a'..'f'|'A'..'F')+ '.' ('0'..'9'|'a'..'f'|'A'..'F')+;
T_INTEGERNUMBER:(('0'..'9')+|('0'..'9')+ '.'('0'..'9')+);
T_FUNC:('F'|'f')('U'|'u')|('N'|'n')('C'|'c');
T_ADD: '+';
T_SUB: '-';
T_MUL: '*';
T_DIV: '/';
T_POW: '^';
T_FACT: '!';
T_ENTESAB:'=';
T_X:'x'|'X';
T_AND: ('a'|'A')('n'|'N')('d'|'D');
T_OR: ('o'|'O')('r'|'R');
T_NOT: ('n'|'N')('o'|'O')('t'|'T');
T_XOR: ('x'|'X')('o'|'O')('r'|'R');
T_COMPARE: '>'| '<'| '>='|'<='| '<>';
T_REMAIN: '%';
T_CONDITION:'?';
T_2POINT:':';
T_PARANTEZBAZ:'(';
T_PARANTEZBASTE:')';
T_COMMA:',';
T_COMMENT:T_COM1LINE|T_COMMULLINE;
T_COM1LINE: '%%' ~( '\t'|'\r')+ -> skip ;
T_COMMULLINE:'%%%' (.|('\t'|'\r'|' '|'\n'))*? '%%%' ->skip;
T_ID : [a-zA-Z] ([a-zA-Z]|('0'..'9'))*;
T_WS : (('\t'|'\r'|' ')+) ->skip;
T_NEWLINE:('\n')->skip;
T_LEXICALERROR:.;
And this is my input file:
%%%This is a sample Written in QUPLA $
#The program compute fibonacci serie%%%
module func
input:
X:real;
output:
i:real;
begin
if x> 0 then
begin
return Func(x-1)+func(x-2);
end
begin
return 1;
end
end
%% This is the main module &%*&()
module main
begin
i:real;
read i;
write (func(i)?1:2);
end
For this input, I have these errors:
In line 5 expecting T_ID but i have T_ID!
In line 8 expecting T_IF,T_WHILE T_READ.... But I have T_IF
Let's start with your errors.
Answers
In line 5 expecting T_ID but i have T_ID!
This error is due the fact that you have lexer rule T_X:'x'|'X'; which will match to the X from line 5 of your sample code. X will be match to T_X lexem because T_X lexem is defined before expected T_ID lexem. The answer is: it is not a T_ID token but T_X.
In line 8 expecting T_IF,T_WHILE T_READ.... But I have T_IF
In line 7 from code example you are trying to define an output variable i:real. But you are missing of define+ rule in an output section of a module definition. I assume you can have named output parameter. Then proper module rule should looks like as follow:
module
: T_MODULE T_ID T_INPUT T_2POINT define+ T_OUTPUT T_2POINT define+ T_TYPE block
| T_MODULE T_ID block
;
Because of missing define+ the definition of module rule is interrupted and everything after output: in line 6 is treated as definition (define) alternative from main rule start.
If above it's not the case and your code example is wrong then you should remove i: characters in the output section of the module.
Anyway, the answer is: code example is inconsistent with your grammar.
Modifications
Rearange your tokens
You should define your tokens in an order:
Skipped tokens (e.g. whitespaces, comments)
Specialized tokens (e.g. keywords, operators, literals)
General tokens (e.g. identifier)
Pay attention to rule naming
You can't use names reserved to a language you use ANTLRv4 with. You defined while grammar rule which will raise conflict with while keyword in Java.
Readability and simplicity is the key
Use pleasent to eye and simpler ANTLRv4 constructs:
Change T_WS : (('\t'|'\r'|' ')+) ->skip; to T_WS : [ \t\r]+ -> skip;
Change T_ID : [a-zA-Z] ([a-zA-Z]|('0'..'9'))*; to T_ID : [a-zA-Z] [a-zA-Z0-9]*;
Change T_COMMULLINE:'%%%' (.|('\t'|'\r'|' '|'\n'))*? '%%%' ->skip; to T_COMMULLINE:'%%%' .*? '%%%' -> skip; (the dot . will match everything anyway, especially whitespace characters)
Related
I am new to ANTLR, and here is a grammar that I am working on and its failing for a given input string - A.B() && ((C.D() || E.F())).
I have tried a number of combinations but its failing on the same place.
grammar Expressions;
expression
: logicBlock (logicalConnector logicBlock)*
| NOT? '('? logicBlock ')'? (logicalConnector NOT? '('? logicBlock ')'?)*
;
logicBlock
: logicUnit comparator THINGS
| logicUnit comparator logicUnit
| logicUnit
;
logicUnit
: NOT? '(' method ')'
| NOT? method
;
method
: object '.' function ('.' function)*
;
object
: THINGS
|'(' THINGS ')'
;
function
: THINGS '(' arguments? ')'
;
arguments
: (object | function | method | logicUnit | logicBlock)
(
','
(object | function | method | logicUnit | logicBlock)
)*
;
logicalConnector
: AND | OR | PLUS | MINUS
;
comparator
: GT | LT | GTE | LTE | EQUALS | NOTEQUALS
;
AND : '&&' ;
OR : '||' ;
EQUALS : '==' ;
ASSIGN : '=' ;
GT : '>' ;
LT : '<' ;
GTE : '>=' ;
LTE : '<=' ;
NOTEQUALS : '!=' ;
NOT : '!' ;
PLUS : '+' ;
MINUS : '-' ;
IF : 'if' ;
THINGS
: [a-zA-Z] [a-zA-Z0-9]*
| '"' .*? '"'
| [0-9]+
| ([0-9]+)? '.' ([0-9])+
;
WS : [ \t\r\n]+ -> skip
;
The error that I getting for this input - A.B() && ((C.D() || E.F())) is below. any help and/or suggestion to improve would be highly appreciated.
This rule looks odd to me:
expression
: logicBlock (logicalConnector logicBlock)*
| NOT? '('? logicBlock ')'? (logicalConnector NOT? '('? logicBlock ')'?)*
// ^ ^ ^ ^
// | | | |
;
All the optional parenthesis can't be right: this means the the first can be present, and the other 3 would be omitted (and any other combination), leaving your expression with unbalanced parenthesis.
The get your grammar working in the input A.B() && ((C.D() || E.F())) , you'll want to do something like this:
expression
: logicBlock (logicalConnector logicBlock)*
| NOT? logicBlock (logicalConnector NOT? logicBlock)*
;
logicBlock
: '(' expression ')'
| logicUnit comparator THINGS
| logicUnit comparator logicUnit
| logicUnit
;
logicUnit
: '(' expression ')'
| NOT? '(' method ')'
| NOT? method
;
But ANTLR4 supports left recursive rules, allowing you to define the expression rule like this:
expression
: '(' expression ')'
| NOT expression
| expression comparator expression
| expression logicalConnector expression
| method
| object
| THINGS
;
method
: object '.' function ( '.' function )*
;
object
: THINGS
|'(' THINGS ')'
;
function
: THINGS '(' arguments? ')'
;
arguments
: expression ( ',' expression )*
;
logicalConnector
: AND | OR | PLUS | MINUS
;
comparator
: GT | LT | GTE | LTE | EQUALS | NOTEQUALS
;
making it much more readable. Sure, it doesn't produce the exact parse tree your original grammar was producing, but you might be flexible in that regard. This proposed grammar also matches more than yours: like NOT A, which your grammar does not allow, where my proposed grammar does accept it.
I have an RSQL grammar defined:
grammar Rsql;
statement
: L_PAREN wrapped=statement R_PAREN
| left=statement op=( AND_OPERATOR | OR_OPERATOR ) right=statement
| node=comparison
;
comparison
: single_comparison
| multi_comparison
| bool_comparison
;
single_comparison
: key=IDENTIFIER op=( EQ | NE | GT | GTE | LT | LTE ) value=single_value
;
multi_comparison
: key=IDENTIFIER op=( IN | NIN ) value=multi_value
;
bool_comparison
: key=IDENTIFIER op=EX value=boolean_value
;
boolean_value
: BOOLEAN
;
single_value
: boolean_value
| ( STRING_LITERAL | IDENTIFIER )
| NUMERIC_LITERAL
;
multi_value
: L_PAREN single_value ( COMMA single_value )* R_PAREN
| single_value
;
TRUE: 'true';
FALSE: 'false';
AND_OPERATOR: ';';
OR_OPERATOR: ',';
L_PAREN: '(';
R_PAREN: ')';
COMMA: ',';
EQ: '==';
NE: '!=';
IN: '=in=';
NIN: '=out=';
GT: '=gt=';
LT: '=lt=';
GTE: '=ge=';
LTE: '=le=';
EX: '=ex=';
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]*
;
BOOLEAN
: TRUE
| FALSE
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( [-+]? DIGIT+ )?
| '.' DIGIT+ ( [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( STRING_ESCAPE_SEQ | ~[\\\r\n'] )* '\''
| '"' ( STRING_ESCAPE_SEQ | ~[\\\r\n"] )* '"'
;
STRING_ESCAPE_SEQ
: '\\' .
;
fragment DIGIT : [0-9];
No matter how I attempt to parse this (listener/visitor), the statements with parenthesis always get evaluated in order. It is my understanding that the order in the rule would be the precedence. However, the parse tree for a statement like "name==foo,(name==bar;age=gt=35)" is always
no matter where the parenthesis appear. Please help me discover what I'm missing. Thanks!
we knew the priority of logical operation from strong to low:
Not
And
Or
I want to add logical operation to my grammar in way respect the priority of logical operation. ...
My grammar is:
expression : factor ( PLUS factor | MINUS factor )* ;
factor : term ( MULT term | DIV term )* ;
term : NUMBER | ID | PAR_OPEN expression PAR_CLOSE ;
With ANTLR3 and ANTLR 4, you can doe something like this:
expression
: or_expression
;
// lowest precedence
or_expression
: and_expression ( '||' and_expression )*
;
and_expression
: rel_expression ( '&&' rel_expression )*
;
rel_expression
: add_expression ( ( '<' | '<=' | '>' | '>=' ) add_expression )*
;
add_expression
: mult_expression ( ( '+' | '-' ) mult_expression )*
;
mult_expression
: unary_expression ( ( '*' | '/' ) unary_expression )*
;
unary_expression
: '-' atom
| atom
;
// highest precedence
atom
: NUMBER
| ID
| '(' expression ')'
;
And with ANTLR4, you can also write it like this (which is equivalent to the grammar above!):
expression
: '!' expression
| expression ( '*' | '/' ) expression // higher than '+' | '-'
| expression ( '+' | '-' ) expression // higher than '<' | '<=' | '>' | '>='
| expression ( '<' | '<=' | '>' | '>=' ) expression // higher than '&&'
| expression '&&' expression // higher than '||'
| expression '||' expression
| NUMBER
| ID
| '(' expression ')'
;
I can't seem to figure out why this grammar won't compile. It compiled fine until I modified line 145 from
(Identifier '.')* functionCall
to
(primary '.')? functionCall
I've been trying to figure out how to solve this issue for a while but I can't seem to be able to. Here's the error:
The following sets of rules are mutually left-recursive [primary]
grammar Tadpole;
#header
{package net.tadpole.compiler.parser;}
file
: fileContents*
;
fileContents
: structDec
| functionDec
| statement
| importDec
;
importDec
: 'import' Identifier ';'
;
literal
: IntegerLiteral
| FloatingPointLiteral
| BooleanLiteral
| CharacterLiteral
| StringLiteral
| NoneLiteral
| arrayLiteral
;
arrayLiteral
: '[' expressionList? ']'
;
expressionList
: expression (',' expression)*
;
expression
: primary
| unaryExpression
| <assoc=right> expression binaryOpPrec0 expression
| <assoc=left> expression binaryOpPrec1 expression
| <assoc=left> expression binaryOpPrec2 expression
| <assoc=left> expression binaryOpPrec3 expression
| <assoc=left> expression binaryOpPrec4 expression
| <assoc=left> expression binaryOpPrec5 expression
| <assoc=left> expression binaryOpPrec6 expression
| <assoc=left> expression binaryOpPrec7 expression
| <assoc=left> expression binaryOpPrec8 expression
| <assoc=left> expression binaryOpPrec9 expression
| <assoc=left> expression binaryOpPrec10 expression
| <assoc=right> expression binaryOpPrec11 expression
;
unaryExpression
: unaryOp expression
| prefixPostfixOp primary
| primary prefixPostfixOp
;
unaryOp
: '+'
| '-'
| '!'
| '~'
;
prefixPostfixOp
: '++'
| '--'
;
binaryOpPrec0
: '**'
;
binaryOpPrec1
: '*'
| '/'
| '%'
;
binaryOpPrec2
: '+'
| '-'
;
binaryOpPrec3
: '>>'
| '>>>'
| '<<'
;
binaryOpPrec4
: '<'
| '>'
| '<='
| '>='
| 'is'
;
binaryOpPrec5
: '=='
| '!='
;
binaryOpPrec6
: '&'
;
binaryOpPrec7
: '^'
;
binaryOpPrec8
: '|'
;
binaryOpPrec9
: '&&'
;
binaryOpPrec10
: '||'
;
binaryOpPrec11
: '='
| '**='
| '*='
| '/='
| '%='
| '+='
| '-='
| '&='
| '|='
| '^='
| '>>='
| '>>>='
| '<<='
| '<-'
;
primary
: literal
| fieldName
| '(' expression ')'
| '(' type ')' (primary | unaryExpression)
| 'new' objType '(' expressionList? ')'
| primary '.' fieldName
| primary dimension
| (primary '.')? functionCall
;
functionCall
: functionName '(' expressionList? ')'
;
functionName
: Identifier
;
dimension
: '[' expression ']'
;
statement
: '{' statement* '}'
| expression ';'
| 'recall' ';'
| 'return' expression? ';'
| variableDec
| 'if' '(' expression ')' statement ('else' statement)?
| 'while' '(' expression ')' statement
| 'do' expression 'while' '(' expression ')' ';'
| 'do' '{' statement* '}' 'while' '(' expression ')' ';'
;
structDec
: 'struct' structName ('(' parameterList ')')? '{' variableDec* functionDec* '}'
;
structName
: Identifier
;
fieldName
: Identifier
;
variableDec
: type fieldName ('=' expression)? ';'
;
type
: primitiveType ('[' ']')*
| objType ('[' ']')*
;
primitiveType
: 'byte'
| 'short'
| 'int'
| 'long'
| 'char'
| 'boolean'
| 'float'
| 'double'
;
objType
: (Identifier '.')? structName
;
functionDec
: 'def' functionName '(' parameterList? ')' ':' type '->' functionBody
;
functionBody
: statement
;
parameterList
: parameter (',' parameter)*
;
parameter
: type fieldName
;
IntegerLiteral
: DecimalIntegerLiteral
| HexIntegerLiteral
| OctalIntegerLiteral
| BinaryIntegerLiteral
;
fragment
DecimalIntegerLiteral
: DecimalNumeral IntegerSuffix?
;
fragment
HexIntegerLiteral
: HexNumeral IntegerSuffix?
;
fragment
OctalIntegerLiteral
: OctalNumeral IntegerSuffix?
;
fragment
BinaryIntegerLiteral
: BinaryNumeral IntegerSuffix?
;
fragment
IntegerSuffix
: [lL]
;
fragment
DecimalNumeral
: Digit (Digits? | Underscores Digits)
;
fragment
Digits
: Digit (DigitsAndUnderscores? Digit)?
;
fragment
Digit
: [0-9]
;
fragment
DigitsAndUnderscores
: DigitOrUnderscore+
;
fragment
DigitOrUnderscore
: Digit
| '_'
;
fragment
Underscores
: '_'+
;
fragment
HexNumeral
: '0' [xX] HexDigits
;
fragment
HexDigits
: HexDigit (HexDigitsAndUnderscores? HexDigit)?
;
fragment
HexDigit
: [0-9a-fA-F]
;
fragment
HexDigitsAndUnderscores
: HexDigitOrUnderscore+
;
fragment
HexDigitOrUnderscore
: HexDigit
| '_'
;
fragment
OctalNumeral
: '0' [oO] Underscores? OctalDigits
;
fragment
OctalDigits
: OctalDigit (OctalDigitsAndUnderscores? OctalDigit)?
;
fragment
OctalDigit
: [0-7]
;
fragment
OctalDigitsAndUnderscores
: OctalDigitOrUnderscore+
;
fragment
OctalDigitOrUnderscore
: OctalDigit
| '_'
;
fragment
BinaryNumeral
: '0' [bB] BinaryDigits
;
fragment
BinaryDigits
: BinaryDigit (BinaryDigitsAndUnderscores? BinaryDigit)?
;
fragment
BinaryDigit
: [01]
;
fragment
BinaryDigitsAndUnderscores
: BinaryDigitOrUnderscore+
;
fragment
BinaryDigitOrUnderscore
: BinaryDigit
| '_'
;
// §3.10.2 Floating-Point Literals
FloatingPointLiteral
: DecimalFloatingPointLiteral FloatingPointSuffix?
| HexadecimalFloatingPointLiteral FloatingPointSuffix?
;
fragment
FloatingPointSuffix
: [fFdD]
;
fragment
DecimalFloatingPointLiteral
: Digits '.' Digits? ExponentPart?
| '.' Digits ExponentPart?
| Digits ExponentPart
| Digits
;
fragment
ExponentPart
: ExponentIndicator SignedInteger
;
fragment
ExponentIndicator
: [eE]
;
fragment
SignedInteger
: Sign? Digits
;
fragment
Sign
: [+-]
;
fragment
HexadecimalFloatingPointLiteral
: HexSignificand BinaryExponent
;
fragment
HexSignificand
: HexNumeral '.'?
| '0' [xX] HexDigits? '.' HexDigits
;
fragment
BinaryExponent
: BinaryExponentIndicator SignedInteger
;
fragment
BinaryExponentIndicator
: [pP]
;
BooleanLiteral
: 'true'
| 'false'
;
CharacterLiteral
: '\'' SingleCharacter '\''
| '\'' EscapeSequence '\''
;
fragment
SingleCharacter
: ~['\\]
;
StringLiteral
: '"' StringCharacters? '"'
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\\]
| EscapeSequence
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]
| OctalEscape
| UnicodeEscape
;
fragment
OctalEscape
: '\\' OctalDigit
| '\\' OctalDigit OctalDigit
| '\\' ZeroToThree OctalDigit OctalDigit
;
fragment
ZeroToThree
: [0-3]
;
fragment
UnicodeEscape
: '\\' 'u' HexDigit HexDigit HexDigit HexDigit
;
NoneLiteral
: 'nil'
;
Identifier
: IdentifierStartChar IdentifierChar*
;
fragment
IdentifierStartChar
: [a-zA-Z$_] // these are the "java letters" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierStart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
fragment
IdentifierChar
: [a-zA-Z0-9$_] // these are the "java letters or digits" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierPart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
WS : [ \t\r\n\u000C]+ -> skip
;
LINE_COMMENT
: '#' ~[\r\n]* -> skip
;
The left recursive invocation needs to be the first, so no parenthesis can be placed before it.
You can rewrite it like this:
primary
: literal
| fieldName
| '(' expression ')'
| '(' type ')' (primary | unaryExpression)
| 'new' objType '(' expressionList? ')'
| primary '.' fieldName
| primary dimension
| primary '.' functionCall
| functionCall
;
which is equivalent.
I am creating an antlr4 grammar for a moderately simple language. I am struggling to get the grammar to differentiate between unary and binary minus. I have read all the other posts that I can find on this topic here on Stackoverflow, but have found that the answers either apply to antlr3 in ways I cannot figure out how to express in antlr4, or that I seem not to be adept in translating the advice of these answers to my own situation. I often end with the problem that antlr cannot unambiguously resolve the rules if I play around with other alternatives.
Below is the antlr file in its entirety. The ambiguity in this version occurs around the production:
binop_expr
: SUMOP product
| product ( SUMOP product )*
;
(I had originally used UNARY_ABELIAN_OP instead of the second SUMOP, but that led to a different kind of ambiguity — the tool apparently couldn't recognise that it needed to differentiate between the same token in two different contexts. I mention this because one of the posts here recommends using a different name for the unary operator.)
grammar Kant;
program
: type_declaration_list main
;
type_declaration_list
: type_declaration
| type_declaration_list type_declaration
| /* null */
;
type_declaration
: 'context' JAVA_ID '{' context_body '}'
| 'class' JAVA_ID '{' class_body '}'
| 'class' JAVA_ID 'extends' JAVA_ID '{' class_body '}'
;
context_body
: context_body context_body_element
| context_body_element
| /* null */
;
context_body_element
: method_decl
| object_decl
| role_decl
| stageprop_decl
;
role_decl
: 'role' JAVA_ID '{' role_body '}'
| 'role' JAVA_ID '{' role_body '}' REQUIRES '{' self_methods '}'
| access_qualifier 'role' JAVA_ID '{' role_body '}'
| access_qualifier 'role' JAVA_ID '{' role_body '}' REQUIRES '{' self_methods '}'
;
role_body
: method_decl
| role_body method_decl
| object_decl // illegal
| role_body object_decl // illegal — for better error messages only
;
self_methods
: self_methods ';' method_signature
| method_signature
| self_methods /* null */ ';'
;
stageprop_decl
: 'stageprop' JAVA_ID '{' stageprop_body '}'
| 'stageprop' JAVA_ID '{' stageprop_body '}' REQUIRES '{' self_methods '}'
| access_qualifier 'stageprop' JAVA_ID '{' stageprop_body '}'
| access_qualifier 'stageprop' JAVA_ID '{' stageprop_body '}' REQUIRES '{' self_methods '}'
;
stageprop_body
: method_decl
| stageprop_body method_decl
;
class_body
: class_body class_body_element
| class_body_element
| /* null */
;
class_body_element
: method_decl
| object_decl
;
method_decl
: method_decl_hook '{' expr_and_decl_list '}'
;
method_decl_hook
: method_signature
| method_signature CONST
;
method_signature
: access_qualifier return_type method_name '(' param_list ')'
| access_qualifier return_type method_name
| access_qualifier method_name '(' param_list ')'
;
expr_and_decl_list
: object_decl
| expr ';' object_decl
| expr_and_decl_list object_decl
| expr_and_decl_list expr
| expr_and_decl_list /*null-expr */ ';'
| /* null */
;
return_type
: type_name
| /* null */
;
method_name
: JAVA_ID
;
access_qualifier
: 'public' | 'private' | /* null */
;
object_decl
: access_qualifier compound_type_name identifier_list ';'
| access_qualifier compound_type_name identifier_list
| compound_type_name identifier_list /* null expr */ ';'
| compound_type_name identifier_list
;
compound_type_name
: type_name '[' ']'
| type_name
;
type_name
: JAVA_ID
| 'int'
| 'double'
| 'char'
| 'String'
;
identifier_list
: JAVA_ID
| identifier_list ',' JAVA_ID
| JAVA_ID ASSIGN expr
| identifier_list ',' JAVA_ID ASSIGN expr
;
param_list
: param_decl
| param_list ',' param_decl
| /* null */
;
param_decl
: type_name JAVA_ID
;
main
: expr
;
expr
: block
| expr '.' message
| expr '.' CLONE
| expr '.' JAVA_ID
| ABELIAN_INCREMENT_OP expr '.' JAVA_ID
| expr '.' JAVA_ID ABELIAN_INCREMENT_OP
| /* this. */ message
| JAVA_ID
| constant
| if_expr
| for_expr
| while_expr
| do_while_expr
| switch_expr
| BREAK
| CONTINUE
| boolean_expr
| binop_expr
| '(' expr ')'
| <assoc=right> expr ASSIGN expr
| NEW message
| NEW type_name '[' expr ']'
| RETURN expr
| RETURN
;
relop_expr
: sexpr RELATIONAL_OPERATOR sexpr
;
// This is just a duplication of expr. We separate it out
// because a top-down antlr4 parser can't handle the
// left associative ambiguity. It is used only
// for abelian types.
sexpr
: block
| sexpr '.' message
| sexpr '.' CLONE
| sexpr '.' JAVA_ID
| ABELIAN_INCREMENT_OP sexpr '.' JAVA_ID
| sexpr '.' JAVA_ID ABELIAN_INCREMENT_OP
| /* this. */ message
| JAVA_ID
| constant
| if_expr
| for_expr
| while_expr
| do_while_expr
| switch_expr
| BREAK
| CONTINUE
| '(' sexpr ')'
| <assoc=right> sexpr ASSIGN sexpr
| NEW message
| NEW type_name '[' expr ']'
| RETURN expr
| RETURN
;
block
: '{' expr_and_decl_list '}'
| '{' '}'
;
expr_or_null
: expr
| /* null */
;
if_expr
: 'if' '(' boolean_expr ')' expr
| 'if' '(' boolean_expr ')' expr 'else' expr
;
for_expr
: 'for' '(' object_decl boolean_expr ';' expr ')' expr // O.K. — expr can be a block
| 'for' '(' JAVA_ID ':' expr ')' expr
;
while_expr
: 'while' '(' boolean_expr ')' expr
;
do_while_expr
: 'do' expr 'while' '(' boolean_expr ')'
;
switch_expr
: SWITCH '(' expr ')' '{' ( switch_body )* '}'
;
switch_body
: ( CASE constant | DEFAULT ) ':' expr_and_decl_list
;
binop_expr
: SUMOP product
| product ( SUMOP product )*
;
product
: atom ( MULOP atom )*
;
atom
: null_expr
| JAVA_ID
| JAVA_ID ABELIAN_INCREMENT_OP
| ABELIAN_INCREMENT_OP JAVA_ID
| constant
| '(' expr ')'
| array_expr '[' sexpr ']'
| array_expr '[' sexpr ']' ABELIAN_INCREMENT_OP
| ABELIAN_INCREMENT_OP array_expr '[' sexpr ']'
;
null_expr
: NULL
;
array_expr
: sexpr
;
boolean_expr
: boolean_product ( BOOLEAN_SUMOP boolean_product )*
;
boolean_product
: boolean_atom ( BOOLEAN_MULOP boolean_atom )*
;
boolean_atom
: BOOLEAN
| JAVA_ID
| '(' boolean_expr ')'
| LOGICAL_NOT boolean_expr
| relop_expr
;
constant
: STRING
| INTEGER
| FLOAT
| BOOLEAN
;
message
: <assoc=right> JAVA_ID '(' argument_list ')'
;
argument_list
: expr
| argument_list ',' expr
| /* null */
;
// Lexer rules
STRING : '"' ( ~'"' | '\\' '"' )* '"' ;
INTEGER : ('1' .. '9')+ ('0' .. '9')* | '0';
FLOAT : (('1' .. '9')* | '0') '.' ('0' .. '9')* ;
BOOLEAN : 'true' | 'false' ;
SWITCH : 'switch' ;
CASE : 'case' ;
DEFAULT : 'default' ;
BREAK : 'break' ;
CONTINUE : 'continue' ;
RETURN : 'return' ;
REQUIRES : 'requires' ;
NEW : 'new' ;
CLONE : 'clone' ;
NULL : 'null' ;
CONST : 'const' ;
RELATIONAL_OPERATOR : '!=' | '==' | '>' | '<' | '>=' | '<=';
LOGICAL_NOT : '!' ;
BOOLEAN_MULOP : '&&' ;
BOOLEAN_SUMOP : '||' | '^' ;
SUMOP : '+' | '-' ;
MULOP : '*' | '/' ;
ABELIAN_INCREMENT_OP : '++' | '--' ;
JAVA_ID: (('a' .. 'z') | ('A' .. 'Z')) (('a' .. 'z') | ('A' .. 'Z') | ('0' .. '9') | '_')* ;
INLINE_COMMENT: '//' ~[\r\n]* -> channel(HIDDEN) ;
C_COMMENT: '/*' .*? '*/' -> channel(HIDDEN) ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ -> channel(HIDDEN) ;
ASSIGN : '=' ;
Typical of the problem is that the parser can't recognise the unary minus in this expression (it simply does not accept the construct):
Base b1 = new Base(-threetwoone);
Try removing the unary expression from binop_expr, and add it to the expr rule:
expr
: ...
| unary_expr
| binop_expr
| ...
;
unary_expr
: SUMOP binop_expr
;
binop_expr
: product ( SUMOP product )*
;
The problem seems to arise from having two kinds of expressions in the grammar that need better to be grouped with each other instead of being intermixed as much as they are. The overall intent is to create a language where everything is an expression, including, for example, loops and conditionals. (You can use your imagination to deduce the values they generate; however, that is a semantic rather than syntactic issue).
One set of these expressions have nice abelian properties (they behave like ordinary arithmetic types with well-behaved operations). The language specifies how lower-level productions are tied together by operations that give them context: so two identifiers, "a" and "b", can be associated by the operator "+" as "a + b". It's a highly contexutalized grammar.
The other of these sets of expressions comprises productions with a much lower degree of syntactic cohesion. One can combine "if" expressions, "for" expressions, and declarations in any order. The only linguistic property associating them is catenation, as we find in productions for object_body, class_body, and expr_and_decl_list.
The basic problem is that these two grammars became intertwined in the antlr spec. Combined with the fact that operators like "+" are context-sensitive, and that the context can arise either from catenation or the more contextualised abelian alternative, the parser was left with two alternative ways of parsing some expressions. I managed to wiggle around most of the ambiguity, but it got pushed into the area of unary prefix operators. So if "a" is an expression (it is), and "b" is an expression, then the string "a + b" is ambiguous. It can mean either product ( SUMOP product )*, or it can mean expr_and_decl_list expr where expr_and_decl_list was earlier reduced from the expr "a". The expr in expr_and_decl_list expr can reduce from "+b".
I re-wrote the grammar, inspired by other examples I've found on the web. (One of the more influential was one that Bart Kiers pointed me to: If/else statements in ANTLR using listeners) You can find the new grammar below. Stuff that belonged together with the grammar for the abelian expressions but which used to reduce to the expr target, like NEW expr and expr '.' JAVA_ID, has now been separated out to reduce to an abelian_expr target. All the productions that reduce down to an abelian_expr are now more or less direct descendants of the abelian_expr node in the grammar; that helps antlr deal intelligently with what otherwise would be ambiguities. A unary "+" expression no longer can reduce directly to expr but can find its way to expr only through abelian_expr, so it will never be treated as production that can simply be catenated as a largely uncontextualized expression.
I can't prove that that's what was going on, but signs point in that direction. If someone else has a more formal or deductive analysis (particularly if it points to another conclusion) I'd love to hear your reasoning.
The lesson here seems to be that what amounted to a high-level, fundamental flaw in the design of the language managed to show up as only an isolated and in some sense minor bug: the inability to disambiguate between the unary and binary uses of an operator.
grammar Kant;
program
: type_declaration_list main
| type_declaration_list // missing main
;
main
: expr
;
type_declaration_list
: type_declaration
| type_declaration_list type_declaration
| /* null */
;
type_declaration
: 'context' JAVA_ID '{' context_body '}'
| 'class' JAVA_ID '{' class_body '}'
| 'class' JAVA_ID 'extends' JAVA_ID '{' class_body '}'
;
context_body
: context_body context_body_element
| context_body_element
| /* null */
;
context_body_element
: method_decl
| object_decl
| role_decl
| stageprop_decl
;
role_decl
: 'role' JAVA_ID '{' role_body '}'
| 'role' JAVA_ID '{' role_body '}' REQUIRES '{' self_methods '}'
| access_qualifier 'role' JAVA_ID '{' role_body '}'
| access_qualifier 'role' JAVA_ID '{' role_body '}' REQUIRES '{' self_methods '}'
;
role_body
: method_decl
| role_body method_decl
| object_decl // illegal
| role_body object_decl // illegal — for better error messages only
;
self_methods
: self_methods ';' method_signature
| method_signature
| self_methods /* null */ ';'
;
stageprop_decl
: 'stageprop' JAVA_ID '{' stageprop_body '}'
| 'stageprop' JAVA_ID '{' stageprop_body '}' REQUIRES '{' self_methods '}'
| access_qualifier 'stageprop' JAVA_ID '{' stageprop_body '}'
| access_qualifier 'stageprop' JAVA_ID '{' stageprop_body '}' REQUIRES '{' self_methods '}'
;
stageprop_body
: method_decl
| stageprop_body method_decl
| object_decl // illegal
| stageprop_body object_decl // illegal — for better error messages only
;
class_body
: class_body class_body_element
| class_body_element
| /* null */
;
class_body_element
: method_decl
| object_decl
;
method_decl
: method_decl_hook '{' expr_and_decl_list '}'
;
method_decl_hook
: method_signature
;
method_signature
: access_qualifier return_type method_name '(' param_list ')' CONST*
| access_qualifier return_type method_name CONST*
| access_qualifier method_name '(' param_list ')' CONST*
;
expr_and_decl_list
: object_decl
| expr ';' object_decl
| expr_and_decl_list object_decl
| expr_and_decl_list expr
| expr_and_decl_list /*null-expr */ ';'
| /* null */
;
return_type
: type_name
| /* null */
;
method_name
: JAVA_ID
;
access_qualifier
: 'public' | 'private' | /* null */
;
object_decl
: access_qualifier compound_type_name identifier_list ';'
| access_qualifier compound_type_name identifier_list
| compound_type_name identifier_list /* null expr */ ';'
| compound_type_name identifier_list
;
compound_type_name
: type_name '[' ']'
| type_name
;
type_name
: JAVA_ID
| 'int'
| 'double'
| 'char'
| 'String'
;
identifier_list
: JAVA_ID
| identifier_list ',' JAVA_ID
| JAVA_ID ASSIGN expr
| identifier_list ',' JAVA_ID ASSIGN expr
;
param_list
: param_decl
| param_list ',' param_decl
| /* null */
;
param_decl
: type_name JAVA_ID
;
expr
: abelian_expr
| boolean_expr
| block
| if_expr
| for_expr
| while_expr
| do_while_expr
| switch_expr
| BREAK
| CONTINUE
| RETURN expr
| RETURN
;
abelian_expr
: <assoc=right>abelian_expr POW abelian_expr
| ABELIAN_SUMOP expr
| LOGICAL_NEGATION expr
| NEW message
| NEW type_name '[' expr ']'
| abelian_expr ABELIAN_MULOP abelian_expr
| abelian_expr ABELIAN_SUMOP abelian_expr
| abelian_expr ABELIAN_RELOP abelian_expr
| null_expr
| /* this. */ message
| JAVA_ID
| JAVA_ID ABELIAN_INCREMENT_OP
| ABELIAN_INCREMENT_OP JAVA_ID
| constant
| '(' abelian_expr ')'
| abelian_expr '[' expr ']'
| abelian_expr '[' expr ']' ABELIAN_INCREMENT_OP
| ABELIAN_INCREMENT_OP expr '[' expr ']'
| ABELIAN_INCREMENT_OP expr '.' JAVA_ID
| abelian_expr '.' JAVA_ID ABELIAN_INCREMENT_OP
| abelian_expr '.' message
| abelian_expr '.' CLONE
| abelian_expr '.' CLONE '(' ')'
| abelian_expr '.' JAVA_ID
| <assoc=right> abelian_expr ASSIGN expr
;
message
: <assoc=right> JAVA_ID '(' argument_list ')'
;
boolean_expr
: boolean_expr BOOLEAN_MULOP expr
| boolean_expr BOOLEAN_SUMOP expr
| constant // 'true' / 'false'
| abelian_expr
;
block
: '{' expr_and_decl_list '}'
| '{' '}'
;
expr_or_null
: expr
| /* null */
;
if_expr
: 'if' '(' expr ')' expr
| 'if' '(' expr ')' expr 'else' expr
;
for_expr
: 'for' '(' object_decl expr ';' expr ')' expr // O.K. — expr can be a block
| 'for' '(' JAVA_ID ':' expr ')' expr
;
while_expr
: 'while' '(' expr ')' expr
;
do_while_expr
: 'do' expr 'while' '(' expr ')'
;
switch_expr
: SWITCH '(' expr ')' '{' ( switch_body )* '}'
;
switch_body
: ( CASE constant | DEFAULT ) ':' expr_and_decl_list
;
null_expr
: NULL
;
constant
: STRING
| INTEGER
| FLOAT
| BOOLEAN
;
argument_list
: expr
| argument_list ',' expr
| /* null */
;
// Lexer rules
STRING : '"' ( ~'"' | '\\' '"' )* '"' ;
INTEGER : ('1' .. '9')+ ('0' .. '9')* | '0';
FLOAT : (('1' .. '9')* | '0') '.' ('0' .. '9')* ;
BOOLEAN : 'true' | 'false' ;
SWITCH : 'switch' ;
CASE : 'case' ;
DEFAULT : 'default' ;
BREAK : 'break' ;
CONTINUE : 'continue' ;
RETURN : 'return' ;
REQUIRES : 'requires' ;
NEW : 'new' ;
CLONE : 'clone' ;
NULL : 'null' ;
CONST : 'const' ;
ABELIAN_RELOP : '!=' | '==' | '>' | '<' | '>=' | '<=';
LOGICAL_NOT : '!' ;
POW : '**' ;
BOOLEAN_SUMOP : '||' | '^' ;
BOOLEAN_MULOP : '&&' ;
ABELIAN_SUMOP : '+' | '-' ;
ABELIAN_MULOP : '*' | '/' | '%' ;
MINUS : '-' ;
PLUS : '+' ;
LOGICAL_NEGATION : '!' ;
ABELIAN_INCREMENT_OP : '++' | '--' ;
JAVA_ID: (('a' .. 'z') | ('A' .. 'Z')) (('a' .. 'z') | ('A' .. 'Z') | ('0' .. '9') | '_')* ;
INLINE_COMMENT: '//' ~[\r\n]* -> channel(HIDDEN) ;
C_COMMENT: '/*' .*? '*/' -> channel(HIDDEN) ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ -> channel(HIDDEN) ;
ASSIGN : '=' ;