I am using antlr 'org.antlr:antlr4:4.9.2' and come across the "dangling else" ambiguity problem; see the following grammar IfStat.g4.
// file: IfStat.g4
grammar IfStat;
stat : 'if' expr 'then' stat
| 'if' expr 'then' stat 'else' stat
| expr
;
expr : ID ;
ID : LETTER (LETTER | [0-9])* ;
fragment LETTER : [a-zA-Z] ;
WS : [ \t\n\r]+ -> skip ;
I tested this grammar against the input "if a then if b then c else d". It is parsed as `"if a then (if b then c else d)" as expected. How does ANTLR4 resolve this ambiguity?
ANTLR will choose the first possible (successful) path it is able to make.
You can enable ANTLR to report such ambiguities in your grammar. Check this Q&A for that: Ambiguity in grammar not reported by ANTLR
I'm having problem trying to get a grammar working. Here is the simplified version. The language I try to parse has expressions like these:
testing1(2342);
testing2(idfor2);
testing3(4654);
testing4[1..n];
testing5[0..1];
testing6(7);
testing7(1);
testing8(o);
testing9(n);
The problem arises when I introduce the rules for the [1..n] or [0..1] expressions. The grammar file (one of the many variations I've tried):
grammar test;
tests
: test* ;
test
: call
| declaration ;
call
: callName '(' callParameter ')' ';' ;
callName : Identifier ;
callParameter : Identifier | Integer ;
declaration
: declarationName '[' declarationParams ']' ';' ;
declarationName : Identifier ;
declarationParams
: decMin '..' decMax ;
decMin : '0' | '1' ;
decMax : '1' | 'n' ;
Integer : [0-9]+ ;
Identifier : [a-zA-Z_][a-zA-Z0-9_]* ;
WS : [ \t\r\n]+ -> skip ;
When I parse the sample with this grammar, it fails on testing7(1); and testint(9);. It matches as decMin or decMax instead of Integer or Identifier:
line 8:9 mismatched input '1' expecting {Integer, Identifier}
line 10:9 mismatched input 'n' expecting {Integer, Identifier}
I've tried many variations but I can't make it work fine.
I think your problem comes from not using lexer rules clearly defining what you want.
When you added this rule :
decMin : '0' | '1' ;
You in fact created an unnamed lexer rule that matches '0' and another one matching '1' :
UNNAMED_0_RULE : '0';
UNNAMED_1_RULE : '1';
And your parser rule became :
decMin : UNNAMED_0_RULE | UNNAMED_1_RULE ;
Problem : now, when your lexer see
testing7(1);
**it doesn't see **
callName '(' callParameter ')' ';'
anymore, it sees
callName '(' UNNAMED_1_RULE ')' ';'
and it doesn't understand that.
And that is because lexer rules are effective before the parser rules.
To solve your problem, define your lexer rules efficiently, It would probably look like that :
grammar test;
/*---------------- PARSER ----------------*/
tests
: test*
;
test
: call
| declaration
;
call
: callName '(' callParameter ')' ';'
;
callName
: identifier
;
callParameter
: identifier
| integer
;
declaration
: declarationName '[' declarationParams ']' ';'
;
declarationName
: identifier
;
declarationParams
: decMin '..' decMax
;
decMin
: INTEGER_ZERO
| INTEGER_ONE
;
decMax
: INTEGER_ONE
| LETTER_N
;
integer
: (INTEGER_ZERO | INTEGER_ONE | INTEGER_OTHERS)+
;
identifier
: LETTER_N
| IDENTIFIER
;
/*---------------- LEXER ----------------*/
LETTER_N: N;
IDENTIFIER
: [a-zA-Z_][a-zA-Z0-9_]*
;
WS
: [ \t\r\n]+ -> skip
;
INTEGER_ZERO: '0';
INTEGER_ONE: '1';
INTEGER_OTHERS: '2'..'9';
fragment N: [nN];
I just tested this grammar and it works.
The drawback is that it will cut your integers at the lexer step (cutting 1245 into 1 2 4 5 in lexer rules, and the considering the parser rule as uniting 1 2 4 and 5).
I think it would be better to be less precise and simply write :
decMin: integer | identifier;
But then it depends on what you do with your grammar...
in a little test-parser I just wrote, I encountered a weird problem, which I don't quite understand.
Stripping it down to the smallest example showing the problem, let's start with the following grammar:
Testing.g4:
grammar Testing;
cscript // This is the construct I shortened
: (statement_list)* ;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
block : '{' statement_list? '}' ;
expression
: left=expression op=('*'|'/') right=expression # arithmeticExpression
| left=expression op=('+'|'-') right=expression # arithmeticExpression
| left=expression op=Comparison_operator right=expression # comparisonExpression
| ID # variableValueExpression
| constant # ignore // will be executed with the rule name
;
assignment_statement
: ID op=Assignment_operator expression
;
constant
: INT
| REAL;
Assignment_operator : ('=' | '+=' | '-=') ;
Comparison_operator : ('<' | '>' | '==' | '!=') ;
Comment : '//' .*? '\n' -> skip;
fragment NUM : [0-9];
INT : NUM+;
REAL
: NUM* '.' NUM+
| '.' NUM+
| INT
;
ID : [a-zA-Z_] [a-zA-Z_0-9]*;
WS : [ \t\r\n]+ -> skip;
Using the input
z = x + y;
everything is fine, we get a parse tree which goes from cscript to statement_list, statement, assignment_statement, id and expression. Great!
Now, if I add the possibility to declare variables, all goes down the drain:
This is the change to the grammar:
cscript
: (statement_list | variable_declaration ';')* ;
variable_declaration
: type ID ('=' expression)?
;
type
: 'int'
| 'real'
;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
// (continue as before)
All of a sudden, the same test-input gets wrongly dissected into two statement_lists, each continued to a statement with a "missing ';'" warning, the first going to an incomplete assignment_statement of "z =" and the second to an incomplete assignment_statement "x +".
My attempt to show the parse tree in text-form:
cscript
statement_list
statement
assignment_statement
'z'
'=' [marked as error]
[warning: missing ';']
statement_list
statement
assignment_statement
'x'
'+' [marked as error]
'y' [marked as error]
';'
Can anyone tell me what the problem is? (And how to fix it? ;-))
Edit on 2016-12-26, after Mike's comment:
After replacing all implicit lexer rules with explicit declarations, all of a sudden, the input "z = x + y" worked. (thumbs up)
The next thing I did was restoring more of the original example I had in mind, and adding a new input line
int x = 22;
to the input (which worked previously, but did not make it into the minimal example). Now, that line fails. This is the -token output of the test rig:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
[#3,8:9='22',<20>,1:8]
[#4,10:10=';',<12>,1:10]
[#5,13:13='z',<22>,2:0]
[#6,15:15='=',<1>,2:2]
[#7,17:17='x',<22>,2:4]
[#8,19:19='+',<18>,2:6]
[#9,21:21='y',<22>,2:8]
[#10,22:22=';',<12>,2:9]
[#11,25:24='<EOF>',<-1>,3:0]
line 1:6 mismatched input '=' expecting '='
As the problem seemed to be in the variable_declaration part, I even tried to split this into two parsing rules like this:
cscript
: (statement_list | variable_declaration_and_assignment SEMICOLON | variable_declaration SEMICOLON)* ;
variable_declaration_and_assignment
: type ID EQUAL expression
;
variable_declaration
: type ID
;
With the result:
line 1:6 no viable alternative at input 'intx='
Still stuck :-(
BTW: Splitting the "int x = 22;" into "int x;" and "x = 22;" works. sigh
Edit on 2016-12-26, after Mike's next comment:
Double-checked, and everything is lexer rules. Still, the mismatch between '=' and '=' (which I unfortunately cannot reconstruct anymore) gave me the idea to check the token types. The current status is:
(Shortened grammar)
cscript
: (statement_list | variable_declaration)* ;
...
variable_declaration
: type ID (EQUAL expression)? SEMICOLON
;
...
Assignment_operator : (EQUAL | PLUS_EQ | MINUS_EQ) ;
// among others
PLUS_EQ : '+=';
MINUS_EQ : '-=';
EQUAL: '=';
...
Shortened output:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
...
line 1:6 mismatched input '=' expecting ';'
Here, if I understand this correctly, the '=' is parsed to token type 1, which - according to the lexer.tokens output - is Assignment_Operator, while the expected EQUAL would be 13.
Might this be the problem?
Ok, seems the main take away here is: think about your definitions and how you define them. Create explicit lexer rules for your literals instead of defining them implicitly in the parser rules. Check the token values you get from the lexer if the parser gives you weird errors, because they must be correct in the first place or your parse has no chance to do its job.
So I defined a grammar to parse an C style syntax language:
grammar mygrammar;
program
: (declaration)*
(statement)*
EOF
;
declaration
: INT ID '=' expression ';'
;
assignment
: ID '=' expression ';'
;
expression
: expression (op=('*'|'/') expression)*
| expression (op=('+'|'-') expression)*
| relation
| INT
| ID
| '(' expression ')'
;
relation
: expression (op=('<'|'>') expression)*
;
statement
: expression ';'
| ifstatement
| loopstatement
| printstatement
| assignment
;
ifstatement
: IF '(' expression ')' (statement)* FI ';'
;
loopstatement
: LOOP '(' expression ')' (statement)* POOL ';'
;
printstatement
: PRINT '(' expression ')' ';'
;
IF : 'if';
FI : 'fi';
LOOP : 'loop';
POOL : 'pool';
INT : 'int';
PRINT : 'print';
ID : [a-zA-Z][a-zA-Z0-9]*;
INTEGER : [0-9]+;
WS : [ \r\n\t] -> skip;
And I can parse a simple test as this:
int i = (2+3)*3/2*(3+36);
int j = i;
int k = 2*1+i*3;
if (k > 2)
k = k + 1;
i = i / 3;
j = j / 3;
fi;
loop (i < 10)
i = i + 1 * (i+k);
j = (j + 1) * (j-k);
k = i + j;
print(k);
pool;
However, when I want to generate ANTLR Recogonizers in intelliJ, I got this error:
sCalc.g4:19:0: left recursive rule expression contains a left recursive alternative which can be followed by the empty string
I wonder if this is caused by my ID could be an empty string?
There are a couple of issues with your grammar:
you have INT as an alternative inside expression while you probably want INTEGER instead
there is no need to do expression (op=('+'|'-') expression)*: this will do: expression op=('+'|'-') expression
ANTLR4 does not support indirect left recursive rules: you must include relation inside expression
Something like this ought to do it:
grammar mygrammar;
program
: (declaration)*
(statement)*
EOF
;
declaration
: INT ID '=' expression ';'
;
assignment
: ID '=' expression ';'
;
expression
: expression op=('*'|'/') expression
| expression op=('+'|'-') expression
| expression op=('<'|'>') expression
| INTEGER
| ID
| '(' expression ')'
;
statement
: expression ';'
| ifstatement
| loopstatement
| printstatement
| assignment
;
ifstatement
: IF '(' expression ')' (statement)* FI ';'
;
loopstatement
: LOOP '(' expression ')' (statement)* POOL ';'
;
printstatement
: PRINT '(' expression ')' ';'
;
IF : 'if';
FI : 'fi';
LOOP : 'loop';
POOL : 'pool';
INT : 'int';
PRINT : 'print';
ID : [a-zA-Z][a-zA-Z0-9]*;
INTEGER : [0-9]+;
WS : [ \r\n\t] -> skip;
Also not that this (statement)* can simply be written as statement*
It's about your expression and relation rules. The expression rule can match relation in one alt, which in turn recurses back to expression. Rule relation additionally can potentially match nothing because of (op=('<'|'>') expression)*
A better approach is probably to have relation call expression and remove the relation alt from expression. Then use relation everywhere you used expression now. That's a typical scenario in expressions, starting out with low precedence operations as top level rules and drilling down to higher precedence rules, ultimately ending at a simple expression rule (or similar).
I am new to xtext (antlr), and don't understand why there is ambiguity between IconType '}' and WindowType '}'.
I get the
warning(200): ../org.simsulla.tools.ui.gui/src-gen/org/simsulla/tools/ui/gui/parser/antlr/internal/InternalGui.g:137:1: Decision can match input such as "'windowType'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
warning(200): ../org.simsulla.tools.ui.gui.ui/src-gen/org/simsulla/tools/ui/gui/ui/contentassist/antlr/internal/InternalGui.g:117:38: Decision can match input such as "'windowType' '=' '{' 'name' '=' RULE_ID 'iconType' '=' '{' 'name' '=' RULE_ID 'spriteType' '=' RULE_STRING 'Orientation' '=' RULE_STRING '}' '}'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
This is my grammar
Ui:
guiTypes+=GuiTypes+;
GuiTypes:
windowType+=WindowType+
;
WindowType:
'windowType' '=' '{'
'name' '=' name=STRING
iconType+=IconType*
'}'
;
IconType:
'iconType' '=' '{'
'name' '=' name=STRING
'spriteType' '=' spriteType=STRING
'Orientation' '=' Orientation=STRING
'}'
;
The grammar fragment has multiple ways of repeating WindowType:
deriving from Ui to GuiTypes to multiple WindowType instances, or
deriving from Ui to multiple GuiTypes instances where each derives to a WindowType, or
any combination of the above.
So a given input corresponds to multiple different parse trees. The grammar is thus ambiguous.