Problem matching single digits when integers are defined as tokens

Problem matching single digits when integers are defined as tokens - antlr4

I'm having problem trying to get a grammar working. Here is the simplified version. The language I try to parse has expressions like these:
testing1(2342);
testing2(idfor2);
testing3(4654);
testing4[1..n];
testing5[0..1];
testing6(7);
testing7(1);
testing8(o);
testing9(n);
The problem arises when I introduce the rules for the [1..n] or [0..1] expressions. The grammar file (one of the many variations I've tried):
grammar test;
tests
: test* ;
test
: call
| declaration ;
call
: callName '(' callParameter ')' ';' ;
callName : Identifier ;
callParameter : Identifier | Integer ;
declaration
: declarationName '[' declarationParams ']' ';' ;
declarationName : Identifier ;
declarationParams
: decMin '..' decMax ;
decMin : '0' | '1' ;
decMax : '1' | 'n' ;
Integer : [0-9]+ ;
Identifier : [a-zA-Z_][a-zA-Z0-9_]* ;
WS : [ \t\r\n]+ -> skip ;
When I parse the sample with this grammar, it fails on testing7(1); and testint(9);. It matches as decMin or decMax instead of Integer or Identifier:
line 8:9 mismatched input '1' expecting {Integer, Identifier}
line 10:9 mismatched input 'n' expecting {Integer, Identifier}
I've tried many variations but I can't make it work fine.

I think your problem comes from not using lexer rules clearly defining what you want.
When you added this rule :
decMin : '0' | '1' ;
You in fact created an unnamed lexer rule that matches '0' and another one matching '1' :
UNNAMED_0_RULE : '0';
UNNAMED_1_RULE : '1';
And your parser rule became :
decMin : UNNAMED_0_RULE | UNNAMED_1_RULE ;
Problem : now, when your lexer see
testing7(1);
**it doesn't see **
callName '(' callParameter ')' ';'
anymore, it sees
callName '(' UNNAMED_1_RULE ')' ';'
and it doesn't understand that.
And that is because lexer rules are effective before the parser rules.
To solve your problem, define your lexer rules efficiently, It would probably look like that :
grammar test;
/*---------------- PARSER ----------------*/
tests
: test*
;
test
: call
| declaration
;
call
: callName '(' callParameter ')' ';'
;
callName
: identifier
;
callParameter
: identifier
| integer
;
declaration
: declarationName '[' declarationParams ']' ';'
;
declarationName
: identifier
;
declarationParams
: decMin '..' decMax
;
decMin
: INTEGER_ZERO
| INTEGER_ONE
;
decMax
: INTEGER_ONE
| LETTER_N
;
integer
: (INTEGER_ZERO | INTEGER_ONE | INTEGER_OTHERS)+
;
identifier
: LETTER_N
| IDENTIFIER
;
/*---------------- LEXER ----------------*/
LETTER_N: N;
IDENTIFIER
: [a-zA-Z_][a-zA-Z0-9_]*
;
WS
: [ \t\r\n]+ -> skip
;
INTEGER_ZERO: '0';
INTEGER_ONE: '1';
INTEGER_OTHERS: '2'..'9';
fragment N: [nN];
I just tested this grammar and it works.
The drawback is that it will cut your integers at the lexer step (cutting 1245 into 1 2 4 5 in lexer rules, and the considering the parser rule as uniting 1 2 4 and 5).
I think it would be better to be less precise and simply write :
decMin: integer | identifier;
But then it depends on what you do with your grammar...

Related

Why isn't the program token recognized? ANTLR4

I have this grammar:
grammar BajaPower;
// Gramaticas
programa:PROGRAM ID ';' vars* bloque ;
vars:VAR ((ID|ID',')+ ':' tipo ';')+;
tipo:(INT|FLOAT);
bloque:'{' estatuto+ '}';
estatuto: (asignacion|condicion|escritura);
asignacion: ID '=' expresion ';';
condicion: 'if' '(' expresion ')' bloque (';'|'else' bloque ';');
escritura: 'print' '(' (expresion|STRING ',')* (expresion|STRING) ')' ';';
expresion: exp ('>'|'<'|'<>') exp;
exp: (termino ('+'|'-')*|termino);
termino: (factor ('*'|'/')*|factor);
factor: ('(' expresion ')')|('+'|'-') varcte| varcte;
varcte: (ID|CteI|CteF);
// Tokens
WS: [\t\r\n]+ -> skip;
PROGRAM:'program';
ID:([a-zA-Z]['_'(a-zA-Z0-9)+]*);
VAR:'var';
INT:'int';
FLOAT:'float';
CteI: ([1-9][0-9]*|'0');
CteF: [+-]?([0-9]*[.])?[0-9]+;
STRING:'"' [a-zA-Z0-9]+ '"';
And I'm trying to test it with the following code:
program TestCorrect;
var
x,y:int;
z:float;
{
x = 1;
y = 2;
z = (x+y*3)/4;
if (z > x) {
print("hola mundo",(x+y));
}
}
When I run it it only detects program as an ID and not the PROGRAM token.

There are quite a few things going wrong. In future, I suggest you incrementally create your grammar instead of (trying) to write the entire thing in one go and then coming to the conclusion it doesn't do what you meant it to.
Let's start with the lexer:
WS: [\t\r\n]+ -> skip does not include spaces
ID: ['_'(a-zA-Z0-9)+]* should be ('_'[a-zA-Z0-9]+)*
ID: the first part, [a-zA-Z], should probably be [a-zA-Z]+
VAR, INT, FLOAT are placed after ID, so when ID is properly defined, it will match var, int and float before these tokens
CteF: don't include [+-]?, leave that for the parser to deal with
STRING: [a-zA-Z0-9]+ doe not include spaces, so "hola mundo" will not be matched
Now the parser:
vars: (ID|ID',')+ is wrong because it now always has to end with a comma if you want to match multiple ID's. Do ID (',' ID)* instead
condicion: (';'|'else' bloque ';') mandates a semi-colon should always be present after an if or else block, but in your input, you do not have a semi-colon. Do ('else' bloque)? instead
expresion: exp ('>'|'<'|'<>') exp means an expresion always contains one of the operators >, < or <>, which is not correct (an expression can also just be 1*2). Do exp (('>'|'<'|'<>') exp)? instead
exp: termino ('+'|'-')* is odd: that will match 1++++++++++++. Do termino (('+'|'-') termino)* instead
termino: factor ('*'|'/')* should be factor (('*'|'/') factor)* (same as exp)
varcte: should probably include STRING so that you do not have to do this on multiple places: (expresion|STRING) but can then just do expresion
All in all, this should do the trick:
grammar BajaPower;
programa
: PROGRAM ID ';' vars* bloque
;
vars
: VAR (ID (',' ID)* ':' tipo ';')+
;
tipo
: INT
| FLOAT
;
bloque
:'{' estatuto+ '}'
;
estatuto
: asignacion
| condicion
| escritura
;
asignacion
: ID '=' expresion ';'
;
condicion
: 'if' '(' expresion ')' bloque ('else' bloque)?
;
escritura
: 'print' '(' (expresion ',')* expresion ')' ';'
;
expresion
: exp (('>'|'<'|'<>') exp)?
;
exp
: termino (('+'|'-') termino)*
;
termino
: factor (('*'|'/') factor)*
;
factor
: '(' expresion ')'
| ('+'|'-')? varcte
| STRING
;
varcte
: ID
| CteI
| CteF
;
WS : [ \t\r\n]+ -> skip;
PROGRAM : 'program';
VAR : 'var';
INT : 'int';
FLOAT : 'float';
CteI : [1-9][0-9]* | '0';
CteF : [0-9]* '.' [0-9]+;
ID : [a-zA-Z]+ ('_' [a-zA-Z0-9]+)*;
STRING : '"' .*? '"';

Antlr4 Mismatch input

First of all, I have read the solutions for the following similar questions: q1 q2 q3
Still I don't understand why I get the following message:
line 1:0 missing 'PROGRAM' at 'PROGRAM'
when I try to match the following:
PROGRAM test
BEGIN
END
My grammar:
grammar Wengo;
program : PROGRAM id BEGIN pgm_body END ;
id : IDENTIFIER ;
pgm_body : decl func_declarations ;
decl : string_decl decl | var_decl decl | empty ;
string_decl : STRING id ASSIGN str SEMICOLON ;
str : STRINGLITERAL ;
var_decl : var_type id_list SEMICOLON ;
var_type : FLOAT | INT ;
any_type : var_type | VOID ;
id_list : id id_tail ;
id_tail : COMA id id_tail | empty ;
param_decl_list : param_decl param_decl_tail | empty ;
param_decl : var_type id ;
param_decl_tail : COMA param_decl param_decl_tail | empty ;
func_declarations : func_decl func_declarations | empty ;
func_decl : FUNCTION any_type id (param_decl_list) BEGIN func_body END ;
func_body : decl stmt_list ;
stmt_list : stmt stmt_list | empty ;
stmt : base_stmt | if_stmt | loop_stmt ;
base_stmt : assign_stmt | read_stmt | write_stmt | control_stmt ;
assign_stmt : assign_expr SEMICOLON ;
assign_expr : id ASSIGN expr ;
read_stmt : READ ( id_list )SEMICOLON ;
write_stmt : WRITE ( id_list )SEMICOLON ;
return_stmt : RETURN expr SEMICOLON ;
expr : expr_prefix factor ;
expr_prefix : expr_prefix factor addop | empty ;
factor : factor_prefix postfix_expr ;
factor_prefix : factor_prefix postfix_expr mulop | empty ;
postfix_expr : primary | call_expr ;
call_expr : id ( expr_list ) ;
expr_list : expr expr_list_tail | empty ;
expr_list_tail : COMA expr expr_list_tail | empty ;
primary : ( expr ) | id | INTLITERAL | FLOATLITERAL ;
addop : ADD | MIN ;
mulop : MUL | DIV ;
if_stmt : IF ( cond ) decl stmt_list else_part ENDIF ;
else_part : ELSE decl stmt_list | empty ;
cond : expr compop expr | TRUE | FALSE ;
compop : LESS | GREAT | EQUAL | NOTEQUAL | LESSEQ | GREATEQ ;
while_stmt : WHILE ( cond ) decl stmt_list ENDWHILE ;
control_stmt : return_stmt | CONTINUE SEMICOLON | BREAK SEMICOLON ;
loop_stmt : while_stmt | for_stmt ;
init_stmt : assign_expr | empty ;
incr_stmt : assign_expr | empty ;
for_stmt : FOR ( init_stmt SEMICOLON cond SEMICOLON incr_stmt ) decl stmt_list ENDFOR ;
COMMENT : '--' ~[\r\n]* -> skip ;
WS : [ \t\r\n]+ -> skip ;
NEWLINE : [ \n] ;
EMPTY : $ ;
KEYWORD : PROGRAM|BEGIN|END|FUNCTION|READ|WRITE|IF|ELSE|ENDIF|WHILE|ENDWHILE|RETURN|INT|VOID|STRING|FLOAT|TRUE|FALSE|FOR|ENDFOR|CONTINUE|BREAK ;
OPERATOR : ASSIGN|ADD|MIN|MUL|DIV|EQUAL|NOTEQUAL|LESS|GREAT|LBRACKET|RBRACKET|SEMICOLON|COMA|LESSEQ|GREATEQ ;
IDENTIFIER : [a-zA-Z][a-zA-Z0-9]* ;
INTLITERAL : [0-9]+ ;
FLOATLITERAL : [0-9]*'.'[0-9]+ ;
STRINGLITERAL : '"' (~[\r\n"] | '""')* '"' ;
PROGRAM : 'PROGRAM';
BEGIN : 'BEGIN';
END : 'END';
FUNCTION : 'FUNCTION';
READ : 'READ';
WRITE : 'WRITE';
IF : 'IF';
ELSE : 'ELSE';
ENDIF : 'ENDIF';
WHILE : 'WHILE';
ENDWHILE : 'ENDWHILE';
RETURN : 'RETURN';
INT : 'INT';
VOID : 'VOID';
STRING : 'STRING';
FLOAT : 'FLOAT' ;
TRUE : 'TRUE';
FALSE : 'FALSE';
FOR : 'FOR';
ENDFOR : 'ENDFOR';
CONTINUE : 'CONTINUE';
BREAK : 'BREAK';
ASSIGN : ':=';
ADD : '+';
MIN : '-';
MUL : '*';
DIV : '/';
EQUAL : '=';
NOTEQUAL : '!=';
LESS : '<';
GREAT : '>';
LBRACKET : '(';
RBRACKET : ')';
SEMICOLON : ';';
COMA : ',';
LESSEQ : '<=';
GREATEQ : '>=';
From what I've read, I think there's a mismatch between KEYWORD and PROGRAM, but removing KEYWORD altogether does not solve the problem.
EDIT:
Removing KEYWORD gives the following message:
line 3:0 mismatched input 'END' expecting {'INT', 'STRING', 'FLOAT', '+'}
This my grun output when KEYWORD is available:
[#0,0:6='PROGRAM',<KEYWORD>,1:0]
[#1,8:11='test',<IDENTIFIER>,1:8]
[#2,13:17='BEGIN',<KEYWORD>,2:0]
[#3,19:21='END',<KEYWORD>,3:0]
[#4,23:22='<EOF>',<EOF>,4:0]
line 1:0 mismatched input 'PROGRAM' expecting 'PROGRAM'
(program PROGRAM test BEGIN END)
This is the output when KEYWORD is removed:
[#0,0:6='PROGRAM',<'PROGRAM'>,1:0]
[#1,8:11='test',<IDENTIFIER>,1:8]
[#2,13:17='BEGIN',<'BEGIN'>,2:0]
[#3,19:21='END',<'END'>,3:0]
[#4,23:22='<EOF>',<EOF>,4:0]
line 3:0 mismatched input 'END' expecting {'INT', 'STRING', 'FLOAT', '+'}
(program PROGRAM (id test) BEGIN (pgm_body decl func_declarations) END)

The error about "missing 'PROGRAM'" has been solved when you removed the KEYWORD rule (note that you should also remove the OPERATOR rule for the same reasons).
The error you're encountering now is completely unrelated.
Your current problem concerns the definition of empty, which you didn't show. You've said that you tried both EMPTY : $ ; and EMPTY : ^$ ; (and then presumably empty: EMPTY;), but none of those even compile, so they wouldn't cause the parse error you posted. Either way, the concept of an EMPTY token can't work. When would such a token be generated? Once between every other token? In that case, you'd get a lot of "unexpected EMPTY" errors. No, the whole point of an empty rule is that it should succeed without consuming any tokens.
To achieve that, you can just define empty : ; and remove EMPTY altogether. Alternatively you could remove empty as well and just use an empty alternative (i.e. | ;) wherever you're currently using empty. Either approach will make your code work, but there's a better way:
You're using empty as the base case for rules that basically amount to lists. ANTLR offers the repetition operators * (0 or more) , + (1 or more) as well as the ? operator to make things optional. These allow you to define lists non-recursively and without an empty rule. For example stmt_list could be defined like this:
stmt_list : stmt* ;
And id_list like this:
id_list : (id (',' id)*)? ;
On an unrelated note, your grammar can simplified greatly by making use of the fact that ANTLR 4 supports direct left recursion, so you can get rid of all the different expression rules and just have one that's left-recursive.
That'd give you:
expr : primary
| id '(' expr_list ')'
| expr mulop expr
| expr addop expr
;
And the rules expr_prefix, factor, factor_prefix and postfix_expr and call_expr could all be removed.

ANTLR4 tells me: mismatched input 'little' expecting {'big', 'little'}

I have the following simple grammar:
grammar TestG;
p : pDecl+ ;
pDecl : endianDecl
| dTDecl
;
endianType : E_BIG
| E_LITTLE
;
endianDecl : 'endian' '=' endianType ';' ;
dTDecl : 'dT' '[' STRING ']' '=' ID ';' ;
STRING: '"'.*?'"' ; //Embedded quotes?
COMMENT: '#' .*? [\n\r] -> skip ; // Discard comments for now
ID : [a-zA-Z][a-zA-Z0-9_]* ;
WS : [ \t\n\r]+ -> skip ;
INT : ('0x')?[0-9]+ ; // How to handle 0xDD and ensure non zero?
E_BIG : 'big' ;
E_LITTLE : 'little' ;
When I run grun TestG p and input the following:
endian = little;
I get the following:
line 1:9 mismatched input 'little' expecting {'big', 'little'}
What have I done wrong?

Because your lexer rule for ID precedes that for E_LITTLE, your 'little' input is being lexed as an ID.
[#0,0:5='endian',<'endian'>,1:0]
[#1,7:7='=',<'='>,1:7]
[#2,9:14='little',<ID>,1:9] <== see here it's being lexed as an ID
[#3,15:15=';',<';'>,1:15]
[#4,18:17='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'little' expecting {'big', 'little'}
Moving the these lexer tokens above ID like so:
STRING: '"'.*?'"' ; //Embedded quotes?
COMMENT: '#' .*? [\n\r] -> skip ; // Discard comments for now
E_BIG : 'big' ;
E_LITTLE : 'little' ;
ID : [a-zA-Z][a-zA-Z0-9_]* ;
WS : [ \t\n\r]+ -> skip ;
INT : ('0x')?[0-9]+ ; // How to handle 0xDD and ensure non zero?
yields the correct output from your test input.
[#0,0:5='endian',<'endian'>,1:0]
[#1,7:7='=',<'='>,1:7]
[#2,9:14='little',<'little'>,1:9] <== see here being lexed correctly
[#3,15:15=';',<';'>,1:15]
[#4,18:17='<EOF>',<EOF>,2:0]
Remember, for lexer tokens, the longest match wins, but in the case of a tie, the one that appears FIRST wins. This is why you want your more specific lexer tokens at the top of the lexer token list, and the more general ones (like identifiers, strings, etc.) farther down.

Antlr4 parsing inconsistency

in a little test-parser I just wrote, I encountered a weird problem, which I don't quite understand.
Stripping it down to the smallest example showing the problem, let's start with the following grammar:
Testing.g4:
grammar Testing;
cscript // This is the construct I shortened
: (statement_list)* ;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
block : '{' statement_list? '}' ;
expression
: left=expression op=('*'|'/') right=expression # arithmeticExpression
| left=expression op=('+'|'-') right=expression # arithmeticExpression
| left=expression op=Comparison_operator right=expression # comparisonExpression
| ID # variableValueExpression
| constant # ignore // will be executed with the rule name
;
assignment_statement
: ID op=Assignment_operator expression
;
constant
: INT
| REAL;
Assignment_operator : ('=' | '+=' | '-=') ;
Comparison_operator : ('<' | '>' | '==' | '!=') ;
Comment : '//' .*? '\n' -> skip;
fragment NUM : [0-9];
INT : NUM+;
REAL
: NUM* '.' NUM+
| '.' NUM+
| INT
;
ID : [a-zA-Z_] [a-zA-Z_0-9]*;
WS : [ \t\r\n]+ -> skip;
Using the input
z = x + y;
everything is fine, we get a parse tree which goes from cscript to statement_list, statement, assignment_statement, id and expression. Great!
Now, if I add the possibility to declare variables, all goes down the drain:
This is the change to the grammar:
cscript
: (statement_list | variable_declaration ';')* ;
variable_declaration
: type ID ('=' expression)?
;
type
: 'int'
| 'real'
;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
// (continue as before)
All of a sudden, the same test-input gets wrongly dissected into two statement_lists, each continued to a statement with a "missing ';'" warning, the first going to an incomplete assignment_statement of "z =" and the second to an incomplete assignment_statement "x +".
My attempt to show the parse tree in text-form:
cscript
statement_list
statement
assignment_statement
'z'
'=' [marked as error]
[warning: missing ';']
statement_list
statement
assignment_statement
'x'
'+' [marked as error]
'y' [marked as error]
';'
Can anyone tell me what the problem is? (And how to fix it? ;-))
Edit on 2016-12-26, after Mike's comment:
After replacing all implicit lexer rules with explicit declarations, all of a sudden, the input "z = x + y" worked. (thumbs up)
The next thing I did was restoring more of the original example I had in mind, and adding a new input line
int x = 22;
to the input (which worked previously, but did not make it into the minimal example). Now, that line fails. This is the -token output of the test rig:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
[#3,8:9='22',<20>,1:8]
[#4,10:10=';',<12>,1:10]
[#5,13:13='z',<22>,2:0]
[#6,15:15='=',<1>,2:2]
[#7,17:17='x',<22>,2:4]
[#8,19:19='+',<18>,2:6]
[#9,21:21='y',<22>,2:8]
[#10,22:22=';',<12>,2:9]
[#11,25:24='<EOF>',<-1>,3:0]
line 1:6 mismatched input '=' expecting '='
As the problem seemed to be in the variable_declaration part, I even tried to split this into two parsing rules like this:
cscript
: (statement_list | variable_declaration_and_assignment SEMICOLON | variable_declaration SEMICOLON)* ;
variable_declaration_and_assignment
: type ID EQUAL expression
;
variable_declaration
: type ID
;
With the result:
line 1:6 no viable alternative at input 'intx='
Still stuck :-(
BTW: Splitting the "int x = 22;" into "int x;" and "x = 22;" works. sigh
Edit on 2016-12-26, after Mike's next comment:
Double-checked, and everything is lexer rules. Still, the mismatch between '=' and '=' (which I unfortunately cannot reconstruct anymore) gave me the idea to check the token types. The current status is:
(Shortened grammar)
cscript
: (statement_list | variable_declaration)* ;
...
variable_declaration
: type ID (EQUAL expression)? SEMICOLON
;
...
Assignment_operator : (EQUAL | PLUS_EQ | MINUS_EQ) ;
// among others
PLUS_EQ : '+=';
MINUS_EQ : '-=';
EQUAL: '=';
...
Shortened output:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
...
line 1:6 mismatched input '=' expecting ';'
Here, if I understand this correctly, the '=' is parsed to token type 1, which - according to the lexer.tokens output - is Assignment_Operator, while the expected EQUAL would be 13.
Might this be the problem?

Ok, seems the main take away here is: think about your definitions and how you define them. Create explicit lexer rules for your literals instead of defining them implicitly in the parser rules. Check the token values you get from the lexer if the parser gives you weird errors, because they must be correct in the first place or your parse has no chance to do its job.

Antlr4 left recursive rule contains a left recursive alternative which can be followed by the empty string

So I defined a grammar to parse an C style syntax language:
grammar mygrammar;
program
: (declaration)*
(statement)*
EOF
;
declaration
: INT ID '=' expression ';'
;
assignment
: ID '=' expression ';'
;
expression
: expression (op=('*'|'/') expression)*
| expression (op=('+'|'-') expression)*
| relation
| INT
| ID
| '(' expression ')'
;
relation
: expression (op=('<'|'>') expression)*
;
statement
: expression ';'
| ifstatement
| loopstatement
| printstatement
| assignment
;
ifstatement
: IF '(' expression ')' (statement)* FI ';'
;
loopstatement
: LOOP '(' expression ')' (statement)* POOL ';'
;
printstatement
: PRINT '(' expression ')' ';'
;
IF : 'if';
FI : 'fi';
LOOP : 'loop';
POOL : 'pool';
INT : 'int';
PRINT : 'print';
ID : [a-zA-Z][a-zA-Z0-9]*;
INTEGER : [0-9]+;
WS : [ \r\n\t] -> skip;
And I can parse a simple test as this:
int i = (2+3)*3/2*(3+36);
int j = i;
int k = 2*1+i*3;
if (k > 2)
k = k + 1;
i = i / 3;
j = j / 3;
fi;
loop (i < 10)
i = i + 1 * (i+k);
j = (j + 1) * (j-k);
k = i + j;
print(k);
pool;
However, when I want to generate ANTLR Recogonizers in intelliJ, I got this error:
sCalc.g4:19:0: left recursive rule expression contains a left recursive alternative which can be followed by the empty string
I wonder if this is caused by my ID could be an empty string?

There are a couple of issues with your grammar:
you have INT as an alternative inside expression while you probably want INTEGER instead
there is no need to do expression (op=('+'|'-') expression)*: this will do: expression op=('+'|'-') expression
ANTLR4 does not support indirect left recursive rules: you must include relation inside expression
Something like this ought to do it:
grammar mygrammar;
program
: (declaration)*
(statement)*
EOF
;
declaration
: INT ID '=' expression ';'
;
assignment
: ID '=' expression ';'
;
expression
: expression op=('*'|'/') expression
| expression op=('+'|'-') expression
| expression op=('<'|'>') expression
| INTEGER
| ID
| '(' expression ')'
;
statement
: expression ';'
| ifstatement
| loopstatement
| printstatement
| assignment
;
ifstatement
: IF '(' expression ')' (statement)* FI ';'
;
loopstatement
: LOOP '(' expression ')' (statement)* POOL ';'
;
printstatement
: PRINT '(' expression ')' ';'
;
IF : 'if';
FI : 'fi';
LOOP : 'loop';
POOL : 'pool';
INT : 'int';
PRINT : 'print';
ID : [a-zA-Z][a-zA-Z0-9]*;
INTEGER : [0-9]+;
WS : [ \r\n\t] -> skip;
Also not that this (statement)* can simply be written as statement*

It's about your expression and relation rules. The expression rule can match relation in one alt, which in turn recurses back to expression. Rule relation additionally can potentially match nothing because of (op=('<'|'>') expression)*
A better approach is probably to have relation call expression and remove the relation alt from expression. Then use relation everywhere you used expression now. That's a typical scenario in expressions, starting out with low precedence operations as top level rules and drilling down to higher precedence rules, ultimately ending at a simple expression rule (or similar).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Problem matching single digits when integers are defined as tokens - antlr4

Related

Why isn't the program token recognized? ANTLR4

Antlr4 Mismatch input

ANTLR4 tells me: mismatched input 'little' expecting {'big', 'little'}

Antlr4 parsing inconsistency

Antlr4 left recursive rule contains a left recursive alternative which can be followed by the empty string

Categories

Resources