Why isn't the program token recognized? ANTLR4 - antlr4

I have this grammar:
grammar BajaPower;
// Gramaticas
programa:PROGRAM ID ';' vars* bloque ;
vars:VAR ((ID|ID',')+ ':' tipo ';')+;
tipo:(INT|FLOAT);
bloque:'{' estatuto+ '}';
estatuto: (asignacion|condicion|escritura);
asignacion: ID '=' expresion ';';
condicion: 'if' '(' expresion ')' bloque (';'|'else' bloque ';');
escritura: 'print' '(' (expresion|STRING ',')* (expresion|STRING) ')' ';';
expresion: exp ('>'|'<'|'<>') exp;
exp: (termino ('+'|'-')*|termino);
termino: (factor ('*'|'/')*|factor);
factor: ('(' expresion ')')|('+'|'-') varcte| varcte;
varcte: (ID|CteI|CteF);
// Tokens
WS: [\t\r\n]+ -> skip;
PROGRAM:'program';
ID:([a-zA-Z]['_'(a-zA-Z0-9)+]*);
VAR:'var';
INT:'int';
FLOAT:'float';
CteI: ([1-9][0-9]*|'0');
CteF: [+-]?([0-9]*[.])?[0-9]+;
STRING:'"' [a-zA-Z0-9]+ '"';
And I'm trying to test it with the following code:
program TestCorrect;
var
x,y:int;
z:float;
{
x = 1;
y = 2;
z = (x+y*3)/4;
if (z > x) {
print("hola mundo",(x+y));
}
}
When I run it it only detects program as an ID and not the PROGRAM token.

There are quite a few things going wrong. In future, I suggest you incrementally create your grammar instead of (trying) to write the entire thing in one go and then coming to the conclusion it doesn't do what you meant it to.
Let's start with the lexer:
WS: [\t\r\n]+ -> skip does not include spaces
ID: ['_'(a-zA-Z0-9)+]* should be ('_'[a-zA-Z0-9]+)*
ID: the first part, [a-zA-Z], should probably be [a-zA-Z]+
VAR, INT, FLOAT are placed after ID, so when ID is properly defined, it will match var, int and float before these tokens
CteF: don't include [+-]?, leave that for the parser to deal with
STRING: [a-zA-Z0-9]+ doe not include spaces, so "hola mundo" will not be matched
Now the parser:
vars: (ID|ID',')+ is wrong because it now always has to end with a comma if you want to match multiple ID's. Do ID (',' ID)* instead
condicion: (';'|'else' bloque ';') mandates a semi-colon should always be present after an if or else block, but in your input, you do not have a semi-colon. Do ('else' bloque)? instead
expresion: exp ('>'|'<'|'<>') exp means an expresion always contains one of the operators >, < or <>, which is not correct (an expression can also just be 1*2). Do exp (('>'|'<'|'<>') exp)? instead
exp: termino ('+'|'-')* is odd: that will match 1++++++++++++. Do termino (('+'|'-') termino)* instead
termino: factor ('*'|'/')* should be factor (('*'|'/') factor)* (same as exp)
varcte: should probably include STRING so that you do not have to do this on multiple places: (expresion|STRING) but can then just do expresion
All in all, this should do the trick:
grammar BajaPower;
programa
: PROGRAM ID ';' vars* bloque
;
vars
: VAR (ID (',' ID)* ':' tipo ';')+
;
tipo
: INT
| FLOAT
;
bloque
:'{' estatuto+ '}'
;
estatuto
: asignacion
| condicion
| escritura
;
asignacion
: ID '=' expresion ';'
;
condicion
: 'if' '(' expresion ')' bloque ('else' bloque)?
;
escritura
: 'print' '(' (expresion ',')* expresion ')' ';'
;
expresion
: exp (('>'|'<'|'<>') exp)?
;
exp
: termino (('+'|'-') termino)*
;
termino
: factor (('*'|'/') factor)*
;
factor
: '(' expresion ')'
| ('+'|'-')? varcte
| STRING
;
varcte
: ID
| CteI
| CteF
;
WS : [ \t\r\n]+ -> skip;
PROGRAM : 'program';
VAR : 'var';
INT : 'int';
FLOAT : 'float';
CteI : [1-9][0-9]* | '0';
CteF : [0-9]* '.' [0-9]+;
ID : [a-zA-Z]+ ('_' [a-zA-Z0-9]+)*;
STRING : '"' .*? '"';

Related

Parsing key / value subparameters

I'm a bit clueless as to how I can parse (more or less) "free form" parameter lists, suppose the syntax allows for
PARM=(VAL1, 'VAL2', VAL3, KEY4=VAL4, KEY5=VAL5(XYZ), PARM=ABC, SOMETHING=ELSE)
I have managed to basically parse combos of positional and key/value parameters, but as soon as I hit a lexer token like PARM= the parser bails out with a "mismatched input", and I can't specifically allow for or expect anything because these parameters passed to a function are completely arbitrary.
So I'd think I'll need to switch to a specific lexer mode but right now I can't see how I would properly switch back to "normal" mode, the delimiters are PARM=( on the left and the closing ) on the right, but as the "data" itself can contain (pairs of) brackets how would I identify the correct closing paren so I don't prematurely end the lexer mode?
TIA - Alex
Edit 1:
Minimal grammar showing the issue with keywords being used where they shouldn't, as this is part of a complex grammar I can't change the order of tokens to put ID in front of everything else, for example, as it would catch too much. So I don't see how this can work short of breaking out into a different lexer mode.
lexer grammar ParmLexer;
SPACE : [ \t\r\n]+ -> channel(HIDDEN) ;
COMMA : ',' ;
EQUALS : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
PARM : 'PARM=' ;
ID : ID_LITERAL ;
fragment ID_LITERAL : [A-Za-z]+ ;
.
parser grammar ParmParser;
options { tokenVocab=ParmLexer; }
parms : PARM LPAREN parm+ RPAREN ;
parm : (pkey=ID EQUALS)? pval=ID COMMA? ;
Input:
PARM=( TEST, KEY=VAL, PARM=X)
Results in
line 1:22 extraneous input 'PARM=' expecting {')', ID}
So I'd think I'll need to switch to a specific lexer mode but right now I can't see how I would properly switch back to "normal" mode
Instead of switching to modes (with -> mode(...)), you can push your "special" mode on a stack (with -> pushMode(...)) and then when encountering a ) you pop a mode from the stack. That way, you can have multiple nested lists (..(..(..).)..). A quick demo:
lexer grammar ParmLexer;
SPACE : [ \t\r\n]+ -> channel(HIDDEN);
EQUALS : '=' ;
LPAREN : '(' -> pushMode(InList);
PARM : 'PARM';
ID : [A-Za-z] [A-Za-z0-9]*;
mode InList;
LST_LPAREN : '(' -> type(LPAREN), pushMode(InList);
RPAREN : ')' -> popMode;
COMMA : ',';
LST_EQUALS : '=' -> type(EQUALS);
STRING : '\'' ~['\r\n]* '\'';
LST_ID : [A-Za-z] [A-Za-z0-9]* -> type(ID);
LST_SPACE : [ \t\r\n]+ -> channel(HIDDEN);
and:
parser grammar ParmParser;
options { tokenVocab=ParmLexer; }
parse
: PARM EQUALS list EOF
;
list
: LPAREN ( value ( COMMA value )* )? RPAREN
;
value
: ID
| STRING
| key_value
| ID list
;
key_value
: ID EQUALS value
;
which will parse your example input PARM=(VAL1, 'VAL2', VAL3, KEY4=VAL4, KEY5=VAL5(XYZ), PARM=ABC, SOMETHING=ELSE) like this:
You don't have a rule (alternative) that recognizes a PARM token in your parm rule.
Bart has provided an answer using Lexer modes (and assuming that LPAREN and RPAREN always control those modes), but you can also just set up a parser rule that matches all of your keywords:
lexer grammar ParmLexer
;
SPACE: [ \t\r\n]+ -> channel(HIDDEN);
COMMA: ',';
EQUALS: '=';
LPAREN: '(';
RPAREN: ')';
PARM: 'PARM';
KW1: 'KW1';
KW2: 'KW2';
ID: ID_LITERAL;
fragment ID_LITERAL: [A-Za-z]+;
parser grammar ParmParser
;
options {
tokenVocab = ParmLexer;
}
parms: PARM EQUALS LPAREN parm (COMMA parm)* RPAREN;
parm: ((pkey = ID | kwid = kw) EQUALS)? pval = ID;
kw: PARM | KW1 | KW2;
input
"PARM=( TEST, KEY=VAL, KW2=v2, PARM=X)"
yields:
(parms PARM = ( (parm TEST) , (parm KEY = VAL) , (parm (kw KW2) = v) , (parm (kw PARM) = X) ))

How to express a required 'RETURN' statement in the grammar

I am still a newbie to ANTLR, so sorry if I am posting an obvious question.
I have a relatively simple grammar. What I need is for the user to be able to enter something like the following:
if (condition)
{
return true
}
else if (condition)
{
return false
}
else
{
if (condition)
{
return true
}
return false
}
In my grammar below, is there a way to make sure that an error will be flagged if the input string does not contain a 'return' statement? If not, can I do it via the Listener, and if so, how?
grammar Evaluator;
parse
: block EOF
;
block
: statement
;
statement
: return_statement
| if_statement
;
return_statement
: RETURN (TRUE | FALSE)
;
if_statement
: IF condition_block (ELSE IF condition_block)* (ELSE statement_block)?
;
condition_block
: expression statement_block
;
statement_block
: OBRACE block CBRACE
;
expression
: MINUS expression #unaryMinusExpression
| NOT expression #notExpression
| expression op=(MULT | DIV) expression #multiplicationExpression
| expression op=(PLUS | MINUS) expression #additiveExpression
| expression op=(LTEQ | GTEQ | LT | GT) expression #relationalExpression
| expression op=(EQ | NEQ) expression #equalityExpression
| expression AND expression #andExpression
| expression OR expression #orExpression
| atom #atomExpression
;
atom
: function #functionAtom
| OPAR expression CPAR #parenExpression
| (INT | FLOAT) #numberAtom
| (TRUE | FALSE) #booleanAtom
| ID #idAtom
;
function
: ID OPAR (parameter (',' parameter)*)? CPAR
;
parameter
: expression #expressionParameter
;
OR : '||';
AND : '&&';
EQ : '==';
NEQ : '!=';
GT : '>';
LT : '<';
GTEQ : '>=';
LTEQ : '<=';
PLUS : '+';
MINUS : '-';
MULT : '*';
DIV : '/';
NOT : '!';
OPAR : '(';
CPAR : ')';
OBRACE : '{';
CBRACE : '}';
ASSIGN : '=';
RETURN : 'return';
TRUE : 'true';
FALSE : 'false';
IF : 'if';
ELSE : 'else';
// ID either starts with a letter then followed by any number of a-zA-Z_0-9
// or starts with one or more numbers, then followed by at least one a-zA-Z_ then followed
// by any number of a-zA-Z_0-9
ID
: [a-zA-Z] [a-zA-Z_0-9]*
| [0-9]+ [a-zA-Z_]+ [a-zA-Z_0-9]*
;
INT
: [0-9]+
;
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
SPACE
: [ \t\r\n] -> skip
;
// Anything not recognized above will be an error
ErrChar
: .
;
Ross' answer is perfectly correct. You design your grammar to accept a certain input. If the input stream does not correspond, the parser will complain.
Allow me to rewrite your grammar like this :
grammar Question;
/* enforce each block to end with a return statement */
a_grammar
: if_statement EOF
;
if_statement
: 'if' expression statement+ ( 'else' statement+ )?
;
statement
: if_statement
// other statements
| statement_block
;
statement_block
: '{' statement* return_statement '}'
;
return_statement
: 'return' ( 'true' | 'false' )
;
expression // reduced to a strict minimum to answer the OP question
: atom
| atom '<=' atom
| '(' expression ')'
;
atom
: ID
| INT
;
ID
: [a-zA-Z] [a-zA-Z_0-9]*
| [0-9]+ [a-zA-Z_]+ [a-zA-Z_0-9]*
;
INT : [0-9]+ ;
WS : [ \t\r\n] -> skip ;
// Anything not recognized above will be an error
ErrChar
: .
;
With the following input
if (a <= 7)
{
return true
}
else
if (xyz <= 99)
{
return false
}
else incor##!$rect
{
if (b <= a)
{
return true
}
return false
}
you get these tokens
[#0,0:1='if',<'if'>,1:0]
[#1,3:3='(',<'('>,1:3]
[#2,4:4='a',<ID>,1:4]
[#3,6:7='<=',<'<='>,1:6]
...
[#21,82:85='else',<'else'>,10:1]
[#22,87:91='incor',<ID>,10:6]
[#23,92:92='#',<ErrChar>,10:11]
[#24,93:93='#',<ErrChar>,10:12]
[#25,94:94='!',<ErrChar>,10:13]
[#26,95:95='$',<ErrChar>,10:14]
[#27,96:99='rect',<ID>,10:15]
[#28,102:102='{',<'{'>,11:1]
...
line 10:6 mismatched input 'incor' expecting {'if', '{'}
If you run the test rig with the -gui option, it displays the parse tree with erroneous tokens nicely displayed in pink !
grun Question a_grammar -gui data.txt
I've never played with the Listener before.
Via the Visitor, in the VisitStatement(StatementContext context) method, check if the context.return_statement() (ReturnStatementContext) is null. If it is null, throw an exception.
I'm a newbie as well. I was thinking of forcing the lexer to barf by
requiring a return statement, so instead of:
statement
: return_statement
| if_statement
;
Which says a statement is EITHER a if_statement OR a return_statement I would try something like:
statement
: (if_statement)? return_statement
;
Which (I believe), says the if_statement is optional but the return_statement MUST always occur. But you might want to try something like:
block_data : statements+ return_statement;
Where statements could be if_statements etc, and one or more of those are allowed.
I would take everything above with a grain of salt, as I have only been working with ANTLR4 a week or so. I have 4 .g4 files working, and am happy with ANTLR, but you may actually have more ANTLR stick time than I.
-Regards

Antlr4 left recursive rule contains a left recursive alternative which can be followed by the empty string

So I defined a grammar to parse an C style syntax language:
grammar mygrammar;
program
: (declaration)*
(statement)*
EOF
;
declaration
: INT ID '=' expression ';'
;
assignment
: ID '=' expression ';'
;
expression
: expression (op=('*'|'/') expression)*
| expression (op=('+'|'-') expression)*
| relation
| INT
| ID
| '(' expression ')'
;
relation
: expression (op=('<'|'>') expression)*
;
statement
: expression ';'
| ifstatement
| loopstatement
| printstatement
| assignment
;
ifstatement
: IF '(' expression ')' (statement)* FI ';'
;
loopstatement
: LOOP '(' expression ')' (statement)* POOL ';'
;
printstatement
: PRINT '(' expression ')' ';'
;
IF : 'if';
FI : 'fi';
LOOP : 'loop';
POOL : 'pool';
INT : 'int';
PRINT : 'print';
ID : [a-zA-Z][a-zA-Z0-9]*;
INTEGER : [0-9]+;
WS : [ \r\n\t] -> skip;
And I can parse a simple test as this:
int i = (2+3)*3/2*(3+36);
int j = i;
int k = 2*1+i*3;
if (k > 2)
k = k + 1;
i = i / 3;
j = j / 3;
fi;
loop (i < 10)
i = i + 1 * (i+k);
j = (j + 1) * (j-k);
k = i + j;
print(k);
pool;
However, when I want to generate ANTLR Recogonizers in intelliJ, I got this error:
sCalc.g4:19:0: left recursive rule expression contains a left recursive alternative which can be followed by the empty string
I wonder if this is caused by my ID could be an empty string?
There are a couple of issues with your grammar:
you have INT as an alternative inside expression while you probably want INTEGER instead
there is no need to do expression (op=('+'|'-') expression)*: this will do: expression op=('+'|'-') expression
ANTLR4 does not support indirect left recursive rules: you must include relation inside expression
Something like this ought to do it:
grammar mygrammar;
program
: (declaration)*
(statement)*
EOF
;
declaration
: INT ID '=' expression ';'
;
assignment
: ID '=' expression ';'
;
expression
: expression op=('*'|'/') expression
| expression op=('+'|'-') expression
| expression op=('<'|'>') expression
| INTEGER
| ID
| '(' expression ')'
;
statement
: expression ';'
| ifstatement
| loopstatement
| printstatement
| assignment
;
ifstatement
: IF '(' expression ')' (statement)* FI ';'
;
loopstatement
: LOOP '(' expression ')' (statement)* POOL ';'
;
printstatement
: PRINT '(' expression ')' ';'
;
IF : 'if';
FI : 'fi';
LOOP : 'loop';
POOL : 'pool';
INT : 'int';
PRINT : 'print';
ID : [a-zA-Z][a-zA-Z0-9]*;
INTEGER : [0-9]+;
WS : [ \r\n\t] -> skip;
Also not that this (statement)* can simply be written as statement*
It's about your expression and relation rules. The expression rule can match relation in one alt, which in turn recurses back to expression. Rule relation additionally can potentially match nothing because of (op=('<'|'>') expression)*
A better approach is probably to have relation call expression and remove the relation alt from expression. Then use relation everywhere you used expression now. That's a typical scenario in expressions, starting out with low precedence operations as top level rules and drilling down to higher precedence rules, ultimately ending at a simple expression rule (or similar).

antlr4 literal string handling

I have the following antlr4 grammar:
grammar squirrel;
program: globalstatement+;
globalstatement: globalvardef | classdef | functiondef;
globalvardef: IDENT '=' constantexpr ';';
classdef: CLASS IDENT '{' classstatement+ '}';
functiondef: FUNCTION IDENT '(' parameterlist ')' functionbody;
constructordef: CONSTRUCTOR '(' parameterlist ')' functionbody;
parameterlist: IDENT (',' IDENT)* | ;
functionbody: '{' statement* '}';
classstatement: globalvardef | functiondef | constructordef;
statement: expression ';';
expression:
IDENT # ident |
IDENT '=' expression # assignment |
IDENT ('.' IDENT)+ # lookupchain |
constantexpr # constant |
IDENT '(' expressionlist ')' # functioncall |
expression '+' expression # addition;
constantexpr: INTEGER | STRING;
expressionlist: expression (',' expression)* | ;
CONSTRUCTOR: 'constructor';
CLASS: 'class';
FUNCTION: 'function';
COMMENT: '//'.*[\n];
STRING: '"' CHAR* '"';
CHAR: [ a-zA-Z0-9];
INTEGER: [0-9]+;
IDENT: [a-zA-Z]+;
WS: [ \t\r\n]+ -> skip;
Now if I parse this file:
z = "global variable";
class Base
{
z = 10;
}
everything is fine:
#0,0:0='z',<16>,1:0
#1,2:2='=',<1>,1:2
#2,4:20='"global variable"',<14>,1:4
#3,21:21=';',<2>,1:21
#4,26:30='class',<11>,3:0
#5,32:35='Base',<16>,3:6
#6,38:38='{',<3>,4:0
#7,42:42='z',<16>,5:1
#8,44:44='=',<1>,5:3
#9,46:47='10',<15>,5:5
#10,48:48=';',<2>,5:7
#11,51:51='}',<4>,6:0
#12,56:55='<EOF>',<-1>,8:0
But with this file:
z = "global variable";
class Base
{
z = "10";
}
I get this:
#0,0:0='z',<16>,1:0
#1,2:2='=',<1>,1:2
#2,4:49='"global variable";\r\n\r\nclass Base\r\n{\r\n\tz = "10"',<14>,1:4
#3,50:50=';',<2>,5:9
#4,53:53='}',<4>,6:0
#5,58:57='<EOF>',<-1>,8:0
So it seems like everything between the first " and last " in a file gets matched to one string literal.
How do I prevent this ?
Note the string is matching from the first quote to the last possible quote.
By default, a Kleene operator (*) in ANTLR is greedy. So, change
STRING: '"' CHAR* '"';
to
STRING: '"' CHAR*? '"';
to make it non-greedy.

ANTLR4 Grammar picks up 'and' and 'or' in variable names

Please help me with my ANTLR4 Grammar.
Sample "formel":
(Arbejde.ArbejderIKommuneNr=860) and (Arbejde.ErIArbejde = 'J') &
(Arbejde.ArbejdsTimerPrUge = 40)
(Ansogeren.BorIKommunen = 'J') and (BeregnDato(Ansogeren.Fodselsdato;
'+62Å') < DagsDato)
(Arb.BorI=860)
My problem is that Arb.BorI=860 is not handled correct. I get this error:
Error: no viable alternative at input '(Arb.Bor' at linenr/position: 1/6 \r\nException: Der blev udløst en undtagelse af typen 'Antlr4.Runtime.NoViableAltException
Please notis that Arb.BorI contains the word 'or'.
I think my problem is that my 'booleanOps' in the grammar override 'datakildefelt'
So... My problem is how do I get my grammar correct - I am stuck, so any help will be appreciated.
My Grammar:
grammar UnikFormel;
formel : boolExpression # BooleanExpr
| expression # Expr
| '(' formel ')' # Parentes;
boolExpression : ( '(' expression ')' ) ( booleanOps '(' expression ')' )+;
expression : element compareOps element # Compare;
element : datakildefelt # DatakildeId
| function # Funktion
| int # Integer
| decimal # Real
| string # Text;
datakildefelt : datakilde '.' felt;
datakilde : identifyer;
felt : identifyer;
function : funktionsnavn ('(' funcParameters? ')')?;
funktionsnavn : identifyer;
funcParameters : funcParameter (';' funcParameter)*;
funcParameter : element;
identifyer : LETTER+;
int : DIGIT+;
decimal : DIGIT+ '.' DIGIT+ | '.' DIGIT+;
string : QUOTE .*? QUOTE;
booleanOps : (AND | OR);
compareOps : (LT | GT | EQ | GTEQ | LTEQ);
QUOTE : '\'';
OPERATOR: '+';
DIGIT: [0-9];
LETTER: [a-åA-Å];
MUL : '*';
DIV : '/';
ADD : '+';
SUB : '-';
GT : '>';
LT : '<';
EQ : '=';
GTEQ : '>=';
LTEQ : '<=';
AND : '&' | 'and';
OR : '?' | 'or';
WS : ' '+ -> skip;
Rules that come first always have precedence. In your case you need to move AND and OR before LETTER. Also there is the same problem with GTEQ and LTEQ, maybe somewhere else too.
EDIT
Additionally, you should make identifyer a lexer rule, i.e. start with capital letter (IDENTIFIER or Identifier). The same goes for int, decimal and string. Input is initially a stream of characters and is first processed into a stream of tokens, using only lexer rules. At this point parser rules (those starting with lowercase letter) do not come to play yet. So, to make "BorI" parse as single entity (token), you need to create a lexer rule that matches identifiers. Currently it would be parsed as 3 tokens: LETTER (B) OR (or) LETTER (I).
Thanks for your help. There were multiple problems. Reading the ANTLR4 book and using "TestRig -gui" got me on the right track. The working grammar is:
grammar UnikFormel;
formel : '(' formel ')' # Parentes
| expression # Expr
| boolExpression # BooleanExpr
;
boolExpression : '(' expression ')' ( booleanOps '(' expression ')' )+
| '(' formel ')' ( booleanOps '(' formel ')' )+;
expression : element compareOps element # Compare;
datakildefelt : ID '.' ID;
function : ID ('(' funcParameters? ')')?;
funcParameters : funcParameter (';' funcParameter)*;
funcParameter : element;
element : datakildefelt # DatakildeId
| function # Funktion
| INT # Integer
| DECIMAL # Real
| STRING # Text;
booleanOps : (AND | OR);
compareOps : ( GTEQ | LTEQ | LT | GT | EQ |);
AND : '&' | 'and';
OR : '?' | 'or';
GTEQ : '>=';
LTEQ : '<=';
GT : '>';
LT : '<';
EQ : '=';
ID : LETTER ( LETTER | DIGIT)*;
INT : DIGIT+;
DECIMAL : DIGIT+ '.' DIGIT+ | '.' DIGIT+;
STRING : QUOTE .*? QUOTE;
fragment QUOTE : '\'';
fragment DIGIT: [0-9];
fragment LETTER: [a-åA-Å];
WS : [ \t\r\n]+ -> skip;

Resources