Antlr4 Mismatch input - antlr4

First of all, I have read the solutions for the following similar questions: q1 q2 q3
Still I don't understand why I get the following message:
line 1:0 missing 'PROGRAM' at 'PROGRAM'
when I try to match the following:
PROGRAM test
BEGIN
END
My grammar:
grammar Wengo;
program : PROGRAM id BEGIN pgm_body END ;
id : IDENTIFIER ;
pgm_body : decl func_declarations ;
decl : string_decl decl | var_decl decl | empty ;
string_decl : STRING id ASSIGN str SEMICOLON ;
str : STRINGLITERAL ;
var_decl : var_type id_list SEMICOLON ;
var_type : FLOAT | INT ;
any_type : var_type | VOID ;
id_list : id id_tail ;
id_tail : COMA id id_tail | empty ;
param_decl_list : param_decl param_decl_tail | empty ;
param_decl : var_type id ;
param_decl_tail : COMA param_decl param_decl_tail | empty ;
func_declarations : func_decl func_declarations | empty ;
func_decl : FUNCTION any_type id (param_decl_list) BEGIN func_body END ;
func_body : decl stmt_list ;
stmt_list : stmt stmt_list | empty ;
stmt : base_stmt | if_stmt | loop_stmt ;
base_stmt : assign_stmt | read_stmt | write_stmt | control_stmt ;
assign_stmt : assign_expr SEMICOLON ;
assign_expr : id ASSIGN expr ;
read_stmt : READ ( id_list )SEMICOLON ;
write_stmt : WRITE ( id_list )SEMICOLON ;
return_stmt : RETURN expr SEMICOLON ;
expr : expr_prefix factor ;
expr_prefix : expr_prefix factor addop | empty ;
factor : factor_prefix postfix_expr ;
factor_prefix : factor_prefix postfix_expr mulop | empty ;
postfix_expr : primary | call_expr ;
call_expr : id ( expr_list ) ;
expr_list : expr expr_list_tail | empty ;
expr_list_tail : COMA expr expr_list_tail | empty ;
primary : ( expr ) | id | INTLITERAL | FLOATLITERAL ;
addop : ADD | MIN ;
mulop : MUL | DIV ;
if_stmt : IF ( cond ) decl stmt_list else_part ENDIF ;
else_part : ELSE decl stmt_list | empty ;
cond : expr compop expr | TRUE | FALSE ;
compop : LESS | GREAT | EQUAL | NOTEQUAL | LESSEQ | GREATEQ ;
while_stmt : WHILE ( cond ) decl stmt_list ENDWHILE ;
control_stmt : return_stmt | CONTINUE SEMICOLON | BREAK SEMICOLON ;
loop_stmt : while_stmt | for_stmt ;
init_stmt : assign_expr | empty ;
incr_stmt : assign_expr | empty ;
for_stmt : FOR ( init_stmt SEMICOLON cond SEMICOLON incr_stmt ) decl stmt_list ENDFOR ;
COMMENT : '--' ~[\r\n]* -> skip ;
WS : [ \t\r\n]+ -> skip ;
NEWLINE : [ \n] ;
EMPTY : $ ;
KEYWORD : PROGRAM|BEGIN|END|FUNCTION|READ|WRITE|IF|ELSE|ENDIF|WHILE|ENDWHILE|RETURN|INT|VOID|STRING|FLOAT|TRUE|FALSE|FOR|ENDFOR|CONTINUE|BREAK ;
OPERATOR : ASSIGN|ADD|MIN|MUL|DIV|EQUAL|NOTEQUAL|LESS|GREAT|LBRACKET|RBRACKET|SEMICOLON|COMA|LESSEQ|GREATEQ ;
IDENTIFIER : [a-zA-Z][a-zA-Z0-9]* ;
INTLITERAL : [0-9]+ ;
FLOATLITERAL : [0-9]*'.'[0-9]+ ;
STRINGLITERAL : '"' (~[\r\n"] | '""')* '"' ;
PROGRAM : 'PROGRAM';
BEGIN : 'BEGIN';
END : 'END';
FUNCTION : 'FUNCTION';
READ : 'READ';
WRITE : 'WRITE';
IF : 'IF';
ELSE : 'ELSE';
ENDIF : 'ENDIF';
WHILE : 'WHILE';
ENDWHILE : 'ENDWHILE';
RETURN : 'RETURN';
INT : 'INT';
VOID : 'VOID';
STRING : 'STRING';
FLOAT : 'FLOAT' ;
TRUE : 'TRUE';
FALSE : 'FALSE';
FOR : 'FOR';
ENDFOR : 'ENDFOR';
CONTINUE : 'CONTINUE';
BREAK : 'BREAK';
ASSIGN : ':=';
ADD : '+';
MIN : '-';
MUL : '*';
DIV : '/';
EQUAL : '=';
NOTEQUAL : '!=';
LESS : '<';
GREAT : '>';
LBRACKET : '(';
RBRACKET : ')';
SEMICOLON : ';';
COMA : ',';
LESSEQ : '<=';
GREATEQ : '>=';
From what I've read, I think there's a mismatch between KEYWORD and PROGRAM, but removing KEYWORD altogether does not solve the problem.
EDIT:
Removing KEYWORD gives the following message:
line 3:0 mismatched input 'END' expecting {'INT', 'STRING', 'FLOAT', '+'}
This my grun output when KEYWORD is available:
[#0,0:6='PROGRAM',<KEYWORD>,1:0]
[#1,8:11='test',<IDENTIFIER>,1:8]
[#2,13:17='BEGIN',<KEYWORD>,2:0]
[#3,19:21='END',<KEYWORD>,3:0]
[#4,23:22='<EOF>',<EOF>,4:0]
line 1:0 mismatched input 'PROGRAM' expecting 'PROGRAM'
(program PROGRAM test BEGIN END)
This is the output when KEYWORD is removed:
[#0,0:6='PROGRAM',<'PROGRAM'>,1:0]
[#1,8:11='test',<IDENTIFIER>,1:8]
[#2,13:17='BEGIN',<'BEGIN'>,2:0]
[#3,19:21='END',<'END'>,3:0]
[#4,23:22='<EOF>',<EOF>,4:0]
line 3:0 mismatched input 'END' expecting {'INT', 'STRING', 'FLOAT', '+'}
(program PROGRAM (id test) BEGIN (pgm_body decl func_declarations) END)

The error about "missing 'PROGRAM'" has been solved when you removed the KEYWORD rule (note that you should also remove the OPERATOR rule for the same reasons).
The error you're encountering now is completely unrelated.
Your current problem concerns the definition of empty, which you didn't show. You've said that you tried both EMPTY : $ ; and EMPTY : ^$ ; (and then presumably empty: EMPTY;), but none of those even compile, so they wouldn't cause the parse error you posted. Either way, the concept of an EMPTY token can't work. When would such a token be generated? Once between every other token? In that case, you'd get a lot of "unexpected EMPTY" errors. No, the whole point of an empty rule is that it should succeed without consuming any tokens.
To achieve that, you can just define empty : ; and remove EMPTY altogether. Alternatively you could remove empty as well and just use an empty alternative (i.e. | ;) wherever you're currently using empty. Either approach will make your code work, but there's a better way:
You're using empty as the base case for rules that basically amount to lists. ANTLR offers the repetition operators * (0 or more) , + (1 or more) as well as the ? operator to make things optional. These allow you to define lists non-recursively and without an empty rule. For example stmt_list could be defined like this:
stmt_list : stmt* ;
And id_list like this:
id_list : (id (',' id)*)? ;
On an unrelated note, your grammar can simplified greatly by making use of the fact that ANTLR 4 supports direct left recursion, so you can get rid of all the different expression rules and just have one that's left-recursive.
That'd give you:
expr : primary
| id '(' expr_list ')'
| expr mulop expr
| expr addop expr
;
And the rules expr_prefix, factor, factor_prefix and postfix_expr and call_expr could all be removed.

Related

Why isn't the program token recognized? ANTLR4

I have this grammar:
grammar BajaPower;
// Gramaticas
programa:PROGRAM ID ';' vars* bloque ;
vars:VAR ((ID|ID',')+ ':' tipo ';')+;
tipo:(INT|FLOAT);
bloque:'{' estatuto+ '}';
estatuto: (asignacion|condicion|escritura);
asignacion: ID '=' expresion ';';
condicion: 'if' '(' expresion ')' bloque (';'|'else' bloque ';');
escritura: 'print' '(' (expresion|STRING ',')* (expresion|STRING) ')' ';';
expresion: exp ('>'|'<'|'<>') exp;
exp: (termino ('+'|'-')*|termino);
termino: (factor ('*'|'/')*|factor);
factor: ('(' expresion ')')|('+'|'-') varcte| varcte;
varcte: (ID|CteI|CteF);
// Tokens
WS: [\t\r\n]+ -> skip;
PROGRAM:'program';
ID:([a-zA-Z]['_'(a-zA-Z0-9)+]*);
VAR:'var';
INT:'int';
FLOAT:'float';
CteI: ([1-9][0-9]*|'0');
CteF: [+-]?([0-9]*[.])?[0-9]+;
STRING:'"' [a-zA-Z0-9]+ '"';
And I'm trying to test it with the following code:
program TestCorrect;
var
x,y:int;
z:float;
{
x = 1;
y = 2;
z = (x+y*3)/4;
if (z > x) {
print("hola mundo",(x+y));
}
}
When I run it it only detects program as an ID and not the PROGRAM token.
There are quite a few things going wrong. In future, I suggest you incrementally create your grammar instead of (trying) to write the entire thing in one go and then coming to the conclusion it doesn't do what you meant it to.
Let's start with the lexer:
WS: [\t\r\n]+ -> skip does not include spaces
ID: ['_'(a-zA-Z0-9)+]* should be ('_'[a-zA-Z0-9]+)*
ID: the first part, [a-zA-Z], should probably be [a-zA-Z]+
VAR, INT, FLOAT are placed after ID, so when ID is properly defined, it will match var, int and float before these tokens
CteF: don't include [+-]?, leave that for the parser to deal with
STRING: [a-zA-Z0-9]+ doe not include spaces, so "hola mundo" will not be matched
Now the parser:
vars: (ID|ID',')+ is wrong because it now always has to end with a comma if you want to match multiple ID's. Do ID (',' ID)* instead
condicion: (';'|'else' bloque ';') mandates a semi-colon should always be present after an if or else block, but in your input, you do not have a semi-colon. Do ('else' bloque)? instead
expresion: exp ('>'|'<'|'<>') exp means an expresion always contains one of the operators >, < or <>, which is not correct (an expression can also just be 1*2). Do exp (('>'|'<'|'<>') exp)? instead
exp: termino ('+'|'-')* is odd: that will match 1++++++++++++. Do termino (('+'|'-') termino)* instead
termino: factor ('*'|'/')* should be factor (('*'|'/') factor)* (same as exp)
varcte: should probably include STRING so that you do not have to do this on multiple places: (expresion|STRING) but can then just do expresion
All in all, this should do the trick:
grammar BajaPower;
programa
: PROGRAM ID ';' vars* bloque
;
vars
: VAR (ID (',' ID)* ':' tipo ';')+
;
tipo
: INT
| FLOAT
;
bloque
:'{' estatuto+ '}'
;
estatuto
: asignacion
| condicion
| escritura
;
asignacion
: ID '=' expresion ';'
;
condicion
: 'if' '(' expresion ')' bloque ('else' bloque)?
;
escritura
: 'print' '(' (expresion ',')* expresion ')' ';'
;
expresion
: exp (('>'|'<'|'<>') exp)?
;
exp
: termino (('+'|'-') termino)*
;
termino
: factor (('*'|'/') factor)*
;
factor
: '(' expresion ')'
| ('+'|'-')? varcte
| STRING
;
varcte
: ID
| CteI
| CteF
;
WS : [ \t\r\n]+ -> skip;
PROGRAM : 'program';
VAR : 'var';
INT : 'int';
FLOAT : 'float';
CteI : [1-9][0-9]* | '0';
CteF : [0-9]* '.' [0-9]+;
ID : [a-zA-Z]+ ('_' [a-zA-Z0-9]+)*;
STRING : '"' .*? '"';

Problem matching single digits when integers are defined as tokens

I'm having problem trying to get a grammar working. Here is the simplified version. The language I try to parse has expressions like these:
testing1(2342);
testing2(idfor2);
testing3(4654);
testing4[1..n];
testing5[0..1];
testing6(7);
testing7(1);
testing8(o);
testing9(n);
The problem arises when I introduce the rules for the [1..n] or [0..1] expressions. The grammar file (one of the many variations I've tried):
grammar test;
tests
: test* ;
test
: call
| declaration ;
call
: callName '(' callParameter ')' ';' ;
callName : Identifier ;
callParameter : Identifier | Integer ;
declaration
: declarationName '[' declarationParams ']' ';' ;
declarationName : Identifier ;
declarationParams
: decMin '..' decMax ;
decMin : '0' | '1' ;
decMax : '1' | 'n' ;
Integer : [0-9]+ ;
Identifier : [a-zA-Z_][a-zA-Z0-9_]* ;
WS : [ \t\r\n]+ -> skip ;
When I parse the sample with this grammar, it fails on testing7(1); and testint(9);. It matches as decMin or decMax instead of Integer or Identifier:
line 8:9 mismatched input '1' expecting {Integer, Identifier}
line 10:9 mismatched input 'n' expecting {Integer, Identifier}
I've tried many variations but I can't make it work fine.
I think your problem comes from not using lexer rules clearly defining what you want.
When you added this rule :
decMin : '0' | '1' ;
You in fact created an unnamed lexer rule that matches '0' and another one matching '1' :
UNNAMED_0_RULE : '0';
UNNAMED_1_RULE : '1';
And your parser rule became :
decMin : UNNAMED_0_RULE | UNNAMED_1_RULE ;
Problem : now, when your lexer see
testing7(1);
**it doesn't see **
callName '(' callParameter ')' ';'
anymore, it sees
callName '(' UNNAMED_1_RULE ')' ';'
and it doesn't understand that.
And that is because lexer rules are effective before the parser rules.
To solve your problem, define your lexer rules efficiently, It would probably look like that :
grammar test;
/*---------------- PARSER ----------------*/
tests
: test*
;
test
: call
| declaration
;
call
: callName '(' callParameter ')' ';'
;
callName
: identifier
;
callParameter
: identifier
| integer
;
declaration
: declarationName '[' declarationParams ']' ';'
;
declarationName
: identifier
;
declarationParams
: decMin '..' decMax
;
decMin
: INTEGER_ZERO
| INTEGER_ONE
;
decMax
: INTEGER_ONE
| LETTER_N
;
integer
: (INTEGER_ZERO | INTEGER_ONE | INTEGER_OTHERS)+
;
identifier
: LETTER_N
| IDENTIFIER
;
/*---------------- LEXER ----------------*/
LETTER_N: N;
IDENTIFIER
: [a-zA-Z_][a-zA-Z0-9_]*
;
WS
: [ \t\r\n]+ -> skip
;
INTEGER_ZERO: '0';
INTEGER_ONE: '1';
INTEGER_OTHERS: '2'..'9';
fragment N: [nN];
I just tested this grammar and it works.
The drawback is that it will cut your integers at the lexer step (cutting 1245 into 1 2 4 5 in lexer rules, and the considering the parser rule as uniting 1 2 4 and 5).
I think it would be better to be less precise and simply write :
decMin: integer | identifier;
But then it depends on what you do with your grammar...

How to express a required 'RETURN' statement in the grammar

I am still a newbie to ANTLR, so sorry if I am posting an obvious question.
I have a relatively simple grammar. What I need is for the user to be able to enter something like the following:
if (condition)
{
return true
}
else if (condition)
{
return false
}
else
{
if (condition)
{
return true
}
return false
}
In my grammar below, is there a way to make sure that an error will be flagged if the input string does not contain a 'return' statement? If not, can I do it via the Listener, and if so, how?
grammar Evaluator;
parse
: block EOF
;
block
: statement
;
statement
: return_statement
| if_statement
;
return_statement
: RETURN (TRUE | FALSE)
;
if_statement
: IF condition_block (ELSE IF condition_block)* (ELSE statement_block)?
;
condition_block
: expression statement_block
;
statement_block
: OBRACE block CBRACE
;
expression
: MINUS expression #unaryMinusExpression
| NOT expression #notExpression
| expression op=(MULT | DIV) expression #multiplicationExpression
| expression op=(PLUS | MINUS) expression #additiveExpression
| expression op=(LTEQ | GTEQ | LT | GT) expression #relationalExpression
| expression op=(EQ | NEQ) expression #equalityExpression
| expression AND expression #andExpression
| expression OR expression #orExpression
| atom #atomExpression
;
atom
: function #functionAtom
| OPAR expression CPAR #parenExpression
| (INT | FLOAT) #numberAtom
| (TRUE | FALSE) #booleanAtom
| ID #idAtom
;
function
: ID OPAR (parameter (',' parameter)*)? CPAR
;
parameter
: expression #expressionParameter
;
OR : '||';
AND : '&&';
EQ : '==';
NEQ : '!=';
GT : '>';
LT : '<';
GTEQ : '>=';
LTEQ : '<=';
PLUS : '+';
MINUS : '-';
MULT : '*';
DIV : '/';
NOT : '!';
OPAR : '(';
CPAR : ')';
OBRACE : '{';
CBRACE : '}';
ASSIGN : '=';
RETURN : 'return';
TRUE : 'true';
FALSE : 'false';
IF : 'if';
ELSE : 'else';
// ID either starts with a letter then followed by any number of a-zA-Z_0-9
// or starts with one or more numbers, then followed by at least one a-zA-Z_ then followed
// by any number of a-zA-Z_0-9
ID
: [a-zA-Z] [a-zA-Z_0-9]*
| [0-9]+ [a-zA-Z_]+ [a-zA-Z_0-9]*
;
INT
: [0-9]+
;
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
SPACE
: [ \t\r\n] -> skip
;
// Anything not recognized above will be an error
ErrChar
: .
;
Ross' answer is perfectly correct. You design your grammar to accept a certain input. If the input stream does not correspond, the parser will complain.
Allow me to rewrite your grammar like this :
grammar Question;
/* enforce each block to end with a return statement */
a_grammar
: if_statement EOF
;
if_statement
: 'if' expression statement+ ( 'else' statement+ )?
;
statement
: if_statement
// other statements
| statement_block
;
statement_block
: '{' statement* return_statement '}'
;
return_statement
: 'return' ( 'true' | 'false' )
;
expression // reduced to a strict minimum to answer the OP question
: atom
| atom '<=' atom
| '(' expression ')'
;
atom
: ID
| INT
;
ID
: [a-zA-Z] [a-zA-Z_0-9]*
| [0-9]+ [a-zA-Z_]+ [a-zA-Z_0-9]*
;
INT : [0-9]+ ;
WS : [ \t\r\n] -> skip ;
// Anything not recognized above will be an error
ErrChar
: .
;
With the following input
if (a <= 7)
{
return true
}
else
if (xyz <= 99)
{
return false
}
else incor##!$rect
{
if (b <= a)
{
return true
}
return false
}
you get these tokens
[#0,0:1='if',<'if'>,1:0]
[#1,3:3='(',<'('>,1:3]
[#2,4:4='a',<ID>,1:4]
[#3,6:7='<=',<'<='>,1:6]
...
[#21,82:85='else',<'else'>,10:1]
[#22,87:91='incor',<ID>,10:6]
[#23,92:92='#',<ErrChar>,10:11]
[#24,93:93='#',<ErrChar>,10:12]
[#25,94:94='!',<ErrChar>,10:13]
[#26,95:95='$',<ErrChar>,10:14]
[#27,96:99='rect',<ID>,10:15]
[#28,102:102='{',<'{'>,11:1]
...
line 10:6 mismatched input 'incor' expecting {'if', '{'}
If you run the test rig with the -gui option, it displays the parse tree with erroneous tokens nicely displayed in pink !
grun Question a_grammar -gui data.txt
I've never played with the Listener before.
Via the Visitor, in the VisitStatement(StatementContext context) method, check if the context.return_statement() (ReturnStatementContext) is null. If it is null, throw an exception.
I'm a newbie as well. I was thinking of forcing the lexer to barf by
requiring a return statement, so instead of:
statement
: return_statement
| if_statement
;
Which says a statement is EITHER a if_statement OR a return_statement I would try something like:
statement
: (if_statement)? return_statement
;
Which (I believe), says the if_statement is optional but the return_statement MUST always occur. But you might want to try something like:
block_data : statements+ return_statement;
Where statements could be if_statements etc, and one or more of those are allowed.
I would take everything above with a grain of salt, as I have only been working with ANTLR4 a week or so. I have 4 .g4 files working, and am happy with ANTLR, but you may actually have more ANTLR stick time than I.
-Regards

How to share values for different tokens - mismatched input 'COMMAND1' expecting FUNCTIONNAME2" -

My grammar looks like this (simplified to show the issue):
parse
: block EOF
;
block
: TYPE1 OPAR STRING CPAR type1_statement_block
| TYPE2 OPAR STRING CPAR type2_statement_block
;
type1_statement_block
: OBRACE function1+ CBRACE
;
function1
: FUNCTIONNAME1 OPAR (parameter (',' parameter)*)? CPAR
;
FUNCTIONNAME1 : 'COMMAND1';
type2_statement_block
: OBRACE function2+ CBRACE
;
function2
: FUNCTIONNAME2 OPAR (parameter (',' parameter)*)? CPAR
;
FUNCTIONNAME2 : 'COMMAND1' | 'COMMAND2'
parameter
: INT
;
OPAR : '(';
CPAR : ')';
OBRACE : '{';
CBRACE : '}';
TYPE1 : 'TYPE1';
TYPE2 : 'TYPE2';
INT
: [0-9]+
;
STRING
: '"' (~["\r\n] | '""')* '"'
;
SPACE
: [ \t\r\n] -> skip
;
ErrChar
: .
;
Parsing the following strings works fine:
TYPE1 ("abc") { COMMAND1(0) }
TYPE2 ("abc") { COMMAND2(0) }
However parsing the following string results in an error
TYPE2 ("abc") { COMMAND1(0) }
I get "mismatched input 'COMMAND1' expecting FUNCTIONNAME2"
How can I get this scenario to work? I.e. that both code block can contain the same function name?
The basic issue is the lexer will always assign 'COMMAND1' to the token FUNCTIONNAME1, since that rule appears first and all other lexical requirements relative to the rule FUNCTIONNAME2 are equal. You could merge the FUNCTIONNAME1/2 rules, but that leads into the second issue with the grammar.
As written, the grammar is attempting to impose a semantic distinction between 'Type1' and 'Type2' statements that are otherwise syntactically identical. The better practice is to use the grammar to do syntactic analysis and defer semantic analysis for tree walks. The separation of concerns will make both easier.
block
: ( cmd OPAR STRING CPAR statement_block )+
;
statement_block
: OBRACE function+ CBRACE
;
function
: FUNCTIONNAME OPAR (parameter (',' parameter)*)? CPAR
;
cmd : 'TYPE1' | 'TYPE2' ;
FUNCTIONNAME : 'COMMAND1' | 'COMMAND2' ;

Advice on handling an ambiguous operator in an ANTLR 4 grammar

I am writing an antlr grammar file for a dialect of basic. Most of it is either working or I have a good idea of what I need to do next. However, I am not at all sure what I should do with the '=' character which is used for both equality tests as well as assignment.
For example, this is a valid statement
t = (x = 5) And (y = 3)
This evaluates if x is EQUAL to 5, if y is EQUAL to 3 then performs a logical AND on those results and ASSIGNS the result to t.
My grammar will parse this; albeit incorrectly, but I think that will resolve itself once the ambiguity is resolved .
How do I differentiate between the two uses of the '=' character?
1) Should I remove the assignment rule from expression and handle these cases (assignment vs equality test) in my visitor and\or listener implementation during code generation
2) Is there a better way to define the grammar such that it is already sorted out
Would someone be able to simply point me in the right direction as to how best implement this language "feature"?
Also, I have been reading through the Definitive guide to ANTLR4 as well as Language Implementation Patterns looking for a solution to this. It may be there but I have not yet found it.
Below is the full parser grammar. The ASSIGN token is currently set to '='. EQUAL is set to '=='.
parser grammar wlParser;
options { tokenVocab=wlLexer; }
program
: multistatement (NEWLINE multistatement)* NEWLINE?
;
multistatement
: statement (COLON statement)*
;
statement
: declarationStat
| defTypeStat
| assignment
| expression
;
assignment
: lvalue op=ASSIGN expression
;
expression
: <assoc=right> left=expression op=CARAT right=expression #exponentiationExprStat
| (PLUS|MINUS) expression #signExprStat
| IDENTIFIER DATATYPESUFFIX? LPAREN expression RPAREN #arrayIndexExprStat
| left=expression op=(ASTERISK|FSLASH) right=expression #multDivExprStat
| left=expression op=BSLASH right=expression #integerDivExprStat
| left=expression op=KW_MOD right=expression #modulusDivExprStat
| left=expression op=(PLUS|MINUS) right=expression #addSubExprStat
| left=string op=AMPERSAND right=string #stringConcatenation
| left=expression op=(RELATIONALOPERATORS | KW_IS | KW_ISA) right=expression #relationalComparisonExprStat
| left=expression (op=LOGICALOPERATORS right=expression)+ #logicalOrAndExprStat
| op=KW_LIKE patternString #likeExprStat
| LPAREN expression RPAREN #groupingExprStat
| NUMBER #atom
| string #atom
| IDENTIFIER DATATYPESUFFIX? #atom
;
lvalue
: (IDENTIFIER DATATYPESUFFIX?) | (IDENTIFIER DATATYPESUFFIX? LPAREN expression RPAREN)
;
string
: STRING
;
patternString
: DQUOT (QUESTIONMARK | POUND | ASTERISK | LBRACKET BANG? .*? RBRACKET)+ DQUOT
;
referenceType
: DATATYPE
;
declarationStat
: constDecl
| varDecl
;
constDecl
: CONSTDECL? KW_CONST IDENTIFIER EQUAL expression
;
varDecl
: VARDECL (varDeclPart (COMMA varDeclPart)*)? | listDeclPart
;
varDeclPart
: IDENTIFIER DATATYPESUFFIX? ((arrayBounds)? KW_AS DATATYPE (COMMA DATATYPE)*)?
;
listDeclPart
: IDENTIFIER DATATYPESUFFIX? KW_LIST KW_AS DATATYPE
;
arrayBounds
: LPAREN (arrayDimension (COMMA arrayDimension)*)? RPAREN
;
arrayDimension
: INTEGER (KW_TO INTEGER)?
;
defTypeStat
: DEFTYPES DEFTYPERANGE (COMMA DEFTYPERANGE)*
;
This is the lexer grammar.
lexer grammar wlLexer;
NUMBER
: INTEGER
| REAL
| BINARY
| OCTAL
| HEXIDECIMAL
;
RELATIONALOPERATORS
: EQUAL
| NEQUAL
| LT
| LTE
| GT
| GTE
;
LOGICALOPERATORS
: KW_OR
| KW_XOR
| KW_AND
| KW_NOT
| KW_IMP
| KW_EQV
;
INSTANCEOF
: KW_IS
| KW_ISA
;
CONSTDECL
: KW_PUBLIC
| KW_PRIVATE
;
DATATYPE
: KW_BOOLEAN
| KW_BYTE
| KW_INTEGER
| KW_LONG
| KW_SINGLE
| KW_DOUBLE
| KW_CURRENCY
| KW_STRING
;
VARDECL
: KW_DIM
| KW_STATIC
| KW_PUBLIC
| KW_PRIVATE
;
LABEL
: IDENTIFIER COLON
;
DEFTYPERANGE
: [a-zA-Z] MINUS [a-zA-Z]
;
DEFTYPES
: KW_DEFBOOL
| KW_DEFBYTE
| KW_DEFCUR
| KW_DEFDBL
| KW_DEFINT
| KW_DEFLNG
| KW_DEFSNG
| KW_DEFSTR
| KW_DEFVAR
;
DATATYPESUFFIX
: PERCENT
| AMPERSAND
| BANG
| POUND
| AT
| DOLLARSIGN
;
STRING
: (DQUOT (DQUOTESC|.)*? DQUOT)
| (LBRACE (RBRACEESC|.)*? RBRACE)
| (PIPE (PIPESC|.|NEWLINE)*? PIPE)
;
fragment DQUOTESC: '\"\"' ;
fragment RBRACEESC: '}}' ;
fragment PIPESC: '||' ;
INTEGER
: DIGIT+ (E (PLUS|MINUS)? DIGIT+)?
;
REAL
: DIGIT+ PERIOD DIGIT+ (E (PLUS|MINUS)? DIGIT+)?
;
BINARY
: AMPERSAND B BINARYDIGIT+
;
OCTAL
: AMPERSAND O OCTALDIGIT+
;
HEXIDECIMAL
: AMPERSAND H HEXDIGIT+
;
QUESTIONMARK: '?' ;
COLON: ':' ;
ASSIGN: '=';
SEMICOLON: ';' ;
AT: '#' ;
LPAREN: '(' ;
RPAREN: ')' ;
DQUOT: '"' ;
LBRACE: '{' ;
RBRACE: '}' ;
LBRACKET: '[' ;
RBRACKET: ']' ;
CARAT: '^' ;
PLUS: '+' ;
MINUS: '-' ;
ASTERISK: '*' ;
FSLASH: '/' ;
BSLASH: '\\' ;
AMPERSAND: '&' ;
BANG: '!' ;
POUND: '#' ;
DOLLARSIGN: '$' ;
PERCENT: '%' ;
COMMA: ',' ;
APOSTROPHE: '\'' ;
TWOPERIODS: '..' ;
PERIOD: '.' ;
UNDERSCORE: '_' ;
PIPE: '|' ;
NEWLINE: '\r\n' | '\r' | '\n';
EQUAL: '==' ;
NEQUAL: '<>' | '><' ;
LT: '<' ;
LTE: '<=' | '=<';
GT: '>' ;
GTE: '=<'|'<=' ;
KW_AND: A N D ;
KW_BINARY: B I N A R Y ;
KW_BOOLEAN: B O O L E A N ;
KW_BYTE: B Y T E ;
KW_DATATYPE: D A T A T Y P E ;
KW_DATE: D A T E ;
KW_INTEGER: I N T E G E R ;
KW_IS: I S ;
KW_ISA: I S A ;
KW_LIKE: L I K E ;
KW_LONG: L O N G ;
KW_MOD: M O D ;
KW_NOT: N O T ;
KW_TO: T O ;
KW_FALSE: F A L S E ;
KW_TRUE: T R U E ;
KW_SINGLE: S I N G L E ;
KW_DOUBLE: D O U B L E ;
KW_CURRENCY: C U R R E N C Y ;
KW_STRING: S T R I N G ;
fragment BINARYDIGIT: ('0'|'1') ;
fragment OCTALDIGIT: ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7') ;
fragment DIGIT: '0'..'9' ;
fragment HEXDIGIT: ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9' | A | B | C | D | E | F) ;
fragment A: ('a'|'A');
fragment B: ('b'|'B');
fragment C: ('c'|'C');
fragment D: ('d'|'D');
fragment E: ('e'|'E');
fragment F: ('f'|'F');
fragment G: ('g'|'G');
fragment H: ('h'|'H');
fragment I: ('i'|'I');
fragment J: ('j'|'J');
fragment K: ('k'|'K');
fragment L: ('l'|'L');
fragment M: ('m'|'M');
fragment N: ('n'|'N');
fragment O: ('o'|'O');
fragment P: ('p'|'P');
fragment Q: ('q'|'Q');
fragment R: ('r'|'R');
fragment S: ('s'|'S');
fragment T: ('t'|'T');
fragment U: ('u'|'U');
fragment V: ('v'|'V');
fragment W: ('w'|'W');
fragment X: ('x'|'X');
fragment Y: ('y'|'Y');
fragment Z: ('z'|'Z');
IDENTIFIER
: [a-zA-Z_][a-zA-Z0-9_~]*
;
LINE_ESCAPE
: (' ' | '\t') UNDERSCORE ('\r'? | '\n')
;
WS
: [ \t] -> skip
;
Take a look at this grammar (Note that this grammar is not supposed to be a grammar for BASIC, it's just an example to show how to disambiguate using "=" for both assignment and equality):
grammar Foo;
program:
(statement | exprOtherThanEquality)*
;
statement:
assignment
;
expr:
equality | exprOtherThanEquality
;
exprOtherThanEquality:
boolAndOr
;
boolAndOr:
atom (BOOL_OP expr)*
;
equality:
atom EQUAL expr
;
assignment:
VAR EQUAL expr ENDL
;
atom:
BOOL |
VAR |
INT |
group
;
group:
LEFT_PARENTH expr RGHT_PARENTH
;
ENDL : ';' ;
LEFT_PARENTH : '(' ;
RGHT_PARENTH : ')' ;
EQUAL : '=' ;
BOOL:
'true' | 'false'
;
BOOL_OP:
'and' | 'or'
;
VAR:
[A-Za-z_]+ [A-Za-z_0-9]*
;
INT:
'-'? [0-9]+
;
WS:
[ \t\r\n] -> skip
;
Here is the parse tree for the input: t = (x = 5) and (y = 2);
In one of the comments above, I asked you if we can assume that the first equal sign on a line always corresponds to an assignment. I retract that assumption slightly... The first equal sign on a line always corresponds to an assignment unless it is contained within parentheses. With the above grammar, this is a valid line: (x = 2). Here is the parse tree:

Resources