parsing SQL CREATE statement using ANTLR4: no viable alternative at input 'conflict' - antlr4

I'm newbie ANTLR user and trying to parse the following sql create statement. (I dropped some unimportant part of both SQL and grammar)
CREATE TABLE Account (_id integer primary key, conflict integer default 1);
and the grammar is like this: (You can compile this grammar with copy&paste)
grammar CreateTable;
tableList : (createTableStmt)* ;
createTableStmt : CREATE TABLE tableName LP columnDefs (COMMA tableConstraints)? RP SEMICOLON ;
columnDefs : columnDef (COMMA columnDef)* ;
columnDef : columnName typeName? columnConstraint* ;
typeName : sqliteType (LP SIGNED_NUMBER (COMMA SIGNED_NUMBER)? RP)? ;
sqliteType : intType | textType | ID ;
intType : 'INTEGER'|'LONG';
textType : TEXT ;
columnConstraint
: (CONSTRAINT name)? PRIMARY KEY conflictClause?
| (CONSTRAINT name)? UNIQUE conflictClause?
| (CONSTRAINT name)? DEFAULT SIGNED_NUMBER
;
tableConstraints
: tableConstraint (COMMA tableConstraint)* ;
tableConstraint
: (CONSTRAINT name)? (PRIMARY KEY|UNIQUE) LP indexedColumns RP conflictClause? ;
conflictClause : ON CONFLICT REPLACE ;
indexedColumns : indexedColumn (COMMA indexedColumn)* ;
indexedColumn : columnName;
columnName : name ;
tableName : name ;
name : ID | '\"' ID '\"' | STRING_LITERAL ;
SIGNED_NUMBER : (PLUS|MINUS)? NUMERIC_LITERAL ;
NUMERIC_LITERAL : DIGIT+ ;
STRING_LITERAL : '\'' (~'\'')* '\'' ;
LP : '(' ;
RP : ')' ;
COMMA : ',' ;
SEMICOLON : ';' ;
PLUS : '+' ;
MINUS : '-' ;
CONFLICT : C O N F L I C T ;
CONSTRAINT : C O N S T R A I N T ;
CREATE : C R E A T E ;
DEFAULT : D E F A U L T;
KEY : K E Y ;
ON : O N;
PRIMARY : P R I M A R Y ;
REPLACE : R E P L A C E;
TABLE : T A B L E ;
TEXT : T E X T;
UNIQUE : U N I Q U E ;
WS : [ \t\r\n\f]+ -> channel(HIDDEN);
ID : LETTER (LETTER|DIGIT)*;
fragment LETTER : [a-zA-Z_];
fragment DIGIT : [0-9] ;
NL : '\r'? '\n' ;
fragment A:('a'|'A'); fragment B:('b'|'B'); fragment C:('c'|'C');
fragment D:('d'|'D'); fragment E:('e'|'E'); fragment F:('f'|'F');
fragment G:('g'|'G'); fragment I:('i'|'I'); fragment K:('k'|'K');
fragment L:('l'|'L'); fragment M:('m'|'M'); fragment N:('n'|'N');
fragment O:('o'|'O'); fragment P:('p'|'P'); fragment Q:('q'|'Q');
fragment R:('r'|'R'); fragment S:('s'|'S'); fragment T:('t'|'T');
fragment U:('u'|'U'); fragment X:('x'|'X');
By the way, the above SQL statement I should parse uses reserved word 'conflict' as column name. If I change column name 'conflict' with other name like 'conflict1' everything is okay.
Where should I change to parse above SQL statement?
The parse trees look like this.
Thanks

You are defining the input "conflict" as a separate token CONFLICT. So if it is also a valid table name and column name, this should work:
name : ID | '\"' ID '\"' | STRING_LITERAL | CONFLICT

Related

ANTLR4 Grammar Issue with Decimal Numbers

I'm new to ANTLR and using ANTLR4 (4.7.2 Jar file). I'm currently working on Oracle Parser.
I'm having issues with Decimal numbers. I have kept only the relevant parts.
My grammar file is as below.
Now when I parse the below statement it is fine. ".1" is a valid number in my case.
BEGIN a NUMBER:=.1; END;
I haven't shown the grammar but the below are valid cases for me in Oracle.
a NUMBER:= .1; // with Space after operator
a NUMBER:=1.1; // without Space after operator
a NUMBER:=1; // without Space after operator
a NUMER:= 3; // with Space after operator
Now I need to create a tablespace as below.
CREATE TABLESPACE tbs_01 DATAFILE +DATA/BR/CONTROLFILE/Current.260.750;
Here the Digits 260 & 750 are tokenized along with the DOT (as per the definition of NUMERIC_LITERAL). I would want this to be 2 separate digits separated by DOT (and assigned to filenumber and incarnation_number resp as shown in the grammar).
How do I do this?
I have tried using _input.LA(-1)!='.'}? etc but was not working correctly for me.
I tried many other steps mentioned (most solutions were for ANTLR3 and not working in ANTLR4). Is there a simple way to do this in LEXER? I do not want to write a Parser rule to split the decimal digits.
grammar Oracle;
parse
: ( sql_statements | error )* EOF
;
error
: UNEXPECTED_CHAR
{
throw new RuntimeException("UNEXPECTED_CHAR=" + $UNEXPECTED_CHAR.text);
}
;
sql_statements
: 'CREATE' 'TABLESPACE' tablespace_name 'DATAFILE' fully_qualified_file_name ';'
| 'BEGIN' var1 'NUMBER' ':=' num1 ';' 'END' ';'
;
tablespace_name : IDENTIFIER;
fully_qualified_file_name : K_PLUS_SIGN diskgroup_name K_SOLIDUS db_name K_SOLIDUS file_type K_SOLIDUS file_type_tag '.' filenumber '.' incarnation_number;
diskgroup_name : IDENTIFIER;
db_name : IDENTIFIER;
file_type : IDENTIFIER;
file_type_tag : IDENTIFIER;
filenumber : NUMERIC_LITERAL;
incarnation_number : NUMERIC_LITERAL;
var1 : IDENTIFIER;
num1 : NUMERIC_LITERAL;
IDENTIFIER : [a-zA-Z_] ([a-zA-Z] | '$' | '_' | '#' | DIGIT)* ;
K_PLUS_SIGN : '+';
K_SOLIDUS : '/';
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT+ )? ( E ('+'|'-')? DIGIT+ )? ('D' | 'F')?
| '.' DIGIT+ ( E ('+'|'-')? DIGIT+ )? ('D' | 'F')?
;
SPACES : [ \u000B\t\r\n] -> skip;
WS : [ \t\r\n]+ -> skip;
UNEXPECTED_CHAR : . ;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
Your Dsl has a natural ambiguity: in some instances, numbers are integers and in others, decimals.
If the Dsl provides sufficient guard conditions, Antlr modes can be used to isolate the instances. For example, in the given Dsl, decimal numbers appear to always occur between := and ; guards.
...
K_ASSIGN : ':=' -> pushMode(Decimals);
K_SEMI : ';' ;
NUMERIC_LITERAL : DIGIT+ ;
...
mode Decimals;
D_SEMI : ';' -> type(K_SEMI), popMode ;
NUMERIC:
DIGIT+ ( '.' DIGIT+ )? ( E ('+'|'-')? DIGIT+ )? 'D'
| 'F')?
| '.' DIGIT+ ( E ('+'|'-')? DIGIT+ )? ('D' | 'F')?
-> type(NUMERIC_LITERAL);

Antlr4 Mismatch input

First of all, I have read the solutions for the following similar questions: q1 q2 q3
Still I don't understand why I get the following message:
line 1:0 missing 'PROGRAM' at 'PROGRAM'
when I try to match the following:
PROGRAM test
BEGIN
END
My grammar:
grammar Wengo;
program : PROGRAM id BEGIN pgm_body END ;
id : IDENTIFIER ;
pgm_body : decl func_declarations ;
decl : string_decl decl | var_decl decl | empty ;
string_decl : STRING id ASSIGN str SEMICOLON ;
str : STRINGLITERAL ;
var_decl : var_type id_list SEMICOLON ;
var_type : FLOAT | INT ;
any_type : var_type | VOID ;
id_list : id id_tail ;
id_tail : COMA id id_tail | empty ;
param_decl_list : param_decl param_decl_tail | empty ;
param_decl : var_type id ;
param_decl_tail : COMA param_decl param_decl_tail | empty ;
func_declarations : func_decl func_declarations | empty ;
func_decl : FUNCTION any_type id (param_decl_list) BEGIN func_body END ;
func_body : decl stmt_list ;
stmt_list : stmt stmt_list | empty ;
stmt : base_stmt | if_stmt | loop_stmt ;
base_stmt : assign_stmt | read_stmt | write_stmt | control_stmt ;
assign_stmt : assign_expr SEMICOLON ;
assign_expr : id ASSIGN expr ;
read_stmt : READ ( id_list )SEMICOLON ;
write_stmt : WRITE ( id_list )SEMICOLON ;
return_stmt : RETURN expr SEMICOLON ;
expr : expr_prefix factor ;
expr_prefix : expr_prefix factor addop | empty ;
factor : factor_prefix postfix_expr ;
factor_prefix : factor_prefix postfix_expr mulop | empty ;
postfix_expr : primary | call_expr ;
call_expr : id ( expr_list ) ;
expr_list : expr expr_list_tail | empty ;
expr_list_tail : COMA expr expr_list_tail | empty ;
primary : ( expr ) | id | INTLITERAL | FLOATLITERAL ;
addop : ADD | MIN ;
mulop : MUL | DIV ;
if_stmt : IF ( cond ) decl stmt_list else_part ENDIF ;
else_part : ELSE decl stmt_list | empty ;
cond : expr compop expr | TRUE | FALSE ;
compop : LESS | GREAT | EQUAL | NOTEQUAL | LESSEQ | GREATEQ ;
while_stmt : WHILE ( cond ) decl stmt_list ENDWHILE ;
control_stmt : return_stmt | CONTINUE SEMICOLON | BREAK SEMICOLON ;
loop_stmt : while_stmt | for_stmt ;
init_stmt : assign_expr | empty ;
incr_stmt : assign_expr | empty ;
for_stmt : FOR ( init_stmt SEMICOLON cond SEMICOLON incr_stmt ) decl stmt_list ENDFOR ;
COMMENT : '--' ~[\r\n]* -> skip ;
WS : [ \t\r\n]+ -> skip ;
NEWLINE : [ \n] ;
EMPTY : $ ;
KEYWORD : PROGRAM|BEGIN|END|FUNCTION|READ|WRITE|IF|ELSE|ENDIF|WHILE|ENDWHILE|RETURN|INT|VOID|STRING|FLOAT|TRUE|FALSE|FOR|ENDFOR|CONTINUE|BREAK ;
OPERATOR : ASSIGN|ADD|MIN|MUL|DIV|EQUAL|NOTEQUAL|LESS|GREAT|LBRACKET|RBRACKET|SEMICOLON|COMA|LESSEQ|GREATEQ ;
IDENTIFIER : [a-zA-Z][a-zA-Z0-9]* ;
INTLITERAL : [0-9]+ ;
FLOATLITERAL : [0-9]*'.'[0-9]+ ;
STRINGLITERAL : '"' (~[\r\n"] | '""')* '"' ;
PROGRAM : 'PROGRAM';
BEGIN : 'BEGIN';
END : 'END';
FUNCTION : 'FUNCTION';
READ : 'READ';
WRITE : 'WRITE';
IF : 'IF';
ELSE : 'ELSE';
ENDIF : 'ENDIF';
WHILE : 'WHILE';
ENDWHILE : 'ENDWHILE';
RETURN : 'RETURN';
INT : 'INT';
VOID : 'VOID';
STRING : 'STRING';
FLOAT : 'FLOAT' ;
TRUE : 'TRUE';
FALSE : 'FALSE';
FOR : 'FOR';
ENDFOR : 'ENDFOR';
CONTINUE : 'CONTINUE';
BREAK : 'BREAK';
ASSIGN : ':=';
ADD : '+';
MIN : '-';
MUL : '*';
DIV : '/';
EQUAL : '=';
NOTEQUAL : '!=';
LESS : '<';
GREAT : '>';
LBRACKET : '(';
RBRACKET : ')';
SEMICOLON : ';';
COMA : ',';
LESSEQ : '<=';
GREATEQ : '>=';
From what I've read, I think there's a mismatch between KEYWORD and PROGRAM, but removing KEYWORD altogether does not solve the problem.
EDIT:
Removing KEYWORD gives the following message:
line 3:0 mismatched input 'END' expecting {'INT', 'STRING', 'FLOAT', '+'}
This my grun output when KEYWORD is available:
[#0,0:6='PROGRAM',<KEYWORD>,1:0]
[#1,8:11='test',<IDENTIFIER>,1:8]
[#2,13:17='BEGIN',<KEYWORD>,2:0]
[#3,19:21='END',<KEYWORD>,3:0]
[#4,23:22='<EOF>',<EOF>,4:0]
line 1:0 mismatched input 'PROGRAM' expecting 'PROGRAM'
(program PROGRAM test BEGIN END)
This is the output when KEYWORD is removed:
[#0,0:6='PROGRAM',<'PROGRAM'>,1:0]
[#1,8:11='test',<IDENTIFIER>,1:8]
[#2,13:17='BEGIN',<'BEGIN'>,2:0]
[#3,19:21='END',<'END'>,3:0]
[#4,23:22='<EOF>',<EOF>,4:0]
line 3:0 mismatched input 'END' expecting {'INT', 'STRING', 'FLOAT', '+'}
(program PROGRAM (id test) BEGIN (pgm_body decl func_declarations) END)
The error about "missing 'PROGRAM'" has been solved when you removed the KEYWORD rule (note that you should also remove the OPERATOR rule for the same reasons).
The error you're encountering now is completely unrelated.
Your current problem concerns the definition of empty, which you didn't show. You've said that you tried both EMPTY : $ ; and EMPTY : ^$ ; (and then presumably empty: EMPTY;), but none of those even compile, so they wouldn't cause the parse error you posted. Either way, the concept of an EMPTY token can't work. When would such a token be generated? Once between every other token? In that case, you'd get a lot of "unexpected EMPTY" errors. No, the whole point of an empty rule is that it should succeed without consuming any tokens.
To achieve that, you can just define empty : ; and remove EMPTY altogether. Alternatively you could remove empty as well and just use an empty alternative (i.e. | ;) wherever you're currently using empty. Either approach will make your code work, but there's a better way:
You're using empty as the base case for rules that basically amount to lists. ANTLR offers the repetition operators * (0 or more) , + (1 or more) as well as the ? operator to make things optional. These allow you to define lists non-recursively and without an empty rule. For example stmt_list could be defined like this:
stmt_list : stmt* ;
And id_list like this:
id_list : (id (',' id)*)? ;
On an unrelated note, your grammar can simplified greatly by making use of the fact that ANTLR 4 supports direct left recursion, so you can get rid of all the different expression rules and just have one that's left-recursive.
That'd give you:
expr : primary
| id '(' expr_list ')'
| expr mulop expr
| expr addop expr
;
And the rules expr_prefix, factor, factor_prefix and postfix_expr and call_expr could all be removed.

ANTLR single grammar input mismatch

So far I've been testing with ANTLR4, I've tested with this single grammar:
grammar LivingDSLParser;
options{
language = Java;
//tokenVocab = LivingDSLLexer;
}
living
: query #QUERY
;
query
: K_QUERY entity K_WITH expr
;
entity
: STAR #ALL
| D_FUAS #FUAS
| D_RESOURCES #RESOURCES
;
field
: ((D_FIELD | D_PROPERTY | D_METAINFO) DOT)? IDENTIFIER
| STAR
;
expr
: field
| expr ( '*' | '/' | '%' ) expr
| expr ( '+' | '-' ) expr
| expr ( '<<' | '>>' | '&' | '|' ) expr
| expr ( '<' | '<=' | '>' | '>=' ) expr
| expr ( '=' | '==' | '!=' | '<>' ) expr
| expr K_AND expr
| expr K_OR expr
;
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]* // TODO check: needs more chars in set
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
;
K_QUERY : Q U E R Y;
K_WITH: W I T H;
K_OR: O R;
K_AND: A N D;
D_FUAS : F U A S;
D_RESOURCES : R E S O U R C E S;
D_METAINFO: M E T A I N F O;
D_PROPERTY: P R O P E R T Y;
D_FIELD: F I E L D;
STAR : '*';
PLUS : '+';
MINUS : '-';
PIPE2 : '||';
DIV : '/';
MOD : '%';
LT2 : '<<';
GT2 : '>>';
AMP : '&';
PIPE : '|';
LT : '<';
LT_EQ : '<=';
GT : '>';
GT_EQ : '>=';
EQ : '==';
NOT_EQ1 : '!=';
NOT_EQ2 : '<>';
OPEN_PAR : '(';
CLOSE_PAR : ')';
SCOL : ';';
DOT : '.';
SPACES
: [ \u000B\t\r\n] -> channel(HIDDEN)
;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
//so on...
As far I've been able to figure out, when I write some input like this:
query fuas with field.xxx == property.yyy
, it should match.
However I recive this message:
LivingDSLParser::living:1:0: mismatched input 'query' expecting K_QUERY
I have no idea where's the problem and neither what this message means.
Whenever ANTLR can match 2 or more rules to some input, it chooses the first rule. Since both IDENTIFIER and K_QUERY match the input "query"
, and IDENTIFIER is defined before K_QUERY, IDENTIFIER is matched.
Solution: move your IDENTIFIER rule below your keyword definitions.

ANTLR4 Grammar picks up 'and' and 'or' in variable names

Please help me with my ANTLR4 Grammar.
Sample "formel":
(Arbejde.ArbejderIKommuneNr=860) and (Arbejde.ErIArbejde = 'J') &
(Arbejde.ArbejdsTimerPrUge = 40)
(Ansogeren.BorIKommunen = 'J') and (BeregnDato(Ansogeren.Fodselsdato;
'+62Å') < DagsDato)
(Arb.BorI=860)
My problem is that Arb.BorI=860 is not handled correct. I get this error:
Error: no viable alternative at input '(Arb.Bor' at linenr/position: 1/6 \r\nException: Der blev udløst en undtagelse af typen 'Antlr4.Runtime.NoViableAltException
Please notis that Arb.BorI contains the word 'or'.
I think my problem is that my 'booleanOps' in the grammar override 'datakildefelt'
So... My problem is how do I get my grammar correct - I am stuck, so any help will be appreciated.
My Grammar:
grammar UnikFormel;
formel : boolExpression # BooleanExpr
| expression # Expr
| '(' formel ')' # Parentes;
boolExpression : ( '(' expression ')' ) ( booleanOps '(' expression ')' )+;
expression : element compareOps element # Compare;
element : datakildefelt # DatakildeId
| function # Funktion
| int # Integer
| decimal # Real
| string # Text;
datakildefelt : datakilde '.' felt;
datakilde : identifyer;
felt : identifyer;
function : funktionsnavn ('(' funcParameters? ')')?;
funktionsnavn : identifyer;
funcParameters : funcParameter (';' funcParameter)*;
funcParameter : element;
identifyer : LETTER+;
int : DIGIT+;
decimal : DIGIT+ '.' DIGIT+ | '.' DIGIT+;
string : QUOTE .*? QUOTE;
booleanOps : (AND | OR);
compareOps : (LT | GT | EQ | GTEQ | LTEQ);
QUOTE : '\'';
OPERATOR: '+';
DIGIT: [0-9];
LETTER: [a-åA-Å];
MUL : '*';
DIV : '/';
ADD : '+';
SUB : '-';
GT : '>';
LT : '<';
EQ : '=';
GTEQ : '>=';
LTEQ : '<=';
AND : '&' | 'and';
OR : '?' | 'or';
WS : ' '+ -> skip;
Rules that come first always have precedence. In your case you need to move AND and OR before LETTER. Also there is the same problem with GTEQ and LTEQ, maybe somewhere else too.
EDIT
Additionally, you should make identifyer a lexer rule, i.e. start with capital letter (IDENTIFIER or Identifier). The same goes for int, decimal and string. Input is initially a stream of characters and is first processed into a stream of tokens, using only lexer rules. At this point parser rules (those starting with lowercase letter) do not come to play yet. So, to make "BorI" parse as single entity (token), you need to create a lexer rule that matches identifiers. Currently it would be parsed as 3 tokens: LETTER (B) OR (or) LETTER (I).
Thanks for your help. There were multiple problems. Reading the ANTLR4 book and using "TestRig -gui" got me on the right track. The working grammar is:
grammar UnikFormel;
formel : '(' formel ')' # Parentes
| expression # Expr
| boolExpression # BooleanExpr
;
boolExpression : '(' expression ')' ( booleanOps '(' expression ')' )+
| '(' formel ')' ( booleanOps '(' formel ')' )+;
expression : element compareOps element # Compare;
datakildefelt : ID '.' ID;
function : ID ('(' funcParameters? ')')?;
funcParameters : funcParameter (';' funcParameter)*;
funcParameter : element;
element : datakildefelt # DatakildeId
| function # Funktion
| INT # Integer
| DECIMAL # Real
| STRING # Text;
booleanOps : (AND | OR);
compareOps : ( GTEQ | LTEQ | LT | GT | EQ |);
AND : '&' | 'and';
OR : '?' | 'or';
GTEQ : '>=';
LTEQ : '<=';
GT : '>';
LT : '<';
EQ : '=';
ID : LETTER ( LETTER | DIGIT)*;
INT : DIGIT+;
DECIMAL : DIGIT+ '.' DIGIT+ | '.' DIGIT+;
STRING : QUOTE .*? QUOTE;
fragment QUOTE : '\'';
fragment DIGIT: [0-9];
fragment LETTER: [a-åA-Å];
WS : [ \t\r\n]+ -> skip;

Advice on handling an ambiguous operator in an ANTLR 4 grammar

I am writing an antlr grammar file for a dialect of basic. Most of it is either working or I have a good idea of what I need to do next. However, I am not at all sure what I should do with the '=' character which is used for both equality tests as well as assignment.
For example, this is a valid statement
t = (x = 5) And (y = 3)
This evaluates if x is EQUAL to 5, if y is EQUAL to 3 then performs a logical AND on those results and ASSIGNS the result to t.
My grammar will parse this; albeit incorrectly, but I think that will resolve itself once the ambiguity is resolved .
How do I differentiate between the two uses of the '=' character?
1) Should I remove the assignment rule from expression and handle these cases (assignment vs equality test) in my visitor and\or listener implementation during code generation
2) Is there a better way to define the grammar such that it is already sorted out
Would someone be able to simply point me in the right direction as to how best implement this language "feature"?
Also, I have been reading through the Definitive guide to ANTLR4 as well as Language Implementation Patterns looking for a solution to this. It may be there but I have not yet found it.
Below is the full parser grammar. The ASSIGN token is currently set to '='. EQUAL is set to '=='.
parser grammar wlParser;
options { tokenVocab=wlLexer; }
program
: multistatement (NEWLINE multistatement)* NEWLINE?
;
multistatement
: statement (COLON statement)*
;
statement
: declarationStat
| defTypeStat
| assignment
| expression
;
assignment
: lvalue op=ASSIGN expression
;
expression
: <assoc=right> left=expression op=CARAT right=expression #exponentiationExprStat
| (PLUS|MINUS) expression #signExprStat
| IDENTIFIER DATATYPESUFFIX? LPAREN expression RPAREN #arrayIndexExprStat
| left=expression op=(ASTERISK|FSLASH) right=expression #multDivExprStat
| left=expression op=BSLASH right=expression #integerDivExprStat
| left=expression op=KW_MOD right=expression #modulusDivExprStat
| left=expression op=(PLUS|MINUS) right=expression #addSubExprStat
| left=string op=AMPERSAND right=string #stringConcatenation
| left=expression op=(RELATIONALOPERATORS | KW_IS | KW_ISA) right=expression #relationalComparisonExprStat
| left=expression (op=LOGICALOPERATORS right=expression)+ #logicalOrAndExprStat
| op=KW_LIKE patternString #likeExprStat
| LPAREN expression RPAREN #groupingExprStat
| NUMBER #atom
| string #atom
| IDENTIFIER DATATYPESUFFIX? #atom
;
lvalue
: (IDENTIFIER DATATYPESUFFIX?) | (IDENTIFIER DATATYPESUFFIX? LPAREN expression RPAREN)
;
string
: STRING
;
patternString
: DQUOT (QUESTIONMARK | POUND | ASTERISK | LBRACKET BANG? .*? RBRACKET)+ DQUOT
;
referenceType
: DATATYPE
;
declarationStat
: constDecl
| varDecl
;
constDecl
: CONSTDECL? KW_CONST IDENTIFIER EQUAL expression
;
varDecl
: VARDECL (varDeclPart (COMMA varDeclPart)*)? | listDeclPart
;
varDeclPart
: IDENTIFIER DATATYPESUFFIX? ((arrayBounds)? KW_AS DATATYPE (COMMA DATATYPE)*)?
;
listDeclPart
: IDENTIFIER DATATYPESUFFIX? KW_LIST KW_AS DATATYPE
;
arrayBounds
: LPAREN (arrayDimension (COMMA arrayDimension)*)? RPAREN
;
arrayDimension
: INTEGER (KW_TO INTEGER)?
;
defTypeStat
: DEFTYPES DEFTYPERANGE (COMMA DEFTYPERANGE)*
;
This is the lexer grammar.
lexer grammar wlLexer;
NUMBER
: INTEGER
| REAL
| BINARY
| OCTAL
| HEXIDECIMAL
;
RELATIONALOPERATORS
: EQUAL
| NEQUAL
| LT
| LTE
| GT
| GTE
;
LOGICALOPERATORS
: KW_OR
| KW_XOR
| KW_AND
| KW_NOT
| KW_IMP
| KW_EQV
;
INSTANCEOF
: KW_IS
| KW_ISA
;
CONSTDECL
: KW_PUBLIC
| KW_PRIVATE
;
DATATYPE
: KW_BOOLEAN
| KW_BYTE
| KW_INTEGER
| KW_LONG
| KW_SINGLE
| KW_DOUBLE
| KW_CURRENCY
| KW_STRING
;
VARDECL
: KW_DIM
| KW_STATIC
| KW_PUBLIC
| KW_PRIVATE
;
LABEL
: IDENTIFIER COLON
;
DEFTYPERANGE
: [a-zA-Z] MINUS [a-zA-Z]
;
DEFTYPES
: KW_DEFBOOL
| KW_DEFBYTE
| KW_DEFCUR
| KW_DEFDBL
| KW_DEFINT
| KW_DEFLNG
| KW_DEFSNG
| KW_DEFSTR
| KW_DEFVAR
;
DATATYPESUFFIX
: PERCENT
| AMPERSAND
| BANG
| POUND
| AT
| DOLLARSIGN
;
STRING
: (DQUOT (DQUOTESC|.)*? DQUOT)
| (LBRACE (RBRACEESC|.)*? RBRACE)
| (PIPE (PIPESC|.|NEWLINE)*? PIPE)
;
fragment DQUOTESC: '\"\"' ;
fragment RBRACEESC: '}}' ;
fragment PIPESC: '||' ;
INTEGER
: DIGIT+ (E (PLUS|MINUS)? DIGIT+)?
;
REAL
: DIGIT+ PERIOD DIGIT+ (E (PLUS|MINUS)? DIGIT+)?
;
BINARY
: AMPERSAND B BINARYDIGIT+
;
OCTAL
: AMPERSAND O OCTALDIGIT+
;
HEXIDECIMAL
: AMPERSAND H HEXDIGIT+
;
QUESTIONMARK: '?' ;
COLON: ':' ;
ASSIGN: '=';
SEMICOLON: ';' ;
AT: '#' ;
LPAREN: '(' ;
RPAREN: ')' ;
DQUOT: '"' ;
LBRACE: '{' ;
RBRACE: '}' ;
LBRACKET: '[' ;
RBRACKET: ']' ;
CARAT: '^' ;
PLUS: '+' ;
MINUS: '-' ;
ASTERISK: '*' ;
FSLASH: '/' ;
BSLASH: '\\' ;
AMPERSAND: '&' ;
BANG: '!' ;
POUND: '#' ;
DOLLARSIGN: '$' ;
PERCENT: '%' ;
COMMA: ',' ;
APOSTROPHE: '\'' ;
TWOPERIODS: '..' ;
PERIOD: '.' ;
UNDERSCORE: '_' ;
PIPE: '|' ;
NEWLINE: '\r\n' | '\r' | '\n';
EQUAL: '==' ;
NEQUAL: '<>' | '><' ;
LT: '<' ;
LTE: '<=' | '=<';
GT: '>' ;
GTE: '=<'|'<=' ;
KW_AND: A N D ;
KW_BINARY: B I N A R Y ;
KW_BOOLEAN: B O O L E A N ;
KW_BYTE: B Y T E ;
KW_DATATYPE: D A T A T Y P E ;
KW_DATE: D A T E ;
KW_INTEGER: I N T E G E R ;
KW_IS: I S ;
KW_ISA: I S A ;
KW_LIKE: L I K E ;
KW_LONG: L O N G ;
KW_MOD: M O D ;
KW_NOT: N O T ;
KW_TO: T O ;
KW_FALSE: F A L S E ;
KW_TRUE: T R U E ;
KW_SINGLE: S I N G L E ;
KW_DOUBLE: D O U B L E ;
KW_CURRENCY: C U R R E N C Y ;
KW_STRING: S T R I N G ;
fragment BINARYDIGIT: ('0'|'1') ;
fragment OCTALDIGIT: ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7') ;
fragment DIGIT: '0'..'9' ;
fragment HEXDIGIT: ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9' | A | B | C | D | E | F) ;
fragment A: ('a'|'A');
fragment B: ('b'|'B');
fragment C: ('c'|'C');
fragment D: ('d'|'D');
fragment E: ('e'|'E');
fragment F: ('f'|'F');
fragment G: ('g'|'G');
fragment H: ('h'|'H');
fragment I: ('i'|'I');
fragment J: ('j'|'J');
fragment K: ('k'|'K');
fragment L: ('l'|'L');
fragment M: ('m'|'M');
fragment N: ('n'|'N');
fragment O: ('o'|'O');
fragment P: ('p'|'P');
fragment Q: ('q'|'Q');
fragment R: ('r'|'R');
fragment S: ('s'|'S');
fragment T: ('t'|'T');
fragment U: ('u'|'U');
fragment V: ('v'|'V');
fragment W: ('w'|'W');
fragment X: ('x'|'X');
fragment Y: ('y'|'Y');
fragment Z: ('z'|'Z');
IDENTIFIER
: [a-zA-Z_][a-zA-Z0-9_~]*
;
LINE_ESCAPE
: (' ' | '\t') UNDERSCORE ('\r'? | '\n')
;
WS
: [ \t] -> skip
;
Take a look at this grammar (Note that this grammar is not supposed to be a grammar for BASIC, it's just an example to show how to disambiguate using "=" for both assignment and equality):
grammar Foo;
program:
(statement | exprOtherThanEquality)*
;
statement:
assignment
;
expr:
equality | exprOtherThanEquality
;
exprOtherThanEquality:
boolAndOr
;
boolAndOr:
atom (BOOL_OP expr)*
;
equality:
atom EQUAL expr
;
assignment:
VAR EQUAL expr ENDL
;
atom:
BOOL |
VAR |
INT |
group
;
group:
LEFT_PARENTH expr RGHT_PARENTH
;
ENDL : ';' ;
LEFT_PARENTH : '(' ;
RGHT_PARENTH : ')' ;
EQUAL : '=' ;
BOOL:
'true' | 'false'
;
BOOL_OP:
'and' | 'or'
;
VAR:
[A-Za-z_]+ [A-Za-z_0-9]*
;
INT:
'-'? [0-9]+
;
WS:
[ \t\r\n] -> skip
;
Here is the parse tree for the input: t = (x = 5) and (y = 2);
In one of the comments above, I asked you if we can assume that the first equal sign on a line always corresponds to an assignment. I retract that assumption slightly... The first equal sign on a line always corresponds to an assignment unless it is contained within parentheses. With the above grammar, this is a valid line: (x = 2). Here is the parse tree:

Resources