Removing ambiguity in antlr4 lexing - antlr4

I've been trying to convert a grammar that I have found online into an antlr4 format. The original grammar is here: https://github.com/cv/asp-parser/blob/master/vbscript.bnf.
Short Version:
The current problem that I am running across I think is due to ambiguity at the lexing stage.
For example, I copied the rule for a floating point literal as below:
float_literal : DIGIT* '.' DIGIT+ ( 'e' PLUS_OR_MINUS? DIGIT+ )?
| DIGIT+ 'e' PLUS_OR_MINUS? DIGIT+;
Further up in the file I have a definition for letters:
LETTER: 'a'..'z';
It seems that because I am using 'e' in the float literal, that character can't be recognised as a letter? In my research I have come across the idea of having a token for each letter, so letter would become:
letter: A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z;
And I would replace any instances of 'e' with E. However there are much longer strings in this file, such as '.and'. So this approach would require replacing things like that with DOT A N D? Which doesn't seem right at all.
Am I doing something fundamentally wrong or is there something I can do to avoid this ambiguity?
Thanks,
Craig
The full grammar is below.
grammar vbscript;
/*===== Character Sets =====*/
SPACES: ' ' -> skip;
DIGIT: '0'..'9';
SEMI_COLON: ':';
NEW_LINE_CHARACTER: [\r\n]+;
WHITESPACE_CHARACTER: [ \t];
LETTER: 'a'..'z';
QUOTE: '"';
HASH: '#';
SQUARE_BRACE: '[' | ']';
PLUS_OR_MINUS: [+-];
ANYTHING_ELSE: ~('"' | '#');
ws: WHITESPACE_CHARACTER;
id_tail: (DIGIT | LETTER | '_');
string_character: ANYTHING_ELSE | DIGIT | WHITESPACE_CHARACTER | SEMI_COLON | LETTER | PLUS_OR_MINUS | SQUARE_BRACE;
id_name_char: ANYTHING_ELSE | DIGIT | WHITESPACE_CHARACTER | SEMI_COLON | LETTER | PLUS_OR_MINUS;
/*===== terminals =====*/
whitespace: ws+ | '_' ws* new_line?;
comment_line : '' | 'rem';
string_literal : '"' ( string_character | '""' )* '"';
float_literal : DIGIT* '.' DIGIT+ ( 'e' PLUS_OR_MINUS? DIGIT+ )?
| DIGIT+ 'e' PLUS_OR_MINUS? DIGIT+;
id : LETTER id_tail*
| '[' id_name_char* ']';
iddot : LETTER id_tail* '.'
| '[' id_name_char* ']' '.'
| 'and.'
| 'byref.'
| 'byval.'
| 'call.'
| 'case.'
| 'class.'
| 'const.'
| 'default.'
| 'dim.'
| 'do.'
| 'each.'
| 'else.'
| 'elseif.'
| 'empty.'
| 'end.'
| 'eqv.'
| 'erase.'
| 'error.'
| 'exit.'
| 'explicit.'
| 'false.'
| 'for.'
| 'function.'
| 'get.'
| 'goto.'
| 'if.'
| 'imp.'
| 'in.'
| 'is.'
| 'let.'
| 'loop.'
| 'mod.'
| 'new.'
| 'next.'
| 'not.'
| 'nothing.'
| 'null.'
| 'on.'
| 'option.'
| 'or.'
| 'preserve.'
| 'private.'
| 'property.'
| 'public.'
| 'redim.'
| 'rem.'
| 'resume.'
| 'select.'
| 'set.'
| 'step.'
| 'sub.'
| 'then.'
| 'to.'
| 'true.'
| 'until.'
| 'wend.'
| 'while.'
| 'with.'
| 'xor.';
dot_id : '.' LETTER id_tail*
| '.' '[' id_name_char* ']'
| '.and'
| '.byref'
| '.byval'
| '.call'
| '.case'
| '.class'
| '.const'
| '.default'
| '.dim'
| '.do'
| '.each'
| '.else'
| '.elseif'
| '.empty'
| '.end'
| '.eqv'
| '.erase'
| '.error'
| '.exit'
| '.explicit'
| '.false'
| '.for'
| '.function'
| '.get'
| '.goto'
| '.if'
| '.imp'
| '.in'
| '.is'
| '.let'
| '.loop'
| '.mod'
| '.new'
| '.next'
| '.not'
| '.nothing'
| '.null'
| '.on'
| '.option'
| '.or'
| '.preserve'
| '.private'
| '.property'
| '.public'
| '.redim'
| '.rem'
| '.resume'
| '.select'
| '.set'
| '.step'
| '.sub'
| '.then'
| '.to'
| '.true'
| '.until'
| '.wend'
| '.while'
| '.with'
| '.xor';
dot_iddot : '.' LETTER id_tail* '.'
| '.' '[' id_name_char* ']' '.'
| '.and.'
| '.byref.'
| '.byval.'
| '.call.'
| '.case.'
| '.class.'
| '.const.'
| '.default.'
| '.dim.'
| '.do.'
| '.each.'
| '.else.'
| '.elseif.'
| '.empty.'
| '.end.'
| '.eqv.'
| '.erase.'
| '.error.'
| '.exit.'
| '.explicit.'
| '.false.'
| '.for.'
| '.function.'
| '.get.'
| '.goto.'
| '.if.'
| '.imp.'
| '.in.'
| '.is.'
| '.let.'
| '.loop.'
| '.mod.'
| '.new.'
| '.next.'
| '.not.'
| '.nothing.'
| '.null.'
| '.on.'
| '.option.'
| '.or.'
| '.preserve.'
| '.private.'
| '.property.'
| '.public.'
| '.redim.'
| '.rem.'
| '.resume.'
| '.select.'
| '.set.'
| '.step.'
| '.sub.'
| '.then.'
| '.to.'
| '.true.'
| '.until.'
| '.wend.'
| '.while.'
| '.with.'
| '.xor.';
/*===== rules =====*/
new_line: (SEMI_COLON | NEW_LINE_CHARACTER)+;
program: new_line? global_stmt_list;
/*===== rules: declarations =====*/
class_decl: 'class' extended_id new_line member_decl_list 'end' 'class' new_line;
member_decl_list: member_decl*;
member_decl: field_decl | var_decl | const_decl | sub_decl | function_decl | property_decl;
field_decl:
'private' field_name other_vars_opt new_line
| 'public' field_name other_vars_opt new_line;
field_name: field_id '(' array_rank_list ')' | field_id;
field_id: id | 'default' | 'erase' | 'error' | 'explicit' | 'step';
var_decl: 'dim' var_name other_vars_opt new_line;
var_name: extended_id '(' array_rank_list ')' | extended_id;
other_vars_opt: (',' var_name other_vars_opt)?;
array_rank_list: (int_literal ',' array_rank_list | int_literal)?;
const_decl: access_modifier_opt 'const' const_list new_line;
const_list: extended_id '=' const_expr_def ',' const_list | extended_id '=' const_expr_def;
const_expr_def: '(' const_expr_def ')'
| '-' const_expr_def
| '+' const_expr_def
| const_expr;
sub_decl:
method_access_opt 'sub' extended_id method_arg_list new_line method_stmt_list 'end' 'sub' new_line
| method_access_opt 'sub' extended_id method_arg_list inline_stmt 'end' 'sub' new_line;
function_decl:
method_access_opt 'function' extended_id method_arg_list new_line method_stmt_list 'end' 'function' new_line
| method_access_opt 'function' extended_id method_arg_list inline_stmt 'end' 'function' new_line;
method_access_opt: 'public' 'default' | access_modifier_opt;
access_modifier_opt: ('public' | 'private')?;
method_arg_list: ('(' arg_list? ')')?;
arg_list: arg (',' arg_list)?;
arg: arg_modifier_opt extended_id ('(' ')')?;
arg_modifier_opt: ('byval' | 'byref')?;
property_decl: method_access_opt 'property' property_access_type extended_id method_arg_list new_line method_stmt_list 'end' 'property' new_line;
property_access_type: 'get' | 'let' | 'set';
/*===== rules: statements =====*/
global_stmt: option_explicit | class_decl | field_decl | const_decl | sub_decl | function_decl | block_stmt;
method_stmt: const_decl | block_stmt;
block_stmt:
var_decl
| redim_stmt
| if_stmt
| with_stmt
| select_stmt
| loop_stmt
| for_stmt
| inline_stmt new_line;
inline_stmt:
assign_stmt
| call_stmt
| sub_call_stmt
| error_stmt
| exit_stmt
| 'erase' extended_id;
global_stmt_list: global_stmt_list global_stmt | global_stmt;
method_stmt_list: method_stmt*;
block_stmt_list: block_stmt*;
option_explicit: 'option' 'explicit' new_line;
error_stmt: 'on' 'error' 'resume' 'next' | 'on' 'error' 'goto' int_literal;
exit_stmt: 'exit' 'do' | 'exit' 'for' | 'exit' 'function' | 'exit' 'property' | 'exit' 'sub';
assign_stmt:
left_expr '=' expr
| 'set' left_expr '=' expr
| 'set' left_expr '=' 'new' left_expr;
sub_call_stmt: qualified_id sub_safe_expr? comma_expr_list
| qualified_id sub_safe_expr?
| qualified_id '(' expr ')' comma_expr_list
| qualified_id '(' expr ')'
| qualified_id '(' ')'
| qualified_id index_or_params_list '.' left_expr_tail sub_safe_expr? comma_expr_list
| qualified_id index_or_params_list_dot left_expr_tail sub_safe_expr? comma_expr_list
| qualified_id index_or_params_list '.' left_expr_tail sub_safe_expr?
| qualified_id index_or_params_list_dot left_expr_tail sub_safe_expr?;
call_stmt: 'call' left_expr;
left_expr: qualified_id index_or_params_list '.' left_expr_tail
| qualified_id index_or_params_list_dot left_expr_tail
| qualified_id index_or_params_list
| qualified_id
| safe_keyword_id;
left_expr_tail: qualified_id_tail index_or_params_list '.' left_expr_tail
| qualified_id_tail index_or_params_list_dot left_expr_tail
| qualified_id_tail index_or_params_list
| qualified_id_tail;
qualified_id: iddot qualified_id_tail
| dot_iddot qualified_id_tail
| id
| dot_id;
qualified_id_tail: iddot qualified_id_tail
| id
| keyword_id;
keyword_id: safe_keyword_id
| 'and'
| 'byref'
| 'byval'
| 'call'
| 'case'
| 'class'
| 'const'
| 'dim'
| 'do'
| 'each'
| 'else'
| 'elseif'
| 'empty'
| 'end'
| 'eqv'
| 'exit'
| 'false'
| 'for'
| 'function'
| 'get'
| 'goto'
| 'if'
| 'imp'
| 'in'
| 'is'
| 'let'
| 'loop'
| 'mod'
| 'new'
| 'next'
| 'not'
| 'nothing'
| 'null'
| 'on'
| 'option'
| 'or'
| 'preserve'
| 'private'
| 'public'
| 'redim'
| 'resume'
| 'select'
| 'set'
| 'sub'
| 'then'
| 'to'
| 'true'
| 'until'
| 'wend'
| 'while'
| 'with'
| 'xor';
safe_keyword_id: 'default'
| 'erase'
| 'error'
| 'explicit'
| 'property'
| 'step';
extended_id: safe_keyword_id
| id;
index_or_params_list: index_or_params index_or_params_list
| index_or_params;
index_or_params: '(' expr comma_expr_list ')'
| '(' comma_expr_list ')'
| '(' expr ')'
| '(' ')';
index_or_params_list_dot: index_or_params index_or_params_list_dot
| index_or_params_dot;
index_or_params_dot: '(' expr comma_expr_list ').'
| '(' comma_expr_list ').'
| '(' expr ').'
| '(' ').';
comma_expr_list: ',' expr comma_expr_list
| ',' comma_expr_list
| ',' expr
| ',';
/* redim statement */
redim_stmt: 'redim' redim_decl_list new_line
| 'redim' 'preserve' redim_decl_list new_line;
redim_decl_list: redim_decl ',' redim_decl_list
| redim_decl;
redim_decl: extended_id '(' expr_list ')';
/* if statement */
if_stmt: 'if' expr 'then' new_line block_stmt_list else_stmt_list 'end' 'if' new_line
| 'if' expr 'then' inline_stmt else_opt end_if_opt new_line;
else_stmt_list: ('elseif' expr 'then' new_line block_stmt_list else_stmt_list
| 'elseif' expr 'then' inline_stmt new_line else_stmt_list
| 'else' inline_stmt new_line
| 'else' new_line block_stmt_list)?;
else_opt: ('else' inline_stmt)?;
end_if_opt : ('end' 'if')?;
/* with statement */
with_stmt: 'with' expr new_line block_stmt_list 'end' 'with' new_line;
/* loop statement */
loop_stmt: 'do' loop_type expr new_line block_stmt_list 'loop' new_line
| 'do' new_line block_stmt_list 'loop' loop_type expr new_line
| 'do' new_line block_stmt_list 'loop' new_line
| 'while' expr new_line block_stmt_list 'wend' new_line;
loop_type: 'while' | 'until';
/* for statement */
for_stmt: 'for' extended_id '=' expr 'to' expr step_opt new_line block_stmt_list 'next' new_line
| 'for' 'each' extended_id 'in' expr new_line block_stmt_list 'next' new_line;
step_opt: ('step' expr)?;
/* select statement */
select_stmt: 'select' 'case' expr new_line cast_stmt_list 'end' 'select' new_line;
cast_stmt_list: ('case' expr_list nl_opt block_stmt_list cast_stmt_list
| 'case' 'else' nl_opt block_stmt_list)?;
nl_opt: new_line?;
expr_list: expr ',' expr_list | expr;
/*===== rules: expressions =====*/
sub_safe_expr: sub_safe_imp_expr;
sub_safe_imp_expr: sub_safe_imp_expr 'imp' eqv_expr | sub_safe_eqv_expr;
sub_safe_eqv_expr: sub_safe_eqv_expr 'eqv' xor_expr
| sub_safe_xor_expr;
sub_safe_xor_expr: sub_safe_xor_expr 'xor' or_expr
| sub_safe_or_expr;
sub_safe_or_expr: sub_safe_or_expr 'or' and_expr
| sub_safe_and_expr;
sub_safe_and_expr : sub_safe_and_expr 'and' not_expr
| sub_safe_not_expr;
sub_safe_not_expr : 'not' not_expr
| sub_safe_compare_expr;
sub_safe_compare_expr : sub_safe_compare_expr 'is' concat_expr
| sub_safe_compare_expr 'is' 'not' concat_expr
| sub_safe_compare_expr '>=' concat_expr
| sub_safe_compare_expr '=>' concat_expr
| sub_safe_compare_expr '<=' concat_expr
| sub_safe_compare_expr '=<' concat_expr
| sub_safe_compare_expr '>' concat_expr
| sub_safe_compare_expr '<' concat_expr
| sub_safe_compare_expr '<>' concat_expr
| sub_safe_compare_expr '=' concat_expr
| sub_safe_concat_expr;
sub_safe_concat_expr : sub_safe_concat_expr '&' add_expr
| sub_safe_add_expr;
sub_safe_add_expr : sub_safe_add_expr '+' mod_expr
| sub_safe_add_expr '-' mod_expr
| sub_safe_mod_expr;
sub_safe_mod_expr : sub_safe_mod_expr 'mod' int_div_expr
| sub_safe_int_div_expr;
sub_safe_int_div_expr : sub_safe_int_div_expr '\\' mult_expr
| sub_safe_mult_expr;
sub_safe_mult_expr : sub_safe_mult_expr '*' unary_expr
| sub_safe_mult_expr '/' unary_expr
| sub_safe_unary_expr;
sub_safe_unary_expr : '-' unary_expr
| '+' unary_expr
| sub_safe_exp_expr;
sub_safe_exp_expr : sub_safe_value '^' exp_expr
| sub_safe_value;
sub_safe_value : const_expr
| left_expr
| '(' expr ')';
expr : imp_expr;
imp_expr : imp_expr 'imp' eqv_expr
| eqv_expr;
eqv_expr : eqv_expr 'eqv' xor_expr
| xor_expr;
xor_expr : xor_expr 'xor' or_expr
| or_expr;
or_expr : or_expr 'or' and_expr
| and_expr;
and_expr : and_expr 'and' not_expr
| not_expr;
not_expr : 'not' not_expr
| compare_expr;
compare_expr : compare_expr 'is' concat_expr
| compare_expr 'is' 'not' concat_expr
| compare_expr '>=' concat_expr
| compare_expr '=>' concat_expr
| compare_expr '<=' concat_expr
| compare_expr '=<' concat_expr
| compare_expr '>' concat_expr
| compare_expr '<' concat_expr
| compare_expr '<>' concat_expr
| compare_expr '=' concat_expr
| concat_expr;
concat_expr : concat_expr '&' add_expr
| add_expr;
add_expr : add_expr '+' mod_expr
| add_expr '-' mod_expr
| mod_expr;
mod_expr : mod_expr 'mod' int_div_expr
| int_div_expr;
int_div_expr : int_div_expr '\\' mult_expr
| mult_expr;
mult_expr : mult_expr '*' unary_expr
| mult_expr '/' unary_expr
| unary_expr;
unary_expr : '-' unary_expr
| '+' unary_expr
| exp_expr;
exp_expr : value '^' exp_expr
| value;
value : const_expr
| left_expr
| '(' expr ')';
const_expr : bool_literal
| int_literal
| float_literal
| string_literal
| nothing;
bool_literal : 'true'
| 'false';
int_literal : DIGIT+;
nothing : 'nothing'
| 'null'
| 'empty';

Your grammar defines "Literals" in the parser section. Note that ANTLR treats every lower case rule as parser rule (upper case rules are lexer rules).
Your small problem part may be solved like this:
FLOAT_LITERAL
: DIGIT* '.' DIGIT+ ( 'e' PLUS_OR_MINUS? DIGIT+ )?
| DIGIT+ 'e' PLUS_OR_MINUS? DIGIT+;
LETTER
: [a-z];
The ANTLR lexer prefers the longest matching rule (if two rules are in conflict, it prefers the first defined one). Both rules are completely disjunct so the sequence of definition is not relevant (its just more readable to define the more complex rules above the basic ones).
You can extend the second definition by upper case Characters:
LETTER
: [a-zA-Z];
To solve the overall problem of your grammar, you will need a complete rewriting of your grammar. Most rules of the terminals section should be lexer rules. However the terminals sections seems overly populated so that it could also be that some rules are workarounds for non existent parser rules.

Related

ANTLR 4 building parse tree incorrectly

I'm making a grammar for a subset of SQL, which I've pasted below:
grammar Sql;
sel_stmt : SEL QUANT? col_list FROM tab_list;
as_stmt : 'as' ID;
col_list : '*' | col_spec (',' col_spec | ',' col_group)*;
col_spec : (ID | ID '.' ID) as_stmt?;
col_group : ID '.' '*';
tab_list : (tab_spec | join_tab) (',' tab_spec | ',' join_tab)*;
tab_spec : (ID | sub_query) as_stmt?;
sub_query : '(' sel_stmt ')';
join_tab : tab_spec (join_type? 'join' tab_spec 'on' cond_list)+;
join_type : 'inner' | 'full' | 'left' | 'right';
cond_list : cond_term (BOOL_OP cond_term)*;
cond_term : col_spec (COMP_OP val | between | like);
val : INT | FLT | SQ_STR | DQ_STR;
between : ('not')? 'between' val 'and' val;
like : ('not')? 'like' (SQ_STR | DQ_STR);
WS : (' ' | '\t' | '\n')+ -> skip;
INT : ('0'..'9')+;
FLT : INT '.' INT;
SQ_STR : '\'' ~('\\'|'\'')* '\'';
DQ_STR : '"' ~('\\'|'"')* '"';
BOOL_OP : ',' | 'or' | 'and';
COMP_OP : '=' | '<>' | '<' | '>' | '<=' | '>=';
SEL : 'select' | 'SELECT';
QUANT: 'distinct' | 'DISTINCT' | 'all' | 'ALL';
FROM: 'from' | 'FROM';
ID : ('A'..'Z' | 'a'..'z')('A'..'Z' | 'a'..'z' | '0'..'9')*;
The input I'm testing is select distinct test.col1 as col, test2.* from test join test2 on col='something', test2.col1=1.4. The output parse tree matches the last appearance of test2 as a and thus doesn't know what to do with the rest of the input. The comma before the last 'test2' token is made a child of the node, when it should be a child of .
My question is what is going on behind the scenes to cause this?
Apparently, ANTLR does not like it when you use a literal symbol as a token and have it be part of another token. In my case, the comma token was both used as a literal token and as part of the BOOL_OP token.
A future question may be how ANTLR disambiguates when the same symbol is used as parts of different tokens that apply only under specific scopes. However, this question is answered for now.

how can I refactor this ANTLR4 grammar so that it isn't mutually left recursive?

I can't seem to figure out why this grammar won't compile. It compiled fine until I modified line 145 from
(Identifier '.')* functionCall
to
(primary '.')? functionCall
I've been trying to figure out how to solve this issue for a while but I can't seem to be able to. Here's the error:
The following sets of rules are mutually left-recursive [primary]
grammar Tadpole;
#header
{package net.tadpole.compiler.parser;}
file
: fileContents*
;
fileContents
: structDec
| functionDec
| statement
| importDec
;
importDec
: 'import' Identifier ';'
;
literal
: IntegerLiteral
| FloatingPointLiteral
| BooleanLiteral
| CharacterLiteral
| StringLiteral
| NoneLiteral
| arrayLiteral
;
arrayLiteral
: '[' expressionList? ']'
;
expressionList
: expression (',' expression)*
;
expression
: primary
| unaryExpression
| <assoc=right> expression binaryOpPrec0 expression
| <assoc=left> expression binaryOpPrec1 expression
| <assoc=left> expression binaryOpPrec2 expression
| <assoc=left> expression binaryOpPrec3 expression
| <assoc=left> expression binaryOpPrec4 expression
| <assoc=left> expression binaryOpPrec5 expression
| <assoc=left> expression binaryOpPrec6 expression
| <assoc=left> expression binaryOpPrec7 expression
| <assoc=left> expression binaryOpPrec8 expression
| <assoc=left> expression binaryOpPrec9 expression
| <assoc=left> expression binaryOpPrec10 expression
| <assoc=right> expression binaryOpPrec11 expression
;
unaryExpression
: unaryOp expression
| prefixPostfixOp primary
| primary prefixPostfixOp
;
unaryOp
: '+'
| '-'
| '!'
| '~'
;
prefixPostfixOp
: '++'
| '--'
;
binaryOpPrec0
: '**'
;
binaryOpPrec1
: '*'
| '/'
| '%'
;
binaryOpPrec2
: '+'
| '-'
;
binaryOpPrec3
: '>>'
| '>>>'
| '<<'
;
binaryOpPrec4
: '<'
| '>'
| '<='
| '>='
| 'is'
;
binaryOpPrec5
: '=='
| '!='
;
binaryOpPrec6
: '&'
;
binaryOpPrec7
: '^'
;
binaryOpPrec8
: '|'
;
binaryOpPrec9
: '&&'
;
binaryOpPrec10
: '||'
;
binaryOpPrec11
: '='
| '**='
| '*='
| '/='
| '%='
| '+='
| '-='
| '&='
| '|='
| '^='
| '>>='
| '>>>='
| '<<='
| '<-'
;
primary
: literal
| fieldName
| '(' expression ')'
| '(' type ')' (primary | unaryExpression)
| 'new' objType '(' expressionList? ')'
| primary '.' fieldName
| primary dimension
| (primary '.')? functionCall
;
functionCall
: functionName '(' expressionList? ')'
;
functionName
: Identifier
;
dimension
: '[' expression ']'
;
statement
: '{' statement* '}'
| expression ';'
| 'recall' ';'
| 'return' expression? ';'
| variableDec
| 'if' '(' expression ')' statement ('else' statement)?
| 'while' '(' expression ')' statement
| 'do' expression 'while' '(' expression ')' ';'
| 'do' '{' statement* '}' 'while' '(' expression ')' ';'
;
structDec
: 'struct' structName ('(' parameterList ')')? '{' variableDec* functionDec* '}'
;
structName
: Identifier
;
fieldName
: Identifier
;
variableDec
: type fieldName ('=' expression)? ';'
;
type
: primitiveType ('[' ']')*
| objType ('[' ']')*
;
primitiveType
: 'byte'
| 'short'
| 'int'
| 'long'
| 'char'
| 'boolean'
| 'float'
| 'double'
;
objType
: (Identifier '.')? structName
;
functionDec
: 'def' functionName '(' parameterList? ')' ':' type '->' functionBody
;
functionBody
: statement
;
parameterList
: parameter (',' parameter)*
;
parameter
: type fieldName
;
IntegerLiteral
: DecimalIntegerLiteral
| HexIntegerLiteral
| OctalIntegerLiteral
| BinaryIntegerLiteral
;
fragment
DecimalIntegerLiteral
: DecimalNumeral IntegerSuffix?
;
fragment
HexIntegerLiteral
: HexNumeral IntegerSuffix?
;
fragment
OctalIntegerLiteral
: OctalNumeral IntegerSuffix?
;
fragment
BinaryIntegerLiteral
: BinaryNumeral IntegerSuffix?
;
fragment
IntegerSuffix
: [lL]
;
fragment
DecimalNumeral
: Digit (Digits? | Underscores Digits)
;
fragment
Digits
: Digit (DigitsAndUnderscores? Digit)?
;
fragment
Digit
: [0-9]
;
fragment
DigitsAndUnderscores
: DigitOrUnderscore+
;
fragment
DigitOrUnderscore
: Digit
| '_'
;
fragment
Underscores
: '_'+
;
fragment
HexNumeral
: '0' [xX] HexDigits
;
fragment
HexDigits
: HexDigit (HexDigitsAndUnderscores? HexDigit)?
;
fragment
HexDigit
: [0-9a-fA-F]
;
fragment
HexDigitsAndUnderscores
: HexDigitOrUnderscore+
;
fragment
HexDigitOrUnderscore
: HexDigit
| '_'
;
fragment
OctalNumeral
: '0' [oO] Underscores? OctalDigits
;
fragment
OctalDigits
: OctalDigit (OctalDigitsAndUnderscores? OctalDigit)?
;
fragment
OctalDigit
: [0-7]
;
fragment
OctalDigitsAndUnderscores
: OctalDigitOrUnderscore+
;
fragment
OctalDigitOrUnderscore
: OctalDigit
| '_'
;
fragment
BinaryNumeral
: '0' [bB] BinaryDigits
;
fragment
BinaryDigits
: BinaryDigit (BinaryDigitsAndUnderscores? BinaryDigit)?
;
fragment
BinaryDigit
: [01]
;
fragment
BinaryDigitsAndUnderscores
: BinaryDigitOrUnderscore+
;
fragment
BinaryDigitOrUnderscore
: BinaryDigit
| '_'
;
// §3.10.2 Floating-Point Literals
FloatingPointLiteral
: DecimalFloatingPointLiteral FloatingPointSuffix?
| HexadecimalFloatingPointLiteral FloatingPointSuffix?
;
fragment
FloatingPointSuffix
: [fFdD]
;
fragment
DecimalFloatingPointLiteral
: Digits '.' Digits? ExponentPart?
| '.' Digits ExponentPart?
| Digits ExponentPart
| Digits
;
fragment
ExponentPart
: ExponentIndicator SignedInteger
;
fragment
ExponentIndicator
: [eE]
;
fragment
SignedInteger
: Sign? Digits
;
fragment
Sign
: [+-]
;
fragment
HexadecimalFloatingPointLiteral
: HexSignificand BinaryExponent
;
fragment
HexSignificand
: HexNumeral '.'?
| '0' [xX] HexDigits? '.' HexDigits
;
fragment
BinaryExponent
: BinaryExponentIndicator SignedInteger
;
fragment
BinaryExponentIndicator
: [pP]
;
BooleanLiteral
: 'true'
| 'false'
;
CharacterLiteral
: '\'' SingleCharacter '\''
| '\'' EscapeSequence '\''
;
fragment
SingleCharacter
: ~['\\]
;
StringLiteral
: '"' StringCharacters? '"'
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\\]
| EscapeSequence
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]
| OctalEscape
| UnicodeEscape
;
fragment
OctalEscape
: '\\' OctalDigit
| '\\' OctalDigit OctalDigit
| '\\' ZeroToThree OctalDigit OctalDigit
;
fragment
ZeroToThree
: [0-3]
;
fragment
UnicodeEscape
: '\\' 'u' HexDigit HexDigit HexDigit HexDigit
;
NoneLiteral
: 'nil'
;
Identifier
: IdentifierStartChar IdentifierChar*
;
fragment
IdentifierStartChar
: [a-zA-Z$_] // these are the "java letters" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierStart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
fragment
IdentifierChar
: [a-zA-Z0-9$_] // these are the "java letters or digits" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierPart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
WS : [ \t\r\n\u000C]+ -> skip
;
LINE_COMMENT
: '#' ~[\r\n]* -> skip
;
The left recursive invocation needs to be the first, so no parenthesis can be placed before it.
You can rewrite it like this:
primary
: literal
| fieldName
| '(' expression ')'
| '(' type ')' (primary | unaryExpression)
| 'new' objType '(' expressionList? ')'
| primary '.' fieldName
| primary dimension
| primary '.' functionCall
| functionCall
;
which is equivalent.

Disambiguating unary and binary minus in antlr4 grammar

I am creating an antlr4 grammar for a moderately simple language. I am struggling to get the grammar to differentiate between unary and binary minus. I have read all the other posts that I can find on this topic here on Stackoverflow, but have found that the answers either apply to antlr3 in ways I cannot figure out how to express in antlr4, or that I seem not to be adept in translating the advice of these answers to my own situation. I often end with the problem that antlr cannot unambiguously resolve the rules if I play around with other alternatives.
Below is the antlr file in its entirety. The ambiguity in this version occurs around the production:
binop_expr
: SUMOP product
| product ( SUMOP product )*
;
(I had originally used UNARY_ABELIAN_OP instead of the second SUMOP, but that led to a different kind of ambiguity — the tool apparently couldn't recognise that it needed to differentiate between the same token in two different contexts. I mention this because one of the posts here recommends using a different name for the unary operator.)
grammar Kant;
program
: type_declaration_list main
;
type_declaration_list
: type_declaration
| type_declaration_list type_declaration
| /* null */
;
type_declaration
: 'context' JAVA_ID '{' context_body '}'
| 'class' JAVA_ID '{' class_body '}'
| 'class' JAVA_ID 'extends' JAVA_ID '{' class_body '}'
;
context_body
: context_body context_body_element
| context_body_element
| /* null */
;
context_body_element
: method_decl
| object_decl
| role_decl
| stageprop_decl
;
role_decl
: 'role' JAVA_ID '{' role_body '}'
| 'role' JAVA_ID '{' role_body '}' REQUIRES '{' self_methods '}'
| access_qualifier 'role' JAVA_ID '{' role_body '}'
| access_qualifier 'role' JAVA_ID '{' role_body '}' REQUIRES '{' self_methods '}'
;
role_body
: method_decl
| role_body method_decl
| object_decl // illegal
| role_body object_decl // illegal — for better error messages only
;
self_methods
: self_methods ';' method_signature
| method_signature
| self_methods /* null */ ';'
;
stageprop_decl
: 'stageprop' JAVA_ID '{' stageprop_body '}'
| 'stageprop' JAVA_ID '{' stageprop_body '}' REQUIRES '{' self_methods '}'
| access_qualifier 'stageprop' JAVA_ID '{' stageprop_body '}'
| access_qualifier 'stageprop' JAVA_ID '{' stageprop_body '}' REQUIRES '{' self_methods '}'
;
stageprop_body
: method_decl
| stageprop_body method_decl
;
class_body
: class_body class_body_element
| class_body_element
| /* null */
;
class_body_element
: method_decl
| object_decl
;
method_decl
: method_decl_hook '{' expr_and_decl_list '}'
;
method_decl_hook
: method_signature
| method_signature CONST
;
method_signature
: access_qualifier return_type method_name '(' param_list ')'
| access_qualifier return_type method_name
| access_qualifier method_name '(' param_list ')'
;
expr_and_decl_list
: object_decl
| expr ';' object_decl
| expr_and_decl_list object_decl
| expr_and_decl_list expr
| expr_and_decl_list /*null-expr */ ';'
| /* null */
;
return_type
: type_name
| /* null */
;
method_name
: JAVA_ID
;
access_qualifier
: 'public' | 'private' | /* null */
;
object_decl
: access_qualifier compound_type_name identifier_list ';'
| access_qualifier compound_type_name identifier_list
| compound_type_name identifier_list /* null expr */ ';'
| compound_type_name identifier_list
;
compound_type_name
: type_name '[' ']'
| type_name
;
type_name
: JAVA_ID
| 'int'
| 'double'
| 'char'
| 'String'
;
identifier_list
: JAVA_ID
| identifier_list ',' JAVA_ID
| JAVA_ID ASSIGN expr
| identifier_list ',' JAVA_ID ASSIGN expr
;
param_list
: param_decl
| param_list ',' param_decl
| /* null */
;
param_decl
: type_name JAVA_ID
;
main
: expr
;
expr
: block
| expr '.' message
| expr '.' CLONE
| expr '.' JAVA_ID
| ABELIAN_INCREMENT_OP expr '.' JAVA_ID
| expr '.' JAVA_ID ABELIAN_INCREMENT_OP
| /* this. */ message
| JAVA_ID
| constant
| if_expr
| for_expr
| while_expr
| do_while_expr
| switch_expr
| BREAK
| CONTINUE
| boolean_expr
| binop_expr
| '(' expr ')'
| <assoc=right> expr ASSIGN expr
| NEW message
| NEW type_name '[' expr ']'
| RETURN expr
| RETURN
;
relop_expr
: sexpr RELATIONAL_OPERATOR sexpr
;
// This is just a duplication of expr. We separate it out
// because a top-down antlr4 parser can't handle the
// left associative ambiguity. It is used only
// for abelian types.
sexpr
: block
| sexpr '.' message
| sexpr '.' CLONE
| sexpr '.' JAVA_ID
| ABELIAN_INCREMENT_OP sexpr '.' JAVA_ID
| sexpr '.' JAVA_ID ABELIAN_INCREMENT_OP
| /* this. */ message
| JAVA_ID
| constant
| if_expr
| for_expr
| while_expr
| do_while_expr
| switch_expr
| BREAK
| CONTINUE
| '(' sexpr ')'
| <assoc=right> sexpr ASSIGN sexpr
| NEW message
| NEW type_name '[' expr ']'
| RETURN expr
| RETURN
;
block
: '{' expr_and_decl_list '}'
| '{' '}'
;
expr_or_null
: expr
| /* null */
;
if_expr
: 'if' '(' boolean_expr ')' expr
| 'if' '(' boolean_expr ')' expr 'else' expr
;
for_expr
: 'for' '(' object_decl boolean_expr ';' expr ')' expr // O.K. — expr can be a block
| 'for' '(' JAVA_ID ':' expr ')' expr
;
while_expr
: 'while' '(' boolean_expr ')' expr
;
do_while_expr
: 'do' expr 'while' '(' boolean_expr ')'
;
switch_expr
: SWITCH '(' expr ')' '{' ( switch_body )* '}'
;
switch_body
: ( CASE constant | DEFAULT ) ':' expr_and_decl_list
;
binop_expr
: SUMOP product
| product ( SUMOP product )*
;
product
: atom ( MULOP atom )*
;
atom
: null_expr
| JAVA_ID
| JAVA_ID ABELIAN_INCREMENT_OP
| ABELIAN_INCREMENT_OP JAVA_ID
| constant
| '(' expr ')'
| array_expr '[' sexpr ']'
| array_expr '[' sexpr ']' ABELIAN_INCREMENT_OP
| ABELIAN_INCREMENT_OP array_expr '[' sexpr ']'
;
null_expr
: NULL
;
array_expr
: sexpr
;
boolean_expr
: boolean_product ( BOOLEAN_SUMOP boolean_product )*
;
boolean_product
: boolean_atom ( BOOLEAN_MULOP boolean_atom )*
;
boolean_atom
: BOOLEAN
| JAVA_ID
| '(' boolean_expr ')'
| LOGICAL_NOT boolean_expr
| relop_expr
;
constant
: STRING
| INTEGER
| FLOAT
| BOOLEAN
;
message
: <assoc=right> JAVA_ID '(' argument_list ')'
;
argument_list
: expr
| argument_list ',' expr
| /* null */
;
// Lexer rules
STRING : '"' ( ~'"' | '\\' '"' )* '"' ;
INTEGER : ('1' .. '9')+ ('0' .. '9')* | '0';
FLOAT : (('1' .. '9')* | '0') '.' ('0' .. '9')* ;
BOOLEAN : 'true' | 'false' ;
SWITCH : 'switch' ;
CASE : 'case' ;
DEFAULT : 'default' ;
BREAK : 'break' ;
CONTINUE : 'continue' ;
RETURN : 'return' ;
REQUIRES : 'requires' ;
NEW : 'new' ;
CLONE : 'clone' ;
NULL : 'null' ;
CONST : 'const' ;
RELATIONAL_OPERATOR : '!=' | '==' | '>' | '<' | '>=' | '<=';
LOGICAL_NOT : '!' ;
BOOLEAN_MULOP : '&&' ;
BOOLEAN_SUMOP : '||' | '^' ;
SUMOP : '+' | '-' ;
MULOP : '*' | '/' ;
ABELIAN_INCREMENT_OP : '++' | '--' ;
JAVA_ID: (('a' .. 'z') | ('A' .. 'Z')) (('a' .. 'z') | ('A' .. 'Z') | ('0' .. '9') | '_')* ;
INLINE_COMMENT: '//' ~[\r\n]* -> channel(HIDDEN) ;
C_COMMENT: '/*' .*? '*/' -> channel(HIDDEN) ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ -> channel(HIDDEN) ;
ASSIGN : '=' ;
Typical of the problem is that the parser can't recognise the unary minus in this expression (it simply does not accept the construct):
Base b1 = new Base(-threetwoone);
Try removing the unary expression from binop_expr, and add it to the expr rule:
expr
: ...
| unary_expr
| binop_expr
| ...
;
unary_expr
: SUMOP binop_expr
;
binop_expr
: product ( SUMOP product )*
;
The problem seems to arise from having two kinds of expressions in the grammar that need better to be grouped with each other instead of being intermixed as much as they are. The overall intent is to create a language where everything is an expression, including, for example, loops and conditionals. (You can use your imagination to deduce the values they generate; however, that is a semantic rather than syntactic issue).
One set of these expressions have nice abelian properties (they behave like ordinary arithmetic types with well-behaved operations). The language specifies how lower-level productions are tied together by operations that give them context: so two identifiers, "a" and "b", can be associated by the operator "+" as "a + b". It's a highly contexutalized grammar.
The other of these sets of expressions comprises productions with a much lower degree of syntactic cohesion. One can combine "if" expressions, "for" expressions, and declarations in any order. The only linguistic property associating them is catenation, as we find in productions for object_body, class_body, and expr_and_decl_list.
The basic problem is that these two grammars became intertwined in the antlr spec. Combined with the fact that operators like "+" are context-sensitive, and that the context can arise either from catenation or the more contextualised abelian alternative, the parser was left with two alternative ways of parsing some expressions. I managed to wiggle around most of the ambiguity, but it got pushed into the area of unary prefix operators. So if "a" is an expression (it is), and "b" is an expression, then the string "a + b" is ambiguous. It can mean either product ( SUMOP product )*, or it can mean expr_and_decl_list expr where expr_and_decl_list was earlier reduced from the expr "a". The expr in expr_and_decl_list expr can reduce from "+b".
I re-wrote the grammar, inspired by other examples I've found on the web. (One of the more influential was one that Bart Kiers pointed me to: If/else statements in ANTLR using listeners) You can find the new grammar below. Stuff that belonged together with the grammar for the abelian expressions but which used to reduce to the expr target, like NEW expr and expr '.' JAVA_ID, has now been separated out to reduce to an abelian_expr target. All the productions that reduce down to an abelian_expr are now more or less direct descendants of the abelian_expr node in the grammar; that helps antlr deal intelligently with what otherwise would be ambiguities. A unary "+" expression no longer can reduce directly to expr but can find its way to expr only through abelian_expr, so it will never be treated as production that can simply be catenated as a largely uncontextualized expression.
I can't prove that that's what was going on, but signs point in that direction. If someone else has a more formal or deductive analysis (particularly if it points to another conclusion) I'd love to hear your reasoning.
The lesson here seems to be that what amounted to a high-level, fundamental flaw in the design of the language managed to show up as only an isolated and in some sense minor bug: the inability to disambiguate between the unary and binary uses of an operator.
grammar Kant;
program
: type_declaration_list main
| type_declaration_list // missing main
;
main
: expr
;
type_declaration_list
: type_declaration
| type_declaration_list type_declaration
| /* null */
;
type_declaration
: 'context' JAVA_ID '{' context_body '}'
| 'class' JAVA_ID '{' class_body '}'
| 'class' JAVA_ID 'extends' JAVA_ID '{' class_body '}'
;
context_body
: context_body context_body_element
| context_body_element
| /* null */
;
context_body_element
: method_decl
| object_decl
| role_decl
| stageprop_decl
;
role_decl
: 'role' JAVA_ID '{' role_body '}'
| 'role' JAVA_ID '{' role_body '}' REQUIRES '{' self_methods '}'
| access_qualifier 'role' JAVA_ID '{' role_body '}'
| access_qualifier 'role' JAVA_ID '{' role_body '}' REQUIRES '{' self_methods '}'
;
role_body
: method_decl
| role_body method_decl
| object_decl // illegal
| role_body object_decl // illegal — for better error messages only
;
self_methods
: self_methods ';' method_signature
| method_signature
| self_methods /* null */ ';'
;
stageprop_decl
: 'stageprop' JAVA_ID '{' stageprop_body '}'
| 'stageprop' JAVA_ID '{' stageprop_body '}' REQUIRES '{' self_methods '}'
| access_qualifier 'stageprop' JAVA_ID '{' stageprop_body '}'
| access_qualifier 'stageprop' JAVA_ID '{' stageprop_body '}' REQUIRES '{' self_methods '}'
;
stageprop_body
: method_decl
| stageprop_body method_decl
| object_decl // illegal
| stageprop_body object_decl // illegal — for better error messages only
;
class_body
: class_body class_body_element
| class_body_element
| /* null */
;
class_body_element
: method_decl
| object_decl
;
method_decl
: method_decl_hook '{' expr_and_decl_list '}'
;
method_decl_hook
: method_signature
;
method_signature
: access_qualifier return_type method_name '(' param_list ')' CONST*
| access_qualifier return_type method_name CONST*
| access_qualifier method_name '(' param_list ')' CONST*
;
expr_and_decl_list
: object_decl
| expr ';' object_decl
| expr_and_decl_list object_decl
| expr_and_decl_list expr
| expr_and_decl_list /*null-expr */ ';'
| /* null */
;
return_type
: type_name
| /* null */
;
method_name
: JAVA_ID
;
access_qualifier
: 'public' | 'private' | /* null */
;
object_decl
: access_qualifier compound_type_name identifier_list ';'
| access_qualifier compound_type_name identifier_list
| compound_type_name identifier_list /* null expr */ ';'
| compound_type_name identifier_list
;
compound_type_name
: type_name '[' ']'
| type_name
;
type_name
: JAVA_ID
| 'int'
| 'double'
| 'char'
| 'String'
;
identifier_list
: JAVA_ID
| identifier_list ',' JAVA_ID
| JAVA_ID ASSIGN expr
| identifier_list ',' JAVA_ID ASSIGN expr
;
param_list
: param_decl
| param_list ',' param_decl
| /* null */
;
param_decl
: type_name JAVA_ID
;
expr
: abelian_expr
| boolean_expr
| block
| if_expr
| for_expr
| while_expr
| do_while_expr
| switch_expr
| BREAK
| CONTINUE
| RETURN expr
| RETURN
;
abelian_expr
: <assoc=right>abelian_expr POW abelian_expr
| ABELIAN_SUMOP expr
| LOGICAL_NEGATION expr
| NEW message
| NEW type_name '[' expr ']'
| abelian_expr ABELIAN_MULOP abelian_expr
| abelian_expr ABELIAN_SUMOP abelian_expr
| abelian_expr ABELIAN_RELOP abelian_expr
| null_expr
| /* this. */ message
| JAVA_ID
| JAVA_ID ABELIAN_INCREMENT_OP
| ABELIAN_INCREMENT_OP JAVA_ID
| constant
| '(' abelian_expr ')'
| abelian_expr '[' expr ']'
| abelian_expr '[' expr ']' ABELIAN_INCREMENT_OP
| ABELIAN_INCREMENT_OP expr '[' expr ']'
| ABELIAN_INCREMENT_OP expr '.' JAVA_ID
| abelian_expr '.' JAVA_ID ABELIAN_INCREMENT_OP
| abelian_expr '.' message
| abelian_expr '.' CLONE
| abelian_expr '.' CLONE '(' ')'
| abelian_expr '.' JAVA_ID
| <assoc=right> abelian_expr ASSIGN expr
;
message
: <assoc=right> JAVA_ID '(' argument_list ')'
;
boolean_expr
: boolean_expr BOOLEAN_MULOP expr
| boolean_expr BOOLEAN_SUMOP expr
| constant // 'true' / 'false'
| abelian_expr
;
block
: '{' expr_and_decl_list '}'
| '{' '}'
;
expr_or_null
: expr
| /* null */
;
if_expr
: 'if' '(' expr ')' expr
| 'if' '(' expr ')' expr 'else' expr
;
for_expr
: 'for' '(' object_decl expr ';' expr ')' expr // O.K. — expr can be a block
| 'for' '(' JAVA_ID ':' expr ')' expr
;
while_expr
: 'while' '(' expr ')' expr
;
do_while_expr
: 'do' expr 'while' '(' expr ')'
;
switch_expr
: SWITCH '(' expr ')' '{' ( switch_body )* '}'
;
switch_body
: ( CASE constant | DEFAULT ) ':' expr_and_decl_list
;
null_expr
: NULL
;
constant
: STRING
| INTEGER
| FLOAT
| BOOLEAN
;
argument_list
: expr
| argument_list ',' expr
| /* null */
;
// Lexer rules
STRING : '"' ( ~'"' | '\\' '"' )* '"' ;
INTEGER : ('1' .. '9')+ ('0' .. '9')* | '0';
FLOAT : (('1' .. '9')* | '0') '.' ('0' .. '9')* ;
BOOLEAN : 'true' | 'false' ;
SWITCH : 'switch' ;
CASE : 'case' ;
DEFAULT : 'default' ;
BREAK : 'break' ;
CONTINUE : 'continue' ;
RETURN : 'return' ;
REQUIRES : 'requires' ;
NEW : 'new' ;
CLONE : 'clone' ;
NULL : 'null' ;
CONST : 'const' ;
ABELIAN_RELOP : '!=' | '==' | '>' | '<' | '>=' | '<=';
LOGICAL_NOT : '!' ;
POW : '**' ;
BOOLEAN_SUMOP : '||' | '^' ;
BOOLEAN_MULOP : '&&' ;
ABELIAN_SUMOP : '+' | '-' ;
ABELIAN_MULOP : '*' | '/' | '%' ;
MINUS : '-' ;
PLUS : '+' ;
LOGICAL_NEGATION : '!' ;
ABELIAN_INCREMENT_OP : '++' | '--' ;
JAVA_ID: (('a' .. 'z') | ('A' .. 'Z')) (('a' .. 'z') | ('A' .. 'Z') | ('0' .. '9') | '_')* ;
INLINE_COMMENT: '//' ~[\r\n]* -> channel(HIDDEN) ;
C_COMMENT: '/*' .*? '*/' -> channel(HIDDEN) ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ -> channel(HIDDEN) ;
ASSIGN : '=' ;

Issues with quoted string regex in antlr4

I want to parse strings like "TOM*", "TOM" , "*TOM" , "TOM", "*" and all these without quotes. I created 2 rules name_with_quotes & name without quotes, but string with quotes are giving expected token: <EOF> error
I have following tokens in lexer.g4 file
ID : [a-zA-Z0-9-_]+ ;
WILDCARD_STARTS_WITH_STRING: ID'*';
WILDCARD_ENDS_WITH_STRING: '*'ID;
WILDCARD_CONTAINS_STRING : '*'ID'*' ;
STRING : ('"' | '\'') ( ( (~('"' | '\\' | '\r' | '\n') | '\\'('"' ) )*) | ) ('"' | '\'');
QUOTED_ID : ('"' | '\'') (((STAR)? ID (STAR)?) | ID | STAR) ('"' | '\'');
I have following rules in my parser file:
name_without_quotes : ID | WILDCARD_STARTS_WITH_STRING | WILDCARD_ENDS_WITH_STRING | WILDCARD_CONTAINS_STRING | STAR ;
name_with_quotes : QUOTED_ID;
name : name_with_quotes | name_without_quotes;
I also tried using following rules.
WITHOUT_QUOTES : '"' (ID | ID'*' | '*'ID | '*'ID'*' ) '"';
WITH_QUOTES : ID | WILDCARD_STARTS_WITH_STRING | WILDCARD_ENDS_WITH_STRING | WILDCARD_CONTAINS_STRING | STAR ;
But no luck. Any clue what I could be doing wrong?
Many Thanks.
Found solution here.
http://www.antlr3.org/pipermail/antlr-interest/2012-March/044273.html.
Just changed the order and it worked.

ANTLR 4 precedence not as expected

I've defined a flavor of SQL that we use in my company as the following:
/** Grammars always start with a grammar header. This grammar is called
* GigyaSQL and must match the filename: GigyaSQL.g4
*/
grammar GigyaSQL;
parse
: selectClause
fromClause
( whereClause )?
( filterClause )?
( groupByClause )?
( limitClause )?
;
selectClause
: K_SELECT result_column ( ',' result_column )*
;
result_column
: '*' # selectAll
| table_name '.' '*' # selectAllFromTable
| select_expr ( K_AS? column_alias )? # selectExpr
| with_table # selectWithTable
;
fromClause
: K_FROM table_name
;
table_name
: any_name # simpleTable
| any_name K_WITH with_table # tableWithTable
;
any_name
: IDENTIFIER
| STRING_LITERAL
| '(' any_name ')'
;
with_table
: COUNTERS
;
select_expr
: literal_value
| range_function_in_select
| interval_function_in_select
| ( table_name '.' )? column_name
| function_name '(' argument_list ')'
;
whereClause
: K_WHERE condition_expr
;
condition_expr
: literal_value # literal
| ( table_name '.' )? column_name # column_name_expr
| unary_operator condition_expr # unary_expr
| condition_expr binary_operator condition_expr # binary_expr
| K_IFELEMENT '(' with_table ',' condition_expr ')' # if_element
| function_name '(' argument_list ')' # function_expr
| '(' condition_expr ')' # brackets_expr
| condition_expr K_NOT? K_LIKE condition_expr # like_expr
| condition_expr K_NOT? K_CONTAINS condition_expr # contains_expr
| condition_expr K_IS K_NOT? condition_expr # is_expr
//| condition_expr K_NOT? K_BETWEEN condition_expr K_AND condition_expr
| condition_expr K_NOT? K_IN '(' ( literal_value ( ',' literal_value )*) ')' # in_expr
;
filterClause
: K_FILTER with_table K_BY condition_expr
;
groupByClause
: K_GROUP K_BY group_expr ( ',' group_expr )*
;
group_expr
: literal_value
| ( table_name '.' )? column_name
| function_name '(' argument_list ')'
| range_function_in_group
| interval_function_in_group
;
limitClause
: K_LIMIT NUMERIC_LITERAL
;
argument_list
: ( select_expr ( ',' select_expr )* | '*' )
;
unary_operator
: MINUS
| PLUS
| '~'
| K_NOT
;
binary_operator
: ( '*' | DIVIDE | MODULAR )
| ( PLUS | MINUS )
//| ( '<<' | '>>' | '&' | '|' )
| ( LTH | LEQ | GTH | GEQ )
| ( EQUAL | NOT_EQUAL | K_IN | K_LIKE )
//| ( '=' | '==' | '!=' | '<>' | K_IS | K_IS K_NOT | K_IN | K_LIKE | K_GLOB | K_MATCH | K_REGEXP )
| K_AND
| K_OR
;
range_function_in_select
: K_RANGE '(' select_expr ')'
;
range_function_in_group
: K_RANGE '(' select_expr ',' range_pair (',' range_pair)* ')'
;
range_pair // Tried to use INT instead (for decimal numbers) but that didn't work fine (didn't parse a = 1 correctly)
: '"' NUMERIC_LITERAL ',' NUMERIC_LITERAL '"'
| '"' ',' NUMERIC_LITERAL '"'
| '"' NUMERIC_LITERAL ',' '"'
;
interval_function_in_select
: K_INTERVAL '(' select_expr ')'
;
interval_function_in_group
: K_INTERVAL '(' select_expr ',' NUMERIC_LITERAL ')'
;
function_name
: any_name
;
literal_value
: NUMERIC_LITERAL
| STRING_LITERAL
// | BLOB_LITERAL
| K_NULL
// | K_CURRENT_TIME
// | K_CURRENT_DATE
// | K_CURRENT_TIMESTAMP
;
column_name
: any_name
;
column_alias
: IDENTIFIER
| STRING_LITERAL
;
SPACES
: [ \u000B\t\r\n] -> skip
;
COUNTERS : 'counters' | 'COUNTERS';
//INT : '0' | DIGIT+ ;
EQUAL : '=';
NOT_EQUAL : '<>' | '!=';
LTH : '<' ;
LEQ : '<=';
GTH : '>';
GEQ : '>=';
//MULTIPLY: '*';
DIVIDE : '/';
MODULAR : '%';
PLUS : '+';
MINUS : '-';
K_AND : A N D;
K_AS : A S;
K_BY : B Y;
K_CONTAINS: C O N T A I N S;
K_DISTINCT : D I S T I N C T;
K_FILTER : F I L T E R;
K_FROM : F R O M;
K_GROUP : G R O U P;
K_IFELEMENT : I F E L E M E N T;
K_IN : I N;
K_INTERVAL : I N T E R V A L;
K_IS : I S;
K_LIKE : L I K E;
K_LIMIT : L I M I T;
K_NOT : N O T;
K_NULL : N U L L;
K_OR : O R;
K_RANGE : R A N G E;
K_REGEXP : R E G E X P;
K_SELECT : S E L E C T;
K_WHERE : W H E R E;
K_WITH : W I T H;
IDENTIFIER
: '"' (~'"' | '""')* '"'
| '`' (~'`' | '``')* '`'
| '[' ~']'* ']'
| [a-zA-Z_] [.a-zA-Z_0-9]* // TODO - need to check if the period is correcly handled
| [a-zA-Z_] [a-zA-Z_0-9]* // TODO check: needs more chars in set
;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
;
NUMERIC_LITERAL
:// INT
DIGIT+ ('.' DIGIT*)? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
and I try to parse the following query:
SELECT * from accounts where not data.zzz > 124
I get the following tree:
But I wanted to get the tree similar to when I use parenthesis:
SELECT * from accounts where not (data.zzz > 124)
I don't understand why it's working that way sinze the unary rule is before others.
Any suggestion?
That is the correct result for the given grammar. As you've already mentioned, the unary_operator is before the binary_operator meaning any operand for the NOT keyword is binded to it first before other operators. And since it is unary, it takes the data.zzz as its operand and after that the whole NOT expression becomes an operand of the binary_operator.
To get what you want, just shift down the unary_operator according to the precedence level of it (as I recall, in SQL, NOT's precedence is lower than that of binary operators, and the NOT operator should not have the same precedence as the MINUS PLUS and ~ just like what your grammar does) e.g.
condition_expr
: literal_value # literal
| ( table_name '.' )? column_name # column_name_expr
| condition_expr binary_operator condition_expr # binary_expr
| unary_operator condition_expr # unary_expr
| K_IFELEMENT '(' with_table ',' condition_expr ')' # if_element
| function_name '(' argument_list ')' # function_expr
| '(' condition_expr ')' # brackets_expr
| condition_expr K_NOT? K_LIKE condition_expr # like_expr
| condition_expr K_NOT? K_CONTAINS condition_expr # contains_expr
| condition_expr K_IS K_NOT? condition_expr # is_expr
//| condition_expr K_NOT? K_BETWEEN condition_expr K_AND condition_expr
| condition_expr K_NOT? K_IN '(' ( literal_value ( ',' literal_value )*) ')' # in_expr
;
And this gives what you want:

Resources