ANTLR4.7 listener for a rule when sub rules are labeled - antlr4

I have an antlr4.7 grammar like this, where all sub rules are labeled.
date_expr
: attr op=( '+' | '-' ) dt_interval=ISO8601_INTERVAL
#dateexpr_Op
| DATETIME_NAME
#dateexpr_Named
| d=( DATETIME_LITERAL | DATE_LITERAL | TIME_LITERAL )
#dateexpr_Literal
| attr
#dateexpr_Attr
| '(' date_expr ')'
#dateexpr_Paren
;
I would like to annotate the tree when a date_expr rule completes. However, looking at the generated listener class, I see no exitDate_expr. How can I add this? Or, do I have to use a visitor interface for it. I am not much familiar with grammar tools.
Thanks.

To achieve beforeAllLabledAlts and afterAllLabledAlts visit points, wrap the labeled alt rule in a singleton rule:
anyDate : dateExpr ;
dateExpr
: attr op=( '+' | '-' ) dt_interval=ISO8601_INTERVAL #dateexpr_Op
| DATETIME_NAME #dateexpr_Named
| d=( DATETIME_LITERAL | DATE_LITERAL | TIME_LITERAL ) #dateexpr_Literal
| attr #dateexpr_Attr
| '(' date_expr ')' #dateexpr_Paren
;
The ANTLR tool will then generate the listener interface (and/or visitor interface) with AnyDateContext onEntry and onExit methods.

Related

Antlr4 Grammar fails to parse

I am new to ANTLR, and here is a grammar that I am working on and its failing for a given input string - A.B() && ((C.D() || E.F())).
I have tried a number of combinations but its failing on the same place.
grammar Expressions;
expression
: logicBlock (logicalConnector logicBlock)*
| NOT? '('? logicBlock ')'? (logicalConnector NOT? '('? logicBlock ')'?)*
;
logicBlock
: logicUnit comparator THINGS
| logicUnit comparator logicUnit
| logicUnit
;
logicUnit
: NOT? '(' method ')'
| NOT? method
;
method
: object '.' function ('.' function)*
;
object
: THINGS
|'(' THINGS ')'
;
function
: THINGS '(' arguments? ')'
;
arguments
: (object | function | method | logicUnit | logicBlock)
(
','
(object | function | method | logicUnit | logicBlock)
)*
;
logicalConnector
: AND | OR | PLUS | MINUS
;
comparator
: GT | LT | GTE | LTE | EQUALS | NOTEQUALS
;
AND : '&&' ;
OR : '||' ;
EQUALS : '==' ;
ASSIGN : '=' ;
GT : '>' ;
LT : '<' ;
GTE : '>=' ;
LTE : '<=' ;
NOTEQUALS : '!=' ;
NOT : '!' ;
PLUS : '+' ;
MINUS : '-' ;
IF : 'if' ;
THINGS
: [a-zA-Z] [a-zA-Z0-9]*
| '"' .*? '"'
| [0-9]+
| ([0-9]+)? '.' ([0-9])+
;
WS : [ \t\r\n]+ -> skip
;
The error that I getting for this input - A.B() && ((C.D() || E.F())) is below. any help and/or suggestion to improve would be highly appreciated.
This rule looks odd to me:
expression
: logicBlock (logicalConnector logicBlock)*
| NOT? '('? logicBlock ')'? (logicalConnector NOT? '('? logicBlock ')'?)*
// ^ ^ ^ ^
// | | | |
;
All the optional parenthesis can't be right: this means the the first can be present, and the other 3 would be omitted (and any other combination), leaving your expression with unbalanced parenthesis.
The get your grammar working in the input A.B() && ((C.D() || E.F())) , you'll want to do something like this:
expression
: logicBlock (logicalConnector logicBlock)*
| NOT? logicBlock (logicalConnector NOT? logicBlock)*
;
logicBlock
: '(' expression ')'
| logicUnit comparator THINGS
| logicUnit comparator logicUnit
| logicUnit
;
logicUnit
: '(' expression ')'
| NOT? '(' method ')'
| NOT? method
;
But ANTLR4 supports left recursive rules, allowing you to define the expression rule like this:
expression
: '(' expression ')'
| NOT expression
| expression comparator expression
| expression logicalConnector expression
| method
| object
| THINGS
;
method
: object '.' function ( '.' function )*
;
object
: THINGS
|'(' THINGS ')'
;
function
: THINGS '(' arguments? ')'
;
arguments
: expression ( ',' expression )*
;
logicalConnector
: AND | OR | PLUS | MINUS
;
comparator
: GT | LT | GTE | LTE | EQUALS | NOTEQUALS
;
making it much more readable. Sure, it doesn't produce the exact parse tree your original grammar was producing, but you might be flexible in that regard. This proposed grammar also matches more than yours: like NOT A, which your grammar does not allow, where my proposed grammar does accept it.

How to programmatically handle objects in Xtext

I have a grammar defined like:
Key:
name=ID;
Step:
name+=[Key]+;
#Override terminal ID:
('a'..'z' | 'A'..'Z' | '_' | '-' | '0'..'9')
('a'..'z' | 'A'..'Z' | '_' | ' ' | '-' | '0'..'9')+;
And my input is:
When I open window Kitchen Window
Then I (see) Beautiful Garden
The question: How to programmatically handle input and split it to Key references in some internal rules.
4example In first string I want it to be:
[When] [I open window] [Kitchen Window]
And the second to be:
[Then] [I] [(see)] [Beautiful Garden]
I don't know how to split it until it reached scope or linker and should make a decision somewhere in the code. Where should I do it?

ANTLR 4 building parse tree incorrectly

I'm making a grammar for a subset of SQL, which I've pasted below:
grammar Sql;
sel_stmt : SEL QUANT? col_list FROM tab_list;
as_stmt : 'as' ID;
col_list : '*' | col_spec (',' col_spec | ',' col_group)*;
col_spec : (ID | ID '.' ID) as_stmt?;
col_group : ID '.' '*';
tab_list : (tab_spec | join_tab) (',' tab_spec | ',' join_tab)*;
tab_spec : (ID | sub_query) as_stmt?;
sub_query : '(' sel_stmt ')';
join_tab : tab_spec (join_type? 'join' tab_spec 'on' cond_list)+;
join_type : 'inner' | 'full' | 'left' | 'right';
cond_list : cond_term (BOOL_OP cond_term)*;
cond_term : col_spec (COMP_OP val | between | like);
val : INT | FLT | SQ_STR | DQ_STR;
between : ('not')? 'between' val 'and' val;
like : ('not')? 'like' (SQ_STR | DQ_STR);
WS : (' ' | '\t' | '\n')+ -> skip;
INT : ('0'..'9')+;
FLT : INT '.' INT;
SQ_STR : '\'' ~('\\'|'\'')* '\'';
DQ_STR : '"' ~('\\'|'"')* '"';
BOOL_OP : ',' | 'or' | 'and';
COMP_OP : '=' | '<>' | '<' | '>' | '<=' | '>=';
SEL : 'select' | 'SELECT';
QUANT: 'distinct' | 'DISTINCT' | 'all' | 'ALL';
FROM: 'from' | 'FROM';
ID : ('A'..'Z' | 'a'..'z')('A'..'Z' | 'a'..'z' | '0'..'9')*;
The input I'm testing is select distinct test.col1 as col, test2.* from test join test2 on col='something', test2.col1=1.4. The output parse tree matches the last appearance of test2 as a and thus doesn't know what to do with the rest of the input. The comma before the last 'test2' token is made a child of the node, when it should be a child of .
My question is what is going on behind the scenes to cause this?
Apparently, ANTLR does not like it when you use a literal symbol as a token and have it be part of another token. In my case, the comma token was both used as a literal token and as part of the BOOL_OP token.
A future question may be how ANTLR disambiguates when the same symbol is used as parts of different tokens that apply only under specific scopes. However, this question is answered for now.

ANTLR: VERY slow parsing

I have successfully split my expressions into arithmetic and boolean expressions like this:
/* entry point */
parse: formula EOF;
formula : (expr|boolExpr);
/* boolean expressions : take math expr and use boolean operators on them */
boolExpr
: bool
| l=expr operator=(GT|LT|GEQ|LEQ) r=expr
| l=boolExpr operator=(OR|AND) r=boolExpr
| l=expr (not='!')? EQUALS r=expr
| l=expr BETWEEN low=expr AND high=expr
| l=expr IS (NOT)? NULL
| l=atom LIKE regexp=string
| l=atom ('IN'|'in') '(' string (',' string)* ')'
| '(' boolExpr ')'
;
/* arithmetic expressions */
expr
: atom
| (PLUS|MINUS) expr
| l=expr operator=(MULT|DIV) r=expr
| l=expr operator=(PLUS|MINUS) r=expr
| function=IDENTIFIER '(' (expr ( ',' expr )* ) ? ')'
| '(' expr ')'
;
atom
: number
| variable
| string
;
But now I have HUGE performance problems. Some formulas I try to parse are utterly slow, to the point that it has become unbearable: more than an hour (I stopped at that point) to parse this:
-4.77+[V1]*-0.0071+[V1]*[V1]*0+[V2]*-0.0194+[V2]*[V2]*0+[V3]*-0.00447932+[V3]*[V3]*-0.0017+[V4]*-0.00003298+[V4]*[V4]*0.0017+[V5]*-0.0035+[V5]*[V5]*0+[V6]*-4.19793004+[V6]*[V6]*1.5962+[V7]*12.51966636+[V7]*[V7]*-5.7058+[V8]*-19.06596752+[V8]*[V8]*28.6281+[V9]*9.47136506+[V9]*[V9]*-33.0993+[V10]*0.001+[V10]*[V10]*0+[V11]*-0.15397774+[V11]*[V11]*-0.0021+[V12]*-0.027+[V12]*[V12]*0+[V13]*-2.02963068+[V13]*[V13]*0.1683+[V14]*24.6268688+[V14]*[V14]*-5.1685+[V15]*-6.17590512+[V15]*[V15]*1.2936+[V16]*2.03846688+[V16]*[V16]*-0.1427+[V17]*9.02302288+[V17]*[V17]*-1.8223+[V18]*1.7471106+[V18]*[V18]*-0.1255+[V19]*-30.00770912+[V19]*[V19]*6.7738
Do you have any idea on what the problem is?
The parsing stops when the parser enters the formula grammar rule.
edit Original problem here:
My grammar allows this:
// ( 1 LESS_EQUALS 2 )
1 <= 2
But the way I expressed it in my G4 file makes it also accept this:
// ( ( 1 LESS_EQUALS 2 ) LESS_EQUALS 3 )
1 <= 2 <= 3
Which I don't want.
My grammar contains this:
expr
: atom # atomArithmeticExpr
| (PLUS|MINUS) expr # plusMinusExpr
| l=expr operator=('*'|'/') r=expr # multdivArithmeticExpr
| l=expr operator=('+'|'-') r=expr # addsubtArithmeticExpr
| l=expr operator=('>'|'<'|'>='|'<=') r=expr # comparisonExpr
[...]
How can I tell Antlr that this is not acceptable?
Just split root into two. Either rename root 'expr' into 'rootexpr', or vice versa.
rootExpr
: atom # atomArithmeticExpr
| (PLUS|MINUS) expr # plusMinusExpr
| l=expr operator=('*'|'/') r=expr # multdivArithmeticExpr
| l=expr operator=('+'|'-') r=expr # addsubtArithmeticExpr
| l=expr operator=('>'|'<'|'>='|'<=') r=expr # comparisonExpr
EDIT: You cannot have cyclic reference => expr node in expr rule.

Struggling to parse array notation

I have a very simple grammar to parse statements.
Here are examples of the type of statements that need be parsed:
a.b.c
a.b.c == "88"
The issue I am having is that array notation is not matching. For example, things that are not working:
a.b[0].c
a[3][4]
I hope someone can point out what I am doing wrong here. (I am testing in ANTLRWorks)
Here is the grammar (generationUnit is my entry point):
grammar RatBinding;
generationUnit: testStatement | statement;
arrayAccesor : identifier arrayNotation+;
arrayNotation: '[' Number ']';
testStatement:
(statement | string | Number | Bool )
(greaterThanAndEqual
| lessThanOrEqual
| greaterThan
| lessThan | notEquals | equals)
(statement | string | Number | Bool )
;
part: identifier | arrayAccesor;
statement: part ('.' part )*;
string: ('"' identifier '"') | ('\'' identifier '\'');
greaterThanAndEqual: '>=';
lessThanOrEqual: '<=';
greaterThan: '>';
lessThan: '<';
notEquals : '!=';
equals: '==';
identifier: Letter (Letter|Digit)*;
Bool : 'true' | 'false';
ArrayLeft: '\u005B';
ArrayRight: '\u005D';
Letter
: '\u0024' |
'\u0041'..'\u005a' |
'\u005f '|
'\u0061'..'\u007a' |
'\u00c0'..'\u00d6' |
'\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' |
'\u0100'..'\u1fff' |
'\u3040'..'\u318f' |
'\u3300'..'\u337f' |
'\u3400'..'\u3d2d' |
'\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
Digit
: '\u0030'..'\u0039' |
'\u0660'..'\u0669' |
'\u06f0'..'\u06f9' |
'\u0966'..'\u096f' |
'\u09e6'..'\u09ef' |
'\u0a66'..'\u0a6f' |
'\u0ae6'..'\u0aef' |
'\u0b66'..'\u0b6f' |
'\u0be7'..'\u0bef' |
'\u0c66'..'\u0c6f' |
'\u0ce6'..'\u0cef' |
'\u0d66'..'\u0d6f' |
'\u0e50'..'\u0e59' |
'\u0ed0'..'\u0ed9' |
'\u1040'..'\u1049'
;
WS : [ \r\t\u000C\n]+ -> channel(HIDDEN)
;
You referenced the non-existent rule Number in the arrayNotation parser rule.
A Digit rule does exist in the lexer, but it will only match a single-digit number. For example, 1 is a Digit, but 10 is two separate Digit tokens so a[10] won't match the arrayAccesor rule. You probably want to resolve this in two parts:
Create a Number token consisting of one or more digits.
Number
: Digit+
;
Mark Digit as a fragment rule to indicate that it doesn't form tokens on its own, but is merely intended to be referenced from other lexer rules.
fragment // prevents a Digit token from being created on its own
Digit
: ...
You will not need to change arrayNotation because it already references the Number rule you created here.
Bah, waste of space. I Used Number instead of Digit in my array declaration.

Resources