Struggling to parse array notation - antlr4

I have a very simple grammar to parse statements.
Here are examples of the type of statements that need be parsed:
a.b.c
a.b.c == "88"
The issue I am having is that array notation is not matching. For example, things that are not working:
a.b[0].c
a[3][4]
I hope someone can point out what I am doing wrong here. (I am testing in ANTLRWorks)
Here is the grammar (generationUnit is my entry point):
grammar RatBinding;
generationUnit: testStatement | statement;
arrayAccesor : identifier arrayNotation+;
arrayNotation: '[' Number ']';
testStatement:
(statement | string | Number | Bool )
(greaterThanAndEqual
| lessThanOrEqual
| greaterThan
| lessThan | notEquals | equals)
(statement | string | Number | Bool )
;
part: identifier | arrayAccesor;
statement: part ('.' part )*;
string: ('"' identifier '"') | ('\'' identifier '\'');
greaterThanAndEqual: '>=';
lessThanOrEqual: '<=';
greaterThan: '>';
lessThan: '<';
notEquals : '!=';
equals: '==';
identifier: Letter (Letter|Digit)*;
Bool : 'true' | 'false';
ArrayLeft: '\u005B';
ArrayRight: '\u005D';
Letter
: '\u0024' |
'\u0041'..'\u005a' |
'\u005f '|
'\u0061'..'\u007a' |
'\u00c0'..'\u00d6' |
'\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' |
'\u0100'..'\u1fff' |
'\u3040'..'\u318f' |
'\u3300'..'\u337f' |
'\u3400'..'\u3d2d' |
'\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
Digit
: '\u0030'..'\u0039' |
'\u0660'..'\u0669' |
'\u06f0'..'\u06f9' |
'\u0966'..'\u096f' |
'\u09e6'..'\u09ef' |
'\u0a66'..'\u0a6f' |
'\u0ae6'..'\u0aef' |
'\u0b66'..'\u0b6f' |
'\u0be7'..'\u0bef' |
'\u0c66'..'\u0c6f' |
'\u0ce6'..'\u0cef' |
'\u0d66'..'\u0d6f' |
'\u0e50'..'\u0e59' |
'\u0ed0'..'\u0ed9' |
'\u1040'..'\u1049'
;
WS : [ \r\t\u000C\n]+ -> channel(HIDDEN)
;

You referenced the non-existent rule Number in the arrayNotation parser rule.
A Digit rule does exist in the lexer, but it will only match a single-digit number. For example, 1 is a Digit, but 10 is two separate Digit tokens so a[10] won't match the arrayAccesor rule. You probably want to resolve this in two parts:
Create a Number token consisting of one or more digits.
Number
: Digit+
;
Mark Digit as a fragment rule to indicate that it doesn't form tokens on its own, but is merely intended to be referenced from other lexer rules.
fragment // prevents a Digit token from being created on its own
Digit
: ...
You will not need to change arrayNotation because it already references the Number rule you created here.

Bah, waste of space. I Used Number instead of Digit in my array declaration.

Related

antlr4 grammar with negative option

In antlr4 I want to define a string but exclude from it the combination := permitting the respective single characters. What is syntax to define the grammar
EQUAL : '=';
NUMBER: DIGIT+;
DIGIT : ('0'..'9');
LITERALEQUAL: ((CHAR | NUMBER | EQUAL | OTHERS) ' '?)+;
fragment CHAR :[a-z]| [A-Z];
fragment OTHERS: '.' | '/' | ':' | '-' | '#' | '?' | '&' | '_' | '[' | ']' | '^' | ';' | '"' | '=';
As long as you don't make a lexer rule or implicit token like:
stmt : value ':=' something ; <-- implicit token
or
BADEquals : ':=' ; <-- explicit lexer definition
your eventual grammar won't allow it if your goal is to a allow : and = but exclude the combination := .

ANTLR 4 building parse tree incorrectly

I'm making a grammar for a subset of SQL, which I've pasted below:
grammar Sql;
sel_stmt : SEL QUANT? col_list FROM tab_list;
as_stmt : 'as' ID;
col_list : '*' | col_spec (',' col_spec | ',' col_group)*;
col_spec : (ID | ID '.' ID) as_stmt?;
col_group : ID '.' '*';
tab_list : (tab_spec | join_tab) (',' tab_spec | ',' join_tab)*;
tab_spec : (ID | sub_query) as_stmt?;
sub_query : '(' sel_stmt ')';
join_tab : tab_spec (join_type? 'join' tab_spec 'on' cond_list)+;
join_type : 'inner' | 'full' | 'left' | 'right';
cond_list : cond_term (BOOL_OP cond_term)*;
cond_term : col_spec (COMP_OP val | between | like);
val : INT | FLT | SQ_STR | DQ_STR;
between : ('not')? 'between' val 'and' val;
like : ('not')? 'like' (SQ_STR | DQ_STR);
WS : (' ' | '\t' | '\n')+ -> skip;
INT : ('0'..'9')+;
FLT : INT '.' INT;
SQ_STR : '\'' ~('\\'|'\'')* '\'';
DQ_STR : '"' ~('\\'|'"')* '"';
BOOL_OP : ',' | 'or' | 'and';
COMP_OP : '=' | '<>' | '<' | '>' | '<=' | '>=';
SEL : 'select' | 'SELECT';
QUANT: 'distinct' | 'DISTINCT' | 'all' | 'ALL';
FROM: 'from' | 'FROM';
ID : ('A'..'Z' | 'a'..'z')('A'..'Z' | 'a'..'z' | '0'..'9')*;
The input I'm testing is select distinct test.col1 as col, test2.* from test join test2 on col='something', test2.col1=1.4. The output parse tree matches the last appearance of test2 as a and thus doesn't know what to do with the rest of the input. The comma before the last 'test2' token is made a child of the node, when it should be a child of .
My question is what is going on behind the scenes to cause this?
Apparently, ANTLR does not like it when you use a literal symbol as a token and have it be part of another token. In my case, the comma token was both used as a literal token and as part of the BOOL_OP token.
A future question may be how ANTLR disambiguates when the same symbol is used as parts of different tokens that apply only under specific scopes. However, this question is answered for now.

ANTLR: VERY slow parsing

I have successfully split my expressions into arithmetic and boolean expressions like this:
/* entry point */
parse: formula EOF;
formula : (expr|boolExpr);
/* boolean expressions : take math expr and use boolean operators on them */
boolExpr
: bool
| l=expr operator=(GT|LT|GEQ|LEQ) r=expr
| l=boolExpr operator=(OR|AND) r=boolExpr
| l=expr (not='!')? EQUALS r=expr
| l=expr BETWEEN low=expr AND high=expr
| l=expr IS (NOT)? NULL
| l=atom LIKE regexp=string
| l=atom ('IN'|'in') '(' string (',' string)* ')'
| '(' boolExpr ')'
;
/* arithmetic expressions */
expr
: atom
| (PLUS|MINUS) expr
| l=expr operator=(MULT|DIV) r=expr
| l=expr operator=(PLUS|MINUS) r=expr
| function=IDENTIFIER '(' (expr ( ',' expr )* ) ? ')'
| '(' expr ')'
;
atom
: number
| variable
| string
;
But now I have HUGE performance problems. Some formulas I try to parse are utterly slow, to the point that it has become unbearable: more than an hour (I stopped at that point) to parse this:
-4.77+[V1]*-0.0071+[V1]*[V1]*0+[V2]*-0.0194+[V2]*[V2]*0+[V3]*-0.00447932+[V3]*[V3]*-0.0017+[V4]*-0.00003298+[V4]*[V4]*0.0017+[V5]*-0.0035+[V5]*[V5]*0+[V6]*-4.19793004+[V6]*[V6]*1.5962+[V7]*12.51966636+[V7]*[V7]*-5.7058+[V8]*-19.06596752+[V8]*[V8]*28.6281+[V9]*9.47136506+[V9]*[V9]*-33.0993+[V10]*0.001+[V10]*[V10]*0+[V11]*-0.15397774+[V11]*[V11]*-0.0021+[V12]*-0.027+[V12]*[V12]*0+[V13]*-2.02963068+[V13]*[V13]*0.1683+[V14]*24.6268688+[V14]*[V14]*-5.1685+[V15]*-6.17590512+[V15]*[V15]*1.2936+[V16]*2.03846688+[V16]*[V16]*-0.1427+[V17]*9.02302288+[V17]*[V17]*-1.8223+[V18]*1.7471106+[V18]*[V18]*-0.1255+[V19]*-30.00770912+[V19]*[V19]*6.7738
Do you have any idea on what the problem is?
The parsing stops when the parser enters the formula grammar rule.
edit Original problem here:
My grammar allows this:
// ( 1 LESS_EQUALS 2 )
1 <= 2
But the way I expressed it in my G4 file makes it also accept this:
// ( ( 1 LESS_EQUALS 2 ) LESS_EQUALS 3 )
1 <= 2 <= 3
Which I don't want.
My grammar contains this:
expr
: atom # atomArithmeticExpr
| (PLUS|MINUS) expr # plusMinusExpr
| l=expr operator=('*'|'/') r=expr # multdivArithmeticExpr
| l=expr operator=('+'|'-') r=expr # addsubtArithmeticExpr
| l=expr operator=('>'|'<'|'>='|'<=') r=expr # comparisonExpr
[...]
How can I tell Antlr that this is not acceptable?
Just split root into two. Either rename root 'expr' into 'rootexpr', or vice versa.
rootExpr
: atom # atomArithmeticExpr
| (PLUS|MINUS) expr # plusMinusExpr
| l=expr operator=('*'|'/') r=expr # multdivArithmeticExpr
| l=expr operator=('+'|'-') r=expr # addsubtArithmeticExpr
| l=expr operator=('>'|'<'|'>='|'<=') r=expr # comparisonExpr
EDIT: You cannot have cyclic reference => expr node in expr rule.

Issues with quoted string regex in antlr4

I want to parse strings like "TOM*", "TOM" , "*TOM" , "TOM", "*" and all these without quotes. I created 2 rules name_with_quotes & name without quotes, but string with quotes are giving expected token: <EOF> error
I have following tokens in lexer.g4 file
ID : [a-zA-Z0-9-_]+ ;
WILDCARD_STARTS_WITH_STRING: ID'*';
WILDCARD_ENDS_WITH_STRING: '*'ID;
WILDCARD_CONTAINS_STRING : '*'ID'*' ;
STRING : ('"' | '\'') ( ( (~('"' | '\\' | '\r' | '\n') | '\\'('"' ) )*) | ) ('"' | '\'');
QUOTED_ID : ('"' | '\'') (((STAR)? ID (STAR)?) | ID | STAR) ('"' | '\'');
I have following rules in my parser file:
name_without_quotes : ID | WILDCARD_STARTS_WITH_STRING | WILDCARD_ENDS_WITH_STRING | WILDCARD_CONTAINS_STRING | STAR ;
name_with_quotes : QUOTED_ID;
name : name_with_quotes | name_without_quotes;
I also tried using following rules.
WITHOUT_QUOTES : '"' (ID | ID'*' | '*'ID | '*'ID'*' ) '"';
WITH_QUOTES : ID | WILDCARD_STARTS_WITH_STRING | WILDCARD_ENDS_WITH_STRING | WILDCARD_CONTAINS_STRING | STAR ;
But no luck. Any clue what I could be doing wrong?
Many Thanks.
Found solution here.
http://www.antlr3.org/pipermail/antlr-interest/2012-March/044273.html.
Just changed the order and it worked.

What characters are permitted for Haskell operators?

Is there a complete list of allowed characters somewhere, or a rule that determines what can be used in an identifier vs an operator?
From the Haskell report, this is the syntax for allowed symbols:
a | b means a or b and
a<b> means a except b
special -> ( | ) | , | ; | [ | ] | `| { | }
symbol -> ascSymbol | uniSymbol<special | _ | : | " | '>
ascSymbol -> ! | # | $ | % | & | * | + | . | / | < | = | > | ? | #
\ | ^ | | | - | ~
uniSymbol -> any Unicode symbol or punctuation
So, symbols are ASCII symbols or Unicode symbols except from those in special | _ | : | " | ', which are reserved.
Meaning the following characters can't be used: ( ) | , ; [ ] ` { } _ : " '
A few paragraphs below, the report gives the complete definition for Haskell operators:
varsym -> ( symbol {symbol | :})<reservedop | dashes>
consym -> (: {symbol | :})<reservedop>
reservedop -> .. | : | :: | = | \ | | | <- | -> | # | ~ | =>
Operator symbols are formed from one or more symbol characters, as
defined above, and are lexically distinguished into two namespaces
(Section 1.4):
An operator symbol starting with a colon is a constructor.
An operator symbol starting with any other character is an ordinary identifier.
Notice that a colon by itself, ":", is reserved solely for use as the
Haskell list constructor; this makes its treatment uniform with other
parts of list syntax, such as "[]" and "[a,b]".
Other than the special syntax for prefix negation, all operators are
infix, although each infix operator can be used in a section to yield
partially applied operators (see Section 3.5). All of the standard
infix operators are just predefined symbols and may be rebound.
From the Haskell 2010 Report §2.4:
Operator symbols are formed from one or more symbol characters...
§2.2 defines symbol characters as being any of !#$%&*+./<=>?#\^|-~: or "any [non-ascii] Unicode symbol or punctuation".
NOTE: User-defined operators cannot begin with a : as, quoting the language report, "An operator symbol starting with a colon is a constructor."
What I was looking for was the complete list of characters. Based on the other answers, the full list is;
Unicode Punctuation:
http://www.fileformat.info/info/unicode/category/Pc/list.htm
http://www.fileformat.info/info/unicode/category/Pd/list.htm
http://www.fileformat.info/info/unicode/category/Pe/list.htm
http://www.fileformat.info/info/unicode/category/Pf/list.htm
http://www.fileformat.info/info/unicode/category/Pi/list.htm
http://www.fileformat.info/info/unicode/category/Po/list.htm
http://www.fileformat.info/info/unicode/category/Ps/list.htm
Unicode Symbols:
http://www.fileformat.info/info/unicode/category/Sc/list.htm
http://www.fileformat.info/info/unicode/category/Sk/list.htm
http://www.fileformat.info/info/unicode/category/Sm/list.htm
http://www.fileformat.info/info/unicode/category/So/list.htm
But excluding the following characters with special meaning in Haskell:
(),;[]`{}_:"'
A : is only permitted as the first character of the operator, and denotes a constructor (see An operator symbol starting with a colon is a constructor).

Resources