How can I modify ANTLR4 grammar such that left association of operator can be supported? - antlr4

I want to support the following input as it is with a provided grammar.
This does not work
with [{id: 1},{id: 2}] as a
return a[0].id, a[1].id
It works if we change it to be:
with [{id: 1},{id: 2}] as a
return (a[0]).id, (a[1]).id
The relevant part of the grammar is here:
oC_PropertyOrLabelsExpression
: oC_Atom ( SP? oC_PropertyLookup )* ( SP? oC_NodeLabels )? ;
This is the full grammar:
https://s3.amazonaws.com/artifacts.opencypher.org/M18/Cypher.g4
Is there anything we can do by modifying the grammar?

Change:
oC_StringListNullOperatorExpression
: oC_PropertyOrLabelsExpression ( oC_StringOperatorExpression
| oC_ListOperatorExpression
| oC_NullOperatorExpression
)*
;
into:
oC_StringListNullOperatorExpression
: oC_PropertyOrLabelsExpression ( oC_StringOperatorExpression
| oC_ListOperatorExpression
| oC_NullOperatorExpression
| oC_PropertyLookup
)*
;

Related

Antlr4 parsing inconsistency

in a little test-parser I just wrote, I encountered a weird problem, which I don't quite understand.
Stripping it down to the smallest example showing the problem, let's start with the following grammar:
Testing.g4:
grammar Testing;
cscript // This is the construct I shortened
: (statement_list)* ;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
block : '{' statement_list? '}' ;
expression
: left=expression op=('*'|'/') right=expression # arithmeticExpression
| left=expression op=('+'|'-') right=expression # arithmeticExpression
| left=expression op=Comparison_operator right=expression # comparisonExpression
| ID # variableValueExpression
| constant # ignore // will be executed with the rule name
;
assignment_statement
: ID op=Assignment_operator expression
;
constant
: INT
| REAL;
Assignment_operator : ('=' | '+=' | '-=') ;
Comparison_operator : ('<' | '>' | '==' | '!=') ;
Comment : '//' .*? '\n' -> skip;
fragment NUM : [0-9];
INT : NUM+;
REAL
: NUM* '.' NUM+
| '.' NUM+
| INT
;
ID : [a-zA-Z_] [a-zA-Z_0-9]*;
WS : [ \t\r\n]+ -> skip;
Using the input
z = x + y;
everything is fine, we get a parse tree which goes from cscript to statement_list, statement, assignment_statement, id and expression. Great!
Now, if I add the possibility to declare variables, all goes down the drain:
This is the change to the grammar:
cscript
: (statement_list | variable_declaration ';')* ;
variable_declaration
: type ID ('=' expression)?
;
type
: 'int'
| 'real'
;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
// (continue as before)
All of a sudden, the same test-input gets wrongly dissected into two statement_lists, each continued to a statement with a "missing ';'" warning, the first going to an incomplete assignment_statement of "z =" and the second to an incomplete assignment_statement "x +".
My attempt to show the parse tree in text-form:
cscript
statement_list
statement
assignment_statement
'z'
'=' [marked as error]
[warning: missing ';']
statement_list
statement
assignment_statement
'x'
'+' [marked as error]
'y' [marked as error]
';'
Can anyone tell me what the problem is? (And how to fix it? ;-))
Edit on 2016-12-26, after Mike's comment:
After replacing all implicit lexer rules with explicit declarations, all of a sudden, the input "z = x + y" worked. (thumbs up)
The next thing I did was restoring more of the original example I had in mind, and adding a new input line
int x = 22;
to the input (which worked previously, but did not make it into the minimal example). Now, that line fails. This is the -token output of the test rig:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
[#3,8:9='22',<20>,1:8]
[#4,10:10=';',<12>,1:10]
[#5,13:13='z',<22>,2:0]
[#6,15:15='=',<1>,2:2]
[#7,17:17='x',<22>,2:4]
[#8,19:19='+',<18>,2:6]
[#9,21:21='y',<22>,2:8]
[#10,22:22=';',<12>,2:9]
[#11,25:24='<EOF>',<-1>,3:0]
line 1:6 mismatched input '=' expecting '='
As the problem seemed to be in the variable_declaration part, I even tried to split this into two parsing rules like this:
cscript
: (statement_list | variable_declaration_and_assignment SEMICOLON | variable_declaration SEMICOLON)* ;
variable_declaration_and_assignment
: type ID EQUAL expression
;
variable_declaration
: type ID
;
With the result:
line 1:6 no viable alternative at input 'intx='
Still stuck :-(
BTW: Splitting the "int x = 22;" into "int x;" and "x = 22;" works. sigh
Edit on 2016-12-26, after Mike's next comment:
Double-checked, and everything is lexer rules. Still, the mismatch between '=' and '=' (which I unfortunately cannot reconstruct anymore) gave me the idea to check the token types. The current status is:
(Shortened grammar)
cscript
: (statement_list | variable_declaration)* ;
...
variable_declaration
: type ID (EQUAL expression)? SEMICOLON
;
...
Assignment_operator : (EQUAL | PLUS_EQ | MINUS_EQ) ;
// among others
PLUS_EQ : '+=';
MINUS_EQ : '-=';
EQUAL: '=';
...
Shortened output:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
...
line 1:6 mismatched input '=' expecting ';'
Here, if I understand this correctly, the '=' is parsed to token type 1, which - according to the lexer.tokens output - is Assignment_Operator, while the expected EQUAL would be 13.
Might this be the problem?
Ok, seems the main take away here is: think about your definitions and how you define them. Create explicit lexer rules for your literals instead of defining them implicitly in the parser rules. Check the token values you get from the lexer if the parser gives you weird errors, because they must be correct in the first place or your parse has no chance to do its job.

How to write context free grammar aspect the priority of operations in antlr4

we knew the priority of logical operation from strong to low:
Not
And
Or
I want to add logical operation to my grammar in way respect the priority of logical operation. ...
My grammar is:
expression : factor ( PLUS factor | MINUS factor )* ;
factor : term ( MULT term | DIV term )* ;
term : NUMBER | ID | PAR_OPEN expression PAR_CLOSE ;
With ANTLR3 and ANTLR 4, you can doe something like this:
expression
: or_expression
;
// lowest precedence
or_expression
: and_expression ( '||' and_expression )*
;
and_expression
: rel_expression ( '&&' rel_expression )*
;
rel_expression
: add_expression ( ( '<' | '<=' | '>' | '>=' ) add_expression )*
;
add_expression
: mult_expression ( ( '+' | '-' ) mult_expression )*
;
mult_expression
: unary_expression ( ( '*' | '/' ) unary_expression )*
;
unary_expression
: '-' atom
| atom
;
// highest precedence
atom
: NUMBER
| ID
| '(' expression ')'
;
And with ANTLR4, you can also write it like this (which is equivalent to the grammar above!):
expression
: '!' expression
| expression ( '*' | '/' ) expression // higher than '+' | '-'
| expression ( '+' | '-' ) expression // higher than '<' | '<=' | '>' | '>='
| expression ( '<' | '<=' | '>' | '>=' ) expression // higher than '&&'
| expression '&&' expression // higher than '||'
| expression '||' expression
| NUMBER
| ID
| '(' expression ')'
;

antlr 4 listeners stop in two places - weird

I am parsing a SQL like language.
I want to parse full SQL sentence like : SELECT .. FROM .. WHERE and also a simple expr line which can be a function, where clause, expr and arithmethics.
this is the important parts of the grammar:
// main rule
parse : (statments)* EOF;
// All optional statements
statments : select_statement
| virtual_column_statement
;
select_statement :
SELECT select_item ( ',' select_item )*
FROM from_cluase ( ',' from_cluase )*
(WHERE where_clause )?
( GROUP BY group_by_item (',' group_by_item)* )?
( HAVING having_condition (AND having_condition)* )?
( ORDER BY order_clause (',' order_clause)* )?
( LIMIT limit_clause)?
| '(' select_statement ')'
virtual_column_statement:
virtual_column_expression
;
virtual_column_expression :
expr
| where_clause
| function
| virtual_column_expression arithmetichOp=('*'|'/'|'+'|'-'|'%') virtual_column_expression
| '(' virtual_column_expression ')'
;
virtual_columns works great.
Select queries works too but after it finishes, it goes to virtual_column_statement too.
I want it to choose one.
How can I fix this?
EDIT:
After some research I found out antlr takes my query and seperate it to two different parts.
How can I fix this?
Thanks,
id
Your 'virtual_column_statement' appears to be part of the 'select_statement'. I expect that you are missing a ';' between the two rules.
Most of your 'select_statement' clauses are optional, so after matching the select and from clauses, if Antlr thinks that the balance of the input is better matched as a 'virtual_column_statement', then it will take that path.
Your choices are:
1) make your select_statement comprehensive and at least as general as your 'virtual_column_statement';
2) require a keyword at the beginning of the 'virtual_column_statement' to prevent Antlr from considering it as a partial alternate;
3) put the 'virtual_column_statement' in a separate parser grammar and don't send it any select input text.

Antlr String can not match

I'm working on a little antlr problem. In my small custom DSL I want to be able to do a compare action between fields. I've got three fieldtypes (String, Int, Identifier) the Identifier is a variable name. I made a big specification but i've reduced my problem to a smaller grammer.
The problem is that when I try to use the String grammar notation, which you can add to your grammer using antlrworks, my Strings are seen as an identifier. This is my grammar:
grammar test;
x
: 'FROM' field_value EOF
;
field_value
: STRING
| INT
| identifier
;
identifier
: ID (('.' '`' ID '`')|('.' ID))?
| '`' ID '`' (('.' '`' ID '`')|('.' ID))?
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
INT : '0'..'9'+
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
When I try to parse the following string FROM "Hello!" it returns a parsetree like this
<grammar test>
|
x
|
----------------------------
| | |
FROM field_value !
|
identifier
|
"Hello
It parses what I think should be a string to an identifier thought my identifier doesn't say anything about double quoets so it shouldn't match.
Besides I think my definition for a string is wrong, even though antlrworks generated it for me. Does anybody know why this happens?
Cheers!
There's nothing wrong with your grammar. The things that is messing it up for you is most probably the fact that you're using ANTLRWorks' interpreter. Don't. The interpreter doesn't work well.
Use ANTLRWorks' debugger instead (in your grammar, press CTRL + D), which works like a charm. This is what the debugger shows after parsing FROM "Hello!":

Nested brackets/chars '(' and ')' in grammar/ANTLRWorks warning: Decision can match input such as ... using multiple alternatives

The grammar below parses ( left part = right part # comment ), # comment is optional.
Two questions:
Sometimes warning (ANTLRWorks 1.4.2):
Decision can match input such as "{Int, Word}" using multiple alternatives: 1, 2 (referencing id2)
But only sometimes!
The next extension should be that the comment (id2) can contain chars '(' and ')'.
The grammar:
grammar NestedBrackets1a1;
//==========================================================
// Lexer Rules
//==========================================================
Int
: Digit+
;
fragment Digit
: '0'..'9'
;
Special
: ( TCUnderscore | TCQuote )
;
TCListStart : '(' ;
TCListEnd : ')' ;
fragment TCUnderscore : '_' ;
fragment TCQuote : '"' ;
// A word must start with a letter
Word
: ( 'a'..'z' | 'A'..'Z' | Special ) ('a'..'z' | 'A'..'Z' | Special | Digit )*
;
Space
: ( ' ' | '\t' | '\r' | '\n' ) { $channel = HIDDEN; }
;
//==========================================================
// Parser Rules
//==========================================================
assignment
: TCListStart id1 '=' id1 ( comment )? TCListEnd
;
id1
: Word+
;
comment
: '#' ( id2 )*
;
id2
: ( Word | Int )+
;
ANTLRStarter wrote:
Sometimes warning (ANTLRWorks 1.4.2): Decision can match input such as "{Int, Word}" using multiple alternatives: 1, 2 (referencing id2)
But only sometimes!
No, the grammar you posted will always produce this warning. Perhaps you don't always notice it (your IDE-plugin or ANTLRWorks might show it in a tab you don't have opened), but the warning is there. Convince yourself by creating a lexer/parser from the command line:
java -cp antlr-3.4-complete.jar org.antlr.Tool NestedBrackets1a1.g
will produce:
warning(200): NestedBrackets1a1.g:49:19:
Decision can match input such as "{Int, Word}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
This is because you have a * after ( id2 ) inside your comment rule, and id2 also is a repetition of tokens: ( Word | Int )+. Let's say your input is "# foo bar" (a # followed by two Word tokens). ANTLR can now parse the input in more than 1 way: the 2 tokens "foo" and "bar" could be matched by ( id2 )*, where id2 matches a single Word token at a time, but "foo" and "bar" could also be matches in one go of the id2 rule.
Look at the merged rules:
comment
: '#' ( ( Word | Int )+ )*
;
See how you're repeating a repetition: ( ( ... )+ )*? This is usually a problem, as it is in your case.
Resolve this problem by either replacing the * with a ?:
comment
: '#' ( id2 )?
;
id2
: ( Word | Int )+
;
or by removing the +:
comment
: '#' ( id2 )*
;
id2
: ( Word | Int )
;
ANTLRStarter wrote:
The next extension should be that the comment (id2) can contain chars '(' and ')'.
That is asking for trouble since a comment is followed by a TCListEnd, which is a ). I don't recommend letting a comment match ).
EDIT
Note that comments are usually stripped from the source file while tokenizing the input source. That way you don't need to account for them in your parser rules. You can do that by "skipping" these tokens in a lexer rule:
Comment
: '#' ~('\r' | '\n')* {skip();}
;

Resources