I am parsing an SQL like language of which I need to handle arithmetics with precedence.
Things could be like this:
(a + b) - c
(a + b) / 1000
a + (b - c)
a + (SELECT...)
(SELECT... ) + (SELECT ...)
etc..
I am using the antlr4 listeners pattern and so I can't find a way to build a representation tree for these arithmetic clauses.
grammer parts:
arithmetic_select_clause:
result_column arithmeticExpression result_column # ArithmeticSelect
| result_column arithmeticExpression arithmetic_select_clause # ArithmeticSelect
| arithmetic_select_clause arithmeticExpression result_column # ArithmeticSelect
| '(' arithmetic_select_clause ')' # ArithmeticSelectParentheses
;
arithmeticExpression : '+' # arithmeticsAdd
| '-' # arithmeticsSubtract
| '*' # arithmeticsMultiply
| '/' # arithmeticsDivide
| '%' # arithmeticsModulus
;
I can create a tree using the antlr listenres but I cant handle precedence.
Help please
ANTLR can help you there but you need to follow a few rules for it to do so. The arithmeticExpression rule needs to contain both operands and be directly recursive so that ANTLR can figure out how to rewrite it.
Here's an example of what you could do:
expression : '(' expression ')'
| expression op=('*'|'/'|'%') expression
| expression op=('+'|'-') expression
| result_column
| arithmetic_select_clause
;
This rule is left-recursive but ANTLR will rewrite it to eliminate the left-recursion. Relevant docs.
Notice how the levels of precedence are ordered. Each level gets its alternative. Same-precedence operators are on one level.
Also, for processing math expressions it's much easier to use a visitor than a listener. ANTLR can generate the base classes for you. It'll be much easier to traverse/process the parse tree in the precedence order this way.
Related
I have a ANTR4 rule "expression" that can be either "maths" or "comparison", but "comparison" can contain "maths". Here a concrete code:
expression
: ID
| maths
| comparison
;
maths
: maths_atom ((PLUS | MINUS) maths_atom) ? // "?" because in fact there is first multiplication then pow and I don't want to force a multiplication to make an addition
;
maths_atom
: NUMBER
| ID
| OPEN_PAR expression CLOSE_PAR
;
comparison
: comp_atom ((EQUALS | NOT_EQUALS) comp_atom) ?
;
comp_atom
: ID
| maths // here is the expression of interest
| OPEN_PAR expression CLOSE_PAR
;
If I give, for instance, 6 as input, this is fine for the parse tree, because it detects maths. But in the ANTLR4 plugin for Intellij Idea, it mark my expression rule as red - ambiguity. Should I say goodbye to a short parse tree and allow only maths trough comparison in expression so it is not so ambiguous anymore ?
The problem is that when the parser sees 6, which is a NUMBER, it has two paths of reaching it through your grammar:
expression - maths - maths_atom - NUMBER
or
expression - comparison - comp_atom - NUMBER
This ambiguity triggers the error that you see.
You can fix this by flattening your parser grammar as shown in this tutorial:
start
: expr | <EOF>
;
expr
: expr (PLUS | MINUS) expr # ADDGRP
| expr (EQUALS | NOT_EQUALS) expr # COMPGRP
| OPEN_PAR expression CLOSE_PAR # PARENGRP
| NUMBER # NUM
| ID # IDENT
;
I am creating an interpreter in Java using ANTLR. I have a grammar which I have been using for a long time and I have built a lot of code around classes generated from this grammar.
In the grammar is 'false' defined as a literal, and there is also definition of variable name which allows to build variable names from digits, numbers, underscores and dots (see the definition bellow).
The problem is - when I use 'false' as a variable name.
varName.nestedVar.false. The rule which marks false as falseLiteral takes precedence.
I tried to play with the white spaces, using everything I found on the internet. Solution when I would remove WHITESPACE : [ \t\r\n] -> channel (HIDDEN); and use explicit WS* or WS+ in every rule would work for the parser, but I would have to adjust a lot of code in the AST visitors. I try to tell boolLiteral rule that it has to have some space before the actual literal like WHITESPACE* trueLiteral, but this doesn't work when the white spaces are sent to the HIDDEN channel. And again disable it altogether = lot of code rewriting. (Since I often rely on the order of tokens.) I also tried to reorder non-terminals in the literal rule but this had no effect whatsoever.
...
literal:
boolLiteral
| doubleLiteral
| longLiteral
| stringLiteral
| nullLiteral
| varExpression
;
boolLiteral:
trueLiteral | falseLiteral
;
trueLiteral:
TRUE
;
falseLiteral:
FALSE
;
varExpression:
name=qualifiedName ...
;
...
qualifiedName:
ID ('.' (ID | INT))*
...
TRUE : [Tt] [Rr] [Uu] [Ee];
FALSE : [Ff] [Aa] [Ll] [Ss] [Ee];
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
POINT : '.' ;
...
WHITESPACE : [ \t\r\n] -> channel (HIDDEN);
My best bet was to move qualifiedName definition to the lexer lure
qualifiedName:
QUAL_NAME
;
QUAL_NAME: ID ('.' (ID | INT))* ;
Then it works for
varName.false AND false
varName.whatever.ntimes AND false
Result is correct -> varExpression->quilafiedName on the left-hand side and boolLiteral -> falseLiteral on the right-hand side.
But with this definition this doesn't work, and I really don't know why
varName AND false
Qualified name without . returns
line 1:8 no viable alternative at input 'varName AND'
Expected solution would be ether enable/disable whitespace -> channel{hiddne} for specific rules only
Tell the boolLiteral rule that it canNOT start start with dot, someting like ~POINT falseLiteral, but I tried this as well and with no luck.
Or get qualifiedName working without dot when the rule is moved to the lexer rule.
Thanks.
You could do something like this:
qualifiedName
: ID ('.' (anyId | INT))*
;
anyId
: ID
| TRUE
| FALSE
;
I believed that sequencing (implicitly given by order of subrules) had a higher priority in ANTLR4 parser than alternation (explictly given by | character), meaning that
a : x | y z ;
was semantically identical to
a : x | ( y z) ;
Looking in the ANTLR4 book and searching generally I can't find a clear statement of this but it seems reasonable, however given a rule
expression :
pmqident
|
constant
|
[snip]
|
'(' scalar_subquery ')'
|
unary_operator expression // this is unbracketed
|
expression binary_operator expression
[snip]
;
and I feed it this select - 2 / 3 I get this parse tree
whereas if I just add brackets around unary_operator expression and change absolutely nothing else, to get this
expression :
[snip]
'(' scalar_subquery ')'
|
( unary_operator expression ) // brackets added here
|
expression binary_operator expression
[snip]
;
and give it the same SQL, I get this
What am I misunderstanding?
(BTW and separately, the freaky parse of "- 2 / 3" into "(- ( 2 / 3))" is actually the one I want. That's how MSSQL does it. Mad world)
------
Ok, to reproduce (works for me), not utterly minimal but heavily stripped code. File is named MSSQL.g4:
grammar MSSQL;
expression :
constant
|
unary_operator expression // bracket/unbracket this
|
expression binary_operator expression
;
constant : INTEGER_CONST ;
INTEGER_CONST : [0-9]+ ;
binary_operator :
arithmetic_operator
;
arithmetic_operator :
subtract
|
divide
;
add_symbol : PLUS_SIGN ;
subtract : MINUS_SIGN ;
divide : DIVIDE_SIGN ;
unary_operator :
SIGN
;
SIGN : PLUS_SIGN | MINUS_SIGN ;
DIVIDE_SIGN : '/' ;
PLUS_SIGN : '+' ;
MINUS_SIGN : '-' ;
SKIPWS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
The DOS crud to compile it (relevant parts given):
set CurrDir=%~dp0
set CurrDir=%CurrDir:~0,-1%
cd %CurrDir%
java org.antlr.v4.Tool -Werror -o %CurrDir%\MSSQL MSSQL.g4
IF %ERRORLEVEL% NEQ 0 goto problem
javac %CurrDir%\MSSQL\MSSQL*.java
IF %ERRORLEVEL% NEQ 0 goto problem
cd ./MSSQL
echo enter sql...
java org.antlr.v4.gui.TestRig MSSQL expression -gui -trace -tokens
input is - 2 / 3
Running on win2k8R2, versions of bits are as follows
C:\Users\jan>java -version
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)
C:\Users\jan>java org.antlr.v4.Tool
ANTLR Parser Generator Version 4.5.1
Anything else needed? Can anyone reproduce?
Frankly I'm struggling to believe this is a bug. It's just too elemental.
FYI I found this originally not by bracketing/unbracketing but by hoisting the body of a subrule into rule, and noticed behaviour changed.
This answer is being written in the context of antlr/antlr4#564 not being fixed.
During the code generation process, ANTLR looks for a few specific patterns when rewriting left-recursive rules to work in a recursive-descent parser.
Consider the following rule:
expression
: INT
| '++' expression
| expression '++'
| expression '+' expression
;
Suffix: Top-level alternatives which start with a recursive invocation. In the example, the alternative expression '++' falls into this category.
Prefix: Top-level alternatives which end with a recursive invocation. In the example, the alternative '++' expression falls into this category.
Binary: Top-level alternatives which start and end with a recursive invocation. In the example, the alternative expression '+' expression falls into this category.
Other: Everything else. In the example, the alternative INT falls into this category.
When matching these patterns, no simplifications are performed. This includes removing otherwise-unnecessary parentheses, which is the basis of issue antlr/antlr4#564.
By including parentheses around a top-level alternative in a left-recursive rule, you force the alternative to be treated as Other. For alternatives that would normally be Suffix or Binary, this results in a compilation error due to left recursion that was not eliminated. For Prefix alternatives (which you have), the grammar still compiles but changes behavior because the alternative is treated as a primary expression instead of an operator which overrides its original position in the operator precedence sequence.
Note that including parentheses around a top-level alternative which was already in the Other category will not change behavior at all. Likewise, including parentheses around an alternative in a rule which is not left-recursive will not change behavior.
I am parsing a SQL like language.
I want to parse full SQL sentence like : SELECT .. FROM .. WHERE and also a simple expr line which can be a function, where clause, expr and arithmethics.
this is the important parts of the grammar:
// main rule
parse : (statments)* EOF;
// All optional statements
statments : select_statement
| virtual_column_statement
;
select_statement :
SELECT select_item ( ',' select_item )*
FROM from_cluase ( ',' from_cluase )*
(WHERE where_clause )?
( GROUP BY group_by_item (',' group_by_item)* )?
( HAVING having_condition (AND having_condition)* )?
( ORDER BY order_clause (',' order_clause)* )?
( LIMIT limit_clause)?
| '(' select_statement ')'
virtual_column_statement:
virtual_column_expression
;
virtual_column_expression :
expr
| where_clause
| function
| virtual_column_expression arithmetichOp=('*'|'/'|'+'|'-'|'%') virtual_column_expression
| '(' virtual_column_expression ')'
;
virtual_columns works great.
Select queries works too but after it finishes, it goes to virtual_column_statement too.
I want it to choose one.
How can I fix this?
EDIT:
After some research I found out antlr takes my query and seperate it to two different parts.
How can I fix this?
Thanks,
id
Your 'virtual_column_statement' appears to be part of the 'select_statement'. I expect that you are missing a ';' between the two rules.
Most of your 'select_statement' clauses are optional, so after matching the select and from clauses, if Antlr thinks that the balance of the input is better matched as a 'virtual_column_statement', then it will take that path.
Your choices are:
1) make your select_statement comprehensive and at least as general as your 'virtual_column_statement';
2) require a keyword at the beginning of the 'virtual_column_statement' to prevent Antlr from considering it as a partial alternate;
3) put the 'virtual_column_statement' in a separate parser grammar and don't send it any select input text.
How do I build a token in lexer that can handle recursion inside as this string:
${*anything*${*anything*}*anything*}
?
Yes, you can use recursion inside lexer rules.
Take the following example:
${a ${b} ${c ${ddd} c} a}
which will be parsed correctly by the following grammar:
parse
: DollarVar
;
DollarVar
: '${' (DollarVar | EscapeSequence | ~Special)+ '}'
;
fragment
Special
: '\\' | '$' | '{' | '}'
;
fragment
EscapeSequence
: '\\' Special
;
as the interpreter inside ANTLRWorks shows:
alt text http://img185.imageshack.us/img185/5471/recq.png
ANTLR's lexers do support recursion, as #BartK adeptly points out in his post, but you will only see a single token within the parser. If you need to interpret the various pieces within that token, you'll probably want to handle it within the parser.
IMO, you'd be better off doing something in the parser:
variable: DOLLAR LBRACE id variable id RBRACE;
By doing something like the above, you'll see all the necessary pieces and can build an AST or otherwise handle accordingly.