Eliminating need for ALL(*) parsing in ANTLR 4 Grammar - antlr4

I am writing an ANTLR 4 grammar for a language that will have switch statements that do not allow fallthrough (similar to C#). All case statements must be terminated by a break statement. Multiple case statements can follow each other without any code in between (again, just like in C#). Here is a snippet of the grammar that captures this:
grammar MyGrammar;
switchStmt : 'switch' '(' expression ')' '{' caseStmt+ '}' ;
caseStmt : (caseOpener)+ statementList breakStmt ;
caseOpener : 'case' literal ':'
| 'default' ':'
;
statementList : statement (statement)* ;
breakStmt : 'break' ';' ;
I left out the definitions of expression and statement for brevity. However, it's important to note that the definition for statement includes breakStmt. This is because break statements can also be used to break out of loops.
In general the grammar is fine - it parses input as expected. However, I get warnings during the parse like "line 18:0 reportAttemptingFullContext d=10 (statementList), input='break;" and "line 18:0 reportContextSensitivity d=10 (statementList), input='break;" This makes sense because the parser is not sure whether to match a break statement as statement or as breakStmt and needs to fall back on ALL(*) parsing. My question is, how can I change my grammar in order to eliminate the need for this during the parse and avoid the performance hit? Is it even possible to do without changing the language syntax?

You should remove the breakStmt reference from the end of caseStmt, and instead perform this validation in a listener or visitor after the parse is complete. This offers you the following advantages:
Improved error handling when a user omits the required break statement.
Improved parser performance by removing the ambiguity between the breakStmt at the end of caseStmt and the statementList that precedes it.
I would use the following rules:
switchStmt
: 'switch' '(' expression ')' '{' caseStmt* '}'
;
caseStmt
: caseOpener statementList?
;
statementList
: statement+
;

Related

Antlr4 grammar wouldn't parse multiline input

I want to write a grammar using Antlr4 that will parse a some definition but I've been struggling to get Antlr to co-operate.
The definition has two kinds of lines, a type and a property. I can get my grammar to parse the type line correctly but it either ignores the property lines or fails to identify PROPERTY_TYPE depending on how I tweak my grammar.
Here is my grammar (attempt # 583):
grammar TypeDefGrammar;
start
: statement+ ;
statement
: type NEWLINE
| property NEWLINE
| NEWLINE ;
type
: TYPE_KEYWORD TYPE_NAME; // e.g. 'type MyType1'
property
: PROPERTY_NAME ':' PROPERTY_TYPE ; // e.g. 'someProperty1: int'
TYPE_KEYWORD
: 'type' ;
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
fragment LETTER
: [a-zA-Z] ;
fragment DIGIT
: [0-9] ;
NEWLINE
: '\r'? '\n' ;
WS
: [ \t] -> skip ;
Here is a sample input:
type SimpleType
intProp1: int
stringProp2 : String
(returns the type but ignores intProp1, stringProp2.)
What am I doing wrong?
Usually when a rule does not match the whole input, but does match a prefix of it, it will simply match that prefix and leave the rest of the input in the stream without producing an error. If you want your rule to always match the whole input, you can add EOF to the end of the rule. That way you'll get proper error messages when it can't match the entire input.
So let's change your start rule to start : statement+ EOF;. Now applying start to your input will lead to the following error messages:
line 3:0 extraneous input 'intProp1' expecting {, 'type', PROPERTY_NAME, NEWLINE}
line 4:0 extraneous input 'stringProp2' expecting {, 'type', PROPERTY_NAME, NEWLINE}
So apparently intProp1 and stringProp2 aren't recognized as PROPERTY_NAMEs. So let's look at which tokens are generated (you can do that using the -tokens option to grun or by just iterating over the token stream in your code):
[#0,0:3='type',<'type'>,1:0]
[#1,5:14='SimpleType',<TYPE_NAME>,1:5]
[#2,15:15='\n',<NEWLINE>,1:15]
[#3,16:16='\n',<NEWLINE>,2:0]
[#4,17:24='intProp1',<TYPE_NAME>,3:0]
[#5,25:25=':',<':'>,3:8]
[#6,27:29='int',<TYPE_NAME>,3:10]
[#7,30:30='\n',<NEWLINE>,3:13]
[#8,31:41='stringProp2',<TYPE_NAME>,4:0]
[#9,43:43=':',<':'>,4:12]
[#10,45:50='String',<TYPE_NAME>,4:14]
[#11,51:51='\n',<NEWLINE>,4:20]
[#12,52:51='<EOF>',<EOF>,5:0]
So all of the identifiers in the code are recognized as TYPE_NAMEs, not PROPERTY_NAMEs. In fact, it is not clear what should distinguish a TYPE_NAME from a PROPERTY_NAME, so now let's actually look at your grammar:
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
Here you have three lexer rules with exactly the same definition. That's a bad sign.
Whenever multiple lexer rules can match on the current input, ANTLR chooses the one that would produce the longest match, picking the one that comes first in the grammar in case of ties. This is known as the maximum munch rule.
If you have multiple rules with the same definition, that means those rules will always match on the same input and they will always produce matches of the same length. So by the maximum much rule, the first definition (TYPE_NAME) will always be used and the other ones might as well not exist.
The problem basically boils down to the fact that there's nothing that lexically distinguishes the different types of names, so there's no basis on which the lexer could decide which type of name a given identifier represents. That tells us that the names should not be lexer rules. Instead IDENTIFIER should be a lexer rule and the FOO_NAMEs should either be (somewhat unnecessary) parser rules or removed altogether (you can just use IDENTIFIER wherever you're currently using FOO_NAME).

Using getCharPositionInLine() with leading spaces in ANTLR4

I am writing grammar for a script which is based on VBScript.
In the script, the variable assignment is done in the usual manner of i=10 and in addition a variation with: Set i=10
The method calls can be done in several ways along with calling methods on objects, like:
Another(10).Call(20).Chain(30)
I consider 'Set' as a keyword in my grammar. However, in some pre-defined calsses, the developer is allowed to name the method as 'Set', so, there maybe calls like (let me mark this as line A):
Another(10).Call(20,30).Set 40,50
my grammar:
definition: body EOF;
body: NL_WS* bodyElement NL_WS*;
bodyElement: statement (NL_WS+ statement)* ;
statement: assignment | chainCall;
assignment: (START_SET)? IDENTIFIER WS? EQUALS WS? (chainCall | VALID_NUMBER) ;
chainCall: methodCall ('.' methodCall)* ;
methodCall: IDENTIFIER WS? LPAREN? WS? argumentList? WS? RPAREN?;
argumentList: VALID_NUMBER (WS? COMMA WS? VALID_NUMBER)* ;
START_SET: 'Set' WS;
VALID_NUMBER: [1-9] NUMBER? ;
IDENTIFIER: LETTER LETTER_OR_DIGIT*;
LETTER: [a-zA-Z_];
NUMBER: [0-9];
LETTER_OR_DIGIT: [a-zA-Z0-9_];
EQUALS: '=' ;
LPAREN: '(';
RPAREN: ')';
COMMA: ',';
NL_WS: WS? NEWLINE WS?;
NEWLINE: [\r\n];
WS: [ \t]+;
This fails in what I have marked as line A (where Set is a method call inside an object):
line 10:24 mismatched input 'Set ' expecting IDENTIFIER
1) I am not able to understand why. My thinking is that as in the assignment rule, the (START_SET)? is defined at the beginning, it should expect Set at the beginning and so, the method call at the end should match with IDENTIFIER.
2) When I try with getCharPositionInLine, like:
START_SET: {getCharPositionInLine() == 0}? 'Set' WS;
it works fine, but, I have to deal with another problem. That is, there maybe leading whitespaces before the 'Set' assignment, like:
' Set k=10'
and in such cases, it fails saying:
line 16:8 mismatched input 'k' expecting {<EOF>, '.', NL_WS}
(in this case, I think it matches with chainCall and not assignment which is understandable as it is not the first character in line).
So, is there an alternate method which will be like 'first character in line minus spaces'?
I also tried,START_SET: {getCharPositionInLine() == 0}? WS? 'Set' WS;
thinking that the initial WS? will cover the first character in line, but I get the same error.
Any help is appreciated.
I found a method to deal with this issue. I wrote what is called pre-processor which strips all the leading spaces in a line and then have that parsed by ANTLR. This way I could use {getCharPositionInLine() == 0} successfully.
Also, this helped in keeping the grammar simpler.
HTH.

Precedence of alternation vs sequencing in ANTLR4

I believed that sequencing (implicitly given by order of subrules) had a higher priority in ANTLR4 parser than alternation (explictly given by | character), meaning that
a : x | y z ;
was semantically identical to
a : x | ( y z) ;
Looking in the ANTLR4 book and searching generally I can't find a clear statement of this but it seems reasonable, however given a rule
expression :
pmqident
|
constant
|
[snip]
|
'(' scalar_subquery ')'
|
unary_operator expression // this is unbracketed
|
expression binary_operator expression
[snip]
;
and I feed it this select - 2 / 3 I get this parse tree
whereas if I just add brackets around unary_operator expression and change absolutely nothing else, to get this
expression :
[snip]
'(' scalar_subquery ')'
|
( unary_operator expression ) // brackets added here
|
expression binary_operator expression
[snip]
;
and give it the same SQL, I get this
What am I misunderstanding?
(BTW and separately, the freaky parse of "- 2 / 3" into "(- ( 2 / 3))" is actually the one I want. That's how MSSQL does it. Mad world)
------
Ok, to reproduce (works for me), not utterly minimal but heavily stripped code. File is named MSSQL.g4:
grammar MSSQL;
expression :
constant
|
unary_operator expression // bracket/unbracket this
|
expression binary_operator expression
;
constant : INTEGER_CONST ;
INTEGER_CONST : [0-9]+ ;
binary_operator :
arithmetic_operator
;
arithmetic_operator :
subtract
|
divide
;
add_symbol : PLUS_SIGN ;
subtract : MINUS_SIGN ;
divide : DIVIDE_SIGN ;
unary_operator :
SIGN
;
SIGN : PLUS_SIGN | MINUS_SIGN ;
DIVIDE_SIGN : '/' ;
PLUS_SIGN : '+' ;
MINUS_SIGN : '-' ;
SKIPWS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
The DOS crud to compile it (relevant parts given):
set CurrDir=%~dp0
set CurrDir=%CurrDir:~0,-1%
cd %CurrDir%
java org.antlr.v4.Tool -Werror -o %CurrDir%\MSSQL MSSQL.g4
IF %ERRORLEVEL% NEQ 0 goto problem
javac %CurrDir%\MSSQL\MSSQL*.java
IF %ERRORLEVEL% NEQ 0 goto problem
cd ./MSSQL
echo enter sql...
java org.antlr.v4.gui.TestRig MSSQL expression -gui -trace -tokens
input is - 2 / 3
Running on win2k8R2, versions of bits are as follows
C:\Users\jan>java -version
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)
C:\Users\jan>java org.antlr.v4.Tool
ANTLR Parser Generator Version 4.5.1
Anything else needed? Can anyone reproduce?
Frankly I'm struggling to believe this is a bug. It's just too elemental.
FYI I found this originally not by bracketing/unbracketing but by hoisting the body of a subrule into rule, and noticed behaviour changed.
This answer is being written in the context of antlr/antlr4#564 not being fixed.
During the code generation process, ANTLR looks for a few specific patterns when rewriting left-recursive rules to work in a recursive-descent parser.
Consider the following rule:
expression
: INT
| '++' expression
| expression '++'
| expression '+' expression
;
Suffix: Top-level alternatives which start with a recursive invocation. In the example, the alternative expression '++' falls into this category.
Prefix: Top-level alternatives which end with a recursive invocation. In the example, the alternative '++' expression falls into this category.
Binary: Top-level alternatives which start and end with a recursive invocation. In the example, the alternative expression '+' expression falls into this category.
Other: Everything else. In the example, the alternative INT falls into this category.
When matching these patterns, no simplifications are performed. This includes removing otherwise-unnecessary parentheses, which is the basis of issue antlr/antlr4#564.
By including parentheses around a top-level alternative in a left-recursive rule, you force the alternative to be treated as Other. For alternatives that would normally be Suffix or Binary, this results in a compilation error due to left recursion that was not eliminated. For Prefix alternatives (which you have), the grammar still compiles but changes behavior because the alternative is treated as a primary expression instead of an operator which overrides its original position in the operator precedence sequence.
Note that including parentheses around a top-level alternative which was already in the Other category will not change behavior at all. Likewise, including parentheses around an alternative in a rule which is not left-recursive will not change behavior.

antlr 4 listeners stop in two places - weird

I am parsing a SQL like language.
I want to parse full SQL sentence like : SELECT .. FROM .. WHERE and also a simple expr line which can be a function, where clause, expr and arithmethics.
this is the important parts of the grammar:
// main rule
parse : (statments)* EOF;
// All optional statements
statments : select_statement
| virtual_column_statement
;
select_statement :
SELECT select_item ( ',' select_item )*
FROM from_cluase ( ',' from_cluase )*
(WHERE where_clause )?
( GROUP BY group_by_item (',' group_by_item)* )?
( HAVING having_condition (AND having_condition)* )?
( ORDER BY order_clause (',' order_clause)* )?
( LIMIT limit_clause)?
| '(' select_statement ')'
virtual_column_statement:
virtual_column_expression
;
virtual_column_expression :
expr
| where_clause
| function
| virtual_column_expression arithmetichOp=('*'|'/'|'+'|'-'|'%') virtual_column_expression
| '(' virtual_column_expression ')'
;
virtual_columns works great.
Select queries works too but after it finishes, it goes to virtual_column_statement too.
I want it to choose one.
How can I fix this?
EDIT:
After some research I found out antlr takes my query and seperate it to two different parts.
How can I fix this?
Thanks,
id
Your 'virtual_column_statement' appears to be part of the 'select_statement'. I expect that you are missing a ';' between the two rules.
Most of your 'select_statement' clauses are optional, so after matching the select and from clauses, if Antlr thinks that the balance of the input is better matched as a 'virtual_column_statement', then it will take that path.
Your choices are:
1) make your select_statement comprehensive and at least as general as your 'virtual_column_statement';
2) require a keyword at the beginning of the 'virtual_column_statement' to prevent Antlr from considering it as a partial alternate;
3) put the 'virtual_column_statement' in a separate parser grammar and don't send it any select input text.

string recursion antlr lexer token

How do I build a token in lexer that can handle recursion inside as this string:
${*anything*${*anything*}*anything*}
?
Yes, you can use recursion inside lexer rules.
Take the following example:
${a ${b} ${c ${ddd} c} a}
which will be parsed correctly by the following grammar:
parse
: DollarVar
;
DollarVar
: '${' (DollarVar | EscapeSequence | ~Special)+ '}'
;
fragment
Special
: '\\' | '$' | '{' | '}'
;
fragment
EscapeSequence
: '\\' Special
;
as the interpreter inside ANTLRWorks shows:
alt text http://img185.imageshack.us/img185/5471/recq.png
ANTLR's lexers do support recursion, as #BartK adeptly points out in his post, but you will only see a single token within the parser. If you need to interpret the various pieces within that token, you'll probably want to handle it within the parser.
IMO, you'd be better off doing something in the parser:
variable: DOLLAR LBRACE id variable id RBRACE;
By doing something like the above, you'll see all the necessary pieces and can build an AST or otherwise handle accordingly.

Resources