Using getCharPositionInLine() with leading spaces in ANTLR4 - antlr4

I am writing grammar for a script which is based on VBScript.
In the script, the variable assignment is done in the usual manner of i=10 and in addition a variation with: Set i=10
The method calls can be done in several ways along with calling methods on objects, like:
Another(10).Call(20).Chain(30)
I consider 'Set' as a keyword in my grammar. However, in some pre-defined calsses, the developer is allowed to name the method as 'Set', so, there maybe calls like (let me mark this as line A):
Another(10).Call(20,30).Set 40,50
my grammar:
definition: body EOF;
body: NL_WS* bodyElement NL_WS*;
bodyElement: statement (NL_WS+ statement)* ;
statement: assignment | chainCall;
assignment: (START_SET)? IDENTIFIER WS? EQUALS WS? (chainCall | VALID_NUMBER) ;
chainCall: methodCall ('.' methodCall)* ;
methodCall: IDENTIFIER WS? LPAREN? WS? argumentList? WS? RPAREN?;
argumentList: VALID_NUMBER (WS? COMMA WS? VALID_NUMBER)* ;
START_SET: 'Set' WS;
VALID_NUMBER: [1-9] NUMBER? ;
IDENTIFIER: LETTER LETTER_OR_DIGIT*;
LETTER: [a-zA-Z_];
NUMBER: [0-9];
LETTER_OR_DIGIT: [a-zA-Z0-9_];
EQUALS: '=' ;
LPAREN: '(';
RPAREN: ')';
COMMA: ',';
NL_WS: WS? NEWLINE WS?;
NEWLINE: [\r\n];
WS: [ \t]+;
This fails in what I have marked as line A (where Set is a method call inside an object):
line 10:24 mismatched input 'Set ' expecting IDENTIFIER
1) I am not able to understand why. My thinking is that as in the assignment rule, the (START_SET)? is defined at the beginning, it should expect Set at the beginning and so, the method call at the end should match with IDENTIFIER.
2) When I try with getCharPositionInLine, like:
START_SET: {getCharPositionInLine() == 0}? 'Set' WS;
it works fine, but, I have to deal with another problem. That is, there maybe leading whitespaces before the 'Set' assignment, like:
' Set k=10'
and in such cases, it fails saying:
line 16:8 mismatched input 'k' expecting {<EOF>, '.', NL_WS}
(in this case, I think it matches with chainCall and not assignment which is understandable as it is not the first character in line).
So, is there an alternate method which will be like 'first character in line minus spaces'?
I also tried,START_SET: {getCharPositionInLine() == 0}? WS? 'Set' WS;
thinking that the initial WS? will cover the first character in line, but I get the same error.
Any help is appreciated.

I found a method to deal with this issue. I wrote what is called pre-processor which strips all the leading spaces in a line and then have that parsed by ANTLR. This way I could use {getCharPositionInLine() == 0} successfully.
Also, this helped in keeping the grammar simpler.
HTH.

Related

ANTLR: how to debug a misidentified token

I am trying to implement a grammar in Antlr4 for a simple template engine. This engine consists of 3 different clauses:
IF ANSWERED ( variable )
END IF
Variable
Variable can be any upper or lowercase letter including white spaces. Both IF ANSWERED and END IF are always uppercase.
I have written the following grammar/lexer rules so far, but my problem is that IF ANSWERED keeps getting recognized as a Variable and not as 2 tokens IF and ANSWERED.
grammar program;
/**grammar */
command: (ifStart | ifEnd | VARIABLE ) EOF;
ifStart: IF ANSWERED '(' VARIABLE ')';
ifEnd: 'END IF';
/** lexer */
IF: 'IF';
ANSWERED: 'ANSWERED';
TEXT: (LOWERCASE | UPPERCASE | NUMBER) ;
VARIABLE: (TEXT | [ \t\r\n])+;
fragment LOWERCASE: [a-z];
fragment UPPERCASE: [A-Z];
fragment NUMBER: [0-9];
If I try to parse IF ANSWERED ( FirstName ) I get the following output:
[#0,0:10='IF ANSWERED',**<VARIABLE>**,1:0]
[#1,11:11='(',<'('>,1:11]
[#2,12:25='Execution date',<VARIABLE>,1:12]
[#3,26:26=')',<')'>,1:26]
[#4,27:26='<EOF>',<EOF>,1:27]
line 1:0 mismatched input 'IF ANSWERED' expecting 'IF'
I read that Antlr4 is greedy and tries to match the biggest possible token, but I fail to understand what is the correct approach, or how to think through the problem to find a solution.
Correct: ANTLR's lexer is greedy, and tries to consume as much as possible. That is why IF ANSWERED is tokenised as a TEXT token instead of 2 separate keywords. You'll need to change TEXT so that it does not match spaces.
Something like this could get you started:
parse
: command* EOF
;
command
: (ifStatement | variable)+
;
ifStatement
: IF ANSWERED '(' variable ')' command* END IF
;
variable
: TEXT
;
IF : 'IF';
END : 'END';
ANSWERED : 'ANSWERED';
TEXT : [a-zA-Z0-9]+;
SPACES : [ \t\r\n]+ -> skip;

Antlr4 grammar wouldn't parse multiline input

I want to write a grammar using Antlr4 that will parse a some definition but I've been struggling to get Antlr to co-operate.
The definition has two kinds of lines, a type and a property. I can get my grammar to parse the type line correctly but it either ignores the property lines or fails to identify PROPERTY_TYPE depending on how I tweak my grammar.
Here is my grammar (attempt # 583):
grammar TypeDefGrammar;
start
: statement+ ;
statement
: type NEWLINE
| property NEWLINE
| NEWLINE ;
type
: TYPE_KEYWORD TYPE_NAME; // e.g. 'type MyType1'
property
: PROPERTY_NAME ':' PROPERTY_TYPE ; // e.g. 'someProperty1: int'
TYPE_KEYWORD
: 'type' ;
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
fragment LETTER
: [a-zA-Z] ;
fragment DIGIT
: [0-9] ;
NEWLINE
: '\r'? '\n' ;
WS
: [ \t] -> skip ;
Here is a sample input:
type SimpleType
intProp1: int
stringProp2 : String
(returns the type but ignores intProp1, stringProp2.)
What am I doing wrong?
Usually when a rule does not match the whole input, but does match a prefix of it, it will simply match that prefix and leave the rest of the input in the stream without producing an error. If you want your rule to always match the whole input, you can add EOF to the end of the rule. That way you'll get proper error messages when it can't match the entire input.
So let's change your start rule to start : statement+ EOF;. Now applying start to your input will lead to the following error messages:
line 3:0 extraneous input 'intProp1' expecting {, 'type', PROPERTY_NAME, NEWLINE}
line 4:0 extraneous input 'stringProp2' expecting {, 'type', PROPERTY_NAME, NEWLINE}
So apparently intProp1 and stringProp2 aren't recognized as PROPERTY_NAMEs. So let's look at which tokens are generated (you can do that using the -tokens option to grun or by just iterating over the token stream in your code):
[#0,0:3='type',<'type'>,1:0]
[#1,5:14='SimpleType',<TYPE_NAME>,1:5]
[#2,15:15='\n',<NEWLINE>,1:15]
[#3,16:16='\n',<NEWLINE>,2:0]
[#4,17:24='intProp1',<TYPE_NAME>,3:0]
[#5,25:25=':',<':'>,3:8]
[#6,27:29='int',<TYPE_NAME>,3:10]
[#7,30:30='\n',<NEWLINE>,3:13]
[#8,31:41='stringProp2',<TYPE_NAME>,4:0]
[#9,43:43=':',<':'>,4:12]
[#10,45:50='String',<TYPE_NAME>,4:14]
[#11,51:51='\n',<NEWLINE>,4:20]
[#12,52:51='<EOF>',<EOF>,5:0]
So all of the identifiers in the code are recognized as TYPE_NAMEs, not PROPERTY_NAMEs. In fact, it is not clear what should distinguish a TYPE_NAME from a PROPERTY_NAME, so now let's actually look at your grammar:
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
Here you have three lexer rules with exactly the same definition. That's a bad sign.
Whenever multiple lexer rules can match on the current input, ANTLR chooses the one that would produce the longest match, picking the one that comes first in the grammar in case of ties. This is known as the maximum munch rule.
If you have multiple rules with the same definition, that means those rules will always match on the same input and they will always produce matches of the same length. So by the maximum much rule, the first definition (TYPE_NAME) will always be used and the other ones might as well not exist.
The problem basically boils down to the fact that there's nothing that lexically distinguishes the different types of names, so there's no basis on which the lexer could decide which type of name a given identifier represents. That tells us that the names should not be lexer rules. Instead IDENTIFIER should be a lexer rule and the FOO_NAMEs should either be (somewhat unnecessary) parser rules or removed altogether (you can just use IDENTIFIER wherever you're currently using FOO_NAME).

Eliminating need for ALL(*) parsing in ANTLR 4 Grammar

I am writing an ANTLR 4 grammar for a language that will have switch statements that do not allow fallthrough (similar to C#). All case statements must be terminated by a break statement. Multiple case statements can follow each other without any code in between (again, just like in C#). Here is a snippet of the grammar that captures this:
grammar MyGrammar;
switchStmt : 'switch' '(' expression ')' '{' caseStmt+ '}' ;
caseStmt : (caseOpener)+ statementList breakStmt ;
caseOpener : 'case' literal ':'
| 'default' ':'
;
statementList : statement (statement)* ;
breakStmt : 'break' ';' ;
I left out the definitions of expression and statement for brevity. However, it's important to note that the definition for statement includes breakStmt. This is because break statements can also be used to break out of loops.
In general the grammar is fine - it parses input as expected. However, I get warnings during the parse like "line 18:0 reportAttemptingFullContext d=10 (statementList), input='break;" and "line 18:0 reportContextSensitivity d=10 (statementList), input='break;" This makes sense because the parser is not sure whether to match a break statement as statement or as breakStmt and needs to fall back on ALL(*) parsing. My question is, how can I change my grammar in order to eliminate the need for this during the parse and avoid the performance hit? Is it even possible to do without changing the language syntax?
You should remove the breakStmt reference from the end of caseStmt, and instead perform this validation in a listener or visitor after the parse is complete. This offers you the following advantages:
Improved error handling when a user omits the required break statement.
Improved parser performance by removing the ambiguity between the breakStmt at the end of caseStmt and the statementList that precedes it.
I would use the following rules:
switchStmt
: 'switch' '(' expression ')' '{' caseStmt* '}'
;
caseStmt
: caseOpener statementList?
;
statementList
: statement+
;

antlr does not parse when token only mentions one other token

I am trying to learn EBNF grammars with ANTLR. So I thought I would convert the Wikipedia EBNF grammar to ANTLR 4 and play with it. However I have had a terrible time at it. I was able to reduce the grammar to the one step that generates the problem.
It seems if I have one token reference solely another token then ANTLR 4 can't parse the input.
Here is my grammar:
grammar Hello;
program : statement+ ;
statement : IDENTIFIER STATEMENTEND /*| LETTERS STATEMENTEND */ ;
LETTERS : [a-z]+ ;
IDENTIFIER : LETTERS ;
SEMICOLON : [;] ;
STATEMENTEND : SEMICOLON NEWLINE* | NEWLINE+ ;
fragment NEWLINE : '\r' '\n' | '\n' | '\r';
Notice IDENTIFIER refers only to LETTERS.
If I provide this input:
a;
Then I get this error:
line 1:0 mismatched input 'a' expecting IDENTIFIER
(program a ;\n)
However if I uncomment the code and provide the same input I get legit output:
(program (statement a ;\n))
I do not understand why one works and the other does not.
The token a will only be assigned one token type. Since this input text matches both the LETTERS and IDENTIFIER rules, ANTLR 4 will assign the type according to the first rule appearing in the lexer, which means the input a will be a token of type LETTERS.
If you only meant for LETTERS to be a sub-part of other lexer rules, and not form LETTERS tokens themselves, you can declare it as a fragment rule.
fragment LETTERS : [a-z]+;
IDENTIFIER : LETTERS;
In this case, a would be assigned the token type IDENTIFIER and the original parser rule would work.

string recursion antlr lexer token

How do I build a token in lexer that can handle recursion inside as this string:
${*anything*${*anything*}*anything*}
?
Yes, you can use recursion inside lexer rules.
Take the following example:
${a ${b} ${c ${ddd} c} a}
which will be parsed correctly by the following grammar:
parse
: DollarVar
;
DollarVar
: '${' (DollarVar | EscapeSequence | ~Special)+ '}'
;
fragment
Special
: '\\' | '$' | '{' | '}'
;
fragment
EscapeSequence
: '\\' Special
;
as the interpreter inside ANTLRWorks shows:
alt text http://img185.imageshack.us/img185/5471/recq.png
ANTLR's lexers do support recursion, as #BartK adeptly points out in his post, but you will only see a single token within the parser. If you need to interpret the various pieces within that token, you'll probably want to handle it within the parser.
IMO, you'd be better off doing something in the parser:
variable: DOLLAR LBRACE id variable id RBRACE;
By doing something like the above, you'll see all the necessary pieces and can build an AST or otherwise handle accordingly.

Resources