Lexer and Parser rules for a simple command processor - antlr4

I am attempting to build a simple command processor for a legacy language.
I am attempting to work with C# with antlr4 version "ANTLR", "4.6.6")
I am unable to make progress against one scenario, of several.
The following examples shows various sample invocations of the command PKS.
PKS
PKS?
PKStext_that_is_a_filename
The scenario that I can not solve is the PKS command followed by filename.
Command:
PKS
(block (line (expr (command PKS)) (eol \r\n)) <EOF>)
Command:
PKS?
(block (line (expr (command PKS) (query ?)) (eol \r\n)) <EOF>)
Command:
PKSFILENAME
line 1:0 mismatched input 'PKSFILENAME' expecting COMMAND
(block PKSFILENAME \r\n)
Command:
what I believe to be the relevant snippet of grammar:
block : line+ EOF;
line : (expr eol)+;
expr : command file
| command listOfDouble
| command query
| command
;
command : COMMAND
;
query : QUERY;
file : TEXT ;
eol : EOL;
listOfDouble: DOUBLE (COMMA DOUBLE)* ;
From the lexer:
COMMAND : PKS;
PKS :'PKS' ;
QUERY : '?'
;
fragment LETTER : [A-Z];
fragment DIGIT : [0-9];
fragment UNDER : [_];
TEXT : (LETTER) (LETTER|DIGIT|UNDER)* ;

The main problem here is that your TEXT rule also matches what PKS is supposed to match. And since PKStext_that_is_a_filename can entirely be matched by that TEXT rule it is preferred over the PKS rule, even though it appears first in the grammar (if 2 rules match the same input then the first one wins).
In order to fix that problem you have 2 options:
Require whitespace(s) between the keyword (PKS) and the rest of the expression.
Change the TEXT rule to explicitly exclude "PKS" as valid input.
Option 2 is certainly possible, but will get very messy if you have have more keywords (as they all would have to be excluded). With a whitespace between the keywords and the text the lexer would automatically do that for you.
And let me give you a hint to approach such kind of problems: always check the token list produced by the lexer to see if it generated the tokens you expected. I reworked your grammar a bit, added missing tokens and ran it through my ANTLR4 debugger, which gave me:
Parser error (5, 1): extraneous input 'PKStext_that_is_a_filename' expecting {<EOF>, COMMAND, EOL}
Tokens:
[#0,0:2='PKS',<1>,1:0]
[#1,3:3='\n',<8>,1:3]
[#2,4:4='\n',<8>,2:0]
[#3,5:7='PKS',<1>,3:0]
[#4,8:8='?',<3>,3:3]
[#5,9:9='\n',<8>,3:4]
[#6,10:10='\n',<8>,4:0]
[#7,11:36='PKStext_that_is_a_filename',<7>,5:0]
[#8,37:37='\n',<8>,5:26]
[#9,38:37='<EOF>',<-1>,6:0]
For this input:
PKS
PKS?
PKStext_that_is_a_filename
Here's the grammar I used:
grammar Example;
start: block;
block: line+ EOF;
line: expr? eol;
expr: command (file | listOfDouble | query)?;
command: COMMAND;
query: QUERY;
file: TEXT;
eol: EOL;
listOfDouble: DOUBLE (COMMA DOUBLE)*;
COMMAND: PKS;
PKS: 'PKS';
QUERY: '?';
fragment LETTER: [a-zA-Z];
fragment DIGIT: [0-9];
fragment UNDER: [_];
COMMA: ',';
DOUBLE: DIGIT+ (DOT DIGIT*)?;
DOT: '.';
TEXT: LETTER (LETTER | DIGIT | UNDER)*;
EOL: [\n\r];
and the generated visual parse tree:

Related

ANTLR: how to debug a misidentified token

I am trying to implement a grammar in Antlr4 for a simple template engine. This engine consists of 3 different clauses:
IF ANSWERED ( variable )
END IF
Variable
Variable can be any upper or lowercase letter including white spaces. Both IF ANSWERED and END IF are always uppercase.
I have written the following grammar/lexer rules so far, but my problem is that IF ANSWERED keeps getting recognized as a Variable and not as 2 tokens IF and ANSWERED.
grammar program;
/**grammar */
command: (ifStart | ifEnd | VARIABLE ) EOF;
ifStart: IF ANSWERED '(' VARIABLE ')';
ifEnd: 'END IF';
/** lexer */
IF: 'IF';
ANSWERED: 'ANSWERED';
TEXT: (LOWERCASE | UPPERCASE | NUMBER) ;
VARIABLE: (TEXT | [ \t\r\n])+;
fragment LOWERCASE: [a-z];
fragment UPPERCASE: [A-Z];
fragment NUMBER: [0-9];
If I try to parse IF ANSWERED ( FirstName ) I get the following output:
[#0,0:10='IF ANSWERED',**<VARIABLE>**,1:0]
[#1,11:11='(',<'('>,1:11]
[#2,12:25='Execution date',<VARIABLE>,1:12]
[#3,26:26=')',<')'>,1:26]
[#4,27:26='<EOF>',<EOF>,1:27]
line 1:0 mismatched input 'IF ANSWERED' expecting 'IF'
I read that Antlr4 is greedy and tries to match the biggest possible token, but I fail to understand what is the correct approach, or how to think through the problem to find a solution.
Correct: ANTLR's lexer is greedy, and tries to consume as much as possible. That is why IF ANSWERED is tokenised as a TEXT token instead of 2 separate keywords. You'll need to change TEXT so that it does not match spaces.
Something like this could get you started:
parse
: command* EOF
;
command
: (ifStatement | variable)+
;
ifStatement
: IF ANSWERED '(' variable ')' command* END IF
;
variable
: TEXT
;
IF : 'IF';
END : 'END';
ANSWERED : 'ANSWERED';
TEXT : [a-zA-Z0-9]+;
SPACES : [ \t\r\n]+ -> skip;

Antlr4 grammar wouldn't parse multiline input

I want to write a grammar using Antlr4 that will parse a some definition but I've been struggling to get Antlr to co-operate.
The definition has two kinds of lines, a type and a property. I can get my grammar to parse the type line correctly but it either ignores the property lines or fails to identify PROPERTY_TYPE depending on how I tweak my grammar.
Here is my grammar (attempt # 583):
grammar TypeDefGrammar;
start
: statement+ ;
statement
: type NEWLINE
| property NEWLINE
| NEWLINE ;
type
: TYPE_KEYWORD TYPE_NAME; // e.g. 'type MyType1'
property
: PROPERTY_NAME ':' PROPERTY_TYPE ; // e.g. 'someProperty1: int'
TYPE_KEYWORD
: 'type' ;
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
fragment LETTER
: [a-zA-Z] ;
fragment DIGIT
: [0-9] ;
NEWLINE
: '\r'? '\n' ;
WS
: [ \t] -> skip ;
Here is a sample input:
type SimpleType
intProp1: int
stringProp2 : String
(returns the type but ignores intProp1, stringProp2.)
What am I doing wrong?
Usually when a rule does not match the whole input, but does match a prefix of it, it will simply match that prefix and leave the rest of the input in the stream without producing an error. If you want your rule to always match the whole input, you can add EOF to the end of the rule. That way you'll get proper error messages when it can't match the entire input.
So let's change your start rule to start : statement+ EOF;. Now applying start to your input will lead to the following error messages:
line 3:0 extraneous input 'intProp1' expecting {, 'type', PROPERTY_NAME, NEWLINE}
line 4:0 extraneous input 'stringProp2' expecting {, 'type', PROPERTY_NAME, NEWLINE}
So apparently intProp1 and stringProp2 aren't recognized as PROPERTY_NAMEs. So let's look at which tokens are generated (you can do that using the -tokens option to grun or by just iterating over the token stream in your code):
[#0,0:3='type',<'type'>,1:0]
[#1,5:14='SimpleType',<TYPE_NAME>,1:5]
[#2,15:15='\n',<NEWLINE>,1:15]
[#3,16:16='\n',<NEWLINE>,2:0]
[#4,17:24='intProp1',<TYPE_NAME>,3:0]
[#5,25:25=':',<':'>,3:8]
[#6,27:29='int',<TYPE_NAME>,3:10]
[#7,30:30='\n',<NEWLINE>,3:13]
[#8,31:41='stringProp2',<TYPE_NAME>,4:0]
[#9,43:43=':',<':'>,4:12]
[#10,45:50='String',<TYPE_NAME>,4:14]
[#11,51:51='\n',<NEWLINE>,4:20]
[#12,52:51='<EOF>',<EOF>,5:0]
So all of the identifiers in the code are recognized as TYPE_NAMEs, not PROPERTY_NAMEs. In fact, it is not clear what should distinguish a TYPE_NAME from a PROPERTY_NAME, so now let's actually look at your grammar:
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
Here you have three lexer rules with exactly the same definition. That's a bad sign.
Whenever multiple lexer rules can match on the current input, ANTLR chooses the one that would produce the longest match, picking the one that comes first in the grammar in case of ties. This is known as the maximum munch rule.
If you have multiple rules with the same definition, that means those rules will always match on the same input and they will always produce matches of the same length. So by the maximum much rule, the first definition (TYPE_NAME) will always be used and the other ones might as well not exist.
The problem basically boils down to the fact that there's nothing that lexically distinguishes the different types of names, so there's no basis on which the lexer could decide which type of name a given identifier represents. That tells us that the names should not be lexer rules. Instead IDENTIFIER should be a lexer rule and the FOO_NAMEs should either be (somewhat unnecessary) parser rules or removed altogether (you can just use IDENTIFIER wherever you're currently using FOO_NAME).

antlr tokenizer starts with the last token

I have the following grammar:
grammar Aligner;
line
: emptyLine
| codeLine
;
emptyLine
: ( KW_EMPTY KW_LINE )?
( EOL | EOF )
;
codeLine
: KW_LINE COLON
indent
CODE
( EOL | EOF )
;
indent
: absolute_indent
| relative_indent
;
absolute_indent
: NUMBER
;
relative_indent
: sign NUMBER
;
sign
: PLUS
| MINUS
;
COLON: ':';
MINUS: '-';
PLUS: '+';
KW_EMPTY: 'empty';
KW_LINE: 'line';
NUMBER
: DIGIT+
;
EOL
: ('\n' | '\r\n')
;
SPACING
: LINE_WS -> skip
;
CODE
: (~('\n' | '\r'))+
;
fragment
DIGIT
: '0'..'9'
;
fragment
LINE_WS
: ' '
| '\t'
| '\u000C'
;
when I try to parse - empty line I receive error: line 1:0 no viable alternative at input 'empty line'. When I debug what is going on, the very first token is from type CODE and includes the whole line.
What I am doing wrong?
ANTLR will try to match the longest possible token. When two lexer rules match the same string of a given length, the first rule that appears in the grammar wins.
You rule CODE is basically a catch-all: it will match whole lines of text. So here ANTLR has the choice of matching empty line as one single token of type CODE, and as no other rule can produce a token of length 10, the CODE rule will consume the whole line.
You should rewrite the CODE rule to make it match only what you mean by a code. Right now it's way too broad.

Incorrect Result When ANTLR4 Lexer Action Invokes getText()

It seems that the getText() in a lexer action cannot retrieve the token being matched correctly. Is it a normal behaviour? For example, part of my grammar has these rules for
parsing a C++ style identifier that support a \u sequence to embed unicode characters as part of the identifier name:
grammar CPPDefine;
cppCompilationUnit: (id_token|ALL_OTHER_SYMBOL)+ EOF;
id_token:IDENTIFIER //{System.out.println($text);}
;
CRLF: '\r'? '\n' -> skip;
ALL_OTHER_SYMBOL: '\\';
IDENTIFIER: (NONDIGIT (NONDIGIT | DIGIT)*)
{System.out.println(getText());}
;
fragment DIGIT: [0-9];
fragment NONDIGIT: [_a-zA-Z] | UNIVERSAL_CHARACTER_NAME ;
fragment UNIVERSAL_CHARACTER_NAME: ('\\u' HEX_QUAD | '\\U' HEX_QUAD HEX_QUAD ) ;
fragment HEX_QUAD: [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f];
Tested with this 1 line input containing an identifier with incorrect unicode escape sequence:
dkk\uzzzz
The $text of the id_token parser rule action produces this correct result:
dkk
uzzzz
i.e. input interpreted as 2 identifiers separated by a symbol '\' (symbol '\' not printed by any parser rule).
However, the getText() of IDENTIFIER lexer rule action produces this incorrect result:
dkk\u
uzzzz
Why the lexer rule IDENTIFIER's getText() is different from the parser id_token rule's $text. Afterall, the parser rule contains only this lexer rule?
EDIT:
Issue observed in ANTLR4.1 but not in ANTLR4.2 so it could have been fixed already.
It's hard to tell based on your example, but my instinct is you are using an old version of ANTLR. I am unable to reproduce this issue in ANTLR 4.2.

antlr does not parse when token only mentions one other token

I am trying to learn EBNF grammars with ANTLR. So I thought I would convert the Wikipedia EBNF grammar to ANTLR 4 and play with it. However I have had a terrible time at it. I was able to reduce the grammar to the one step that generates the problem.
It seems if I have one token reference solely another token then ANTLR 4 can't parse the input.
Here is my grammar:
grammar Hello;
program : statement+ ;
statement : IDENTIFIER STATEMENTEND /*| LETTERS STATEMENTEND */ ;
LETTERS : [a-z]+ ;
IDENTIFIER : LETTERS ;
SEMICOLON : [;] ;
STATEMENTEND : SEMICOLON NEWLINE* | NEWLINE+ ;
fragment NEWLINE : '\r' '\n' | '\n' | '\r';
Notice IDENTIFIER refers only to LETTERS.
If I provide this input:
a;
Then I get this error:
line 1:0 mismatched input 'a' expecting IDENTIFIER
(program a ;\n)
However if I uncomment the code and provide the same input I get legit output:
(program (statement a ;\n))
I do not understand why one works and the other does not.
The token a will only be assigned one token type. Since this input text matches both the LETTERS and IDENTIFIER rules, ANTLR 4 will assign the type according to the first rule appearing in the lexer, which means the input a will be a token of type LETTERS.
If you only meant for LETTERS to be a sub-part of other lexer rules, and not form LETTERS tokens themselves, you can declare it as a fragment rule.
fragment LETTERS : [a-z]+;
IDENTIFIER : LETTERS;
In this case, a would be assigned the token type IDENTIFIER and the original parser rule would work.

Resources