Handling line feed in ANTLR4 grammar with Python target - python-3.x

I am working on an ANTLR4 grammar for parsing Python DSL scripts (a subset of Python, basically) with the target set as the Python 3. I am having difficulties handling the line feed.
In my grammar, I use lexer::members and NEWLINE embedded code based on Bart Kiers's Python3 grammar for ANTLR4 which are ported to Python so that they can be used with Python 3 runtime for ANTLR instead of Java. My grammar differs from the one provided by Bart (which is almost the same used in the Python 3 spec) since in my DSL I need to target only certain elements of Python. Based on extensive testing of my grammar, I do think that the Python part of the grammar in itself is not the source of the problem and so I won't post it here in full for now.
The input for the grammar is a file, catched by the file_input rule:
file_input: (NEWLINE | statement)* EOF;
The grammar performs rather well on my DSL and produces correct ASTs. The only problem I have is that my lexer rule NEWLINE clutters the AST with \r\n nodes and proves troublesome when trying to extend the generated MyGrammarListener with my own ExtendedListener which inherits from it.
Here is my NEWLINE lexer rule:
NEWLINE
: ( {self.at_start_of_input()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
{
import re
from MyParser import MyParser
new_line = re.sub(r"[^\r\n\f]+", "", self._interp.getText(self._input))
spaces = re.sub(r"[\r\n\f]+", "", self._interp.getText(self._input))
next = self._input.LA(1)
if self.opened > 0 or next == '\r' or next == '\n' or next == '\f' or next == '#':
self.skip()
else:
self.emit_token(self.common_token(self.NEWLINE, new_line))
indent = self.get_indentation_count(spaces)
if len(self.indents) == 0:
previous = 0
else:
previous = self.indents[-1]
if indent == previous:
self.skip()
elif indent > previous:
self.indents.append(indent)
self.emit_token(self.common_token(MyParser.INDENT, spaces))
else:
while len(self.indents) > 0 and self.indents[-1] > indent:
self.emit_token(self.create_dedent())
del self.indents[-1]
};
The SPACES lexer rule fragment that NEWLINE uses is here:
fragment SPACES
: [ \t]+
;
I feel I should also add that both SPACES and COMMENTS are ultimately being skipped by the grammar, but only after the NEWLINE lexer rule is declared, which, as far as I know, should mean that there are no adverse effects from that, but I wanted to include it just in case.
SKIP_
: ( SPACES | COMMENT ) -> skip
;
When the input file is run without any empty lines between statements, everything runs as it should. However, if there are empty lines in my file (such as between import statements and variable assignement), I get the following errors:
line 15:4 extraneous input '\r\n ' expecting {<EOF>, 'from', 'import', NEWLINE, NAME}
line 15:0 extraneous input '\r\n' expecting {<EOF>, 'from', 'import', NEWLINE, NAME}
As I said before, when line feeds are omitted in my input file, the grammar and my ExtendedListener perform as they should, so the problem is definitely with the \r\n not being matched by the NEWLINE lexer rule - even the error statement I get says that it does not match alternative NEWLINE.
The AST produced by my grammar looks like this:
I would really appreciate any help with this since I cannot see why my NEWLINE lexer rule woud fail to match \r\n as it should and I would like to allow empty lines in my DSL.

so the problem is definitely with the \r\n not being matched by the
NEWLINE lexer rule
There is another explanation. An LL(1) parser would stop at the first mismatch, but ANTLR4 is a very smart LL(*) : it tries to match the input past the mismatch.
As I don't have your statement rule and your input around line 15, I'll demonstrate a possible case with the following grammar :
grammar Question;
/* Extraneous input parsing NL and spaces. */
#lexer::members {
public boolean at_start_of_input() {return true;}; // even if it always returns true, it's not the cause of the problem
}
question
#init {System.out.println("Question last update 2108");}
: ( NEWLINE
| statement
{System.out.println("found <<" + $statement.text + ">>");}
)* EOF
;
statement
: 'line ' NUMBER NEWLINE 'something else' NEWLINE
;
NUMBER : [0-9]+ ;
NEWLINE
: ( {at_start_of_input()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
;
SKIP_
: SPACES -> skip
;
fragment SPACES
: [ \t]+
;
Input file t.text :
line 1
something else
Execution :
$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'
$ hexdump -C t.text
00000000 6c 69 6e 65 20 31 0a 20 20 20 73 6f 6d 65 74 68 |line 1. someth|
00000010 69 6e 67 20 65 6c 73 65 0a |ing else.|
00000019
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[#0,0:4='line ',<'line '>,1:0]
[#1,5:5='1',<NUMBER>,1:5]
[#2,6:9='\n ',<NEWLINE>,1:6]
[#3,10:23='something else',<'something else'>,2:3]
[#4,24:24='\n',<NEWLINE>,2:17]
[#5,25:24='<EOF>',<EOF>,3:0]
Question last update 2108
found <<line 1
something else
>>
Now change statement like so :
statement
// : 'line ' NUMBER NEWLINE 'something else' NEWLINE
: 'line ' NUMBER 'something else' NEWLINE // now NL will be extraneous
;
and execute again :
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[#0,0:4='line ',<'line '>,1:0]
[#1,5:5='1',<NUMBER>,1:5]
[#2,6:9='\n ',<NEWLINE>,1:6]
[#3,10:23='something else',<'something else'>,2:3]
[#4,24:24='\n',<NEWLINE>,2:17]
[#5,25:24='<EOF>',<EOF>,3:0]
Question last update 2114
line 1:6 extraneous input '\n ' expecting 'something else'
found <<line 1
something else
>>
Note that the NL character and spaces have been correctly matched by the NEWLINE lexer rule.
You can find the explanation in section 9.1 of The Definitive ANTLR 4 Reference :
$ grun Simple prog ➾ class T ; { int i; } ➾EOF ❮ line 1:8 extraneous
input ';' expecting '{'
A Parade of Errors • 153
The parser reports an error at the ; but gives a slightly more
informative answer because it knows that the next token is what it was
actually looking for. This feature is called single-token deletion
because the parser can simply pretend the extraneous token isn’t there
and keep going.
Similarly, the parser can do single-token insertion when it detects a
missing token.
In other word, ANTLR4 is so powerful that it can resynchronize the input with the grammar even if several tokens are mismatching. If you run with the -gui option
$ grun Question question -gui t.text
you can see that ANTLR4 has parsed the whole file, despite the fact that a NEWLINE is missing in the statement rule, and that the input does not match exactly the grammar.
To summary : extraneous input is quite a common error when developing a grammar. It can come from a mismatch between input to parse and rule expectations, or also because some piece of input has been interpreted by another token than the one we believe, which can be detected by examining the list of tokens produced by the -tokens option.

Related

Lexer rule to handle escape of quote with quote or backslash in ANTLR4?

I'm trying to expand the answer to How do I escape an escape character with ANTLR 4? to work when the " can be escaped both with " and \. I.e. both
"Rob ""Commander Taco"" Malda is smart."
and
"Rob \"Commander Taco\" Malda is smart."
are both valid and equivalent. I've tried
StringLiteral : '"' ('""'|'\\"'|~["])* '"';
but if fails to match
"Entry Flag for Offset check and for \"don't start Chiller Water Pump Request\""
with the tokenizer consuming more characters than intended, i.e. consumes beyond \""
Anyone who knows how to define the lexer rule?
A bit more detail...
"" succeeds
"""" succeeds
\" " succeeds
"\"" succeeds (at EOF)
"\""\n"" fails (it greedily pulls in the \n and "
Example: (text.txt)
""
""""
"\" "
"\""
""
grun test tokens -tokens < test.txt
line 5:1 token recognition error at: '"'
[#0,0:1='""',<StringLiteral>,1:0]
[#1,2:2='\n',<'
'>,1:2]
[#2,3:6='""""',<StringLiteral>,2:0]
[#3,7:7='\n',<'
'>,2:4]
[#4,8:12='"\" "',<StringLiteral>,3:0]
[#5,13:13='\n',<'
'>,3:5]
[#6,14:19='"\""\n"',<StringLiteral>,4:0]
[#7,21:20='<EOF>',<EOF>,5:2]
\"" and """ at the end of a StringListeral are not being handled the same.
Here's the ATN for that rule:
From this diagram it's not clear why they should be handled differently. They appear to be parallel constructs.
More research
Test Grammar (small change to simplify ATN):
grammar test
;
start: StringLiteral (WS? StringLiteral)+;
StringLiteral: '"' ( (('\\' | '"') '"') | ~["])* '"';
WS: [ \t\n\r]+;
The ATN for StringLiteral in this grammar:
OK, let's walk through this ATN with the input "\""\n"
unconsumed input
transition
"\""\n"
1 -ε-> 5
"\""\n"
5 -"-> 11
\""\n"
11 -ε-> 9
\""\n"
9 -ε-> 6
\""\n"
6 -\-> 7
""\n"
7 -"-> 10
"\n"
10 -ε-> 13
"\n"
13 -ε-> 11
"\n"
11 -ε-> 12
"\n"
12 -ε-> 14
"\n"
14 -"-> 15
\n"
15 -ε-> 2
We should reach State 2 with the " before the \n, which would be the desired behavior.
Instead, we see it continue on to consume the \n and the next "
line 2:1 token recognition error at: '"'
[#0,0:5='"\""\n"',<StringLiteral>,1:0]
[#1,7:6='<EOF>',<EOF>,2:2]
In order for this to be valid, there must be a path from state 11 to state 2 that consumes a \n and a " (and I'm not seeing it)
Maybe I'm missing something, but it's looking more and more like a bug to me.
The problem is handling the \ properly.
Bart found the path through the ATN that I missed and allowed it to match the extra \n". The \ is matched as a ~["] and then comes back through and matches the " to terminate the string.
We could disallow \ in the "everything but a " alternative (~["\\]), but then we have to allow a stand-alone \ to be acceptable. We'd want to add an alternative that allows a \ followed by anything other than a ". You'd think that '\\' ~["] does that, and you'd be right, to a point, but it also consumes the character following the \, which is a problem if you want a string like "test \\" string" since it's consumed the second \ you can't match the \" alternative. What you're looking for is a lookahead (i.e. consume the \ if it's not followed by a ", but don't consume the following character). But ANTLR Lexer rules don't allow for lookaheads (ANTLR lexer can't lookahead at all).
You'll notice that most grammars that allow \" as an escape sequence in strings also require a bare \ to be escaped (\\), and frequently treat other \ (other character) sequences as just the "other character").
If escaping the \ character is acceptable, the rule could be simplified to:
StringLiteral: '"' ('\\' . | '""' | ~["\\])* '"';
"Flag for \\"Chiller Water\\"" would not parse correctly, but "Flag for \\\"Chiller Water\\\"" would. Without lookahead, I'm not seeing a way to Lex the first version.
Also, note that if you don't escape the \, then you have an ambiguous interpretation of \"". Is it \" followed by a " to terminate the string, or \ followed by "" allowing the string to continue? ANTLR will take whichever interpretation consumes the most input, so we see it using the second interpretation and pulling in characters until if finds a "
I cannot reproduce it. Given the grammar:
grammar T;
parse
: .*? EOF
;
StringLiteral
: '"' ( '""' | '\\"' | ~["] )* '"'
;
Other
: . -> skip
;
The following code:
String source =
"\"Rob \"\"Commander Taco\"\" Malda is smart.\"\n" +
"\"Rob \\\"Commander Taco\\\" Malda is smart.\"\n" +
"\"Entry Flag for Offset check and for \\\"don't start Chiller Water Pump Request\\\"\"\n";
TLexer lexer = new TLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-20s '%s'\n",
TLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replace("\n", "\\n"));
}
produces the following output:
StringLiteral '"Rob ""Commander Taco"" Malda is smart."'
StringLiteral '"Rob \"Commander Taco\" Malda is smart."'
StringLiteral '"Entry Flag for Offset check and for \"don't start Chiller Water Pump Request\""'
Tested with ANTLR 4.9.3 and 4.10.1: both produce the same output.

ANTLR: how to debug a misidentified token

I am trying to implement a grammar in Antlr4 for a simple template engine. This engine consists of 3 different clauses:
IF ANSWERED ( variable )
END IF
Variable
Variable can be any upper or lowercase letter including white spaces. Both IF ANSWERED and END IF are always uppercase.
I have written the following grammar/lexer rules so far, but my problem is that IF ANSWERED keeps getting recognized as a Variable and not as 2 tokens IF and ANSWERED.
grammar program;
/**grammar */
command: (ifStart | ifEnd | VARIABLE ) EOF;
ifStart: IF ANSWERED '(' VARIABLE ')';
ifEnd: 'END IF';
/** lexer */
IF: 'IF';
ANSWERED: 'ANSWERED';
TEXT: (LOWERCASE | UPPERCASE | NUMBER) ;
VARIABLE: (TEXT | [ \t\r\n])+;
fragment LOWERCASE: [a-z];
fragment UPPERCASE: [A-Z];
fragment NUMBER: [0-9];
If I try to parse IF ANSWERED ( FirstName ) I get the following output:
[#0,0:10='IF ANSWERED',**<VARIABLE>**,1:0]
[#1,11:11='(',<'('>,1:11]
[#2,12:25='Execution date',<VARIABLE>,1:12]
[#3,26:26=')',<')'>,1:26]
[#4,27:26='<EOF>',<EOF>,1:27]
line 1:0 mismatched input 'IF ANSWERED' expecting 'IF'
I read that Antlr4 is greedy and tries to match the biggest possible token, but I fail to understand what is the correct approach, or how to think through the problem to find a solution.
Correct: ANTLR's lexer is greedy, and tries to consume as much as possible. That is why IF ANSWERED is tokenised as a TEXT token instead of 2 separate keywords. You'll need to change TEXT so that it does not match spaces.
Something like this could get you started:
parse
: command* EOF
;
command
: (ifStatement | variable)+
;
ifStatement
: IF ANSWERED '(' variable ')' command* END IF
;
variable
: TEXT
;
IF : 'IF';
END : 'END';
ANSWERED : 'ANSWERED';
TEXT : [a-zA-Z0-9]+;
SPACES : [ \t\r\n]+ -> skip;

Antlr4: Skip line when it start with * unless the second char is

In my input, a line start with * is a comment line unless it starts with *+ or *-. I can ignore the comments but need to get the others.
This is my lexer rules:
WhiteSpaces : [ \t]+;
Newlines : [\r\n]+;
Commnent : '*' .*? Newlines -> skip ;
SkipTokens : (WhiteSpaces | Newlines) -> skip;
An example:
* this is a comment line
** another comment line
*+ type value
So, the first two are comment lines, and I can skip it. But I don't know to to define lexer/parser rule that can catch the last line.
Your SkipTokens lexer rule will never be matched because the rules WhiteSpaces and Newlines are placed before it. See this Q&A for an explanation how the lexer matches tokens: ANTLR Lexer rule only seems to work as part of parser rule, and not part of another lexer rule
For it to work as you expect, do this:
SkipTokens : (WhiteSpaces | Newlines) -> skip;
fragment WhiteSpaces : [ \t]+;
fragment Newlines : [\r\n]+;
What a fragment is, check this Q&A: What does "fragment" mean in ANTLR?
Now, for your question. You defined a Comment rule to always end with a line break. This means that there can't be a comment at the end of your input. So you should let a comment either end with a line break or the EOF.
Something like this should do the trick:
COMMENT
: '*' ~[+\-\r\n] ~[\r\n]* // a '*' must be followed by something other than '+', '-' or a line break
| '*' ( [\r\n]+ | EOF ) // a '*' is a valid comment if directly followed by a line break, or the EOF
;
STAR_MINUS
: '*-'
;
STAR_PLUS
: '*+'
;
SPACES
: [ \t\r\n]+ -> skip
;
This, of course, does not mandate the * to be at the start of the line. If you want that, checkout this Q&A: Handle strings starting with whitespaces

Lexer and Parser rules for a simple command processor

I am attempting to build a simple command processor for a legacy language.
I am attempting to work with C# with antlr4 version "ANTLR", "4.6.6")
I am unable to make progress against one scenario, of several.
The following examples shows various sample invocations of the command PKS.
PKS
PKS?
PKStext_that_is_a_filename
The scenario that I can not solve is the PKS command followed by filename.
Command:
PKS
(block (line (expr (command PKS)) (eol \r\n)) <EOF>)
Command:
PKS?
(block (line (expr (command PKS) (query ?)) (eol \r\n)) <EOF>)
Command:
PKSFILENAME
line 1:0 mismatched input 'PKSFILENAME' expecting COMMAND
(block PKSFILENAME \r\n)
Command:
what I believe to be the relevant snippet of grammar:
block : line+ EOF;
line : (expr eol)+;
expr : command file
| command listOfDouble
| command query
| command
;
command : COMMAND
;
query : QUERY;
file : TEXT ;
eol : EOL;
listOfDouble: DOUBLE (COMMA DOUBLE)* ;
From the lexer:
COMMAND : PKS;
PKS :'PKS' ;
QUERY : '?'
;
fragment LETTER : [A-Z];
fragment DIGIT : [0-9];
fragment UNDER : [_];
TEXT : (LETTER) (LETTER|DIGIT|UNDER)* ;
The main problem here is that your TEXT rule also matches what PKS is supposed to match. And since PKStext_that_is_a_filename can entirely be matched by that TEXT rule it is preferred over the PKS rule, even though it appears first in the grammar (if 2 rules match the same input then the first one wins).
In order to fix that problem you have 2 options:
Require whitespace(s) between the keyword (PKS) and the rest of the expression.
Change the TEXT rule to explicitly exclude "PKS" as valid input.
Option 2 is certainly possible, but will get very messy if you have have more keywords (as they all would have to be excluded). With a whitespace between the keywords and the text the lexer would automatically do that for you.
And let me give you a hint to approach such kind of problems: always check the token list produced by the lexer to see if it generated the tokens you expected. I reworked your grammar a bit, added missing tokens and ran it through my ANTLR4 debugger, which gave me:
Parser error (5, 1): extraneous input 'PKStext_that_is_a_filename' expecting {<EOF>, COMMAND, EOL}
Tokens:
[#0,0:2='PKS',<1>,1:0]
[#1,3:3='\n',<8>,1:3]
[#2,4:4='\n',<8>,2:0]
[#3,5:7='PKS',<1>,3:0]
[#4,8:8='?',<3>,3:3]
[#5,9:9='\n',<8>,3:4]
[#6,10:10='\n',<8>,4:0]
[#7,11:36='PKStext_that_is_a_filename',<7>,5:0]
[#8,37:37='\n',<8>,5:26]
[#9,38:37='<EOF>',<-1>,6:0]
For this input:
PKS
PKS?
PKStext_that_is_a_filename
Here's the grammar I used:
grammar Example;
start: block;
block: line+ EOF;
line: expr? eol;
expr: command (file | listOfDouble | query)?;
command: COMMAND;
query: QUERY;
file: TEXT;
eol: EOL;
listOfDouble: DOUBLE (COMMA DOUBLE)*;
COMMAND: PKS;
PKS: 'PKS';
QUERY: '?';
fragment LETTER: [a-zA-Z];
fragment DIGIT: [0-9];
fragment UNDER: [_];
COMMA: ',';
DOUBLE: DIGIT+ (DOT DIGIT*)?;
DOT: '.';
TEXT: LETTER (LETTER | DIGIT | UNDER)*;
EOL: [\n\r];
and the generated visual parse tree:

Distinguishing literal \n vs embedded newline

I am working on validating the Rust parser's handwritten stuff against a model written in antlr. I am hitting a problem with antlr escaping strings for me:
[15:48:50]~/src/rust2/src/grammar> grun RustLexer tokens -tokens
"\n"
[#0,0:3='"\n"',<46>,1:0]
and
[15:51:15]~/src/rust2/src/grammar> grun RustLexer tokens -tokens
"
"
[#0,0:2='"\n"',<46>,1:0]
Create the same string content. Is there a way for antlr to behave in any other way here? In particular, it would be acceptable if it escaped literal \ to \\, I could then collapse those down in my tool. As it stands, I am losing information about the input.
grun is probably doing the expanding of "\n"to a line break, because the lexer surely won't do this (luckily).
Given the grammar Test:
grammar Test;
parse
: .*? EOF
;
LINE_BREAK
: '\n'
;
OTHER
: .
;
that parses "\n\\n":
TestLexer lexer = new TestLexer(new ANTLRInputStream("\n\\n"));
for (Token token : lexer.getAllTokens()) {
System.out.printf("%s -> <%s>%n", TestLexer.ruleNames[token.getType() - 1], token.getText());
}
which will print the following:
LINE_BREAK -> <
>
OTHER -> <\>
OTHER -> <n>
B.t.w., I presume you are aware of this repository?

Resources