Distinguishing literal \n vs embedded newline - antlr4

I am working on validating the Rust parser's handwritten stuff against a model written in antlr. I am hitting a problem with antlr escaping strings for me:
[15:48:50]~/src/rust2/src/grammar> grun RustLexer tokens -tokens
"\n"
[#0,0:3='"\n"',<46>,1:0]
and
[15:51:15]~/src/rust2/src/grammar> grun RustLexer tokens -tokens
"
"
[#0,0:2='"\n"',<46>,1:0]
Create the same string content. Is there a way for antlr to behave in any other way here? In particular, it would be acceptable if it escaped literal \ to \\, I could then collapse those down in my tool. As it stands, I am losing information about the input.

grun is probably doing the expanding of "\n"to a line break, because the lexer surely won't do this (luckily).
Given the grammar Test:
grammar Test;
parse
: .*? EOF
;
LINE_BREAK
: '\n'
;
OTHER
: .
;
that parses "\n\\n":
TestLexer lexer = new TestLexer(new ANTLRInputStream("\n\\n"));
for (Token token : lexer.getAllTokens()) {
System.out.printf("%s -> <%s>%n", TestLexer.ruleNames[token.getType() - 1], token.getText());
}
which will print the following:
LINE_BREAK -> <
>
OTHER -> <\>
OTHER -> <n>
B.t.w., I presume you are aware of this repository?

Related

Lexer rule to handle escape of quote with quote or backslash in ANTLR4?

I'm trying to expand the answer to How do I escape an escape character with ANTLR 4? to work when the " can be escaped both with " and \. I.e. both
"Rob ""Commander Taco"" Malda is smart."
and
"Rob \"Commander Taco\" Malda is smart."
are both valid and equivalent. I've tried
StringLiteral : '"' ('""'|'\\"'|~["])* '"';
but if fails to match
"Entry Flag for Offset check and for \"don't start Chiller Water Pump Request\""
with the tokenizer consuming more characters than intended, i.e. consumes beyond \""
Anyone who knows how to define the lexer rule?
A bit more detail...
"" succeeds
"""" succeeds
\" " succeeds
"\"" succeeds (at EOF)
"\""\n"" fails (it greedily pulls in the \n and "
Example: (text.txt)
""
""""
"\" "
"\""
""
grun test tokens -tokens < test.txt
line 5:1 token recognition error at: '"'
[#0,0:1='""',<StringLiteral>,1:0]
[#1,2:2='\n',<'
'>,1:2]
[#2,3:6='""""',<StringLiteral>,2:0]
[#3,7:7='\n',<'
'>,2:4]
[#4,8:12='"\" "',<StringLiteral>,3:0]
[#5,13:13='\n',<'
'>,3:5]
[#6,14:19='"\""\n"',<StringLiteral>,4:0]
[#7,21:20='<EOF>',<EOF>,5:2]
\"" and """ at the end of a StringListeral are not being handled the same.
Here's the ATN for that rule:
From this diagram it's not clear why they should be handled differently. They appear to be parallel constructs.
More research
Test Grammar (small change to simplify ATN):
grammar test
;
start: StringLiteral (WS? StringLiteral)+;
StringLiteral: '"' ( (('\\' | '"') '"') | ~["])* '"';
WS: [ \t\n\r]+;
The ATN for StringLiteral in this grammar:
OK, let's walk through this ATN with the input "\""\n"
unconsumed input
transition
"\""\n"
1 -ε-> 5
"\""\n"
5 -"-> 11
\""\n"
11 -ε-> 9
\""\n"
9 -ε-> 6
\""\n"
6 -\-> 7
""\n"
7 -"-> 10
"\n"
10 -ε-> 13
"\n"
13 -ε-> 11
"\n"
11 -ε-> 12
"\n"
12 -ε-> 14
"\n"
14 -"-> 15
\n"
15 -ε-> 2
We should reach State 2 with the " before the \n, which would be the desired behavior.
Instead, we see it continue on to consume the \n and the next "
line 2:1 token recognition error at: '"'
[#0,0:5='"\""\n"',<StringLiteral>,1:0]
[#1,7:6='<EOF>',<EOF>,2:2]
In order for this to be valid, there must be a path from state 11 to state 2 that consumes a \n and a " (and I'm not seeing it)
Maybe I'm missing something, but it's looking more and more like a bug to me.
The problem is handling the \ properly.
Bart found the path through the ATN that I missed and allowed it to match the extra \n". The \ is matched as a ~["] and then comes back through and matches the " to terminate the string.
We could disallow \ in the "everything but a " alternative (~["\\]), but then we have to allow a stand-alone \ to be acceptable. We'd want to add an alternative that allows a \ followed by anything other than a ". You'd think that '\\' ~["] does that, and you'd be right, to a point, but it also consumes the character following the \, which is a problem if you want a string like "test \\" string" since it's consumed the second \ you can't match the \" alternative. What you're looking for is a lookahead (i.e. consume the \ if it's not followed by a ", but don't consume the following character). But ANTLR Lexer rules don't allow for lookaheads (ANTLR lexer can't lookahead at all).
You'll notice that most grammars that allow \" as an escape sequence in strings also require a bare \ to be escaped (\\), and frequently treat other \ (other character) sequences as just the "other character").
If escaping the \ character is acceptable, the rule could be simplified to:
StringLiteral: '"' ('\\' . | '""' | ~["\\])* '"';
"Flag for \\"Chiller Water\\"" would not parse correctly, but "Flag for \\\"Chiller Water\\\"" would. Without lookahead, I'm not seeing a way to Lex the first version.
Also, note that if you don't escape the \, then you have an ambiguous interpretation of \"". Is it \" followed by a " to terminate the string, or \ followed by "" allowing the string to continue? ANTLR will take whichever interpretation consumes the most input, so we see it using the second interpretation and pulling in characters until if finds a "
I cannot reproduce it. Given the grammar:
grammar T;
parse
: .*? EOF
;
StringLiteral
: '"' ( '""' | '\\"' | ~["] )* '"'
;
Other
: . -> skip
;
The following code:
String source =
"\"Rob \"\"Commander Taco\"\" Malda is smart.\"\n" +
"\"Rob \\\"Commander Taco\\\" Malda is smart.\"\n" +
"\"Entry Flag for Offset check and for \\\"don't start Chiller Water Pump Request\\\"\"\n";
TLexer lexer = new TLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-20s '%s'\n",
TLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replace("\n", "\\n"));
}
produces the following output:
StringLiteral '"Rob ""Commander Taco"" Malda is smart."'
StringLiteral '"Rob \"Commander Taco\" Malda is smart."'
StringLiteral '"Entry Flag for Offset check and for \"don't start Chiller Water Pump Request\""'
Tested with ANTLR 4.9.3 and 4.10.1: both produce the same output.

Atleast ONE Space around Parenthesis in ANTLR4

I want spaces around Parenthesis in IF condition. ATleast one space is required. But when i use Space in grammar it throws me error, when i use Else block with it. Please help me, how to accomplish it, i have seen many examples but none is related to it.
i only need spaces around Parenthesis of If condition.
prog: stat_block EOF;
stat_block: OBRACE block CBRACE;
block: (stat (stat)*)?;
stat: expr ';'
| IF condition_block (ELSE stat_block)?
;
expr
: expr SPACE ('*' | '/') SPACE expr
| ID
| INT
| STRING
;
exprList: expr (',' expr)*;
condition_block: SPACE OPAR SPACE expr SPACE CPAR SPACE stat_block;
IF: 'IF';
ELSE: 'ELSE';
OPAR: '(';
CPAR: ')';
OBRACE: '{';
CBRACE: '}';
SPACE: SINGLE_SPACE+;
SINGLE_SPACE: ' ';
ID: [a-zA-Z]+;
INT: [0-9]+;
NEWLINE: '\r'? '\n' -> skip;
WS: [ \t]+ -> skip;
Expected input to parse
IF ( 3 ) { } ELSE { }
Current Input
There's a reason that almost all languages ignore whitespace. If you don't ignore it, then you have to deal with its possible existence in the token stream anywhere it might, or might not, be in ALL of your parser rules.
You can try to include the spaces in the Lexer rules for tokens that you want wrapped in spaces, but may still find surprises.
Suggestion: Instead if -> skip; for your WS rule, use -> channel(HIDDEN); This keeps the tokens in the token stream so you can look for them in your code, but "hides" the whitespace tokens from the parser rules. This also allows ANTLR to get a proper interpretation of your input and build a parse tree that represents it correctly.
If you REALLY want to insist on the spaces before/after, you can write code in a listener that looks before/after the tokens in the input stream to see if you have whitespace, and generate your own error (that can be very specific about your requirement).
At least one space is required.
Then you either:
cannot -> skip the WS rule, which will cause all spaces and tabs to become tokens and needing your parser to handle them correctly (which is likely going to become a complete mess in your parser rules!), or
you leave WS -> skip as-is, but include a space in your PAREN rules: OPAR : ' ( '; CPAR: ' ) '; (or with tabs as well if that is possible)

Handling line feed in ANTLR4 grammar with Python target

I am working on an ANTLR4 grammar for parsing Python DSL scripts (a subset of Python, basically) with the target set as the Python 3. I am having difficulties handling the line feed.
In my grammar, I use lexer::members and NEWLINE embedded code based on Bart Kiers's Python3 grammar for ANTLR4 which are ported to Python so that they can be used with Python 3 runtime for ANTLR instead of Java. My grammar differs from the one provided by Bart (which is almost the same used in the Python 3 spec) since in my DSL I need to target only certain elements of Python. Based on extensive testing of my grammar, I do think that the Python part of the grammar in itself is not the source of the problem and so I won't post it here in full for now.
The input for the grammar is a file, catched by the file_input rule:
file_input: (NEWLINE | statement)* EOF;
The grammar performs rather well on my DSL and produces correct ASTs. The only problem I have is that my lexer rule NEWLINE clutters the AST with \r\n nodes and proves troublesome when trying to extend the generated MyGrammarListener with my own ExtendedListener which inherits from it.
Here is my NEWLINE lexer rule:
NEWLINE
: ( {self.at_start_of_input()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
{
import re
from MyParser import MyParser
new_line = re.sub(r"[^\r\n\f]+", "", self._interp.getText(self._input))
spaces = re.sub(r"[\r\n\f]+", "", self._interp.getText(self._input))
next = self._input.LA(1)
if self.opened > 0 or next == '\r' or next == '\n' or next == '\f' or next == '#':
self.skip()
else:
self.emit_token(self.common_token(self.NEWLINE, new_line))
indent = self.get_indentation_count(spaces)
if len(self.indents) == 0:
previous = 0
else:
previous = self.indents[-1]
if indent == previous:
self.skip()
elif indent > previous:
self.indents.append(indent)
self.emit_token(self.common_token(MyParser.INDENT, spaces))
else:
while len(self.indents) > 0 and self.indents[-1] > indent:
self.emit_token(self.create_dedent())
del self.indents[-1]
};
The SPACES lexer rule fragment that NEWLINE uses is here:
fragment SPACES
: [ \t]+
;
I feel I should also add that both SPACES and COMMENTS are ultimately being skipped by the grammar, but only after the NEWLINE lexer rule is declared, which, as far as I know, should mean that there are no adverse effects from that, but I wanted to include it just in case.
SKIP_
: ( SPACES | COMMENT ) -> skip
;
When the input file is run without any empty lines between statements, everything runs as it should. However, if there are empty lines in my file (such as between import statements and variable assignement), I get the following errors:
line 15:4 extraneous input '\r\n ' expecting {<EOF>, 'from', 'import', NEWLINE, NAME}
line 15:0 extraneous input '\r\n' expecting {<EOF>, 'from', 'import', NEWLINE, NAME}
As I said before, when line feeds are omitted in my input file, the grammar and my ExtendedListener perform as they should, so the problem is definitely with the \r\n not being matched by the NEWLINE lexer rule - even the error statement I get says that it does not match alternative NEWLINE.
The AST produced by my grammar looks like this:
I would really appreciate any help with this since I cannot see why my NEWLINE lexer rule woud fail to match \r\n as it should and I would like to allow empty lines in my DSL.
so the problem is definitely with the \r\n not being matched by the
NEWLINE lexer rule
There is another explanation. An LL(1) parser would stop at the first mismatch, but ANTLR4 is a very smart LL(*) : it tries to match the input past the mismatch.
As I don't have your statement rule and your input around line 15, I'll demonstrate a possible case with the following grammar :
grammar Question;
/* Extraneous input parsing NL and spaces. */
#lexer::members {
public boolean at_start_of_input() {return true;}; // even if it always returns true, it's not the cause of the problem
}
question
#init {System.out.println("Question last update 2108");}
: ( NEWLINE
| statement
{System.out.println("found <<" + $statement.text + ">>");}
)* EOF
;
statement
: 'line ' NUMBER NEWLINE 'something else' NEWLINE
;
NUMBER : [0-9]+ ;
NEWLINE
: ( {at_start_of_input()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
;
SKIP_
: SPACES -> skip
;
fragment SPACES
: [ \t]+
;
Input file t.text :
line 1
something else
Execution :
$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'
$ hexdump -C t.text
00000000 6c 69 6e 65 20 31 0a 20 20 20 73 6f 6d 65 74 68 |line 1. someth|
00000010 69 6e 67 20 65 6c 73 65 0a |ing else.|
00000019
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[#0,0:4='line ',<'line '>,1:0]
[#1,5:5='1',<NUMBER>,1:5]
[#2,6:9='\n ',<NEWLINE>,1:6]
[#3,10:23='something else',<'something else'>,2:3]
[#4,24:24='\n',<NEWLINE>,2:17]
[#5,25:24='<EOF>',<EOF>,3:0]
Question last update 2108
found <<line 1
something else
>>
Now change statement like so :
statement
// : 'line ' NUMBER NEWLINE 'something else' NEWLINE
: 'line ' NUMBER 'something else' NEWLINE // now NL will be extraneous
;
and execute again :
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[#0,0:4='line ',<'line '>,1:0]
[#1,5:5='1',<NUMBER>,1:5]
[#2,6:9='\n ',<NEWLINE>,1:6]
[#3,10:23='something else',<'something else'>,2:3]
[#4,24:24='\n',<NEWLINE>,2:17]
[#5,25:24='<EOF>',<EOF>,3:0]
Question last update 2114
line 1:6 extraneous input '\n ' expecting 'something else'
found <<line 1
something else
>>
Note that the NL character and spaces have been correctly matched by the NEWLINE lexer rule.
You can find the explanation in section 9.1 of The Definitive ANTLR 4 Reference :
$ grun Simple prog ➾ class T ; { int i; } ➾EOF ❮ line 1:8 extraneous
input ';' expecting '{'
A Parade of Errors • 153
The parser reports an error at the ; but gives a slightly more
informative answer because it knows that the next token is what it was
actually looking for. This feature is called single-token deletion
because the parser can simply pretend the extraneous token isn’t there
and keep going.
Similarly, the parser can do single-token insertion when it detects a
missing token.
In other word, ANTLR4 is so powerful that it can resynchronize the input with the grammar even if several tokens are mismatching. If you run with the -gui option
$ grun Question question -gui t.text
you can see that ANTLR4 has parsed the whole file, despite the fact that a NEWLINE is missing in the statement rule, and that the input does not match exactly the grammar.
To summary : extraneous input is quite a common error when developing a grammar. It can come from a mismatch between input to parse and rule expectations, or also because some piece of input has been interpreted by another token than the one we believe, which can be detected by examining the list of tokens produced by the -tokens option.

Incorrect Result When ANTLR4 Lexer Action Invokes getText()

It seems that the getText() in a lexer action cannot retrieve the token being matched correctly. Is it a normal behaviour? For example, part of my grammar has these rules for
parsing a C++ style identifier that support a \u sequence to embed unicode characters as part of the identifier name:
grammar CPPDefine;
cppCompilationUnit: (id_token|ALL_OTHER_SYMBOL)+ EOF;
id_token:IDENTIFIER //{System.out.println($text);}
;
CRLF: '\r'? '\n' -> skip;
ALL_OTHER_SYMBOL: '\\';
IDENTIFIER: (NONDIGIT (NONDIGIT | DIGIT)*)
{System.out.println(getText());}
;
fragment DIGIT: [0-9];
fragment NONDIGIT: [_a-zA-Z] | UNIVERSAL_CHARACTER_NAME ;
fragment UNIVERSAL_CHARACTER_NAME: ('\\u' HEX_QUAD | '\\U' HEX_QUAD HEX_QUAD ) ;
fragment HEX_QUAD: [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f];
Tested with this 1 line input containing an identifier with incorrect unicode escape sequence:
dkk\uzzzz
The $text of the id_token parser rule action produces this correct result:
dkk
uzzzz
i.e. input interpreted as 2 identifiers separated by a symbol '\' (symbol '\' not printed by any parser rule).
However, the getText() of IDENTIFIER lexer rule action produces this incorrect result:
dkk\u
uzzzz
Why the lexer rule IDENTIFIER's getText() is different from the parser id_token rule's $text. Afterall, the parser rule contains only this lexer rule?
EDIT:
Issue observed in ANTLR4.1 but not in ANTLR4.2 so it could have been fixed already.
It's hard to tell based on your example, but my instinct is you are using an old version of ANTLR. I am unable to reproduce this issue in ANTLR 4.2.

string recursion antlr lexer token

How do I build a token in lexer that can handle recursion inside as this string:
${*anything*${*anything*}*anything*}
?
Yes, you can use recursion inside lexer rules.
Take the following example:
${a ${b} ${c ${ddd} c} a}
which will be parsed correctly by the following grammar:
parse
: DollarVar
;
DollarVar
: '${' (DollarVar | EscapeSequence | ~Special)+ '}'
;
fragment
Special
: '\\' | '$' | '{' | '}'
;
fragment
EscapeSequence
: '\\' Special
;
as the interpreter inside ANTLRWorks shows:
alt text http://img185.imageshack.us/img185/5471/recq.png
ANTLR's lexers do support recursion, as #BartK adeptly points out in his post, but you will only see a single token within the parser. If you need to interpret the various pieces within that token, you'll probably want to handle it within the parser.
IMO, you'd be better off doing something in the parser:
variable: DOLLAR LBRACE id variable id RBRACE;
By doing something like the above, you'll see all the necessary pieces and can build an AST or otherwise handle accordingly.

Resources