Lexer rule to handle escape of quote with quote or backslash in ANTLR4? - antlr4

I'm trying to expand the answer to How do I escape an escape character with ANTLR 4? to work when the " can be escaped both with " and \. I.e. both
"Rob ""Commander Taco"" Malda is smart."
and
"Rob \"Commander Taco\" Malda is smart."
are both valid and equivalent. I've tried
StringLiteral : '"' ('""'|'\\"'|~["])* '"';
but if fails to match
"Entry Flag for Offset check and for \"don't start Chiller Water Pump Request\""
with the tokenizer consuming more characters than intended, i.e. consumes beyond \""
Anyone who knows how to define the lexer rule?
A bit more detail...
"" succeeds
"""" succeeds
\" " succeeds
"\"" succeeds (at EOF)
"\""\n"" fails (it greedily pulls in the \n and "
Example: (text.txt)
""
""""
"\" "
"\""
""
grun test tokens -tokens < test.txt
line 5:1 token recognition error at: '"'
[#0,0:1='""',<StringLiteral>,1:0]
[#1,2:2='\n',<'
'>,1:2]
[#2,3:6='""""',<StringLiteral>,2:0]
[#3,7:7='\n',<'
'>,2:4]
[#4,8:12='"\" "',<StringLiteral>,3:0]
[#5,13:13='\n',<'
'>,3:5]
[#6,14:19='"\""\n"',<StringLiteral>,4:0]
[#7,21:20='<EOF>',<EOF>,5:2]
\"" and """ at the end of a StringListeral are not being handled the same.
Here's the ATN for that rule:
From this diagram it's not clear why they should be handled differently. They appear to be parallel constructs.
More research
Test Grammar (small change to simplify ATN):
grammar test
;
start: StringLiteral (WS? StringLiteral)+;
StringLiteral: '"' ( (('\\' | '"') '"') | ~["])* '"';
WS: [ \t\n\r]+;
The ATN for StringLiteral in this grammar:
OK, let's walk through this ATN with the input "\""\n"
unconsumed input
transition
"\""\n"
1 -ε-> 5
"\""\n"
5 -"-> 11
\""\n"
11 -ε-> 9
\""\n"
9 -ε-> 6
\""\n"
6 -\-> 7
""\n"
7 -"-> 10
"\n"
10 -ε-> 13
"\n"
13 -ε-> 11
"\n"
11 -ε-> 12
"\n"
12 -ε-> 14
"\n"
14 -"-> 15
\n"
15 -ε-> 2
We should reach State 2 with the " before the \n, which would be the desired behavior.
Instead, we see it continue on to consume the \n and the next "
line 2:1 token recognition error at: '"'
[#0,0:5='"\""\n"',<StringLiteral>,1:0]
[#1,7:6='<EOF>',<EOF>,2:2]
In order for this to be valid, there must be a path from state 11 to state 2 that consumes a \n and a " (and I'm not seeing it)
Maybe I'm missing something, but it's looking more and more like a bug to me.

The problem is handling the \ properly.
Bart found the path through the ATN that I missed and allowed it to match the extra \n". The \ is matched as a ~["] and then comes back through and matches the " to terminate the string.
We could disallow \ in the "everything but a " alternative (~["\\]), but then we have to allow a stand-alone \ to be acceptable. We'd want to add an alternative that allows a \ followed by anything other than a ". You'd think that '\\' ~["] does that, and you'd be right, to a point, but it also consumes the character following the \, which is a problem if you want a string like "test \\" string" since it's consumed the second \ you can't match the \" alternative. What you're looking for is a lookahead (i.e. consume the \ if it's not followed by a ", but don't consume the following character). But ANTLR Lexer rules don't allow for lookaheads (ANTLR lexer can't lookahead at all).
You'll notice that most grammars that allow \" as an escape sequence in strings also require a bare \ to be escaped (\\), and frequently treat other \ (other character) sequences as just the "other character").
If escaping the \ character is acceptable, the rule could be simplified to:
StringLiteral: '"' ('\\' . | '""' | ~["\\])* '"';
"Flag for \\"Chiller Water\\"" would not parse correctly, but "Flag for \\\"Chiller Water\\\"" would. Without lookahead, I'm not seeing a way to Lex the first version.
Also, note that if you don't escape the \, then you have an ambiguous interpretation of \"". Is it \" followed by a " to terminate the string, or \ followed by "" allowing the string to continue? ANTLR will take whichever interpretation consumes the most input, so we see it using the second interpretation and pulling in characters until if finds a "

I cannot reproduce it. Given the grammar:
grammar T;
parse
: .*? EOF
;
StringLiteral
: '"' ( '""' | '\\"' | ~["] )* '"'
;
Other
: . -> skip
;
The following code:
String source =
"\"Rob \"\"Commander Taco\"\" Malda is smart.\"\n" +
"\"Rob \\\"Commander Taco\\\" Malda is smart.\"\n" +
"\"Entry Flag for Offset check and for \\\"don't start Chiller Water Pump Request\\\"\"\n";
TLexer lexer = new TLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-20s '%s'\n",
TLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replace("\n", "\\n"));
}
produces the following output:
StringLiteral '"Rob ""Commander Taco"" Malda is smart."'
StringLiteral '"Rob \"Commander Taco\" Malda is smart."'
StringLiteral '"Entry Flag for Offset check and for \"don't start Chiller Water Pump Request\""'
Tested with ANTLR 4.9.3 and 4.10.1: both produce the same output.

Related

How to include quotes in string in ANTLR4

How can I include quotes for string and characters as part of the string. Example is "This is a \" string" which should result in one string instead of "This is a \" as one string and string" as an error in this case. The same goes for the characters. Example is '\'', but
in my case it's only '\'.
This is my current solution which works only without quotes.
CHARACTER
: '\'' ~('\'')+ '\''
;
STRING
: '"' ~('"')+ '"'
;
Your string/char rules don't handle escape sequences correctly. For the character it should be:
CHARACTER: '\'' '\\'? '.' '\'';
Here we make the escape char (backshlash) be part of the rule and require an additional char (whatever it is) follow it. Similar for the string:
STRING: '"' ('\\'? .)+? '"';
By using +? we are telling ANTLR4 to match in a non-greedy manner, stopping at the first non-escaped quote char after the initial one.

Handling line feed in ANTLR4 grammar with Python target

I am working on an ANTLR4 grammar for parsing Python DSL scripts (a subset of Python, basically) with the target set as the Python 3. I am having difficulties handling the line feed.
In my grammar, I use lexer::members and NEWLINE embedded code based on Bart Kiers's Python3 grammar for ANTLR4 which are ported to Python so that they can be used with Python 3 runtime for ANTLR instead of Java. My grammar differs from the one provided by Bart (which is almost the same used in the Python 3 spec) since in my DSL I need to target only certain elements of Python. Based on extensive testing of my grammar, I do think that the Python part of the grammar in itself is not the source of the problem and so I won't post it here in full for now.
The input for the grammar is a file, catched by the file_input rule:
file_input: (NEWLINE | statement)* EOF;
The grammar performs rather well on my DSL and produces correct ASTs. The only problem I have is that my lexer rule NEWLINE clutters the AST with \r\n nodes and proves troublesome when trying to extend the generated MyGrammarListener with my own ExtendedListener which inherits from it.
Here is my NEWLINE lexer rule:
NEWLINE
: ( {self.at_start_of_input()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
{
import re
from MyParser import MyParser
new_line = re.sub(r"[^\r\n\f]+", "", self._interp.getText(self._input))
spaces = re.sub(r"[\r\n\f]+", "", self._interp.getText(self._input))
next = self._input.LA(1)
if self.opened > 0 or next == '\r' or next == '\n' or next == '\f' or next == '#':
self.skip()
else:
self.emit_token(self.common_token(self.NEWLINE, new_line))
indent = self.get_indentation_count(spaces)
if len(self.indents) == 0:
previous = 0
else:
previous = self.indents[-1]
if indent == previous:
self.skip()
elif indent > previous:
self.indents.append(indent)
self.emit_token(self.common_token(MyParser.INDENT, spaces))
else:
while len(self.indents) > 0 and self.indents[-1] > indent:
self.emit_token(self.create_dedent())
del self.indents[-1]
};
The SPACES lexer rule fragment that NEWLINE uses is here:
fragment SPACES
: [ \t]+
;
I feel I should also add that both SPACES and COMMENTS are ultimately being skipped by the grammar, but only after the NEWLINE lexer rule is declared, which, as far as I know, should mean that there are no adverse effects from that, but I wanted to include it just in case.
SKIP_
: ( SPACES | COMMENT ) -> skip
;
When the input file is run without any empty lines between statements, everything runs as it should. However, if there are empty lines in my file (such as between import statements and variable assignement), I get the following errors:
line 15:4 extraneous input '\r\n ' expecting {<EOF>, 'from', 'import', NEWLINE, NAME}
line 15:0 extraneous input '\r\n' expecting {<EOF>, 'from', 'import', NEWLINE, NAME}
As I said before, when line feeds are omitted in my input file, the grammar and my ExtendedListener perform as they should, so the problem is definitely with the \r\n not being matched by the NEWLINE lexer rule - even the error statement I get says that it does not match alternative NEWLINE.
The AST produced by my grammar looks like this:
I would really appreciate any help with this since I cannot see why my NEWLINE lexer rule woud fail to match \r\n as it should and I would like to allow empty lines in my DSL.
so the problem is definitely with the \r\n not being matched by the
NEWLINE lexer rule
There is another explanation. An LL(1) parser would stop at the first mismatch, but ANTLR4 is a very smart LL(*) : it tries to match the input past the mismatch.
As I don't have your statement rule and your input around line 15, I'll demonstrate a possible case with the following grammar :
grammar Question;
/* Extraneous input parsing NL and spaces. */
#lexer::members {
public boolean at_start_of_input() {return true;}; // even if it always returns true, it's not the cause of the problem
}
question
#init {System.out.println("Question last update 2108");}
: ( NEWLINE
| statement
{System.out.println("found <<" + $statement.text + ">>");}
)* EOF
;
statement
: 'line ' NUMBER NEWLINE 'something else' NEWLINE
;
NUMBER : [0-9]+ ;
NEWLINE
: ( {at_start_of_input()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
;
SKIP_
: SPACES -> skip
;
fragment SPACES
: [ \t]+
;
Input file t.text :
line 1
something else
Execution :
$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'
$ hexdump -C t.text
00000000 6c 69 6e 65 20 31 0a 20 20 20 73 6f 6d 65 74 68 |line 1. someth|
00000010 69 6e 67 20 65 6c 73 65 0a |ing else.|
00000019
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[#0,0:4='line ',<'line '>,1:0]
[#1,5:5='1',<NUMBER>,1:5]
[#2,6:9='\n ',<NEWLINE>,1:6]
[#3,10:23='something else',<'something else'>,2:3]
[#4,24:24='\n',<NEWLINE>,2:17]
[#5,25:24='<EOF>',<EOF>,3:0]
Question last update 2108
found <<line 1
something else
>>
Now change statement like so :
statement
// : 'line ' NUMBER NEWLINE 'something else' NEWLINE
: 'line ' NUMBER 'something else' NEWLINE // now NL will be extraneous
;
and execute again :
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[#0,0:4='line ',<'line '>,1:0]
[#1,5:5='1',<NUMBER>,1:5]
[#2,6:9='\n ',<NEWLINE>,1:6]
[#3,10:23='something else',<'something else'>,2:3]
[#4,24:24='\n',<NEWLINE>,2:17]
[#5,25:24='<EOF>',<EOF>,3:0]
Question last update 2114
line 1:6 extraneous input '\n ' expecting 'something else'
found <<line 1
something else
>>
Note that the NL character and spaces have been correctly matched by the NEWLINE lexer rule.
You can find the explanation in section 9.1 of The Definitive ANTLR 4 Reference :
$ grun Simple prog ➾ class T ; { int i; } ➾EOF ❮ line 1:8 extraneous
input ';' expecting '{'
A Parade of Errors • 153
The parser reports an error at the ; but gives a slightly more
informative answer because it knows that the next token is what it was
actually looking for. This feature is called single-token deletion
because the parser can simply pretend the extraneous token isn’t there
and keep going.
Similarly, the parser can do single-token insertion when it detects a
missing token.
In other word, ANTLR4 is so powerful that it can resynchronize the input with the grammar even if several tokens are mismatching. If you run with the -gui option
$ grun Question question -gui t.text
you can see that ANTLR4 has parsed the whole file, despite the fact that a NEWLINE is missing in the statement rule, and that the input does not match exactly the grammar.
To summary : extraneous input is quite a common error when developing a grammar. It can come from a mismatch between input to parse and rule expectations, or also because some piece of input has been interpreted by another token than the one we believe, which can be detected by examining the list of tokens produced by the -tokens option.

ANTLR4 - "Illegal Escape in String" expression for Lexer

The requirement for the assignment is:
"Illegal escape in string: " + wrong string: When the lexer detects an illegal
escape in string. The wrong string is from the beginning of the string to the
illegal escape.
All the supported escape sequences are as follows:
\b backspace
\f formfeed
\r carriage return
\n newline
\t horizontal tab
\’ single quote
\" double quote
\ backslash
I use the code for "String" as same as this post recommended:
ANTLR4 - Need an explanation on this String Literals
STRINGLIT: '"' ( '\\' [btnfr"'\\] | ~[\b\t\f\r\n\\"] )* '"';
And also fix a little bit for "Unterminated (or Unclosed) String" as follow:
UNCLOSE_STRING: '"' ( '\\' [btnfr"'\\] | ~[\b\t\f\r\n\\"] )* ;
So I tried to write down the prototype for that requirement like this:
ILLEGAL_ESCAPE: '"' .*? ESCAPE ;
fragment ESCAPE: [\b\f\r\n\t'"\\]
Can someone help me to figure out if had done something wrong to it, I think there is something not clear between STRING and ILLEGAL_ESCAPE so the result is not right.
I appreciate if you can fix it again to meet the requirement as I mentioned earlier. Thanks in advance!!
Try to use the following lexer rule:
ILLEGAL_ESCAPE: '"' ('\\' ~[btnfr"'\\] | ~'\\')*;

ANTLRv4 : Read double quotes escaped with both \ and "

I'm trying to implement a parser using ANTLRv4 for a language that accepts both "" and \" as a way escaping " characters in " delimited strings.
The answers to this question show how to do it for "" escaping. However when I try to extend it to also cover the \" case, it almost works but becomes too greedy when two strings are on the same line.
Here is my grammar:
grammar strings;
strings : STRING (',' STRING )* ;
STRING
: '"' (~[\r\n"] | '""' | '\"' )* '"'
;
Here is my input of three strings:
"This is ""my string\"",
"cat","fish"
This correctly recognises "This is ""my string\"", but thinks that "cat","fish" is all one string.
If I move "fish" down on to the next line it works correctly.
Can anyone figure out how to make it work if "cat" and "fish" are on the same line?
Make your STRING rule non greedy to stop at the first quote char it encounters, instead of trying to get as much as possible:
STRING
: '"' (~[\r\n"] | '""' | '\"' )*? '"'
;
I've found what I need to do to get this to work as I wanted, though to be honest I'm still not entirely sure why Antlr was doing what it did.
Simply by adding another backslash character to the '\"' clause it works!
So my final STRINGS definition is : '"' (~[\r\n"] | '""' | '\\"' )* '"'
Going back to first principles, I hand drew a state transition diagram of the problem and then realised that the two escaping mechanism sequences are not the same and cannot be treated similarly. Then trying to implement the two patterns in AntlrWorks it became apparent that I needed to add the second backslash at which point it all started working.
Does a single backslash followed by some arbitrary character simply mean that character?

Handling String Literals which End in an Escaped Quote in ANTLR4

How do I write a lexer rule to match a String literal which does not end in an escaped quote?
Here's my grammar:
lexer grammar StringLexer;
// from The Definitive ANTLR 4 Reference
STRING: '"' (ESC|.)*? '"';
fragment ESC : '\\"' | '\\\\' ;
Here's my java block:
String s = "\"\\\""; // looks like "\"
StringLexer lexer = new StringLexer(new ANTLRInputStream(s));
Token t = lexer.nextToken();
if (t.getType() == StringLexer.STRING) {
System.out.println("Saw a String");
}
else {
System.out.println("Nope");
}
This outputs Saw a String. Should "\" really match STRING?
Edit: Both 280Z28 and Bart's solutions are great solutions, unfortunately I can only accept one.
For properly formed input, the lexer will match the text you expect. However, the use of the non-greedy operator will not prevent it from matching something with the following form:
'"' .*? '"'
To ensure strings are tokens in the most "sane" way possible, I recommended using the following rules.
StringLiteral
: UnterminatedStringLiteral '"'
;
UnterminatedStringLiteral
: '"' (~["\\\r\n] | '\\' (. | EOF))*
;
If your language allows string literals to span across multiple lines, you would likely need to modify UnterminatedStringLiteral to allow matching end-of-line characters.
If you do not include the UnterminatedStringLiteral rule, the lexer will handle unterminated strings by simply ignoring the opening " character of the string and proceeding to tokenize the content of the string.
Yes, "\" is matched by the STRING rule:
STRING: '"' (ESC|.)*? '"';
^ ^ ^
| | |
// matches: " \ "
If you don't want the . to match the backslash (and quote), do something like this:
STRING: '"' ( ESC | ~[\\"] )* '"';
And if your string can't be spread over multiple lines, do:
STRING: '"' ( ESC | ~[\\"\r\n] )* '"';

Resources