How do I write a lexer rule to match a String literal which does not end in an escaped quote?
Here's my grammar:
lexer grammar StringLexer;
// from The Definitive ANTLR 4 Reference
STRING: '"' (ESC|.)*? '"';
fragment ESC : '\\"' | '\\\\' ;
Here's my java block:
String s = "\"\\\""; // looks like "\"
StringLexer lexer = new StringLexer(new ANTLRInputStream(s));
Token t = lexer.nextToken();
if (t.getType() == StringLexer.STRING) {
System.out.println("Saw a String");
}
else {
System.out.println("Nope");
}
This outputs Saw a String. Should "\" really match STRING?
Edit: Both 280Z28 and Bart's solutions are great solutions, unfortunately I can only accept one.
For properly formed input, the lexer will match the text you expect. However, the use of the non-greedy operator will not prevent it from matching something with the following form:
'"' .*? '"'
To ensure strings are tokens in the most "sane" way possible, I recommended using the following rules.
StringLiteral
: UnterminatedStringLiteral '"'
;
UnterminatedStringLiteral
: '"' (~["\\\r\n] | '\\' (. | EOF))*
;
If your language allows string literals to span across multiple lines, you would likely need to modify UnterminatedStringLiteral to allow matching end-of-line characters.
If you do not include the UnterminatedStringLiteral rule, the lexer will handle unterminated strings by simply ignoring the opening " character of the string and proceeding to tokenize the content of the string.
Yes, "\" is matched by the STRING rule:
STRING: '"' (ESC|.)*? '"';
^ ^ ^
| | |
// matches: " \ "
If you don't want the . to match the backslash (and quote), do something like this:
STRING: '"' ( ESC | ~[\\"] )* '"';
And if your string can't be spread over multiple lines, do:
STRING: '"' ( ESC | ~[\\"\r\n] )* '"';
Related
How can I include quotes for string and characters as part of the string. Example is "This is a \" string" which should result in one string instead of "This is a \" as one string and string" as an error in this case. The same goes for the characters. Example is '\'', but
in my case it's only '\'.
This is my current solution which works only without quotes.
CHARACTER
: '\'' ~('\'')+ '\''
;
STRING
: '"' ~('"')+ '"'
;
Your string/char rules don't handle escape sequences correctly. For the character it should be:
CHARACTER: '\'' '\\'? '.' '\'';
Here we make the escape char (backshlash) be part of the rule and require an additional char (whatever it is) follow it. Similar for the string:
STRING: '"' ('\\'? .)+? '"';
By using +? we are telling ANTLR4 to match in a non-greedy manner, stopping at the first non-escaped quote char after the initial one.
The requirement for the assignment is:
"Illegal escape in string: " + wrong string: When the lexer detects an illegal
escape in string. The wrong string is from the beginning of the string to the
illegal escape.
All the supported escape sequences are as follows:
\b backspace
\f formfeed
\r carriage return
\n newline
\t horizontal tab
\’ single quote
\" double quote
\ backslash
I use the code for "String" as same as this post recommended:
ANTLR4 - Need an explanation on this String Literals
STRINGLIT: '"' ( '\\' [btnfr"'\\] | ~[\b\t\f\r\n\\"] )* '"';
And also fix a little bit for "Unterminated (or Unclosed) String" as follow:
UNCLOSE_STRING: '"' ( '\\' [btnfr"'\\] | ~[\b\t\f\r\n\\"] )* ;
So I tried to write down the prototype for that requirement like this:
ILLEGAL_ESCAPE: '"' .*? ESCAPE ;
fragment ESCAPE: [\b\f\r\n\t'"\\]
Can someone help me to figure out if had done something wrong to it, I think there is something not clear between STRING and ILLEGAL_ESCAPE so the result is not right.
I appreciate if you can fix it again to meet the requirement as I mentioned earlier. Thanks in advance!!
Try to use the following lexer rule:
ILLEGAL_ESCAPE: '"' ('\\' ~[btnfr"'\\] | ~'\\')*;
So I have this lexical rule for string:
STRINGLIT: '"' ( ('\'[\"bftrn]) | ~[\n\"] )* '"' ;
For example, with the input "abc", I expect abc,<EOF> discarding the "
I read here http://www.antlr2.org/doc/lexer.html that you can use ! operator. Then I would have:
STRINGLIT: '"'! ( ('\'[\"bftrn]) | ~[\n\"] )* '"'! ;
But then I can't make it work on the code.
The v2 functionality of the ! operator is no longer supported since v3 (you're using v4).
There is no equivalent operator in v3 or v4. The only way to strip the quotes is to do so in a listener or visitor after parsing, or embed target specific code in your lexer:
STRINGLIT
: '"' ( ( '\\' [\\bftrn"] ) | ~[\\\r\n"] )* '"'
{
// Get all the text that this rules matched
String matched = getText();
// Strip the first and the last characters (the quotes)
String matchedWithoutQuotes = matched.substring(1, matched.length() - 1);
// possibly do some more replacements here like replace `\\n` with `\n` etc.
// Set the new string to this token
setText(matchedWithoutQuotes);
}
;
I'm trying to implement a parser using ANTLRv4 for a language that accepts both "" and \" as a way escaping " characters in " delimited strings.
The answers to this question show how to do it for "" escaping. However when I try to extend it to also cover the \" case, it almost works but becomes too greedy when two strings are on the same line.
Here is my grammar:
grammar strings;
strings : STRING (',' STRING )* ;
STRING
: '"' (~[\r\n"] | '""' | '\"' )* '"'
;
Here is my input of three strings:
"This is ""my string\"",
"cat","fish"
This correctly recognises "This is ""my string\"", but thinks that "cat","fish" is all one string.
If I move "fish" down on to the next line it works correctly.
Can anyone figure out how to make it work if "cat" and "fish" are on the same line?
Make your STRING rule non greedy to stop at the first quote char it encounters, instead of trying to get as much as possible:
STRING
: '"' (~[\r\n"] | '""' | '\"' )*? '"'
;
I've found what I need to do to get this to work as I wanted, though to be honest I'm still not entirely sure why Antlr was doing what it did.
Simply by adding another backslash character to the '\"' clause it works!
So my final STRINGS definition is : '"' (~[\r\n"] | '""' | '\\"' )* '"'
Going back to first principles, I hand drew a state transition diagram of the problem and then realised that the two escaping mechanism sequences are not the same and cannot be treated similarly. Then trying to implement the two patterns in AntlrWorks it became apparent that I needed to add the second backslash at which point it all started working.
Does a single backslash followed by some arbitrary character simply mean that character?
I have the following deffinition of fragment:
fragment CHAR :'a'..'z'|'A'..'Z'|'\n'|'\t'|'\\'|EOF;
Now I have to define a lexer rule for string. I did the following :
STRING : '"'(CHAR)*'"'
However in string I want to match all of my characters except the new line '\n'. Any ideas how I can achieve that?
You'll also need to exclude " besides line breaks. Try this:
STRING : '"' ~('\r' | '\n' | '"')* '"' ;
The ~ negates char-sets.
ut I want to negate only the new line from my CHAR set
No other way than this AFAIK:
STRING : '"' CHAR_NO_NL* '"' ;
fragment CHAR_NO_NL : 'a'..'z'|'A'..'Z'|'\t'|'\\'|EOF;