ANTLRv4: How to read double quote escaped double quotes in string? - antlr4

In ANTLR v4, how do we parse this kind of string with double quote escaped double quotes like in VBA?
for text:
"some string with ""john doe"" in it"
the goal would be to identify the string: some string with "john doe" in it
And is it possible to rewrite it to turn double double quotes in single double quotes? "" -> "?

Like this:
STRING
: '"' (~[\r\n"] | '""')* '"'
;
where ~[\r\n"] | '""' means:
~[\r\n"] # any char other than '\r', '\n' and double quotes
| # OR
'""' # two successive double quotes
And is it possible to rewrite it to turn double double quotes in single double quotes?
Not without embedding custom code. In Java that could look like:
STRING
: '"' (~[\r\n"] | '""')* '"'
{
String s = getText();
s = s.substring(1, s.length() - 1); // strip the leading and trailing quotes
s = s.replace("\"\"", "\""); // replace all double quotes with single quotes
setText(s);
}
;

Related

Movlet Failing if double quotes exists in string

I have a string with double quotes for example:
"Diagnosed with "Covid19 -Ve"
while rendering in MEL variable it's failing.
My string may contains multiple double quotes in future.
How to escape if string contains all the double quotes in entire string value at the end in MEL
you can escape the double quotes character using a backslash and then the double quotes. Just like this:
myVariable = "Diagnosed with \"Covid19 -Ve" ;

How to include quotes in string in ANTLR4

How can I include quotes for string and characters as part of the string. Example is "This is a \" string" which should result in one string instead of "This is a \" as one string and string" as an error in this case. The same goes for the characters. Example is '\'', but
in my case it's only '\'.
This is my current solution which works only without quotes.
CHARACTER
: '\'' ~('\'')+ '\''
;
STRING
: '"' ~('"')+ '"'
;
Your string/char rules don't handle escape sequences correctly. For the character it should be:
CHARACTER: '\'' '\\'? '.' '\'';
Here we make the escape char (backshlash) be part of the rule and require an additional char (whatever it is) follow it. Similar for the string:
STRING: '"' ('\\'? .)+? '"';
By using +? we are telling ANTLR4 to match in a non-greedy manner, stopping at the first non-escaped quote char after the initial one.

Why there are some forward slash in the error string?

Here is my stdout after I ran my testings.
expect(received).toBe(expected) // Object.is equality
Expected: "child 'path1' fails because ['path1' is not allowed to be empty]"
Received: "child \"path1\" fails because [\"path1\" is not allowed to be empty]"
39 | } catch (error) {
40 | expect(error.name).toBe('ValidationError');
> 41 | expect(error.message).toBe("child 'path1' fails because ['path1' is not allowed to be empty]");
| ^
42 | }
43 | });
44 | });
at Object.<anonymous> (src/__tests__/models/adChannel/googleadwords/AdGroupAd.spec.ts:41:29)
As you can see, the Received value has forward slash \. It doesn't match the Expected value.
I think maybe the string is escaped? I expect the string doesn't have \
Short answer
Change your expect to this:
expect(error.message).toBe('child "path1" fails because ["path1" is not allowed to be empty]');
...and it will work.
Details
JavaScript allows strings to be defined using either single quotes: 'a string' or double quotes: "a string".
From the MDN doc:
Unlike some other languages, JavaScript makes no distinction between single-quoted strings and double-quoted strings
...so it doesn't matter which approach you use.
Single quotes work fine in a string defined with double quotes:
const singleQuotes = "has 'single quotes' in it";
...and the same is true for double quotes in a string defined with single quotes:
const doubleQuotes = 'has "double quotes" in it';
...but single quotes need to be escaped if they are in a string defined with single quotes:
const singleQuotes = 'has \'single quotes\' in it';
...and the same is true for double quotes in a string defined with double quotes:
const doubleQuotes = "has \"double quotes\" in it";
You are seeing escape characters in the Jest output because Jest is formatting the output string with double quotes around it so the double quotes within it need to be escaped.

Handling String Literals which End in an Escaped Quote in ANTLR4

How do I write a lexer rule to match a String literal which does not end in an escaped quote?
Here's my grammar:
lexer grammar StringLexer;
// from The Definitive ANTLR 4 Reference
STRING: '"' (ESC|.)*? '"';
fragment ESC : '\\"' | '\\\\' ;
Here's my java block:
String s = "\"\\\""; // looks like "\"
StringLexer lexer = new StringLexer(new ANTLRInputStream(s));
Token t = lexer.nextToken();
if (t.getType() == StringLexer.STRING) {
System.out.println("Saw a String");
}
else {
System.out.println("Nope");
}
This outputs Saw a String. Should "\" really match STRING?
Edit: Both 280Z28 and Bart's solutions are great solutions, unfortunately I can only accept one.
For properly formed input, the lexer will match the text you expect. However, the use of the non-greedy operator will not prevent it from matching something with the following form:
'"' .*? '"'
To ensure strings are tokens in the most "sane" way possible, I recommended using the following rules.
StringLiteral
: UnterminatedStringLiteral '"'
;
UnterminatedStringLiteral
: '"' (~["\\\r\n] | '\\' (. | EOF))*
;
If your language allows string literals to span across multiple lines, you would likely need to modify UnterminatedStringLiteral to allow matching end-of-line characters.
If you do not include the UnterminatedStringLiteral rule, the lexer will handle unterminated strings by simply ignoring the opening " character of the string and proceeding to tokenize the content of the string.
Yes, "\" is matched by the STRING rule:
STRING: '"' (ESC|.)*? '"';
^ ^ ^
| | |
// matches: " \ "
If you don't want the . to match the backslash (and quote), do something like this:
STRING: '"' ( ESC | ~[\\"] )* '"';
And if your string can't be spread over multiple lines, do:
STRING: '"' ( ESC | ~[\\"\r\n] )* '"';

How to use backslash escape char for new line in JavaCC?

I have an assignment to create a lexical analyser and I've got everything working except for one bit.
I need to create a string that will accept a new line, and the string is delimited by double quotes.
The string accepts any number, letter, some specified punctuation, backslashes and double quotes within the delimiters.
I can't seem to figure out how to escape a new line character.
Is there a certain way of escaping characters like new line and tab?
Here's some of my code that might help
< STRING : ( < QUOTE> (< QUOTE > | < BACKSLASH > | < ID > | < NUM > | " " )* <QUOTE>) >
< #QUOTE : "\"" >
< #BACKSLASH : "\\" >
So my string should allow for a quote, then any of the following characters like a backslash, a whitespace, a number etc, and then followed by another quote.
The newline char like "\n" is what's not working.
Thanks in advance!
For string literals, JavaCC borrows the syntax of Java. So, a single-character literal comprising a carriage return is escaped as "\r", and a single-character literal comprising a line feed is escaped as "\n".
However, the processed string value is just a single character; it is not the escape itself. So, suppose you define a token for line feed:
< LF : "\n" >
A match of the token <LF> will be a single line-feed character. When substituting the token in the definition of another token, the single character is effectively substituted. So, suppose you have the higher-level definition:
< STRING : "\"" ( <LF> ) "\"" >
A match of the token <STRING> will be three characters: a quotation mark, followed by a line feed, followed by a quotation mark. What you seem to want instead is for the escape sequence to be recognized:
< STRING : "\"" ( "\\n" ) "\"" >
Now a match of the token <STRING> will be four characters: a quotation mark, followed by an escape sequence representing a line feed, followed by a quotation mark.
In your current definition, I see that other often-escaped metacharacters like quotation mark and backslash are also being recognized literally, rather than as escape sequences.

Resources