String literal in ANTLR4 - string

I'm using antlr4 C++ runtime and I'd like to create a string literal in my lexer definition file. How can I do this?
What I have so far:
V_STRING : '"' ~('\\' | '"')* '"';
I doesn't work with
printf("string literal\n");
but works with
printf("string literal\\n");
I don't want to explicitly escape the new line character.
my assumptions are that antlr interprets the new line character as a regular new line (when reading a file, for example).
Thanks in advance.

It's always a good idea to list out your token stream to see if your Lexer rules really do what you expect. (Look into the tokens option of the TestRig; also, some plugins will show you your tokens)
In your case your rule essentially says that a String is " a " followed by 0 or more characters that are not a \ or a " and then a "".
So, when the Lexer encounters your \, matches the ~('\\\\'|'")* part of the rule and then looks for a " (which it does not find, since the \ is followed by a n), so It won't recognize "string literal\n" as a V_STRING token (it also fails to match "string literal\\n" as well, here, so I'm not quite sure what's going on with the example that "works").
try:
V_STRING: '"' ~["]* '"';
Note: this is a very simple String rule, but it accepts your input. You probably want to examine grammars for other languages to see how you might want to handle strings in your language; there are several approaches (and many of them involve using Lexer modes). You can find examples here)
If you want the "\n" to be treated as a newline, just understand that the parser won't do that for you, you'll just see the characters "" and "n". It'll be up to you to handle encoding the escaped characters (and it's once you try to handle " that it'll get more complicated and you'll need to look into Lexer modes)

Related

Catching String Literal Errors using Flex-Lexer

I have managed to effectively match valid string literals in my Flex program, but I would also like to match unterminated string literals and string literals with bad escape sequences.
For example, my string literals are matched using simple regex as such:
\"(\\.|[^\\"])*\"
Then I tried to find where a string literal is started with a " then some text, and then \n. This is incorrect syntax for my lexer, and I'd like to catch and produce an error out.
My current regex I came up with for that is this:
\"(\\.|[^\\"])*\n
Which does correctly catch the error, but then seems to eat up the rest of the tokens, because there's no output after that.
Additionally, I was also looking to have a special case error for when an unterminated string literal had an invalid escape sequence. For example:
"some text \
int abc
So my question boils down to, is there something wrong with my current way of matching string literals that's affecting my ability to catch these errors, or is my pattern matching unnecessarily consuming tokens? It's also possible I have no idea what I'm doing!
Some examples of strings:
"a correct string literal"
"an unterminated string literal
"an unterminated string literal with escape \
All string literals are single-line and follow the form:
"(.*)"\n
The correct flex pattern for string literals is (see below for escape sequences):
\"(\\(.|\n)|[^\\"\n])*\"
This differs from your pattern in that it allows a newline after an escape character (which is technically a splice, rather than part of the syntax of string literals [Note 1]), and bans newlines otherwise. That has to be done explicitly because [^...] includes a newline unless \n is part of the list of characters to be rejected. Only . implicitly bans newlines.
To match incorrect string literals, you only need the same pattern without the terminating ":
\"(\\(.|\n)|[^\\"\n])*
You don't need to worry about that pattern matching correct string literals, because flex always chooses the longest match, and the match with the terminating quote is guaranteed to be longer.
If you want to be more accurate about escape characters, you would need something like:
\"(\\([abfnrtv'"?\\\n]|[0-7]{1,3}|x[[:xdigit:]]+|u[[:xdigit:]]{4}|U[[:xdigit:]]{8})|[^\\"\n])*\"
You can use the same technique to match errors, but you might want to distinguish between unterminated quote errors and invalid escape errors, which you can do by using two error patterns:
\"(\\([abfnrtv'"?\\\n]|[0-7]{1,3}|x[[:xdigit:]]+|u[[:xdigit:]]{4}|U[[:xdigit:]]{8})|[^\\"\n])*\" { /* Valid string */ }
\"(\\([abfnrtv'"?\\\n]|[0-7]{1,3}|x[[:xdigit:]]+|u[[:xdigit:]]{4}|U[[:xdigit:]]{8})|[^\\"\n])*/\\ { /* Invalid escape sequence */ }
\"(\\([abfnrtv'"?\\\n]|[0-7]{1,3}|x[[:xdigit:]]+|u[[:xdigit:]]{4}|U[[:xdigit:]]{8})|[^\\"\n])* { /* Missing terminating quote */ }
Notes
A "splice" is a backslash at the end of a line. You commonly see these only in definitions of long macros, but C allows a splice anywhere: the backslash and the following newline are just deleted from the program text, so a splice can even be placed in the middle of an identifier or multicharacter operator. (But don't do that!)
Using splices to continue strings over multiple lines is not good style; it's better to use string concatenation. But the C standard does allow it.
However, splices are removed before tokenisation starts, which means that you cannot backslash escape a backslash at the end of a line:
"This is a string literal which includes a \\
t tab, with a splice in the middle of the escape."
Please don't use that in production code, either :-)

String lexical rule in ANTLR with greedy wildcald and escape character

From the book "The Definitive ANTLR 4 Reference":
Our STRING rule isn’t quite good enough yet because it doesn’t allow
double quotes inside strings. To support that, most languages define
escape sequences starting with a backslash. To get a double quote
inside a double-quoted string, we use \". To support the common escape
characters, we need something like the following:
STRING ​: ​ ​'"' ​( ESC |.)*?​ ​'"' ​ ​;
fragment
ESC ​: ​ ​'\\"' | ​ ​'\\\\' ​ ​; ​ ​// 2-char sequences \" and \\
​ ANTLR itself needs to escape the escape character, so that’s why we need \\ to
specify the backslash character. The loop in STRING now matches either
an escape character sequence, by calling fragment rule ESC, or any
single character via the dot wildcard. The *? subrule operator
terminates the (ESC |.)*?
That sounds fine, but when I read that I noticed a certain ambiguity in the choice between ESC and .. As far as STRING is concerned, it is possible to match an input "Hi\"" by matching the escape character \ to the ., and to consider the following escaped double-quote as closing the string. This would even be less greedy and so would conform better to the use of ?.
The problem, of course, is that if we do that, then we have an extra double-quote at the end that does not get matched to anything.
So I wrote the following grammar:
grammar String;
anything: STRING '"'? '\r\n';
STRING: '"' (ESC|.)*? '"';
fragment
ESC: '\\"' | '\\\\';
which accepts an optional lonely double-quote character right after the string. This grammar still parses "Orange\"" as a full string:
So my question is: why is this the accepted parse, as opposed to the one taking "Orange\" as the STRING, followed by an isolated double-quote "? Note that the latter would be less greedy, which would seem to conform better to the use of ?, so one could think it would be preferable.
After some more experimentation, I realize the explanation is that the choice operator | is order-dependent (but only under non-greedy operator ?): ESC is tried before .. If I invert the two and write (.|ESC)*?, I do get
This is not really surprising, but an interesting reminder that ANTLR is not as declarative as we may sometimes expect (in the sense that logic-or is order-independent but | is not). It is also a good reminder that the non-greedy operator ? does not extend its minimization capabilities to all choices, but just to the first one that matches the input (#sepp2k adds that order dependency only applies to the non-greedy case).

ANTLR: How to write a rule for enforcing line continuation character while writing a string?

I want to write a rule for parsing a string inside double quotes. I want to allow any character, with the only condition being that there MUST be a line continuation character \, when splitting the string on multiple lines.
Example:
variable = "first line \n second line \
still second line \n \
third line"
If the line continuation character is not found before a newline character is found, I want the parser to barf.
My current rule is this:
STRING : '"' (ESC|.)*? '"';
fragment ESC : '\\' [btnr"\\] ;
So I am allowing the string to contain any character, including bunch of escape sequences. But I am not really enforcing that line continuation character \ is a necessity for splitting text.
How can I make the grammar enforce that rule?
Even though there is already an accepted answer let me put in my 2cents. I strongly recommend not to handle this type of error in a lexer rule. The reason is that you will not be able to give the user a good error message. First, lexer errors are usually not reported separately in ANTLR4, they appear as follow up parser errors. Second, the produced error (likely something like: "no viable alt at \n") is all but helpful.
The better solution is to accept both variants (linebreak with or w/o escape) and do a semantic check afterwards. Then you know exactly what is wrong and can the user tell what you really expected.
Solution
fragment ESCAPE
: '\\' .
;
STRING
: '"' (ESCAPE | ~[\n"])* '"'
;
Explanation
Fragment ESCAPE will match escaped characters (especially backslash and a new line character acting as a continuation sign).
Token STRING will match inside double quotation marks:
Escaped characters (fragment ESCAPE)
Everything except new line and double quotation marks.

Antlr4: syntax error: extraneous input ... expecting GT while looking for rule element

This is probably simple but I can't see the solution. Antlr v 4.0 tells me:
error(50): C:\Users\Brenden\Dev\proj\WikiParser\antlr\wiki\wikigrammar.g4:27:8:
syntax error: extraneous input '' LINK_BODY '' expecting GT while looking for rule element
This is for the input line:
link: '<' LINK_BODY '>' ;
27:8 refers to the < character. Not sure what is going on. Does < need to be escaped or something? I didn't see that on the wiki. The rest of the file seems to parse OK, there's several lines, the one above this one seems OK, it's terminated with a ; so I don't think anything else is messing this line up. Halp?
Edit: here's LINK_BODY, if it matters:
LINK_BODY: ~[<">]+ ;
The problem with this grammar started with the use of an improper escape sequence '\' in a rule which resulted in an unterminated string literal. Since ANTLR 4.0 allows a lexer set (e.g. [a-z]) or lexer string literal (e.g. 'parser') to contain newline characters, every ' character from the point of the mistake to the end of the file was causing it to switch into and out of string literals at all the wrong points.
This behavior has been altered for [the not-yet-released] ANTLR 4.0.1 by disallowing embedded newline characters in both of these tokens.
https://github.com/antlr/antlr4/pull/169

Could not able to replace \" with " in html tags using C#

I want to replace a as using C#. I could not able to achive this using Regex.Replace functions as follos
Regex.Replace(html, "\\"", "\"");
execution this command again produces the original output
Anyone have already faced issue like this,Any help would be of greatly appreciated.
Regards,
Ganesan
first of all "\\""produces a compiler error, since you are just escaping one backslash but not the quote.
you are working with 2 escape mechanisms here, one is from the c# compiler, and another is from the regex interpreter.
Which means:
when you give this C# string as a regex: "\\\"" then after compilation there is a string looking like that \", which is then interpreted by the regex engine, which also uses \ as the escape character. therefor regex will escape ", so your code will replace " with "
so if you now use "\\\\\"", first the c# compiler will make \\" out of that, then the regex engine will make \" out of that (both are using \ as escape character)
now c# has a nice little feature to make such strings easier to write.
if you add an # before your string, \ will no long be the escape character, but now you have to escape " with ""
that means "\\\"" == #"\""" and "\\\\\"" == #"\\"""
so you could write Regex.Replace(html,#"\\""","\"")
which is easier to read then Regex.Replace(html,"\\\\\","\"")
i think i got it right this time :D

Resources