Catching String Literal Errors using Flex-Lexer - string

I have managed to effectively match valid string literals in my Flex program, but I would also like to match unterminated string literals and string literals with bad escape sequences.
For example, my string literals are matched using simple regex as such:
\"(\\.|[^\\"])*\"
Then I tried to find where a string literal is started with a " then some text, and then \n. This is incorrect syntax for my lexer, and I'd like to catch and produce an error out.
My current regex I came up with for that is this:
\"(\\.|[^\\"])*\n
Which does correctly catch the error, but then seems to eat up the rest of the tokens, because there's no output after that.
Additionally, I was also looking to have a special case error for when an unterminated string literal had an invalid escape sequence. For example:
"some text \
int abc
So my question boils down to, is there something wrong with my current way of matching string literals that's affecting my ability to catch these errors, or is my pattern matching unnecessarily consuming tokens? It's also possible I have no idea what I'm doing!
Some examples of strings:
"a correct string literal"
"an unterminated string literal
"an unterminated string literal with escape \
All string literals are single-line and follow the form:
"(.*)"\n

The correct flex pattern for string literals is (see below for escape sequences):
\"(\\(.|\n)|[^\\"\n])*\"
This differs from your pattern in that it allows a newline after an escape character (which is technically a splice, rather than part of the syntax of string literals [Note 1]), and bans newlines otherwise. That has to be done explicitly because [^...] includes a newline unless \n is part of the list of characters to be rejected. Only . implicitly bans newlines.
To match incorrect string literals, you only need the same pattern without the terminating ":
\"(\\(.|\n)|[^\\"\n])*
You don't need to worry about that pattern matching correct string literals, because flex always chooses the longest match, and the match with the terminating quote is guaranteed to be longer.
If you want to be more accurate about escape characters, you would need something like:
\"(\\([abfnrtv'"?\\\n]|[0-7]{1,3}|x[[:xdigit:]]+|u[[:xdigit:]]{4}|U[[:xdigit:]]{8})|[^\\"\n])*\"
You can use the same technique to match errors, but you might want to distinguish between unterminated quote errors and invalid escape errors, which you can do by using two error patterns:
\"(\\([abfnrtv'"?\\\n]|[0-7]{1,3}|x[[:xdigit:]]+|u[[:xdigit:]]{4}|U[[:xdigit:]]{8})|[^\\"\n])*\" { /* Valid string */ }
\"(\\([abfnrtv'"?\\\n]|[0-7]{1,3}|x[[:xdigit:]]+|u[[:xdigit:]]{4}|U[[:xdigit:]]{8})|[^\\"\n])*/\\ { /* Invalid escape sequence */ }
\"(\\([abfnrtv'"?\\\n]|[0-7]{1,3}|x[[:xdigit:]]+|u[[:xdigit:]]{4}|U[[:xdigit:]]{8})|[^\\"\n])* { /* Missing terminating quote */ }
Notes
A "splice" is a backslash at the end of a line. You commonly see these only in definitions of long macros, but C allows a splice anywhere: the backslash and the following newline are just deleted from the program text, so a splice can even be placed in the middle of an identifier or multicharacter operator. (But don't do that!)
Using splices to continue strings over multiple lines is not good style; it's better to use string concatenation. But the C standard does allow it.
However, splices are removed before tokenisation starts, which means that you cannot backslash escape a backslash at the end of a line:
"This is a string literal which includes a \\
t tab, with a splice in the middle of the escape."
Please don't use that in production code, either :-)

Related

String literal in ANTLR4

I'm using antlr4 C++ runtime and I'd like to create a string literal in my lexer definition file. How can I do this?
What I have so far:
V_STRING : '"' ~('\\' | '"')* '"';
I doesn't work with
printf("string literal\n");
but works with
printf("string literal\\n");
I don't want to explicitly escape the new line character.
my assumptions are that antlr interprets the new line character as a regular new line (when reading a file, for example).
Thanks in advance.
It's always a good idea to list out your token stream to see if your Lexer rules really do what you expect. (Look into the tokens option of the TestRig; also, some plugins will show you your tokens)
In your case your rule essentially says that a String is " a " followed by 0 or more characters that are not a \ or a " and then a "".
So, when the Lexer encounters your \, matches the ~('\\\\'|'")* part of the rule and then looks for a " (which it does not find, since the \ is followed by a n), so It won't recognize "string literal\n" as a V_STRING token (it also fails to match "string literal\\n" as well, here, so I'm not quite sure what's going on with the example that "works").
try:
V_STRING: '"' ~["]* '"';
Note: this is a very simple String rule, but it accepts your input. You probably want to examine grammars for other languages to see how you might want to handle strings in your language; there are several approaches (and many of them involve using Lexer modes). You can find examples here)
If you want the "\n" to be treated as a newline, just understand that the parser won't do that for you, you'll just see the characters "" and "n". It'll be up to you to handle encoding the escaped characters (and it's once you try to handle " that it'll get more complicated and you'll need to look into Lexer modes)

How to split strings sperated by comma with escapes?

I have a string looks like this:
(The whole code block is a string, aka, this string contains quotation marks.)
"he\"llo", "world\n", "fro,m"
[update] Aka, the "actual" string is this:
"\"he\\\"llo\", \"world\\n\", \"fro,m\""
I want to get an array of strings like this:
[ "\"he\\\"llo\"", "\"world\\n\"", "\"fro,m\"" ]
[update] Comma inside quotation marks should be remained.
In my opinion, there are several ways to solve this:
build a automata (DFA or NFA) for this syntax
using several status flags like inQuote, handle judging logics with lots of if else
write a complex but clever Regular Expression for this
Are there any general solutions to this problem? Or how should I actually do using those thinkings above?
P.S. It couldn't be better if some syntax errors like "unclosed quotation mark" can be found.
You need to first define your grammar. This is a simple grammar for your case:
document = *WS [string *WS *(',' *WS string *WS)]
string = %x22 *char %x22
char = %x20-21 / %x23-5B / escape / %x5D-10FFFF
escape = %x5C (%x5C / %x22 / 't' / 'n' / 'r')
WS = %x9 / %x20
You can read it as:
A document may begin/end with a white space, then may have one or more strings separated by commas. Before and after each comma there may be some white space.
A string is made of characters and begins and ends with double quotes Unicode/ASCII hex code 22.
Each character (char), may be: 1) any non control Unicode character before the double quotes i.e. hex 20 (space) or hex 21 (exclamation mark); 2) any character after the double quotes and before the escape slash \ (hex 5C); 3) an escape character sequence; 4) any other Unicode character after the slash (hex 5C).
The escape sequence (rule escape) begins with the escape slash \ and is followed by another slash, or the characters t for tab, n for line feed and r for carriage return. You may add and other escapable characters if you want, as for a C++ string syntax you can see here: https://en.cppreference.com/w/cpp/language/escape .
A white space (WS) is a tab or space, you may add and %xA and %xD for line feed and carriage return respectively.
By the use of this grammar you will get this tree for your input:
The screenshort is from the Tunnel Grammar Studio online laboratory that can run ABNF grammars (as the one above), and I work on.
After you have the grammar, you may use tools to generate a parser, or you can write one yourself. If you want to do it by hand (preferable for so small and simple grammar), you may have one function per each grammar rule that reads one character and checks is it the expected one. If your input finishes when you are parsing the string rule, then you have an input with a started but not finished string.
Your actual string syntax tree will look like that:

.encode("utf-8") does not seem to support my emoji?

Sorry, this is likely a simple solution. In a server response, I am trying to return an emoji. Right now, I have this line:
return [b"Hello World " β€œπŸ‘‹β€.encode(β€œutf-8”)]
However, I get the following error:
return [b"Hello World " β€œοΏ½πŸ‘‹β€.encode(β€œutf-8”)]
^
SyntaxError: invalid character in identifier
What I'd like to see is: Hello World πŸ‘‹
The problem is that a byte string b'...' cannot contain a character which does not fit into a byte. But you don't need a byte string here anyway; encode will convert a string to bytes - that's what it does.
Try
return ["Hello World β€œπŸ‘‹β€".encode("utf-8")]
The quotes in your question were wacky; I assume you want curly quotes around the emoji and honest-to-GvR quotes around the Python string literals.
In Python, you can put (almost) any Unicode character in a string literal.
Also, you can use most Unicode letters in identifiers, eg. if you think it's appropriate to define a variable Ξ± (Greek lower-case alpha).
But you can't use "fancy quotes" for delimiting string literals.
Look carefully: the double quotes around the emoji (and also around utf-8) aren't straight ASCII quotes, but rather typographical ones – the ones word processors substitute when typing a double quote in text.
Make sure you use a proper programming editor or an IDE for coding.
Then the line will look like this:
return [b"Hello World " "πŸ‘‹".encode("utf-8")]
Edit:
I realise that this still doesn't work: you cannot mix byte string and Unicode string literals (even though here the Unicode literal gets converted to bytes later).
Instead, you have to concatenate them with the + operator:
return [b"Hello World " + "πŸ‘‹".encode("utf-8")]
Or use a single string literal, like tripleee suggested.
There are several problems. Your code:
return [b"Hello World " β€œπŸ‘‹β€.encode(β€œutf-8”)]
You see, you are using β€œ and ” twice, instead of the proper double quote character ("). You should use a proper code editor, or disable character conversion. You should care also on copy and paste.
As you see from the error, the problem is not the emoji on the string, but a problem of identifier (and it is a syntax error), so unknown characters outside string.
But if you correct to:
return [b"Hello World " "πŸ‘‹".encode("utf-8")]
you still have an error: SyntaxError: cannot mix bytes and nonbytes literals.
This because Python will merge strings before to call the function encode, but one is a b-string and the other it is a normal string.
So you could use one of the following:
return [b"Hello World " + "πŸ‘‹".encode("utf-8")] # this force order of operator
or the following two (that are equivalent).
return [b"Hello World " "πŸ‘‹"]
return [b"Hello World πŸ‘‹"]
Python3 uses UTF-8 as default source encoding, so your editor will already encode the emoji into UTF-8, so you can use it in a b-string (UTF-8 encoded).
Note: this is not a very safe assumption: one could manually force source code to be in other encodings, but in that case, you will have probably also problem with the first method (saving a file with emoji in other encodings).

String lexical rule in ANTLR with greedy wildcald and escape character

From the book "The Definitive ANTLR 4 Reference":
Our STRING rule isn’t quite good enough yet because it doesn’t allow
double quotes inside strings. To support that, most languages define
escape sequences starting with a backslash. To get a double quote
inside a double-quoted string, we use \". To support the common escape
characters, we need something like the following:
STRING ​: ​ ​'"' ​( ESC |.)*?​ ​'"' ​ ​;
fragment
ESC ​: ​ ​'\\"' | ​ ​'\\\\' ​ ​; ​ ​// 2-char sequences \" and \\
​ ANTLR itself needs to escape the escape character, so that’s why we need \\ to
specify the backslash character. The loop in STRING now matches either
an escape character sequence, by calling fragment rule ESC, or any
single character via the dot wildcard. The *? subrule operator
terminates the (ESC |.)*?
That sounds fine, but when I read that I noticed a certain ambiguity in the choice between ESC and .. As far as STRING is concerned, it is possible to match an input "Hi\"" by matching the escape character \ to the ., and to consider the following escaped double-quote as closing the string. This would even be less greedy and so would conform better to the use of ?.
The problem, of course, is that if we do that, then we have an extra double-quote at the end that does not get matched to anything.
So I wrote the following grammar:
grammar String;
anything: STRING '"'? '\r\n';
STRING: '"' (ESC|.)*? '"';
fragment
ESC: '\\"' | '\\\\';
which accepts an optional lonely double-quote character right after the string. This grammar still parses "Orange\"" as a full string:
So my question is: why is this the accepted parse, as opposed to the one taking "Orange\" as the STRING, followed by an isolated double-quote "? Note that the latter would be less greedy, which would seem to conform better to the use of ?, so one could think it would be preferable.
After some more experimentation, I realize the explanation is that the choice operator | is order-dependent (but only under non-greedy operator ?): ESC is tried before .. If I invert the two and write (.|ESC)*?, I do get
This is not really surprising, but an interesting reminder that ANTLR is not as declarative as we may sometimes expect (in the sense that logic-or is order-independent but | is not). It is also a good reminder that the non-greedy operator ? does not extend its minimization capabilities to all choices, but just to the first one that matches the input (#sepp2k adds that order dependency only applies to the non-greedy case).

How do I put a single backslash into an ES6 template literal's output?

I'm struggling to get an ES6 template literal to produce a single backslash it its result.
> `\s`
's'
> `\\s`
'\\s'
> `\\\s`
'\\s'
> `\\\\s`
'\\\\s'
> `\u005Cs`
'\\s'
Tested with Node 8.9.1 and 10.0.0 by inspecting the value at a Node REPL (rather than printing it using console.log)
If I get your question right, how about \\?
I tried using $ node -i and run
console.log(`\\`);
Which successfully output a backslash. Keep in mind that the output might be escaped as well, so the only way to know you are successfully getting a backslash is getting the character code:
const myBackslash = `\\`;
console.log(myBackslash.charCodeAt(0)); // 92
And to make sure you are not actually getting \\ (i.e. a double-backslash), check the length:
console.log(myBackslash.length); // 1
It is a known issue that unknown string escape sequences lose their escaping backslash in JavaScript normal and template string literals:
When a character in a string literal or regular expression literal is
preceded by a backslash, it is interpreted as part of an escape
sequence. For example, the escape sequence \n in a string literal
corresponds to a single newline character, and not the \ and n
characters. However, not all characters change meaning when used in an
escape sequence. In this case, the backslash just makes the character
appear to mean something else, and the backslash actually has no
effect. For example, the escape sequence \k in a string literal just
means k. Such superfluous escape sequences are usually benign, and do
not change the behavior of the program.
In regular string literals, one needs to double the backslash in order to introduce a literal backslash char:
console.log("\s \\s"); // => s \s
console.log('\s \\s'); // => s \s
console.log(`\s \\s`); // => s \s
There is a better idea: use String.raw:
The static String.raw() method is a tag function of template
literals. This is similar to the r prefix in Python, or the #
prefix in C# for string literals. (But it is not identical; see
explanations in this issue.) It's used to get the raw string form
of template strings, that is, substitutions (e.g. ${foo}) are
processed, but escapes (e.g. \n) are not.
So, you may simply use String.raw`\s` to define a \s text:
console.log(String.raw`s \s \\s`); // => s \s \\s

Resources