Confused over the purpose of "r". As i understand it helps to read as a normal character than its usage as an escape character
I tried multiple codes as follows and all are giving the same output. This is making me confused on the real interpretation of "r". While i agree with first 3 lines of code.Fourth one is where im confused.
1.re.sub("n\'t", " not", " i am n't happy")
2.re.sub("n\'t", " not", " i am n\'t happy")
3.re.sub(r"n\'t", " not", " i am n\'t happy")
4.re.sub(r"n\'t", " not", " i am n't happy")
Result of all 4 above is :'
' i am not happy'
import re
re.sub(r"n\'t", " not", " i am n't happy")
Given that i have used "r" i expected the backslash to be treated as a characters and not escape character
Actual Output
' i am not happy'
Expected Output
' i am n't happy'
The thing is that there are two layers of -escaping: in the string literal, and in the regex. And in neither does \' have a special meaning, and it's just treated as '.
What using r"" does here is to skip the first string-literal escaping, so that a literal \ is included in the string, but then the regex sees the string \' and just treats it as '.
So all four come down to replacing n't with not.
You still need double backslashes to match a literal backslash.
Related
I'm trying to expand the answer to How do I escape an escape character with ANTLR 4? to work when the " can be escaped both with " and \. I.e. both
"Rob ""Commander Taco"" Malda is smart."
and
"Rob \"Commander Taco\" Malda is smart."
are both valid and equivalent. I've tried
StringLiteral : '"' ('""'|'\\"'|~["])* '"';
but if fails to match
"Entry Flag for Offset check and for \"don't start Chiller Water Pump Request\""
with the tokenizer consuming more characters than intended, i.e. consumes beyond \""
Anyone who knows how to define the lexer rule?
A bit more detail...
"" succeeds
"""" succeeds
\" " succeeds
"\"" succeeds (at EOF)
"\""\n"" fails (it greedily pulls in the \n and "
Example: (text.txt)
""
""""
"\" "
"\""
""
grun test tokens -tokens < test.txt
line 5:1 token recognition error at: '"'
[#0,0:1='""',<StringLiteral>,1:0]
[#1,2:2='\n',<'
'>,1:2]
[#2,3:6='""""',<StringLiteral>,2:0]
[#3,7:7='\n',<'
'>,2:4]
[#4,8:12='"\" "',<StringLiteral>,3:0]
[#5,13:13='\n',<'
'>,3:5]
[#6,14:19='"\""\n"',<StringLiteral>,4:0]
[#7,21:20='<EOF>',<EOF>,5:2]
\"" and """ at the end of a StringListeral are not being handled the same.
Here's the ATN for that rule:
From this diagram it's not clear why they should be handled differently. They appear to be parallel constructs.
More research
Test Grammar (small change to simplify ATN):
grammar test
;
start: StringLiteral (WS? StringLiteral)+;
StringLiteral: '"' ( (('\\' | '"') '"') | ~["])* '"';
WS: [ \t\n\r]+;
The ATN for StringLiteral in this grammar:
OK, let's walk through this ATN with the input "\""\n"
unconsumed input
transition
"\""\n"
1 -ε-> 5
"\""\n"
5 -"-> 11
\""\n"
11 -ε-> 9
\""\n"
9 -ε-> 6
\""\n"
6 -\-> 7
""\n"
7 -"-> 10
"\n"
10 -ε-> 13
"\n"
13 -ε-> 11
"\n"
11 -ε-> 12
"\n"
12 -ε-> 14
"\n"
14 -"-> 15
\n"
15 -ε-> 2
We should reach State 2 with the " before the \n, which would be the desired behavior.
Instead, we see it continue on to consume the \n and the next "
line 2:1 token recognition error at: '"'
[#0,0:5='"\""\n"',<StringLiteral>,1:0]
[#1,7:6='<EOF>',<EOF>,2:2]
In order for this to be valid, there must be a path from state 11 to state 2 that consumes a \n and a " (and I'm not seeing it)
Maybe I'm missing something, but it's looking more and more like a bug to me.
The problem is handling the \ properly.
Bart found the path through the ATN that I missed and allowed it to match the extra \n". The \ is matched as a ~["] and then comes back through and matches the " to terminate the string.
We could disallow \ in the "everything but a " alternative (~["\\]), but then we have to allow a stand-alone \ to be acceptable. We'd want to add an alternative that allows a \ followed by anything other than a ". You'd think that '\\' ~["] does that, and you'd be right, to a point, but it also consumes the character following the \, which is a problem if you want a string like "test \\" string" since it's consumed the second \ you can't match the \" alternative. What you're looking for is a lookahead (i.e. consume the \ if it's not followed by a ", but don't consume the following character). But ANTLR Lexer rules don't allow for lookaheads (ANTLR lexer can't lookahead at all).
You'll notice that most grammars that allow \" as an escape sequence in strings also require a bare \ to be escaped (\\), and frequently treat other \ (other character) sequences as just the "other character").
If escaping the \ character is acceptable, the rule could be simplified to:
StringLiteral: '"' ('\\' . | '""' | ~["\\])* '"';
"Flag for \\"Chiller Water\\"" would not parse correctly, but "Flag for \\\"Chiller Water\\\"" would. Without lookahead, I'm not seeing a way to Lex the first version.
Also, note that if you don't escape the \, then you have an ambiguous interpretation of \"". Is it \" followed by a " to terminate the string, or \ followed by "" allowing the string to continue? ANTLR will take whichever interpretation consumes the most input, so we see it using the second interpretation and pulling in characters until if finds a "
I cannot reproduce it. Given the grammar:
grammar T;
parse
: .*? EOF
;
StringLiteral
: '"' ( '""' | '\\"' | ~["] )* '"'
;
Other
: . -> skip
;
The following code:
String source =
"\"Rob \"\"Commander Taco\"\" Malda is smart.\"\n" +
"\"Rob \\\"Commander Taco\\\" Malda is smart.\"\n" +
"\"Entry Flag for Offset check and for \\\"don't start Chiller Water Pump Request\\\"\"\n";
TLexer lexer = new TLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-20s '%s'\n",
TLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replace("\n", "\\n"));
}
produces the following output:
StringLiteral '"Rob ""Commander Taco"" Malda is smart."'
StringLiteral '"Rob \"Commander Taco\" Malda is smart."'
StringLiteral '"Entry Flag for Offset check and for \"don't start Chiller Water Pump Request\""'
Tested with ANTLR 4.9.3 and 4.10.1: both produce the same output.
Are there any "quote" symbols that you can use if you run out of them?
I know one can use " and ' (and, perhaps a combination of them) like "''" (somehow)
A example of a "additional" "quote" symbol would be
\" inside quotes.
If I make a python script, and "used up" both " and ', is there any more chars that I can use to indicate a quote?
No. From what I can tell, ' and " are the only quotes than can be used in a string literal. The Lexical Analysis page contains this information:
shortstring ::= "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring ::= "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
In plain English: Both types of literals can be enclosed in matching single quotes (') or double quotes ("). They can also be enclosed in matching groups of three single or double quotes (these are generally referred to as triple-quoted strings).
Which suggests that those are the only two options.
No, " and ' are the only quote characters in Python.
If you run out of quote symbols you can "escape" quotes with a backslash(\), which will ignore the quotes and just make them part of the string.
For example, this would be a valid string:
print("hello \"bob\"")
This would return 'hello "bob"'
No, their are no other quote chracters in Python for right now
As far as I know ' and " are tho only quote characters in python
but if you count a multi line string then it would look like this:
multiline = """
string
string
string
"""
and to make a list of them you would use:
multiline.splitlines()
I used string.gsub(str, "%s+") to remove spaces from a string but not remove new lines, example:
str = "string with\nnew line"
string.gsub(str, "%s+")
print(str)
and I'm expecting the output to be like:
stringwith
newline
what pattern should I use to get that result.
It seems you want to match any whitespace matched with %s but exclude a newline char from the pattern.
You can use a reverse %S pattern (that matches any non-whitespace char) in a negated character set, [^...], and add a \n there:
local str = "string with\nnew line"
str = string.gsub(str, "[^%S\n]+", "")
print(str)
See an online Lua demo yielding
stringwith
newline
"%s" matches any whitespace character. if you want to match a space use " ". If you want to define a specific number of spaces either explicitly write them down " " or use string.rep(" ", 5)
I'm trying to implement a parser using ANTLRv4 for a language that accepts both "" and \" as a way escaping " characters in " delimited strings.
The answers to this question show how to do it for "" escaping. However when I try to extend it to also cover the \" case, it almost works but becomes too greedy when two strings are on the same line.
Here is my grammar:
grammar strings;
strings : STRING (',' STRING )* ;
STRING
: '"' (~[\r\n"] | '""' | '\"' )* '"'
;
Here is my input of three strings:
"This is ""my string\"",
"cat","fish"
This correctly recognises "This is ""my string\"", but thinks that "cat","fish" is all one string.
If I move "fish" down on to the next line it works correctly.
Can anyone figure out how to make it work if "cat" and "fish" are on the same line?
Make your STRING rule non greedy to stop at the first quote char it encounters, instead of trying to get as much as possible:
STRING
: '"' (~[\r\n"] | '""' | '\"' )*? '"'
;
I've found what I need to do to get this to work as I wanted, though to be honest I'm still not entirely sure why Antlr was doing what it did.
Simply by adding another backslash character to the '\"' clause it works!
So my final STRINGS definition is : '"' (~[\r\n"] | '""' | '\\"' )* '"'
Going back to first principles, I hand drew a state transition diagram of the problem and then realised that the two escaping mechanism sequences are not the same and cannot be treated similarly. Then trying to implement the two patterns in AntlrWorks it became apparent that I needed to add the second backslash at which point it all started working.
Does a single backslash followed by some arbitrary character simply mean that character?
I use this to detect space in a string in Lua:
if string.byte(" ")==32 then blah blah
What is the return number (instead of 32) for enter key or new line in Lua?
These numbers denote the ASCII codes for each character. Here's a chart for future reference (but only to 127, as extended ASCII is not supported) so newline is 10.
You can also print a list with the following code:
for i=1,127 do
print(i .. " = " .. string.char(i))
end
However, command characters (such as newline) are difficult to interpret.
You can check them with the \n and \r characters.
> =string.byte '\r'
13
> =string.byte '\n'
10
I don't know the number, but you could try finding it by printing print(string.byte("\n"))