Why does "\1" inside a triple-quoted string evaluate to a unicode 0x1 code point - groovy

I wanted a String containing a text \1.
What I did was (the real string was longer but it's not important):
'''
\1
'''
Which resulted in a String containing a unicode 0x1 codepoint.
I think what I should've done is just escape the backslash like this:
'''
\\1
'''
What I don't understand is why Groovy didn't report an error here. I thought unicode escapes are supposed to look like \u1?
Instead of a syntax error I got a runtime exception when I tried to put this String into an XML element:
An invalid XML character (Unicode: 0x1) was found in the element content of the document.

The \ (backward slash) symbol is an escape symbol. If you mean to use it literally, you must escape it itself: \\.
When you escape any character, the character is interpreted to have special meaning. In the case of the \1 sequence, it just happens that this can be interpreted as the 0x01 codepoint.
This is the same in Java Strings.
If you want to not have to escape characters in Groovy, use slashy strings:
def x = /\1/
assert x == "\\1"
which also works as multiline:
def x = /
\1
/

Related

Is it possible to use a "plain" long string?

In Julia, you can't store a string like that:
str = "\mwe"
Because there is a backslash. So the following allows you to prevent that:
str = "\\mwe"
The same occurs for "$, \n" and many other symbols. My question is, given that you have a extremely long string of thousands of characters and this is not very convenient to treat all the different cases even with a search and replace (Ctrl+H), is there a way to assign it directly to a variable?
Maybe the following (which I tried) gives an idea of what I'd like:
str = """\$$$ \\\nn\nn\m this is a very long and complicated (\n^$" string"""
Here """ is not suitable, what should I use instead?
Quick answer: raw string literals like raw"\$$$ \\\nn..." will get you most of the way there.
Raw string literals allow you to put nearly anything you like between quotes and Julia will keep the characters as typed with no replacements, expansions, or interpolations. That means you can do this sort of thing easily:
a = raw"\mwe"
#assert codepoint(a[1]) == 0x5c # Unicode point for backslash
b = raw"$(a)"
#assert codepoint(b[1]) == 0x25 # Unicode point for dollar symbol
The problem is always the delimiters that define where the string begins and ends. You have to have some way of telling Julia what is included in the string literal and what is not, and Julia uses double inverted commas to do that, meaning if you want double inverted commas in your string literal, you still have to escape those:
c = raw"\"quote" # note the backslashe
#assert codepoint(c[1]) == 0x22 # Unicode point for double quote marks
If this bothers you, you can combine triple quotes with raw, but then if you want to represent literal triple quotes in your string, you still have to escape those:
d = raw""""quote""" # the three quotes at the beginning and three at the end delimit the string, the fourth is read literally
#assert codepoint(d[1]) == 0x22 # Unicode point for double quote marks
e = raw"""\"\"\"""" # In triple quoted strings, you do not need to escape the backslash
#assert codeunits(e) == [0x22, 0x22, 0x22] # Three Unicode double quote marks
If this bothers you, you can try to write a macro that avoids these limitations, but you will always end up having to tell Julia where you want to start processing a string literal and where you want to end processing a string literal, so you will always have to choose some way to delimit the string literal from the rest of the code and escape that delimiter within the string.
Edit: You don't need to escape backslashes in raw string literals in order to include quotation marks in the string, you just need to escape the quotes. But if you want a literal backslash followed by a literal quotation mark, you have to escape both:
f = raw"\"quote"
#assert codepoint(f[1]) == 0x22 # double quote marks
g = raw"\\\"quote" # note the three backslashes
#assert codepoint(g[1]) == 0x5c # backslash
#assert codepoint(g[2]) == 0x22 # double quote marks
If you escape the backslash and not the quote marks, Julia will get confused:
h = raw"\\"quote"
# ERROR: syntax: cannot juxtapose string literal
This is explained in the caveat in the documentation.

Groovy How to replace the exact match word in a String

Groovy How to replace the exact match word in a String.
I wanted to replace the exact matched word in a given string in Groovy. and when i tried the below am not getting the exact matched word
def str="My Name is Richards and Richardson"
log.info(str)
str=str.replace("Richards","Praveen")
log.info("After"+str)
Output after executing the above
My Name is Richards and Richardson
AfterMy Name is Praveen and Praveenon
Am Looking for the output like : AfterMy Name is Praveen and Richardson
I tried the boundaries \b
str=str.replace("\bRichards\b","Praveen")
which is in Java and its not working. Looks \b is ba backslash escape sequence in the Groovy
can someone help
def str="My Name is Richards and Richardson"
log.info(str)
str=str.replace("Richards","Praveen")
log.info("After"+str)
expecting:AfterMy Name is Praveen and Richardson
Using boundaries (/b) will not work with String::replace because the method argument does not accept a regular expression pattern but a simple string literal.
You have two options to get the expected outcome:
Instead of using String::replace you can use String::replaceFirst. As the method name suggests it will replace only the first occurrence of the Richards substring leaving the Richardson as is.
str = str.replaceFirst("Richards", "Praveen")
Instead of using String::replace you can use String::replaceAll, in opposite to String::replace it supports regular expressions so you can use word boundaries tokens
str = str.replaceAll("\\bRichards\\b","Praveen")
Mind the double slashes!
Also, according to the String::replaceAll documentation:
Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string; see Matcher.replaceAll. Use Matcher.quoteReplacement(java.lang.String) to suppress the special meaning of these characters, if desired.

python Using variable in re.search source.error("bad escape %s" % escape, len(escape)) [duplicate]

I want to use input from a user as a regex pattern for a search over some text. It works, but how I can handle cases where user puts characters that have meaning in regex?
For example, the user wants to search for Word (s): regex engine will take the (s) as a group. I want it to treat it like a string "(s)" . I can run replace on user input and replace the ( with \( and the ) with \) but the problem is I will need to do replace for every possible regex symbol.
Do you know some better way ?
Use the re.escape() function for this:
4.2.3 re Module Contents
escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
A simplistic example, search any occurence of the provided string optionally followed by 's', and return the match object.
def simplistic_plural(word, text):
word_or_plural = re.escape(word) + 's?'
return re.match(word_or_plural, text)
You can use re.escape():
re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
>>> import re
>>> re.escape('^a.*$')
'\\^a\\.\\*\\$'
If you are using a Python version < 3.7, this will escape non-alphanumerics that are not part of regular expression syntax as well.
If you are using a Python version < 3.7 but >= 3.3, this will escape non-alphanumerics that are not part of regular expression syntax, except for specifically underscore (_).
Unfortunately, re.escape() is not suited for the replacement string:
>>> re.sub('a', re.escape('_'), 'aa')
'\\_\\_'
A solution is to put the replacement in a lambda:
>>> re.sub('a', lambda _: '_', 'aa')
'__'
because the return value of the lambda is treated by re.sub() as a literal string.
Usually escaping the string that you feed into a regex is such that the regex considers those characters literally. Remember usually you type strings into your compuer and the computer insert the specific characters. When you see in your editor \n it's not really a new line until the parser decides it is. It's two characters. Once you pass it through python's print will display it and thus parse it as a new a line but in the text you see in the editor it's likely just the char for backslash followed by n. If you do \r"\n" then python will always interpret it as the raw thing you typed in (as far as I understand). To complicate things further there is another syntax/grammar going on with regexes. The regex parser will interpret the strings it's receives differently than python's print would. I believe this is why we are recommended to pass raw strings like r"(\n+) -- so that the regex receives what you actually typed. However, the regex will receive a parenthesis and won't match it as a literal parenthesis unless you tell it to explicitly using the regex's own syntax rules. For that you need r"(\fun \( x : nat \) :)" here the first parens won't be matched since it's a capture group due to lack of backslashes but the second one will be matched as literal parens.
Thus we usually do re.escape(regex) to escape things we want to be interpreted literally i.e. things that would be usually ignored by the regex paraser e.g. parens, spaces etc. will be escaped. e.g. code I have in my app:
# escapes non-alphanumeric to help match arbitrary literal string, I think the reason this is here is to help differentiate the things escaped from the regex we are inserting in the next line and the literal things we wanted escaped.
__ppt = re.escape(_ppt) # used for e.g. parenthesis ( are not interpreted as was to group this but literally
e.g. see these strings:
_ppt
Out[4]: '(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
__ppt
Out[5]: '\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
print(rf'{_ppt=}')
_ppt='(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
print(rf'{__ppt=}')
__ppt='\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
the double backslashes I believe are there so that the regex receives a literal backslash.
btw, I am surprised it printed double backslashes instead of a single one. If anyone can comment on that it would be appreciated. I'm also curious how to match literal backslashes now in the regex. I assume it's 4 backslashes but I honestly expected only 2 would have been needed due to the raw string r construct.

regex with named capture fails in XRegExp but works fine on regex101.com [duplicate]

In the regex below, \s denotes a space character. I imagine the regex parser, is going through the string and sees \ and knows that the next character is special.
But this is not the case as double escapes are required.
Why is this?
var res = new RegExp('(\\s|^)' + foo).test(moo);
Is there a concrete example of how a single escape could be mis-interpreted as something else?
You are constructing the regular expression by passing a string to the RegExp constructor.
\ is an escape character in string literals.
The \ is consumed by the string literal parsing…
const foo = "foo";
const string = '(\s|^)' + foo;
console.log(string);
… so the data you pass to the RegEx compiler is a plain s and not \s.
You need to escape the \ to express the \ as data instead of being an escape character itself.
Inside the code where you're creating a string, the backslash is a javascript escape character first, which means the escape sequences like \t, \n, \", etc. will be translated into their javascript counterpart (tab, newline, quote, etc.), and that will be made a part of the string. Double-backslash represents a single backslash in the actual string itself, so if you want a backslash in the string, you escape that first.
So when you generate a string by saying var someString = '(\\s|^)', what you're really doing is creating an actual string with the value (\s|^).
The Regex needs a string representation of \s, which in JavaScript can be produced using the literal "\\s".
Here's a live example to illustrate why "\s" is not enough:
alert("One backslash: \s\nDouble backslashes: \\s");
Note how an extra \ before \s changes the output.
As has been said, inside a string literal, a backslash indicates an escape sequence, rather than a literal backslash character, but the RegExp constructor often needs literal backslash characters in the string passed to it, so the code should have \\s to represent a literal backslash, in most cases.
A problem is that double-escaping metacharacters is tedious. There is one way to pass a string to new RegExp without having to double escape them: use the String.raw template tag, an ES6 feature, which allows you to write a string that will be parsed by the interpreter verbatim, without any parsing of escape sequences. For example:
console.log('\\'.length); // length 1: an escaped backslash
console.log(`\\`.length); // length 1: an escaped backslash
console.log(String.raw`\\`.length); // length 2: no escaping in String.raw!
So, if you wish to keep your code readable, and you have many backslashes, you may use String.raw to type only one backslash, when the pattern requires a backslash:
const sentence = 'foo bar baz';
const regex = new RegExp(String.raw`\bfoo\sbar\sbaz\b`);
console.log(regex.test(sentence));
But there's a better option. Generally, there's not much good reason to use new RegExp unless you need to dynamically create a regular expression from existing variables. Otherwise, you should use regex literals instead, which do not require double-escaping of metacharacters, and do not require writing out String.raw to keep the pattern readable:
const sentence = 'foo bar baz';
const regex = /\bfoo\sbar\sbaz\b/;
console.log(regex.test(sentence));
Best to only use new RegExp when the pattern must be created on-the-fly, like in the following snippet:
const sentence = 'foo bar baz';
const wordToFind = 'foo'; // from user input
const regex = new RegExp(String.raw`\b${wordToFind}\b`);
console.log(regex.test(sentence));
\ is used in Strings to escape special characters. If you want a backslash in your string (e.g. for the \ in \s) you have to escape it via a backslash. So \ becomes \\ .
EDIT: Even had to do it here, because \\ in my answer turned to \.

Syntax error in removing bad character in groovy

Hello I have a string like a= " $ 2 187.00" . I tried removing all the white spaces and the bad characters like a.replaceAll("\\s","").replace("$","") . but i am getting error
Impossible to parse JSON response: SyntaxError: JSON.parse: bad escaped character how to remove the bad character in this expression so that the value becomes 2187.00.Kindly help me .Thanks in advance
def a = ' $ 2 187.00'
a.replaceAll(/\s/,"").replaceAll(/\$/,"")
// or simply
a.replaceAll(/[\s\$]/,"")
It should return 2187.00.
Note
that $ has special meaning in double quoted strings literals "" , called as GString.
In groovy, you can user regex literal, using that is better than using regex with multiple escape sequences in string.

Resources