scala string.split does not work - string

Following is my REPL output. I am not sure why string.split does not work here.
val s = "Pedro|groceries|apple|1.42"
s: java.lang.String = Pedro|groceries|apple|1.42
scala> s.split("|")
res27: Array[java.lang.String] = Array("", P, e, d, r, o, |, g, r, o, c, e, r, i, e, s, |, a, p, p, l, e, |, 1, ., 4, 2)

If you use quotes, you're asking for a regular expression split. | is the "or" character, so your regex matches nothing or nothing. So everything is split.
If you use split('|') or split("""\|""") you should get what you want.

| is a special regular expression character which is used as a logical operator for OR operations.
Since java.lang.String#split(String regex); takes in a regular expression, you're splitting the string with "none OR none", which is a whole another speciality about regular expression splitting, where none essentially means "between every single character".
To get what you want, you need to escape your regex pattern properly. To escape the pattern, you need to prepend the character with \ and since \ is a special String character (think \t and \r for example), you need to actually double escape so that you'll end up with s.split("\\|").
For full Java regular expression syntax, see java.util.regex.Pattern javadoc.

Split takes a regex as first argument, so your call is interpreted as "empty string or empty string". To get the expected behavior you need to escape the pipe character "\\|".

Related

python Using variable in re.search source.error("bad escape %s" % escape, len(escape)) [duplicate]

I want to use input from a user as a regex pattern for a search over some text. It works, but how I can handle cases where user puts characters that have meaning in regex?
For example, the user wants to search for Word (s): regex engine will take the (s) as a group. I want it to treat it like a string "(s)" . I can run replace on user input and replace the ( with \( and the ) with \) but the problem is I will need to do replace for every possible regex symbol.
Do you know some better way ?
Use the re.escape() function for this:
4.2.3 re Module Contents
escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
A simplistic example, search any occurence of the provided string optionally followed by 's', and return the match object.
def simplistic_plural(word, text):
word_or_plural = re.escape(word) + 's?'
return re.match(word_or_plural, text)
You can use re.escape():
re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
>>> import re
>>> re.escape('^a.*$')
'\\^a\\.\\*\\$'
If you are using a Python version < 3.7, this will escape non-alphanumerics that are not part of regular expression syntax as well.
If you are using a Python version < 3.7 but >= 3.3, this will escape non-alphanumerics that are not part of regular expression syntax, except for specifically underscore (_).
Unfortunately, re.escape() is not suited for the replacement string:
>>> re.sub('a', re.escape('_'), 'aa')
'\\_\\_'
A solution is to put the replacement in a lambda:
>>> re.sub('a', lambda _: '_', 'aa')
'__'
because the return value of the lambda is treated by re.sub() as a literal string.
Usually escaping the string that you feed into a regex is such that the regex considers those characters literally. Remember usually you type strings into your compuer and the computer insert the specific characters. When you see in your editor \n it's not really a new line until the parser decides it is. It's two characters. Once you pass it through python's print will display it and thus parse it as a new a line but in the text you see in the editor it's likely just the char for backslash followed by n. If you do \r"\n" then python will always interpret it as the raw thing you typed in (as far as I understand). To complicate things further there is another syntax/grammar going on with regexes. The regex parser will interpret the strings it's receives differently than python's print would. I believe this is why we are recommended to pass raw strings like r"(\n+) -- so that the regex receives what you actually typed. However, the regex will receive a parenthesis and won't match it as a literal parenthesis unless you tell it to explicitly using the regex's own syntax rules. For that you need r"(\fun \( x : nat \) :)" here the first parens won't be matched since it's a capture group due to lack of backslashes but the second one will be matched as literal parens.
Thus we usually do re.escape(regex) to escape things we want to be interpreted literally i.e. things that would be usually ignored by the regex paraser e.g. parens, spaces etc. will be escaped. e.g. code I have in my app:
# escapes non-alphanumeric to help match arbitrary literal string, I think the reason this is here is to help differentiate the things escaped from the regex we are inserting in the next line and the literal things we wanted escaped.
__ppt = re.escape(_ppt) # used for e.g. parenthesis ( are not interpreted as was to group this but literally
e.g. see these strings:
_ppt
Out[4]: '(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
__ppt
Out[5]: '\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
print(rf'{_ppt=}')
_ppt='(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
print(rf'{__ppt=}')
__ppt='\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
the double backslashes I believe are there so that the regex receives a literal backslash.
btw, I am surprised it printed double backslashes instead of a single one. If anyone can comment on that it would be appreciated. I'm also curious how to match literal backslashes now in the regex. I assume it's 4 backslashes but I honestly expected only 2 would have been needed due to the raw string r construct.

Wrong matching regex

So I'm using re module to compile my regex, and my regex looks like this:
"(^~\w+?[ & ~\w+?]*?$)"
So I compile it using pattern = re.compile(regex) and then I use re.findall(pattern, string) to find if the given string is matching and to give me the group if it is.
String that I'm matching is "v1 V ~v2_ V ~~v3".
I'd expect to not have a match but it says that it matches the regular expression. I suspect that \w+ matches white spaces so that it matches the whole string but I could not find in the documentation that is correct. What am I missing?
Here this is minimum reproductible example:
import re
test_string = "v1 V ~v2_ V ~~v3"
regex = "(^~*\w+?[ & ~*\w+?]*?$)"
pattern = re.compile(regex)
for elem in re.findall(regex, test_string):
print(elem)
If you expect to not match I think your problem is with [ & ~*\w+?]* part.
The characters between square brackets means one occurrence of, in this case one occurrence of &, ~, *, ?, word and space. And the asterisk (*) at the end makes zero or many occurrences of what is in the brackets.
If what you wanted is to match this sub-regex & ~*\w+? zero or more times use parenthesis.
So I would say that you wanted this regex: (^~*\w+?( & ~*\w+?)*?$) (just change brackets for parenthesis.

What does "/" signify in split(self, /, sep=None, maxsplit=-1)?

str.split = split(self, /, sep=None, maxsplit=-1)
Return a list of the words in the string, using sep as the delimiter string.
sep
The delimiter according which to split the string.
None (the default value) means split according to any whitespace,
and discard empty strings from the result.
maxsplit
Maximum number of splits to do.
-1 (the default value) means no limit.
The /, which would seem to be a 2nd argument is a new notation to me. What is it doing there?
From What's New in Python 3.8:
Positional-only parameters
There is a new function parameter syntax / to indicate that some function parameters must be specified positionally and cannot be used as keyword arguments.
In the following example, parameters a and b are positional-only, while c or d can be positional or keyword, and e or f are required to be keywords:
def f(a, b, /, c, d, *, e, f):
print(a, b, c, d, e, f)
One use case for this notation is that it allows pure Python functions to fully emulate behaviors of existing C coded functions.

What does the backward slash (\) do in a format mask?

Can't seem to find the answer in Google. Is it in a similar category of symbols like $ and ! or something else entirely?
The formula I found on Google is:
=text(A2,"\0.0,,\M")
which converts 1500000 to 1.5M.
In your given example (a format mask), a backslash acts as an escape sequence. Basically, it is equivalent to wrapping the next character in double quotations. This is done to get the literals for 0 and M, since there are a number of characters that have special meanings unless you use an escape sequence.
Date-formatting and time-formatting characters (a, c, d, h, m, n, p, q, s, t, w, y, /, and :), the numeric-formatting characters (#, 0, %, E, e, comma, and period), and the string-formatting characters (#, &, <, >, and !)all must be escaped to be accessed literally.
Due to the slight confusion as to what context you don't understand \ in, I have added a bit of additional information.
In cell formatting:
The backslash \ is used to escape special characters, like a colon.
For instance, if you wanted
100 : 1
since the colon is a special character, you would have to use \ (an escape sequence) to access it as a literal, like this:
100 \: 1
which outputs 100 : 1 as desired.
Another example of use would be the \n VB newline character (different language obviously), which will escape to a new line in your output. In this case, the \ escapes the literal of the key n to access the special constant vbNewLine.
It follows in many programming languages, \ followed by something is often an escape sequence, and used to avoid or access a special characterization of a given key or character.
In VBA:
It is also worth noting that in VBA, the backslash character can be used to force an evaluation of the integer equivalent of a quotient, for instance:
100\33 = Int(100/33)
This is an often overlooked way to divide as well as round down to an integer in a single step.
In workbooks:
One of the options to create a range name in your workbook is to precede the name of your desired range name with a backslash. For instance, a valid range name would be \HLF1

How to split a string by a string in Scala

In Ruby, I did:
"string1::string2".split("::")
In Scala, I can't find how to split using a string, not a single character.
The REPL is even easier than Stack Overflow. I just pasted your example as is.
Welcome to Scala version 2.8.1.final (Java HotSpot Server VM, Java 1.6.0_22).
Type in expressions to have them evaluated.
Type :help for more information.
scala> "string1::string2".split("::")
res0: Array[java.lang.String] = Array(string1, string2)
In your example it does not make a difference, but the String#split method in Scala actually takes a String that represents a regular expression. So be sure to escape certain characters as needed, like e.g. in "a..b.c".split("""\.\.""") or to make that fact more obvious you can call the split method on a RegEx: """\.\.""".r.split("a..b.c").
That line of Ruby should work just like it is in Scala too and return an Array[String].
If you look at the Java implementation you see that the parameter to String#split will be in fact compiled to a regular expression.
There is no problem with "string1::string2".split("::") because ":" is just a character in a regular expression, but for instance "string1|string2".split("|") will not yield the expected result. "|" is the special symbol for alternation in a regular expression.
scala> "string1|string2".split("|")
res0: Array[String] = Array(s, t, r, i, n, g, 1, |, s, t, r, i, n, g, 2)

Resources