In Ruby, I did:
"string1::string2".split("::")
In Scala, I can't find how to split using a string, not a single character.
The REPL is even easier than Stack Overflow. I just pasted your example as is.
Welcome to Scala version 2.8.1.final (Java HotSpot Server VM, Java 1.6.0_22).
Type in expressions to have them evaluated.
Type :help for more information.
scala> "string1::string2".split("::")
res0: Array[java.lang.String] = Array(string1, string2)
In your example it does not make a difference, but the String#split method in Scala actually takes a String that represents a regular expression. So be sure to escape certain characters as needed, like e.g. in "a..b.c".split("""\.\.""") or to make that fact more obvious you can call the split method on a RegEx: """\.\.""".r.split("a..b.c").
That line of Ruby should work just like it is in Scala too and return an Array[String].
If you look at the Java implementation you see that the parameter to String#split will be in fact compiled to a regular expression.
There is no problem with "string1::string2".split("::") because ":" is just a character in a regular expression, but for instance "string1|string2".split("|") will not yield the expected result. "|" is the special symbol for alternation in a regular expression.
scala> "string1|string2".split("|")
res0: Array[String] = Array(s, t, r, i, n, g, 1, |, s, t, r, i, n, g, 2)
Related
I want to use input from a user as a regex pattern for a search over some text. It works, but how I can handle cases where user puts characters that have meaning in regex?
For example, the user wants to search for Word (s): regex engine will take the (s) as a group. I want it to treat it like a string "(s)" . I can run replace on user input and replace the ( with \( and the ) with \) but the problem is I will need to do replace for every possible regex symbol.
Do you know some better way ?
Use the re.escape() function for this:
4.2.3 re Module Contents
escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
A simplistic example, search any occurence of the provided string optionally followed by 's', and return the match object.
def simplistic_plural(word, text):
word_or_plural = re.escape(word) + 's?'
return re.match(word_or_plural, text)
You can use re.escape():
re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
>>> import re
>>> re.escape('^a.*$')
'\\^a\\.\\*\\$'
If you are using a Python version < 3.7, this will escape non-alphanumerics that are not part of regular expression syntax as well.
If you are using a Python version < 3.7 but >= 3.3, this will escape non-alphanumerics that are not part of regular expression syntax, except for specifically underscore (_).
Unfortunately, re.escape() is not suited for the replacement string:
>>> re.sub('a', re.escape('_'), 'aa')
'\\_\\_'
A solution is to put the replacement in a lambda:
>>> re.sub('a', lambda _: '_', 'aa')
'__'
because the return value of the lambda is treated by re.sub() as a literal string.
Usually escaping the string that you feed into a regex is such that the regex considers those characters literally. Remember usually you type strings into your compuer and the computer insert the specific characters. When you see in your editor \n it's not really a new line until the parser decides it is. It's two characters. Once you pass it through python's print will display it and thus parse it as a new a line but in the text you see in the editor it's likely just the char for backslash followed by n. If you do \r"\n" then python will always interpret it as the raw thing you typed in (as far as I understand). To complicate things further there is another syntax/grammar going on with regexes. The regex parser will interpret the strings it's receives differently than python's print would. I believe this is why we are recommended to pass raw strings like r"(\n+) -- so that the regex receives what you actually typed. However, the regex will receive a parenthesis and won't match it as a literal parenthesis unless you tell it to explicitly using the regex's own syntax rules. For that you need r"(\fun \( x : nat \) :)" here the first parens won't be matched since it's a capture group due to lack of backslashes but the second one will be matched as literal parens.
Thus we usually do re.escape(regex) to escape things we want to be interpreted literally i.e. things that would be usually ignored by the regex paraser e.g. parens, spaces etc. will be escaped. e.g. code I have in my app:
# escapes non-alphanumeric to help match arbitrary literal string, I think the reason this is here is to help differentiate the things escaped from the regex we are inserting in the next line and the literal things we wanted escaped.
__ppt = re.escape(_ppt) # used for e.g. parenthesis ( are not interpreted as was to group this but literally
e.g. see these strings:
_ppt
Out[4]: '(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
__ppt
Out[5]: '\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
print(rf'{_ppt=}')
_ppt='(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
print(rf'{__ppt=}')
__ppt='\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
the double backslashes I believe are there so that the regex receives a literal backslash.
btw, I am surprised it printed double backslashes instead of a single one. If anyone can comment on that it would be appreciated. I'm also curious how to match literal backslashes now in the regex. I assume it's 4 backslashes but I honestly expected only 2 would have been needed due to the raw string r construct.
I have a string:
Hi there, "Bananas are, by nature, evil.", Hey there.
I want to split the string with commas as the delimiter. How do I get the .split method to ignore the comma inside the quotes, so that it returns 3 strings and not 5.
You can use regex in split method
According to this answer the following regex only matches , outside of the " mark
,(?=(?:[^\"]\"[^\"]\")[^\"]$)
so try this code:
str.split(",(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*\$)".toRegex())
You can use split overload that accepts regular expressions for that:
val text = """Hi there, "Bananas are, by nature, evil.", Hey there."""
val matchCommaNotInQuotes = Regex("""\,(?=([^"]*"[^"]*")*[^"]*$)""")
println(text.split(matchCommaNotInQuotes))
Would print:
[Hi there, "Bananas are, by nature, evil.", Hey there.]
Consider reading this answer on how the regular expression works in this case.
You have to use a regular expression capable of handling quoted values. See Java: splitting a comma-separated string but ignoring commas in quotes and C#, regular expressions : how to parse comma-separated values, where some values might be quoted strings themselves containing commas
The following code shows a very simple version of such a regular expression.
fun main(args: Array<String>) {
"Hi there, \"Bananas are, by nature, evil.\", Hey there."
.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)".toRegex())
.forEach { println("> $it") }
}
outputs
> Hi there
> "Bananas are, by nature, evil."
> Hey there.
Be aware of the regex backtracking problem: https://www.regular-expressions.info/catastrophic.html. You might be better off writing a parser.
If you don't want regular expressions:
val s = "Hi there, \"Bananas are, by nature, evil.\", Hey there."
val hold = s.substringAfter("\"").substringBefore("\"")
val temp = s.split("\"")
val splitted: MutableList<String> = (temp[0] + "\"" + temp[2]).split(",").toMutableList()
splitted[1] = "\"" + hold + "\""
splitted is the List you want
I'm a new learner of spark. There's one line of code estimating pi but I don't quite understand how it works.
scala>val pi_approx = f"pi = ${355f/113}%.5f"
pi_approx: String = pi = 3.14159
I don't understand the 'f' '$' and '%' in the expression above. Could anyone explain the usage of them? Thanks!
This is the example of String Interpolation that allows users to embed variable references directly in processed string literals. For e.g.
scala> val name = "Scala"
name: String = Scala
scala> println(s"Hello, $name")
Hello, Scala
In above example the literal s"Hello, $name" is a processed string literal.
Scala provides three string interpolation methods out of the box: s, f and raw.
Prepending f to any string literal allows the creation of simple formatted strings, similar to printf in other languages.
The formats allowed after the % character tells that result is formatted as a decimal number while ${} allows any arbitrary expression to be embedded. For e.g.
scala> println(s"1 + 1 = ${1 + 1}")
1 + 1 = 2
More detailed information can be found on:
Scala String Interpolation
Java Formatter
Scala has triple quoted strings """String\nString""" to use special characters in the string without escaping. Scala 2.10 also added raw"String\nString" for the same purpose.
Is there any difference in how raw"" and """""" work? Can they produce different output for the same string?
Looking at the source for the default interpolators (found here: https://github.com/scala/scala/blob/2.11.x/src/library/scala/StringContext.scala) it looks like the "raw" interpolator calls the identity function on each letter, so what you put in is what you get out. The biggest difference that you will find is that if you are providing a string literal in your source that includes the quote character, the raw interpolator still won't work. i.e. you can't say
raw"this whole "thing" should be one string object"
but you can say
"""this whole "thing" should be one string object"""
So you might be wondering "Why would I ever bother using the raw interpolator then?" and the answer is that the raw interpolator still performs variable substitution. So
val helloVar = "hello"
val helloWorldString = raw"""$helloVar, "World"!\n"""
Will give you the string "hello, "World"!\n" with the \n not being converted to a newline, and the quotes around the word world.
It is surprising that using the s-interpolator turns escapes back on, even when using triple quotes:
scala> "hi\nthere."
res5: String =
hi
there.
scala> """hi\nthere."""
res6: String = hi\nthere.
scala> s"""hi\nthere."""
res7: String =
hi
there.
The s-interpolator doesn't know that it's processing string parts that were originally triple-quoted. Hence:
scala> raw"""hi\nthere."""
res8: String = hi\nthere.
This matters when you're using backslashes in other ways, such as regexes:
scala> val n = """\d"""
n: String = \d
scala> s"$n".r
res9: scala.util.matching.Regex = \d
scala> s"\d".r
scala.StringContext$InvalidEscapeException: invalid escape character at index 0 in "\d"
at scala.StringContext$.loop$1(StringContext.scala:231)
at scala.StringContext$.replace$1(StringContext.scala:241)
at scala.StringContext$.treatEscapes0(StringContext.scala:245)
at scala.StringContext$.treatEscapes(StringContext.scala:190)
at scala.StringContext$$anonfun$s$1.apply(StringContext.scala:94)
at scala.StringContext$$anonfun$s$1.apply(StringContext.scala:94)
at scala.StringContext.standardInterpolator(StringContext.scala:124)
at scala.StringContext.s(StringContext.scala:94)
... 33 elided
scala> s"""\d""".r
scala.StringContext$InvalidEscapeException: invalid escape character at index 0 in "\d"
at scala.StringContext$.loop$1(StringContext.scala:231)
at scala.StringContext$.replace$1(StringContext.scala:241)
at scala.StringContext$.treatEscapes0(StringContext.scala:245)
at scala.StringContext$.treatEscapes(StringContext.scala:190)
at scala.StringContext$$anonfun$s$1.apply(StringContext.scala:94)
at scala.StringContext$$anonfun$s$1.apply(StringContext.scala:94)
at scala.StringContext.standardInterpolator(StringContext.scala:124)
at scala.StringContext.s(StringContext.scala:94)
... 33 elided
scala> raw"""\d$n""".r
res12: scala.util.matching.Regex = \d\d
Following is my REPL output. I am not sure why string.split does not work here.
val s = "Pedro|groceries|apple|1.42"
s: java.lang.String = Pedro|groceries|apple|1.42
scala> s.split("|")
res27: Array[java.lang.String] = Array("", P, e, d, r, o, |, g, r, o, c, e, r, i, e, s, |, a, p, p, l, e, |, 1, ., 4, 2)
If you use quotes, you're asking for a regular expression split. | is the "or" character, so your regex matches nothing or nothing. So everything is split.
If you use split('|') or split("""\|""") you should get what you want.
| is a special regular expression character which is used as a logical operator for OR operations.
Since java.lang.String#split(String regex); takes in a regular expression, you're splitting the string with "none OR none", which is a whole another speciality about regular expression splitting, where none essentially means "between every single character".
To get what you want, you need to escape your regex pattern properly. To escape the pattern, you need to prepend the character with \ and since \ is a special String character (think \t and \r for example), you need to actually double escape so that you'll end up with s.split("\\|").
For full Java regular expression syntax, see java.util.regex.Pattern javadoc.
Split takes a regex as first argument, so your call is interpreted as "empty string or empty string". To get the expected behavior you need to escape the pipe character "\\|".