Surrounding Scala Strings - string

If you do something in a single statement like "abc" + stringval + "abc", is that one immutable string copy, or two (noting that abc and 123 are constant at compile time)
Bonus round: would using a StringBuilder like the following have more or less overhead?
def surround(s:String, ss:String):String = {
val surrounded = new StringBuilder(s.length() + 2*ss.length(), s)
surrounded.insert(0,ss)
surrounded.append(ss)
surrounded.mkString
}
Or is there a more idiomatic way that I'm unaware of?

It has less overhead than concatenation. But the insert in your example is not efficient. The following is a little cleaner and uses only appends for efficiency.
def surround(s:String, ss:String) =
new StringBuilder(s.length() + 2*ss.length(), ss).append(s).append(ss).mkString

My first impulse is to look at the bytecode and see. So,
// test.scala
object Comparison {
def surround1(s: String, ss: String) = {
val surrounded = new StringBuilder(s.length() + 2*ss.length(), s)
surrounded.insert(0, ss)
surrounded.append(ss)
surrounded.mkString
}
def surround2(s: String, ss: String) = ss + s + ss
def surround3(s: String, ss: String) = // Neil Essy
new StringBuilder(s.length() + 2*ss.length(), ss).append(s).append(ss).mkString
}
and then:
$ scalac -optimize test.scala
$ javap -verbose Comparison$
[... lots of output ...]
Roughly, Neil Essy's and yours are identical but for the one method call (and some stack noise). surround2 is compiled into something like
val sb = new StringBuilder()
sb.append(ss)
sb.append(s)
sb.append(ss)
sb.toString
I'm new to Scala (and Java), so I don't know how generally useful it is to look at the bytecode -- but it tells you what this scalac does with this code.

Doing a little testing on this in Java and now in Scala the value of using a StringBuilder is questionable unless you are doing a lot of appending of non-constant strings.
object AppendTimeTest {
val tries = 500000
def surround(s:String, ss:String) = {
(1 to tries).foreach(_ => {
new StringBuilder(s.length() + 2*ss.length(), ss).append(s).append(ss).mkString
})
val start = System.currentTimeMillis()
(1 to tries).foreach(_ => {
new StringBuilder(s.length() + 2*ss.length(), ss).append(s).append(ss).mkString
})
val stop = System.currentTimeMillis()
val delta:Double = stop -start
println("Total time: " + delta + ".\n Avg. time: " + (delta/tries))
}
def surroundStatic(s:String) = {
(1 to tries).foreach(_ => {
"ABC" + s + "ABC"
})
val start = System.currentTimeMillis()
(1 to tries).foreach(_ => {
"ABC" + s + "ABC"
})
val stop = System.currentTimeMillis()
val delta:Double = stop -start
println("Total time: " + delta + ".\n Avg. time: " + (delta/tries))
}
}
Calling this a few times in the interpreter yields:
scala> AppendTimeTest.surroundStatic("foo")
Total time: 241.0.
Avg. time: 4.82E-4
scala> AppendTimeTest.surround("foo", "ABC")
Total time: 222.0.
Avg. time: 4.44E-4
scala> AppendTimeTest.surroundStatic("foo")
Total time: 231.0.
Avg. time: 4.62E-4
scala> AppendTimeTest.surround("foo", "ABC")
Total time: 247.0.
Avg. time: 4.94E-4
So unless you are appending many different non-constant strings I believe you won't see any big difference in performance. Alsol concatenating constants (i.e. "ABC" + "foo" + "ABC") is as you probably know handled by the compiler (this is atleast the case in Java, but I believe it's also holds true for Scala)

Scala is pretty close to java for string manipulation. Your example:
val stringval = "bar"
"abc" + stringval + "abc"
actually ends up (in java style) as:
(new StringBuilder()).append("abc").append(stringval()).append("abc").toString()
This is the same behaviour as java, + between strings are generally translated to instances of StringBuilder, which are more efficient. So here, you're doing three copies to the StringBuilder, and one final String creation, and depending upon the length of the strings, potentially three reallocations.
In your example:
def surround(s:String, ss:String):String = {
val surrounded = new StringBuilder(s.length() + 2*ss.length(), s)
surrounded.insert(0,ss)
surrounded.append(ss)
surrounded.mkString
}
you're doing the same number of copies (3) but you're only allocating once.
Advice: If the strings are small, use +, it makes very little difference. If the strings are relatively large, then initialise the StringBuilder with the relevant length, but then simply append. This is much clearer for the other developers.
As always with performance, measure it, and if the differences are small, use the simpler solution

Generally in Java / Scala, String literals in source code are interned for efficiency, which means that all copies of it in your code will refer to the same object. So there will only be one "copy" of "abc".
Java and Scala don't have "constants" like in C++. Variables initialised with val are immutable, but in the general case vals from one instance to the next are not the same (specified via constructor).
So in theory the compiler could check for simple cases where the val will always initialise the same value and optimise accordingly, but it would add extra complicaton. And you could just optimise it yourself by writing it as "abc123abc".
Others have already addressed your bonus question.

You can also use mkString on a String.
def surround(s:String, ss:String) = s.mkString(ss, "", ss)
Internally it is using StringBuilder too, but I like the way it reads.
I would maybe just even inline it.

Related

Split string every n characters

What would be an idiomatic way to split a string into strings of 2 characters each?
Examples:
"" -> [""]
"ab" -> ["ab"]
"abcd" -> ["ab", "cd"]
We can assume that the string has a length which is a multiple of 2.
I could use a regex like in this Java answer but I was hoping to find a better way (i.e. using one of kotlin's additional methods).
Once Kotlin 1.2 is released, you can use the chunked function that is added to kotlin-stdlib by the KEEP-11 proposal. Example:
val chunked = myString.chunked(2)
You can already try this with Kotlin 1.2 M2 pre-release.
Until then, you can implement the same with this code:
fun String.chunked(size: Int): List<String> {
val nChunks = length / size
return (0 until nChunks).map { substring(it * size, (it + 1) * size) }
}
println("abcdef".chunked(2)) // [ab, cd, ef]
This implementation drops the remainder that is less than size elements. You can modify it do add the remainder to the result as well.
A functional version of chunked using generateSequence:
fun String.split(n: Int) = Pair(this.drop(n), this.take(n))
fun String.chunked(n: Int): Sequence<String> =
generateSequence(this.split(n), {
when {
it.first.isEmpty() -> null
else -> it.first.split(n)
}
})
.map(Pair<*, String>::second)
Output:
"".chunked(2) => []
"ab".chunked(2) => [ab]
"abcd".chunked(2) => [ab, cd]
"abc".chunked(2) => [ab, c]

Grails convert String to Map with comma in string values

I want convert string to Map in grails. I already have a function of string to map conversion. Heres the code,
static def StringToMap(String reportValues){
Map result=[:]
result=reportValues.replace('[','').replace(']','').replace(' ','').split(',').inject([:]){map,token ->
List tokenizeStr=token.split(':');
tokenizeStr.size()>1?tokenizeStr?.with {map[it[0]?.toString()?.trim()]=it[1]?.toString()?.trim()}:tokenizeStr?.with {map[it[0]?.toString()?.trim()]=''}
map
}
return result
}
But, I have String with comma in the values, so the above function doesn't work for me. Heres my String
[program_type:, subsidiary_code:, groupName:, termination_date:, effective_date:, subsidiary_name:ABC, INC]
my function returns ABC only. not ABC, INC. I googled about it but couldnt find any concrete help.
Generally speaking, if I have to convert a Stringified Map to a Map object I try to make use of Eval.me. Your example String though isn't quite right to do so, if you had the following it would "just work":
// Note I have added '' around the values.
​String a = "[program_type:'', subsidiary_code:'', groupName:'', termination_date:'', effective_date:'', subsidiary_name:'ABC']"
Map b = Eval.me(a)​
// returns b = [program_type:, subsidiary_code:, groupName:, termination_date:, effective_date:, subsidiary_name:ABC]
If you have control of the String then if you can create it following this kind of pattern, it would be the easiest solution I suspect.
In case it is not possible to change the input parameter, this might be a not so clean and not so short option. It relies on the colon instead of comma values.
​String reportValues = "[program_type:, subsidiary_code:, groupName:, termination_date:, effective_date:, subsidiary_name:ABC, INC]"
reportValues = reportValues[1..-2]
def m = reportValues.split(":")
def map = [:]
def length = m.size()
m.eachWithIndex { v, i ->
if(i != 0) {
List l = m[i].split(",")
if (i == length-1) {
map.put(m[i-1].split(",")[-1], l.join(","))
} else {
map.put(m[i-1].split(",")[-1], l[0..-2].join(","))
}
}
}
map.each {key, value -> println "key: " + key + " value: " + value}
BTW: Only use eval on trusted input, AFAIK it executes everything.
You could try messing around with this bit of code:
String tempString = "[program_type:11, 'aa':'bb', subsidiary_code:, groupName:, termination_date:, effective_date:, subsidiary_name:ABC, INC]"
List StringasList = tempString.tokenize('[],')
def finalMap=[:]
StringasList?.each { e->
def f = e?.split(':')
finalMap."${f[0]}"= f.size()>1 ? f[1] : null
}
println """-- tempString: ${tempString.getClass()} StringasList: ${StringasList.getClass()}
finalMap: ${finalMap.getClass()} \n Results\n finalMap ${finalMap}
"""
Above produces:
-- tempString: class java.lang.String StringasList: class java.util.ArrayList
finalMap: class java.util.LinkedHashMap
Results
finalMap [program_type:11, 'aa':'bb', subsidiary_code:null, groupName:null, termination_date:null, effective_date:null, subsidiary_name:ABC, INC:null]
It tokenizes the String then converts ArrayList by iterating through the list and passing each one again split against : into a map. It also has to check to ensure the size is greater than 1 otherwise it will break on f[1]

Time complexity of a string compression algorithm

I tried to do an exercice from cracking the code interview book about compressing a string :
Implement a method to perform basic string compression using the counts of
repeated characters. For example, the string aabcccccaaa would become
a2blc5a3. If the "compressed" string would not become smaller than the original
string, your method should return the original string.
The first proposition given by the author is the following :
public static String compressBad(String str) {
int size = countCompression(str);
if (size >= str.length()) {
return str;
}
String mystr = "";
char last = str.charAt(0);
int count = 1;
for (int i = 1; i < str.length(); i++) {
if (str.charAt(i) == last) {
count++;
} else {
mystr += last + "" + count;
last = str.charAt(i);
count = 1;
}
}
return mystr + last + count;
}
About the time complexity of thie algorithm, she said that :
The runtime is 0(p + k^2), where p is the size of the original string and k is the number of character sequences. For example, if the string is aabccdeeaa, then there are six character sequences. It's slow because string concatenation operates in 0(n^2) time
I have two main questions :
1) Why the time complexity is O(p+k^2), it should be only O(p) (where p is the size of the string) no? Because we do only p iterations not p + k^2.
2) Why the time complexity of a string concatenation is 0(n^2)???
I thought that if we have a string3 = string1 + string2 then we have a complexity of size(string1) + size(string2), because we have to create a copy of the two strings before adding them to a new string (create a string is O(1)). So, we will have an addition not a multiplication, no? It isn't the same thing for an array (if we used a char array for example) ?
Can you clarify these points please? I didn't understand how we calculate the complexity...
String concatenation is O(n) but in this case it's being concatenated K times.
Every time you find a sequence you have to copy the entire string + what you found. for example if you had four total sequences in the original string
the cost to get the final string will be:
(k-3)+ //first sequence
(k-3)+(k-2) + //copying previous string and adding new sequence
((k-3)+(+k-2)+k(-1) + //copying previous string and adding new sequence
((k-3)+(k-2)+(k-1)+k) //copying previous string and adding new sequence
Thus complexity is O(p + K^2)

Performance difference in toString.map and toString.toArray.map

While coding Euler problems, I ran across what I think is bizarre:
The method toString.map is slower than toString.toArray.map.
Here's an example:
def main(args: Array[String])
{
def toDigit(num : Int) = num.toString.map(_ - 48) //2137 ms
def toDigitFast(num : Int) = num.toString.toArray.map(_ - 48) //592 ms
val startTime = System.currentTimeMillis;
(1 to 1200000).map(toDigit)
println(System.currentTimeMillis - startTime)
}
Shouldn't the method map on String fallback to a map over the array? Why is there such a noticeable difference? (Note that increasing the number even causes an stack overflow on the non-array case).
Original
Could be because toString.map uses the WrappedString implicit, while toString.toArray.map uses the WrappedArray implicit to resolve map.
Let's see map, as defined in TraversableLike:
def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
val b = bf(repr)
b.sizeHint(this)
for (x <- this) b += f(x)
b.result
}
WrappedString uses a StringBuilder as builder:
def +=(x: Char): this.type = { append(x); this }
def append(x: Any): StringBuilder = {
underlying append String.valueOf(x)
this
}
The String.valueOf call for Any uses Java Object.toString on the Char instances, possibly getting boxed first. These extra ops might be the cause of speed difference, versus the supposedly shorter code paths of the Array builder.
This is a guess though, would have to measure.
Edit
After revising, the general point still stands, but the I referred the wrong implicits, since the toDigit methods return an Int sequence (or like), not a translated string as I misread.
toDigit uses LowPriorityImplicits.fallbackStringCanBuildFrom[T]: CanBuildFrom[String, T, immutable.IndexedSeq[T]], with T = Int, which just defers to a general IndexedSeq builder.
toDigitFast uses a direct Array implicit of type CanBuildFrom[Array[_], T, Array[T]], which is unarguably faster.
Passing the following CBF for toDigit explicitly makes the two methods on par:
object FastStringToArrayBuild {
def canBuildFrom[T : ClassManifest] = new CanBuildFrom[String, T, Array[T]] {
private def newBuilder = scala.collection.mutable.ArrayBuilder.make()
def apply(from: String) = newBuilder
def apply() = newBuilder
}
}
You're being fooled by running out of memory. The toDigit version does create more intermediate objects, but if you have plenty of memory then the GC won't be heavily impacted (and it'll all run faster). For example, if instead of creating 1.2 million numbers, I create 12k 100x in a row, I get approximately equal times for the two methods. If I create 1.2k 5-digit numbers 1000x in a row, I find that toDigit is about 5% faster.
Given that the toDigit method produces an immutable collection, which is better when all else is equal since it is easier to reason about, and given that all else is equal for all but highly demanding tasks, I think the library is as it should be.
When trying to improve performance, of course one needs to keep all sorts of tricks in mind; one of these is that arrays have better memory characteristics for collections of known length than do the fancy collections in the Scala library. Also, one needs to know that map isn't the fastest way to get things done; if you really wanted this to be fast you should
final def toDigitReallyFast(num: Int, accum: Long = 0L, iter: Int = 0): Array[Byte] = {
if (num==0) {
val ans = new Array[Byte](math.max(1,iter))
var i = 0
var ac = accum
while (i < ans.length) {
ans(ans.length-i-1) = (ac & 0xF).toByte
ac >>= 4
i += 1
}
ans
}
else {
val next = num/10
toDigitReallyFast(next, (accum << 4) | (num-10*next), iter+1)
}
}
which on my machine is at 4x faster than either of the others. And you can get almost 3x faster yet again if you leave everything in a Long and pack the results in an array instead of using 1 to N:
final def toDigitExtremelyFast(num: Int, accum: Long = 0L, iter: Int = 0): Long = {
if (num==0) accum | (iter.toLong << 48)
else {
val next = num/10
toDigitExtremelyFast(next, accum | ((num-10*next).toLong<<(4*iter)), iter+1)
}
}
// loop, instead of 1 to N map, for the 1.2k number case
{
var i = 10000
val a = new Array[Long](1201)
while (i<=11200) {
a(i-10000) = toDigitReallyReallyFast(i)
i += 1
}
a
}
As with many things, performance tuning is highly dependent on exactly what you want to do. In contrast, library design has to balance many different concerns. I do think it's worth noticing where the library is sub-optimal with respect to performance, but this isn't really one of those cases IMO; the flexibility is worth it for the common use cases.

Reading Multiple Inputs from the Same Line the Scala Way

I tried to use readInt() to read two integers from the same line but that is not how it works.
val x = readInt()
val y = readInt()
With an input of 1 727 I get the following exception at runtime:
Exception in thread "main" java.lang.NumberFormatException: For input string: "1 727"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:231)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
at scala.Console$.readInt(Console.scala:356)
at scala.Predef$.readInt(Predef.scala:201)
at Main$$anonfun$main$1.apply$mcVI$sp(Main.scala:11)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:75)
at Main$.main(Main.scala:10)
at Main.main(Main.scala)
I got the program to work by using readf but it seems pretty awkward and ugly to me:
val (x,y) = readf2("{0,number} {1,number}")
val a = x.asInstanceOf[Int]
val b = y.asInstanceOf[Int]
println(function(a,b))
Someone suggested that I just use Java's Scanner class, (Scanner.nextInt()) but is there a nice idiomatic way to do it in Scala?
Edit:
My solution following paradigmatic's example:
val Array(a,b) = readLine().split(" ").map(_.toInt)
Followup Question: If there were a mix of types in the String how would you extract it? (Say a word, an int and a percentage as a Double)
If you mean how would you convert val s = "Hello 69 13.5%" into a (String, Int, Double) then the most obvious way is
val tokens = s.split(" ")
(tokens(0).toString,
tokens(1).toInt,
tokens(2).init.toDouble / 100)
// (java.lang.String, Int, Double) = (Hello,69,0.135)
Or as mentioned you could match using a regex:
val R = """(.*) (\d+) (\d*\.?\d*)%""".r
s match {
case R(str, int, dbl) => (str, int.toInt, dbl.toDouble / 100)
}
If you don't actually know what data is going to be in the String, then there probably isn't much reason to convert it from a String to the type it represents, since how can you use something that might be a String and might be in Int? Still, you could do something like this:
val int = """(\d+)""".r
val pct = """(\d*\.?\d*)%""".r
val res = s.split(" ").map {
case int(x) => x.toInt
case pct(x) => x.toDouble / 100
case str => str
} // Array[Any] = Array(Hello, 69, 0.135)
now to do anything useful you'll need to match on your values by type:
res.map {
case x: Int => println("It's an Int!")
case x: Double => println("It's a Double!")
case x: String => println("It's a String!")
case _ => println("It's a Fail!")
}
Or if you wanted to take things a bit further, you could define some extractors which will do the conversion for you:
abstract class StringExtractor[A] {
def conversion(s: String): A
def unapply(s: String): Option[A] = try { Some(conversion(s)) }
catch { case _ => None }
}
val intEx = new StringExtractor[Int] {
def conversion(s: String) = s.toInt
}
val pctEx = new StringExtractor[Double] {
val pct = """(\d*\.?\d*)%""".r
def conversion(s: String) = s match { case pct(x) => x.toDouble / 100 }
}
and use:
"Hello 69 13.5%".split(" ").map {
case intEx(x) => println(x + " is Int: " + x.isInstanceOf[Int])
case pctEx(x) => println(x + " is Double: " + x.isInstanceOf[Double])
case str => println(str)
}
prints
Hello
69 is Int: true
0.135 is Double: true
Of course, you can make the extrators match on anything you want (currency mnemonic, name begging with 'J', URL) and return whatever type you want. You're not limited to matching Strings either, if instead of StringExtractor[A] you make it Extractor[A, B].
You can read the line as a whole, split it using spaces and then convert each element (or the one you want) to ints:
scala> "1 727".split(" ").map( _.toInt )
res1: Array[Int] = Array(1, 727)
For most complex inputs, you can have a look at parser combinators.
The input you are describing is not two Ints but a String which just happens to be two Ints. Hence you need to read the String, split by the space and convert the individual Strings into Ints as suggested by #paradigmatic.
One way would be splitting and mapping:
// Assuming whatever is being read is assigned to "input"
val input = "1 727"
val Array(x, y) = input split " " map (_.toInt)
Or, if you have things a bit more complicated than that, a regular expression is usually good enough.
val twoInts = """^\s*(\d+)\s*(\d+)""".r
val Some((x, y)) = for (twoInts(a, b) <- twoInts findFirstIn input) yield (a, b)
There are other ways to use regex. See the Scala API docs about them.
Anyway, if regex patterns are becoming too complicated, then you should appeal to Scala Parser Combinators. Since you can combine both, you don't loose any of regex's power.
import scala.util.parsing.combinator._
object MyParser extends JavaTokenParsers {
def twoInts = wholeNumber ~ wholeNumber ^^ { case a ~ b => (a.toInt, b.toInt) }
}
val MyParser.Success((x, y), _) = MyParser.parse(MyParser.twoInts, input)
The first example was more simple, but harder to adapt to more complex patterns, and more vulnerable to invalid input.
I find that extractors provide some machinery that makes this type of processing nicer. And I think it works up to a certain point nicely.
object Tokens {
def unapplySeq(line: String): Option[Seq[String]] =
Some(line.split("\\s+").toSeq)
}
class RegexToken[T](pattern: String, convert: (String) => T) {
val pat = pattern.r
def unapply(token: String): Option[T] = token match {
case pat(s) => Some(convert(s))
case _ => None
}
}
object IntInput extends RegexToken[Int]("^([0-9]+)$", _.toInt)
object Word extends RegexToken[String]("^([A-Za-z]+)$", identity)
object Percent extends RegexToken[Double](
"""^([0-9]+\.?[0-9]*)%$""", _.toDouble / 100)
Now how to use:
List("1 727", "uptime 365 99.999%") collect {
case Tokens(IntInput(x), IntInput(y)) => "sum " + (x + y)
case Tokens(Word(w), IntInput(i), Percent(p)) => w + " " + (i * p)
}
// List[java.lang.String] = List(sum 728, uptime 364.99634999999995)
To use for reading lines at the console:
Iterator.continually(readLine("prompt> ")).collect{
case Tokens(IntInput(x), IntInput(y)) => "sum " + (x + y)
case Tokens(Word(w), IntInput(i), Percent(p)) => w + " " + (i * p)
case Tokens(Word("done")) => "done"
}.takeWhile(_ != "done").foreach(println)
// type any input and enter, type "done" and enter to finish
The nice thing about extractors and pattern matching is that you can add case clauses as necessary, you can use Tokens(a, b, _*) to ignore some tokens. I think they combine together nicely (for instance with literals as I did with done).

Resources