Performance difference in toString.map and toString.toArray.map - string

While coding Euler problems, I ran across what I think is bizarre:
The method toString.map is slower than toString.toArray.map.
Here's an example:
def main(args: Array[String])
{
def toDigit(num : Int) = num.toString.map(_ - 48) //2137 ms
def toDigitFast(num : Int) = num.toString.toArray.map(_ - 48) //592 ms
val startTime = System.currentTimeMillis;
(1 to 1200000).map(toDigit)
println(System.currentTimeMillis - startTime)
}
Shouldn't the method map on String fallback to a map over the array? Why is there such a noticeable difference? (Note that increasing the number even causes an stack overflow on the non-array case).

Original
Could be because toString.map uses the WrappedString implicit, while toString.toArray.map uses the WrappedArray implicit to resolve map.
Let's see map, as defined in TraversableLike:
def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
val b = bf(repr)
b.sizeHint(this)
for (x <- this) b += f(x)
b.result
}
WrappedString uses a StringBuilder as builder:
def +=(x: Char): this.type = { append(x); this }
def append(x: Any): StringBuilder = {
underlying append String.valueOf(x)
this
}
The String.valueOf call for Any uses Java Object.toString on the Char instances, possibly getting boxed first. These extra ops might be the cause of speed difference, versus the supposedly shorter code paths of the Array builder.
This is a guess though, would have to measure.
Edit
After revising, the general point still stands, but the I referred the wrong implicits, since the toDigit methods return an Int sequence (or like), not a translated string as I misread.
toDigit uses LowPriorityImplicits.fallbackStringCanBuildFrom[T]: CanBuildFrom[String, T, immutable.IndexedSeq[T]], with T = Int, which just defers to a general IndexedSeq builder.
toDigitFast uses a direct Array implicit of type CanBuildFrom[Array[_], T, Array[T]], which is unarguably faster.
Passing the following CBF for toDigit explicitly makes the two methods on par:
object FastStringToArrayBuild {
def canBuildFrom[T : ClassManifest] = new CanBuildFrom[String, T, Array[T]] {
private def newBuilder = scala.collection.mutable.ArrayBuilder.make()
def apply(from: String) = newBuilder
def apply() = newBuilder
}
}

You're being fooled by running out of memory. The toDigit version does create more intermediate objects, but if you have plenty of memory then the GC won't be heavily impacted (and it'll all run faster). For example, if instead of creating 1.2 million numbers, I create 12k 100x in a row, I get approximately equal times for the two methods. If I create 1.2k 5-digit numbers 1000x in a row, I find that toDigit is about 5% faster.
Given that the toDigit method produces an immutable collection, which is better when all else is equal since it is easier to reason about, and given that all else is equal for all but highly demanding tasks, I think the library is as it should be.
When trying to improve performance, of course one needs to keep all sorts of tricks in mind; one of these is that arrays have better memory characteristics for collections of known length than do the fancy collections in the Scala library. Also, one needs to know that map isn't the fastest way to get things done; if you really wanted this to be fast you should
final def toDigitReallyFast(num: Int, accum: Long = 0L, iter: Int = 0): Array[Byte] = {
if (num==0) {
val ans = new Array[Byte](math.max(1,iter))
var i = 0
var ac = accum
while (i < ans.length) {
ans(ans.length-i-1) = (ac & 0xF).toByte
ac >>= 4
i += 1
}
ans
}
else {
val next = num/10
toDigitReallyFast(next, (accum << 4) | (num-10*next), iter+1)
}
}
which on my machine is at 4x faster than either of the others. And you can get almost 3x faster yet again if you leave everything in a Long and pack the results in an array instead of using 1 to N:
final def toDigitExtremelyFast(num: Int, accum: Long = 0L, iter: Int = 0): Long = {
if (num==0) accum | (iter.toLong << 48)
else {
val next = num/10
toDigitExtremelyFast(next, accum | ((num-10*next).toLong<<(4*iter)), iter+1)
}
}
// loop, instead of 1 to N map, for the 1.2k number case
{
var i = 10000
val a = new Array[Long](1201)
while (i<=11200) {
a(i-10000) = toDigitReallyReallyFast(i)
i += 1
}
a
}
As with many things, performance tuning is highly dependent on exactly what you want to do. In contrast, library design has to balance many different concerns. I do think it's worth noticing where the library is sub-optimal with respect to performance, but this isn't really one of those cases IMO; the flexibility is worth it for the common use cases.

Related

Parallel data fetches that are batched

Is this pattern of batching a subset of a collection for parallel processing ok? Is there a better way to do this that I am missing?
When given a collection of entity ids that need to be fetched from a service which returns a scala Future instead of making all the requests at once we batch them because the service can only handle a certain number of requests at a time. In a way it is a primitive throttling mechanism to avoid overwhelming the data store. It looks like a code smell.
object FutureHelper{
def batchSerially[A, B, M[a] <: TraversableOnce[a]](l: M[A])(dbFetch: A => Future[B])(
implicit ctx: ExecutionContext, buildFrom: CanBuildFrom[M[A], B, M[B]]): Future[M[B]] =
l.foldLeft(Future.successful(buildFrom(l))){
case (accF, curr) => for {
acc <- accF
b <- dbFetch(curr)
} yield acc += b
}.map(s => s.result())
}
object FutureBatching extends App {
implicit val e: ExecutionContext = scala.concurrent.ExecutionContext.Implicits.global
val entityIds = List(1,2,3,4,5,6)
val batchSize = 2
val listOfFetchedResults =
FutureHelper.batchSerially(entityIds.grouped(batchSize)) {groupedByBatchSize =>
Future.sequence{
groupedByBatchSize.map( i => Future.successful(i))
}
}.map(_.flatten.toList)
}
I believe by default scala.Future will start executing as soon as the Future is created, so the invocations of dbFetch() will kick-off the connections right away. Since the foldLeft transforms all the suspended A => Future[B] to the actual Future objects, I don't believe the batching will happen the way you want.
Yes, I believe that code works correctly (see comments).
Another way is to let the pool define the level of parallelism, but that doesn't always work, depending on your execution environment.
I've had some success doing batching using the parallel collections. For instance, if you create a collection where the number of elements represent the number of concurrent activities, you can use .par. For instance,
// partition xs into numBatches Set elements, and invoke processBatch on each Set in parallel
def batch[A,B](xs: Iterable[A], numBatches: Int)
(processBatch: Set[A] => Set[B]): ParSeq[B] = split(xs,numBatches).par.flatMap(processBatch)
// Split the input iterable into numBatches sub-sets.
// For example split(Seq(1,2,3,4,5,6), 3) = Seq(Set(1, 4), Set(2, 5), Set(3, 6))
def split[A](xs: Iterable[A], numBatches: Int): Seq[Set[A]] = {
val buffers: Vector[VectorBuilder[A]] = Vector.fill(numBatches)(new VectorBuilder[A]())
val elems = xs.toIndexedSeq
for (i <- 0 until elems.length) {
buffers(i % numBatches) += elems(i)
}
buffers.map(_.result.toSet)
}

Implicit class holding mutable variable in multithreaded environment

I need to implement a parallel method, which takes two computation blocks, a and b, and starts each of them in a new thread. The method must return a tuple with the result values of both the computations. It should have the following signature:
def parallel[A, B](a: => A, b: => B): (A, B)
I managed to solve the exercise by using straight Java-like approach. Then I decided to make up a solution with implicit class. Here's it:
object ParallelApp extends App {
implicit class ParallelOps[A](a: => A) {
var result: A = _
def spawn(): Unit = {
val thread = new Thread {
override def run(): Unit = {
result = a
}
}
thread.start()
thread.join()
}
}
def parallel[A, B](a: => A, b: => B): (A, B) = {
a.spawn()
b.spawn()
(a.result, b.result)
}
println(parallel(1 + 2, "a" + "b"))
}
For unknown reason, I receive output (null,null). Could you please point me out where is the problem?
Spoiler alert: It's not complicated. It's funny, like a magic trick (if you consider reading the documentation about Java Memory Model "funny", that is). If you haven't figured it out yet, I would highly recommend to try to figure it out, otherwise it won't be funny. Someone should make a "division-by-zero proves 2 = 4"-riddle out of it.
Consider the following shorter example:
implicit class Foo[A](a: A) {
var result: String = "not initialized"
def computeResult(): Unit = result = "Yay, result!"
}
val a = "a string"
a.computeResult()
println(a.result)
When run, it prints
not initialized
despite the fact that we invoked computeResult() and set result to "Yay, result!". The problem is that the two invocations a.computeResult() and a.result belong to two completely independent instances of Foo. The implicit conversion is performed twice, and the second implicitly created object doesn't know anything about the changes in the first implicitly created object. It has nothing to do with threads or JMM at all.
By the way: your code is not parallel. Calling join right after calling start doesn't bring you anything, your main thread will simply go idle and wait until another thread finishes. At no point will there be two threads that do any useful work concurrently.
EDIT: Fixed a bug pointed out by Andrey Tyukin
One way to solve your problem is to use Scala Futures
Documentation. Tutorial.
Useful Klang Blog.
You'll typically need some combination of these libraries:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.{Await, Future}
import scala.util.{Failure, Success}
import scala.concurrent.duration._
an asynchronous example:
def parallelAsync[A,B](a: => A, b: => B): Future[(A,B)] = {
// as per Andrey Tyukin's comments, this line runs
// the two futures sequentially and we do not get
// any benefit from it. I will leave this line here
// so others will not fall in my trap
//for {i <- Future(a); j <- Future(b) } yield (i,j)
Future(a) zip Future(b)
}
parallelAsync(1 + 2, "a" + "b").onComplete {
case Success(x) => println(x)
case Failure(e) => e.printStackTrace()
}
If you must block until both are complete, you can use this:
def parallelSync[A,B](a: => A, b: => B): (A,B) = {
// see comment above
//val f = for { i <- Future(a); j <- Future(b) } yield (i,j)
val tuple = Future(a) zip Future(b)
Await.result(tuple, 5 second)
}
println(parallelSync(3 + 4, "c" + "d"))
When running these little examples, don't forget to sleep a little bit at the end so the program won't end before the results come back
Thread.sleep(3000)

Reduce a string using grammar-like rules

I'm trying to find a suitable DP algorithm for simplifying a string. For example I have a string a b a b and a list of rules
a b -> b
a b -> c
b a -> a
c c -> b
The purpose is to get all single chars that can be received from the given string using these rules. For this example it will be b, c. The length of the given string can be up to 200 symbols. Could you please prompt an effective algorithm?
Rules always are 2 -> 1. I've got an idea of creating a tree, root is given string and each child is a string after one transform, but I'm not sure if it's the best way.
If you read those rules from right to left, they look exactly like the rules of a context free grammar, and have basically the same meaning. You could apply a bottom-up parsing algorithm like the Earley algorithm to your data, along with a suitable starting rule; something like
start <- start a
| start b
| start c
and then just examine the parse forest for the shortest chain of starts. The worst case remains O(n^3) of course, but Earley is fairly effective, these days.
You can also produce parse forests when parsing with derivatives. You might be able to efficiently check them for short chains of starts.
For a DP problem, you always need to understand how you can construct the answer for a big problem in terms of smaller sub-problems. Assume you have your function simplify which is called with an input of length n. There are n-1 ways to split the input in a first and a last part. For each of these splits, you should recursively call your simplify function on both the first part and the last part. The final answer for the input of length n is the set of all possible combinations of answers for the first and for the last part, which are allowed by the rules.
In Python, this can be implemented like so:
rules = {'ab': set('bc'), 'ba': set('a'), 'cc': set('b')}
all_chars = set(c for cc in rules.values() for c in cc)
# memoize
def simplify(s):
if len(s) == 1: # base case to end recursion
return set(s)
possible_chars = set()
# iterate over all the possible splits of s
for i in range(1, len(s)):
head = s[:i]
tail = s[i:]
# check all possible combinations of answers of sub-problems
for c1 in simplify(head):
for c2 in simplify(tail):
possible_chars.update(rules.get(c1+c2, set()))
# speed hack
if possible_chars == all_chars: # won't get any bigger
return all_chars
return possible_chars
Quick check:
In [53]: simplify('abab')
Out[53]: {'b', 'c'}
To make this fast enough for large strings (to avoiding exponential behavior), you should use a memoize decorator. This is a critical step in solving DP problems, otherwise you are just doing a brute-force calculation. A further tiny speedup can be obtained by returning from the function as soon as possible_chars == set('abc'), since at that point, you are already sure that you can generate all possible outcomes.
Analysis of running time: for an input of length n, there are 2 substrings of length n-1, 3 substrings of length n-2, ... n substrings of length 1, for a total of O(n^2) subproblems. Due to the memoization, the function is called at most once for every subproblem. Maximum running time for a single sub-problem is O(n) due to the for i in range(len(s)), so the overall running time is at most O(n^3).
Let N - length of given string and R - number of rules.
Expanding a tree in a top down manner yields computational complexity O(NR^N) in the worst case (input string of type aaa... and rules aa -> a).
Proof:
Root of the tree has (N-1)R children, which have (N-1)R^2 children, ..., which have (N-1)R^N children (leafs). So, the total complexity is O((N-1)R + (N-1)R^2 + ... (N-1)R^N) = O(N(1 + R^2 + ... + R^N)) = (using binomial theorem) = O(N(R+1)^N) = O(NR^N).
Recursive Java implementation of this naive approach:
public static void main(String[] args) {
Map<String, Character[]> rules = new HashMap<String, Character[]>() {{
put("ab", new Character[]{'b', 'c'});
put("ba", new Character[]{'a'});
put("cc", new Character[]{'b'});
}};
System.out.println(simplify("abab", rules));
}
public static Set<String> simplify(String in, Map<String, Character[]> rules) {
Set<String> result = new HashSet<String>();
simplify(in, rules, result);
return result;
}
private static void simplify(String in, Map<String, Character[]> rules, Set<String> result) {
if (in.length() == 1) {
result.add(in);
}
for (int i = 0; i < in.length() - 1; i++) {
String two = in.substring(i, i + 2);
Character[] rep = rules.get(two);
if (rep != null) {
for (Character c : rep) {
simplify(in.substring(0, i) + c + in.substring(i + 2, in.length()), rules, result);
}
}
}
}
Bas Swinckels's O(RN^3) Java implementation (with HashMap as a memoization cache):
public static Set<String> simplify2(final String in, Map<String, Character[]> rules) {
Map<String, Set<String>> cache = new HashMap<String, Set<String>>();
return simplify2(in, rules, cache);
}
private static Set<String> simplify2(final String in, Map<String, Character[]> rules, Map<String, Set<String>> cache) {
final Set<String> cached = cache.get(in);
if (cached != null) {
return cached;
}
Set<String> ret = new HashSet<String>();
if (in.length() == 1) {
ret.add(in);
return ret;
}
for (int i = 1; i < in.length(); i++) {
String head = in.substring(0, i);
String tail = in.substring(i, in.length());
for (String c1 : simplify2(head, rules)) {
for (String c2 : simplify2(tail, rules, cache)) {
Character[] rep = rules.get(c1 + c2);
if (rep != null) {
for (Character c : rep) {
ret.add(c.toString());
}
}
}
}
}
cache.put(in, ret);
return ret;
}
Output in both approaches:
[b, c]

Threadsafe mutable collection with fast elements removal and random get

I need a thread safe data structure with three operations: remove, getRandom, reset.
I have only two ideas by now.
First: Seq in syncronized var.
val all: Array[String] = ... //all possible.
var current: Array[String] = Array.empty[String]
def getRandom(): = {
val currentAvailable = current
currentAvailable(Random.nextInt(currentAvailable.length))
}
def remove(s: String) = {
this.syncronized {
current = current diff Seq(s)
}
}
def reset(s: String) = {
this.syncronized {
current = all
}
}
Second:
Maintain some Map[String,Boolean], there bool is true when element currently is present. The main problem is to make a fast getRandom method (not something like O(n) in worst case).
Is there a better way(s) to implement this?
Scala's Trie is a lock free data structure that supports snapshots (aka your currentAvailable) and fast removals
Since I'm not a Scala expert so this answer is general as an example I used Java coding.
in short the answer is YES.
if you use a map such as :
Map<Integer,String> map=new HashMap<Integer,String>(); //is used to get random in constant time
Map<String,Integer> map1=new HashMap<String,Integer>(); //is used to remove in constant time
to store date,
the main idea is to keep the key( in this case the integer) synchronized to be {1 ... size of map}
for example to fill this structure, you need something like this:
int counter=0; //this is a global variable
for(/* all your string (s) in all */ ){
map.put(counter++, s);
}
//then , if you want the removal to be in constant time you need to fill the second map
for(Entry e : map.EntrySet(){
map1.put(e.getValue(),e.getKey());
}
The above code is the initialization. everytime you want to set things you need to do that
then you can achieve a random value with O(1) complexity
String getRandom(){
int i; /*random number between 0 to counter*/
return map.get(i);
}
Now to remove things you use map1 to achive it in constant time O(1);
void remove(String s){
if(!map1.containsKey(s))
return; //s doesn't exists
String val=map.get(counter); //value of the last
map.remove(counter) //removing the last element
int thisCounter= map1.get(s); //pointer to this
map1.remove(s); // remove from map1
map.remove(counter); //remove from map
map1.put(thisCounter,val); //the val of the last element with the current pointer
counter--; //reducing the counter by one
}
obviously the main issue here is to keep the synchronization ensured. but by carefully analyzing the code you should be able to do that.

Surrounding Scala Strings

If you do something in a single statement like "abc" + stringval + "abc", is that one immutable string copy, or two (noting that abc and 123 are constant at compile time)
Bonus round: would using a StringBuilder like the following have more or less overhead?
def surround(s:String, ss:String):String = {
val surrounded = new StringBuilder(s.length() + 2*ss.length(), s)
surrounded.insert(0,ss)
surrounded.append(ss)
surrounded.mkString
}
Or is there a more idiomatic way that I'm unaware of?
It has less overhead than concatenation. But the insert in your example is not efficient. The following is a little cleaner and uses only appends for efficiency.
def surround(s:String, ss:String) =
new StringBuilder(s.length() + 2*ss.length(), ss).append(s).append(ss).mkString
My first impulse is to look at the bytecode and see. So,
// test.scala
object Comparison {
def surround1(s: String, ss: String) = {
val surrounded = new StringBuilder(s.length() + 2*ss.length(), s)
surrounded.insert(0, ss)
surrounded.append(ss)
surrounded.mkString
}
def surround2(s: String, ss: String) = ss + s + ss
def surround3(s: String, ss: String) = // Neil Essy
new StringBuilder(s.length() + 2*ss.length(), ss).append(s).append(ss).mkString
}
and then:
$ scalac -optimize test.scala
$ javap -verbose Comparison$
[... lots of output ...]
Roughly, Neil Essy's and yours are identical but for the one method call (and some stack noise). surround2 is compiled into something like
val sb = new StringBuilder()
sb.append(ss)
sb.append(s)
sb.append(ss)
sb.toString
I'm new to Scala (and Java), so I don't know how generally useful it is to look at the bytecode -- but it tells you what this scalac does with this code.
Doing a little testing on this in Java and now in Scala the value of using a StringBuilder is questionable unless you are doing a lot of appending of non-constant strings.
object AppendTimeTest {
val tries = 500000
def surround(s:String, ss:String) = {
(1 to tries).foreach(_ => {
new StringBuilder(s.length() + 2*ss.length(), ss).append(s).append(ss).mkString
})
val start = System.currentTimeMillis()
(1 to tries).foreach(_ => {
new StringBuilder(s.length() + 2*ss.length(), ss).append(s).append(ss).mkString
})
val stop = System.currentTimeMillis()
val delta:Double = stop -start
println("Total time: " + delta + ".\n Avg. time: " + (delta/tries))
}
def surroundStatic(s:String) = {
(1 to tries).foreach(_ => {
"ABC" + s + "ABC"
})
val start = System.currentTimeMillis()
(1 to tries).foreach(_ => {
"ABC" + s + "ABC"
})
val stop = System.currentTimeMillis()
val delta:Double = stop -start
println("Total time: " + delta + ".\n Avg. time: " + (delta/tries))
}
}
Calling this a few times in the interpreter yields:
scala> AppendTimeTest.surroundStatic("foo")
Total time: 241.0.
Avg. time: 4.82E-4
scala> AppendTimeTest.surround("foo", "ABC")
Total time: 222.0.
Avg. time: 4.44E-4
scala> AppendTimeTest.surroundStatic("foo")
Total time: 231.0.
Avg. time: 4.62E-4
scala> AppendTimeTest.surround("foo", "ABC")
Total time: 247.0.
Avg. time: 4.94E-4
So unless you are appending many different non-constant strings I believe you won't see any big difference in performance. Alsol concatenating constants (i.e. "ABC" + "foo" + "ABC") is as you probably know handled by the compiler (this is atleast the case in Java, but I believe it's also holds true for Scala)
Scala is pretty close to java for string manipulation. Your example:
val stringval = "bar"
"abc" + stringval + "abc"
actually ends up (in java style) as:
(new StringBuilder()).append("abc").append(stringval()).append("abc").toString()
This is the same behaviour as java, + between strings are generally translated to instances of StringBuilder, which are more efficient. So here, you're doing three copies to the StringBuilder, and one final String creation, and depending upon the length of the strings, potentially three reallocations.
In your example:
def surround(s:String, ss:String):String = {
val surrounded = new StringBuilder(s.length() + 2*ss.length(), s)
surrounded.insert(0,ss)
surrounded.append(ss)
surrounded.mkString
}
you're doing the same number of copies (3) but you're only allocating once.
Advice: If the strings are small, use +, it makes very little difference. If the strings are relatively large, then initialise the StringBuilder with the relevant length, but then simply append. This is much clearer for the other developers.
As always with performance, measure it, and if the differences are small, use the simpler solution
Generally in Java / Scala, String literals in source code are interned for efficiency, which means that all copies of it in your code will refer to the same object. So there will only be one "copy" of "abc".
Java and Scala don't have "constants" like in C++. Variables initialised with val are immutable, but in the general case vals from one instance to the next are not the same (specified via constructor).
So in theory the compiler could check for simple cases where the val will always initialise the same value and optimise accordingly, but it would add extra complicaton. And you could just optimise it yourself by writing it as "abc123abc".
Others have already addressed your bonus question.
You can also use mkString on a String.
def surround(s:String, ss:String) = s.mkString(ss, "", ss)
Internally it is using StringBuilder too, but I like the way it reads.
I would maybe just even inline it.

Resources