Synchronizing on function parameter for multithreaded memoization - multithreading

My core question is: how can I implement synchronization in a method on the combination of the object instance and the method parameter?
Here are the details of my situation. I'm using the following code to implement memoization, adapted from this answer:
/**
* Memoizes a unary function
* #param f the function to memoize
* #tparam T the argument type
* #tparam R the result type
*/
class Memoized[-T, +R](f: T => R) extends (T => R) {
import scala.collection.mutable
private[this] val cache = mutable.Map.empty[T, R]
def apply(x: T): R = cache.getOrElse(x, {
val y = f(x)
cache += ((x, y))
y
})
}
In my project, I'm memoizing Futures to deduplicate asynchronous API calls. This worked fine when using for...yield to map over the resulting futures, created with the standard ExcecutionContext, but when I upgraded to Scala Async for nicer handling of these futures. However, I realized that the multithreading that library uses allowed multiple threads to enter apply, defeating memoization, because the async blocks all executed in parallel, entering the "orElse" thunk before cache could be updated with a new Future.
To work around this, I put the main apply function in a this.synchronized block:
def apply(x: T): R = this.synchronized {
cache.getOrElse(x, {
val y = f(x)
cache += ((x, y))
y
})
}
This restored the memoized behavior. The drawback is that this will block calls with different params, at least until the Future is created. I'm wondering if there is a way to set up finer grained synchronization on the combination of the Memoized instance and the value of the x parameter to apply. That way, only calls that would be deduplicated will be blocked.
As a side note, I'm not sure this is truly performance critical, because the synchronized block will release once the Future is created and returned (I think?). But if there are any concerns with this that I'm not thinking of, I would also like to know.

Akka actors combined with futures provide a powerful way to wrap over mutable state without blocking. Here is a simple example of how to use an Actor for memoization:
import akka.actor._
import akka.util.Timeout
import akka.pattern.ask
import scala.concurrent._
import scala.concurrent.duration._
class Memoize(system: ActorSystem) {
class CacheActor(f: Any => Future[Any]) extends Actor {
private[this] val cache = scala.collection.mutable.Map.empty[Any, Future[Any]]
def receive = {
case x => sender ! cache.getOrElseUpdate(x, f(x))
}
}
def apply[K, V](f: K => Future[V]): K => Future[V] = {
val fCast = f.asInstanceOf[Any => Future[Any]]
val actorRef = system.actorOf(Props(new CacheActor(fCast)))
implicit val timeout = Timeout(5.seconds)
import system.dispatcher
x => actorRef.ask(x).asInstanceOf[Future[Future[V]]].flatMap(identity)
}
}
We can use it like:
val system = ActorSystem()
val memoize = new Memoize(system)
val f = memoize { x: Int =>
println("Computing for " + x)
scala.concurrent.Future.successful {
Thread.sleep(1000)
x + 1
}
}
import system.dispatcher
f(5).foreach(println)
f(5).foreach(println)
And "Computing for 5" will only print a single time, but "6" will print twice.
There are some scary looking asInstanceOf calls, but it is perfectly type-safe.

Related

Parallel data fetches that are batched

Is this pattern of batching a subset of a collection for parallel processing ok? Is there a better way to do this that I am missing?
When given a collection of entity ids that need to be fetched from a service which returns a scala Future instead of making all the requests at once we batch them because the service can only handle a certain number of requests at a time. In a way it is a primitive throttling mechanism to avoid overwhelming the data store. It looks like a code smell.
object FutureHelper{
def batchSerially[A, B, M[a] <: TraversableOnce[a]](l: M[A])(dbFetch: A => Future[B])(
implicit ctx: ExecutionContext, buildFrom: CanBuildFrom[M[A], B, M[B]]): Future[M[B]] =
l.foldLeft(Future.successful(buildFrom(l))){
case (accF, curr) => for {
acc <- accF
b <- dbFetch(curr)
} yield acc += b
}.map(s => s.result())
}
object FutureBatching extends App {
implicit val e: ExecutionContext = scala.concurrent.ExecutionContext.Implicits.global
val entityIds = List(1,2,3,4,5,6)
val batchSize = 2
val listOfFetchedResults =
FutureHelper.batchSerially(entityIds.grouped(batchSize)) {groupedByBatchSize =>
Future.sequence{
groupedByBatchSize.map( i => Future.successful(i))
}
}.map(_.flatten.toList)
}
I believe by default scala.Future will start executing as soon as the Future is created, so the invocations of dbFetch() will kick-off the connections right away. Since the foldLeft transforms all the suspended A => Future[B] to the actual Future objects, I don't believe the batching will happen the way you want.
Yes, I believe that code works correctly (see comments).
Another way is to let the pool define the level of parallelism, but that doesn't always work, depending on your execution environment.
I've had some success doing batching using the parallel collections. For instance, if you create a collection where the number of elements represent the number of concurrent activities, you can use .par. For instance,
// partition xs into numBatches Set elements, and invoke processBatch on each Set in parallel
def batch[A,B](xs: Iterable[A], numBatches: Int)
(processBatch: Set[A] => Set[B]): ParSeq[B] = split(xs,numBatches).par.flatMap(processBatch)
// Split the input iterable into numBatches sub-sets.
// For example split(Seq(1,2,3,4,5,6), 3) = Seq(Set(1, 4), Set(2, 5), Set(3, 6))
def split[A](xs: Iterable[A], numBatches: Int): Seq[Set[A]] = {
val buffers: Vector[VectorBuilder[A]] = Vector.fill(numBatches)(new VectorBuilder[A]())
val elems = xs.toIndexedSeq
for (i <- 0 until elems.length) {
buffers(i % numBatches) += elems(i)
}
buffers.map(_.result.toSet)
}

Implicit class holding mutable variable in multithreaded environment

I need to implement a parallel method, which takes two computation blocks, a and b, and starts each of them in a new thread. The method must return a tuple with the result values of both the computations. It should have the following signature:
def parallel[A, B](a: => A, b: => B): (A, B)
I managed to solve the exercise by using straight Java-like approach. Then I decided to make up a solution with implicit class. Here's it:
object ParallelApp extends App {
implicit class ParallelOps[A](a: => A) {
var result: A = _
def spawn(): Unit = {
val thread = new Thread {
override def run(): Unit = {
result = a
}
}
thread.start()
thread.join()
}
}
def parallel[A, B](a: => A, b: => B): (A, B) = {
a.spawn()
b.spawn()
(a.result, b.result)
}
println(parallel(1 + 2, "a" + "b"))
}
For unknown reason, I receive output (null,null). Could you please point me out where is the problem?
Spoiler alert: It's not complicated. It's funny, like a magic trick (if you consider reading the documentation about Java Memory Model "funny", that is). If you haven't figured it out yet, I would highly recommend to try to figure it out, otherwise it won't be funny. Someone should make a "division-by-zero proves 2 = 4"-riddle out of it.
Consider the following shorter example:
implicit class Foo[A](a: A) {
var result: String = "not initialized"
def computeResult(): Unit = result = "Yay, result!"
}
val a = "a string"
a.computeResult()
println(a.result)
When run, it prints
not initialized
despite the fact that we invoked computeResult() and set result to "Yay, result!". The problem is that the two invocations a.computeResult() and a.result belong to two completely independent instances of Foo. The implicit conversion is performed twice, and the second implicitly created object doesn't know anything about the changes in the first implicitly created object. It has nothing to do with threads or JMM at all.
By the way: your code is not parallel. Calling join right after calling start doesn't bring you anything, your main thread will simply go idle and wait until another thread finishes. At no point will there be two threads that do any useful work concurrently.
EDIT: Fixed a bug pointed out by Andrey Tyukin
One way to solve your problem is to use Scala Futures
Documentation. Tutorial.
Useful Klang Blog.
You'll typically need some combination of these libraries:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.{Await, Future}
import scala.util.{Failure, Success}
import scala.concurrent.duration._
an asynchronous example:
def parallelAsync[A,B](a: => A, b: => B): Future[(A,B)] = {
// as per Andrey Tyukin's comments, this line runs
// the two futures sequentially and we do not get
// any benefit from it. I will leave this line here
// so others will not fall in my trap
//for {i <- Future(a); j <- Future(b) } yield (i,j)
Future(a) zip Future(b)
}
parallelAsync(1 + 2, "a" + "b").onComplete {
case Success(x) => println(x)
case Failure(e) => e.printStackTrace()
}
If you must block until both are complete, you can use this:
def parallelSync[A,B](a: => A, b: => B): (A,B) = {
// see comment above
//val f = for { i <- Future(a); j <- Future(b) } yield (i,j)
val tuple = Future(a) zip Future(b)
Await.result(tuple, 5 second)
}
println(parallelSync(3 + 4, "c" + "d"))
When running these little examples, don't forget to sleep a little bit at the end so the program won't end before the results come back
Thread.sleep(3000)

Why am I getting a race condition in multi-threading scala?

I am trying to parallelise a p-norm calculation over an array.
To achieve that I try the following, I understand I can solve this differently but I am interested in understanding where the race condition is occurring,
val toSum = Array(0,1,2,3,4,5,6)
// Calculate the sum over a segment of an array
def sumSegment(a: Array[Int], p:Double, s: Int, t: Int): Int = {
val res = {for (i <- s until t) yield scala.math.pow(a(i), p)}.reduceLeft(_ + _)
res.toInt
}
// Calculate the p-norm over an Array a
def parallelpNorm(a: Array[Int], p: Double): Double = {
var acc = 0L
// The worker who should calculate the sum over a slice of an array
class sumSegmenter(s: Int, t: Int) extends Thread {
override def run() {
// Calculate the sum over the slice
val subsum = sumSegment(a, p, s, t)
// Add the sum of the slice to the accumulator in a synchronized fashion
val x = new AnyRef{}
x.synchronized {
acc = acc + subsum
}
}
}
val split = a.size / 2
val seg_one = new sumSegmenter(0, split)
val seg_two = new sumSegmenter(split, a.size)
seg_one.start
seg_two.start
seg_one.join
seg_two.join
scala.math.pow(acc, 1.0 / p)
}
println(parallelpNorm(toSum, 2))
Expected output is 9.5393920142 but instead some runs give me 9.273618495495704 or even 2.23606797749979.
Any recommendations where the race condition could happen?
The problem has been explained in the previous answer, but a better way to avoid this race condition and improve performance is to use an AtomicInteger
// Calculate the p-norm over an Array a
def parallelpNorm(a: Array[Int], p: Double): Double = {
val acc = new AtomicInteger(0)
// The worker who should calculate the sum over a slice of an array
class sumSegmenter(s: Int, t: Int) extends Thread {
override def run() {
// Calculate the sum over the slice
val subsum = sumSegment(a, p, s, t)
// Add the sum of the slice to the accumulator in a synchronized fashion
acc.getAndAdd(subsum)
}
}
val split = a.length / 2
val seg_one = new sumSegmenter(0, split)
val seg_two = new sumSegmenter(split, a.length)
seg_one.start()
seg_two.start()
seg_one.join()
seg_two.join()
scala.math.pow(acc.get, 1.0 / p)
}
Modern processors can do atomic operations without blocking which can be much faster than explicit synchronisation. In my tests this runs twice as fast as the original code (with correct placement of x)
Move val x = new AnyRef{} outside sumSegmenter (that is, into parallelpNorm) -- the problem is that each thread is using its own mutex rather than sharing one.

Scala Future/Promise fast-fail pipeline

I want to launch two or more Future/Promises in parallel and fail even if one of the launched Future/Promise fails and dont want to wait for the rest to complete.
What is the most idiomatic way to compose this pipeline in Scala.
EDIT: more contextual information.
I have to launch two external processes one writing to a fifo file and another reading from it. Say if the writer process fails; the reader thread might hang forever waiting for any input from the file. So I would want to launch both the processes in parallel and fail fast even if one of the Future/Promise fails without waiting for the completion of the other.
Below is the sample code to be more precise. the commands are not exactly cat and tail. I have used them for brevity.
val future1 = Future { executeShellCommand("cat file.txt > fifo.pipe") }
val future2 = Future { executeShellCommand("tail fifo.pipe") }
If I understand the question correctly, what we are looking for is a fail-fast sequence implementation, which is akin to a failure-biased version of firstCompletedOf
Here, we eagerly register a failure callback in case one of the futures fails early on, ensuring that we fail as soon as any of the futures fail.
import scala.concurrent.{Future, Promise}
import scala.util.{Success, Failure}
import scala.concurrent.ExecutionContext.Implicits.global
def failFast[T](futures: Seq[Future[T]]): Future[Seq[T]] = {
val promise = Promise[Seq[T]]
futures.foreach{f => f.onFailure{case ex => promise.failure(ex)}}
val res = Future.sequence(futures)
promise.completeWith(res).future
}
In contrast to Future.sequence, this implementation will fail as soon as any of the futures fail, regardless of ordering.
Let's show that with an example:
import scala.util.Try
// help method to measure time
def resilientTime[T](t: =>T):(Try[T], Long) = {
val t0 = System.currentTimeMillis
val res = Try(t)
(res, System.currentTimeMillis-t0)
}
import scala.concurrent.duration._
import scala.concurrent.Await
First future will fail (failure in 2 seconds)
val f1 = Future[Int]{Thread.sleep(2000); throw new Exception("boom")}
val f2 = Future[Int]{Thread.sleep(5000); 42}
val f3 = Future[Int]{Thread.sleep(10000); 101}
val res = failFast(Seq(f1,f2,f3))
resilientTime(Await.result(res, 10.seconds))
// res: (scala.util.Try[Seq[Int]], Long) = (Failure(java.lang.Exception: boom),1998)
Last future will fail. Failure also in 2 seconds. (note the order in the sequence construction)
val f1 = Future[Int]{Thread.sleep(2000); throw new Exception("boom")}
val f2 = Future[Int]{Thread.sleep(5000); 42}
val f3 = Future[Int]{Thread.sleep(10000); 101}
val res = failFast(Seq(f3,f2,f1))
resilientTime(Await.result(res, 10.seconds))
// res: (scala.util.Try[Seq[Int]], Long) = (Failure(java.lang.Exception: boom),1998)
Comparing with Future.sequence where failure depends on the ordering (failure in 10 seconds):
val f1 = Future[Int]{Thread.sleep(2000); throw new Exception("boom")}
val f2 = Future[Int]{Thread.sleep(5000); 42}
val f3 = Future[Int]{Thread.sleep(10000); 101}
val seq = Seq(f3,f2,f1)
resilientTime(Await.result(Future.sequence(seq), 10.seconds))
//res: (scala.util.Try[Seq[Int]], Long) = (Failure(java.lang.Exception: boom),10000)
Use Future.sequence:
val both = Future.sequence(Seq(
firstFuture,
secondFuture));
This is the correct way to aggregate two or more futures where the failure of one fails the aggregated future and the aggregated future completes when all inner futures complete. An older version of this answer suggested a for-comprehension which while very common would not reject immediately of one of the futures rejects but rather wait for it.
Zip the futures
val f1 = Future { doSomething() }
val f2 = Future { doSomething() }
val resultF = f1 zip f2
resultF future fails if any one of f1 or f2 is failed
Time taken to resolve is min(f1time, f2time)
scala> import scala.util._
import scala.util._
scala> import scala.concurrent._
import scala.concurrent._
scala> import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.ExecutionContext.Implicits.global
scala> val f = Future { Thread.sleep(10000); throw new Exception("f") }
f: scala.concurrent.Future[Nothing] = scala.concurrent.impl.Promise$DefaultPromise#da1f03e
scala> val g = Future { Thread.sleep(20000); throw new Exception("g") }
g: scala.concurrent.Future[Nothing] = scala.concurrent.impl.Promise$DefaultPromise#634a98e3
scala> val x = f zip g
x: scala.concurrent.Future[(Nothing, Nothing)] = scala.concurrent.impl.Promise$DefaultPromise#3447e854
scala> x onComplete { case Success(x) => println(x) case Failure(th) => println(th)}
result: java.lang.Exception: f

Performance difference in toString.map and toString.toArray.map

While coding Euler problems, I ran across what I think is bizarre:
The method toString.map is slower than toString.toArray.map.
Here's an example:
def main(args: Array[String])
{
def toDigit(num : Int) = num.toString.map(_ - 48) //2137 ms
def toDigitFast(num : Int) = num.toString.toArray.map(_ - 48) //592 ms
val startTime = System.currentTimeMillis;
(1 to 1200000).map(toDigit)
println(System.currentTimeMillis - startTime)
}
Shouldn't the method map on String fallback to a map over the array? Why is there such a noticeable difference? (Note that increasing the number even causes an stack overflow on the non-array case).
Original
Could be because toString.map uses the WrappedString implicit, while toString.toArray.map uses the WrappedArray implicit to resolve map.
Let's see map, as defined in TraversableLike:
def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
val b = bf(repr)
b.sizeHint(this)
for (x <- this) b += f(x)
b.result
}
WrappedString uses a StringBuilder as builder:
def +=(x: Char): this.type = { append(x); this }
def append(x: Any): StringBuilder = {
underlying append String.valueOf(x)
this
}
The String.valueOf call for Any uses Java Object.toString on the Char instances, possibly getting boxed first. These extra ops might be the cause of speed difference, versus the supposedly shorter code paths of the Array builder.
This is a guess though, would have to measure.
Edit
After revising, the general point still stands, but the I referred the wrong implicits, since the toDigit methods return an Int sequence (or like), not a translated string as I misread.
toDigit uses LowPriorityImplicits.fallbackStringCanBuildFrom[T]: CanBuildFrom[String, T, immutable.IndexedSeq[T]], with T = Int, which just defers to a general IndexedSeq builder.
toDigitFast uses a direct Array implicit of type CanBuildFrom[Array[_], T, Array[T]], which is unarguably faster.
Passing the following CBF for toDigit explicitly makes the two methods on par:
object FastStringToArrayBuild {
def canBuildFrom[T : ClassManifest] = new CanBuildFrom[String, T, Array[T]] {
private def newBuilder = scala.collection.mutable.ArrayBuilder.make()
def apply(from: String) = newBuilder
def apply() = newBuilder
}
}
You're being fooled by running out of memory. The toDigit version does create more intermediate objects, but if you have plenty of memory then the GC won't be heavily impacted (and it'll all run faster). For example, if instead of creating 1.2 million numbers, I create 12k 100x in a row, I get approximately equal times for the two methods. If I create 1.2k 5-digit numbers 1000x in a row, I find that toDigit is about 5% faster.
Given that the toDigit method produces an immutable collection, which is better when all else is equal since it is easier to reason about, and given that all else is equal for all but highly demanding tasks, I think the library is as it should be.
When trying to improve performance, of course one needs to keep all sorts of tricks in mind; one of these is that arrays have better memory characteristics for collections of known length than do the fancy collections in the Scala library. Also, one needs to know that map isn't the fastest way to get things done; if you really wanted this to be fast you should
final def toDigitReallyFast(num: Int, accum: Long = 0L, iter: Int = 0): Array[Byte] = {
if (num==0) {
val ans = new Array[Byte](math.max(1,iter))
var i = 0
var ac = accum
while (i < ans.length) {
ans(ans.length-i-1) = (ac & 0xF).toByte
ac >>= 4
i += 1
}
ans
}
else {
val next = num/10
toDigitReallyFast(next, (accum << 4) | (num-10*next), iter+1)
}
}
which on my machine is at 4x faster than either of the others. And you can get almost 3x faster yet again if you leave everything in a Long and pack the results in an array instead of using 1 to N:
final def toDigitExtremelyFast(num: Int, accum: Long = 0L, iter: Int = 0): Long = {
if (num==0) accum | (iter.toLong << 48)
else {
val next = num/10
toDigitExtremelyFast(next, accum | ((num-10*next).toLong<<(4*iter)), iter+1)
}
}
// loop, instead of 1 to N map, for the 1.2k number case
{
var i = 10000
val a = new Array[Long](1201)
while (i<=11200) {
a(i-10000) = toDigitReallyReallyFast(i)
i += 1
}
a
}
As with many things, performance tuning is highly dependent on exactly what you want to do. In contrast, library design has to balance many different concerns. I do think it's worth noticing where the library is sub-optimal with respect to performance, but this isn't really one of those cases IMO; the flexibility is worth it for the common use cases.

Resources