Parallel data fetches that are batched - multithreading

Is this pattern of batching a subset of a collection for parallel processing ok? Is there a better way to do this that I am missing?
When given a collection of entity ids that need to be fetched from a service which returns a scala Future instead of making all the requests at once we batch them because the service can only handle a certain number of requests at a time. In a way it is a primitive throttling mechanism to avoid overwhelming the data store. It looks like a code smell.
object FutureHelper{
def batchSerially[A, B, M[a] <: TraversableOnce[a]](l: M[A])(dbFetch: A => Future[B])(
implicit ctx: ExecutionContext, buildFrom: CanBuildFrom[M[A], B, M[B]]): Future[M[B]] =
case (accF, curr) => for {
acc <- accF
b <- dbFetch(curr)
} yield acc += b
}.map(s => s.result())
object FutureBatching extends App {
implicit val e: ExecutionContext =
val entityIds = List(1,2,3,4,5,6)
val batchSize = 2
val listOfFetchedResults =
FutureHelper.batchSerially(entityIds.grouped(batchSize)) {groupedByBatchSize =>
Future.sequence{ i => Future.successful(i))

I believe by default scala.Future will start executing as soon as the Future is created, so the invocations of dbFetch() will kick-off the connections right away. Since the foldLeft transforms all the suspended A => Future[B] to the actual Future objects, I don't believe the batching will happen the way you want.
Yes, I believe that code works correctly (see comments).
Another way is to let the pool define the level of parallelism, but that doesn't always work, depending on your execution environment.
I've had some success doing batching using the parallel collections. For instance, if you create a collection where the number of elements represent the number of concurrent activities, you can use .par. For instance,
// partition xs into numBatches Set elements, and invoke processBatch on each Set in parallel
def batch[A,B](xs: Iterable[A], numBatches: Int)
(processBatch: Set[A] => Set[B]): ParSeq[B] = split(xs,numBatches).par.flatMap(processBatch)
// Split the input iterable into numBatches sub-sets.
// For example split(Seq(1,2,3,4,5,6), 3) = Seq(Set(1, 4), Set(2, 5), Set(3, 6))
def split[A](xs: Iterable[A], numBatches: Int): Seq[Set[A]] = {
val buffers: Vector[VectorBuilder[A]] = Vector.fill(numBatches)(new VectorBuilder[A]())
val elems = xs.toIndexedSeq
for (i <- 0 until elems.length) {
buffers(i % numBatches) += elems(i)


Implementing a multithreading function for running "foreach, map and reduce" parallel

I am quite new to Scala but I am learning about Threads and Multithreading.
As the title says, I am trying to implement a way to divide the problem onto different threads of variable count.
We are given this code:
/** Executes the provided function for each entry in the input sequence in parallel.
* #param input the input sequence
* #param parallelism the number of threads to use
* #param f the function to run
def parallelForeach[A](input: IndexedSeq[A], parallelism: Int, f: A => Unit): Unit = ???
I tried implementing it like this:
def parallelForeach[A](input: IndexedSeq[A], parallelism: Int, f: A => Unit): Unit = {
if (parallelism < 1) {
throw new IllegalArgumentException("a degree of parallelism < 1 s not allowed for parallel foreach")
val threads = (0 until parallelism).map { threadId =>
val startIndex = threadId * input.size / parallelism
val endIndex = (threadId + 1) * input.size / parallelism
val task: Runnable = () => {
(startIndex until endIndex).foreach { A =>
val key = input.grouped(input.size / parallelism)
val x: Unit = input.foreach(A => f(A))
new Thread(task)
for this test:
test("parallel foreach should perform the given function once for each element in the sequence") {
val counter = AtomicLong(0L)
parallelForeach((1 to 100), 16, counter.addAndGet(_))
assert(counter.get() == 5050)
But, as you can guess, it doesn't work this way as my result isn't 5050 but 505000.
Now here is my question. How do I implement a way to use multithreading efficiently, so there are for example 16 different threads working at the same time?
Check your test: "1 to 100".
With your Code you go with each thread through 100, this is why your result is 100 times to large.

Implicit class holding mutable variable in multithreaded environment

I need to implement a parallel method, which takes two computation blocks, a and b, and starts each of them in a new thread. The method must return a tuple with the result values of both the computations. It should have the following signature:
def parallel[A, B](a: => A, b: => B): (A, B)
I managed to solve the exercise by using straight Java-like approach. Then I decided to make up a solution with implicit class. Here's it:
object ParallelApp extends App {
implicit class ParallelOps[A](a: => A) {
var result: A = _
def spawn(): Unit = {
val thread = new Thread {
override def run(): Unit = {
result = a
def parallel[A, B](a: => A, b: => B): (A, B) = {
(a.result, b.result)
println(parallel(1 + 2, "a" + "b"))
For unknown reason, I receive output (null,null). Could you please point me out where is the problem?
Spoiler alert: It's not complicated. It's funny, like a magic trick (if you consider reading the documentation about Java Memory Model "funny", that is). If you haven't figured it out yet, I would highly recommend to try to figure it out, otherwise it won't be funny. Someone should make a "division-by-zero proves 2 = 4"-riddle out of it.
Consider the following shorter example:
implicit class Foo[A](a: A) {
var result: String = "not initialized"
def computeResult(): Unit = result = "Yay, result!"
val a = "a string"
When run, it prints
not initialized
despite the fact that we invoked computeResult() and set result to "Yay, result!". The problem is that the two invocations a.computeResult() and a.result belong to two completely independent instances of Foo. The implicit conversion is performed twice, and the second implicitly created object doesn't know anything about the changes in the first implicitly created object. It has nothing to do with threads or JMM at all.
By the way: your code is not parallel. Calling join right after calling start doesn't bring you anything, your main thread will simply go idle and wait until another thread finishes. At no point will there be two threads that do any useful work concurrently.
EDIT: Fixed a bug pointed out by Andrey Tyukin
One way to solve your problem is to use Scala Futures
Documentation. Tutorial.
Useful Klang Blog.
You'll typically need some combination of these libraries:
import scala.concurrent.{Await, Future}
import scala.util.{Failure, Success}
import scala.concurrent.duration._
an asynchronous example:
def parallelAsync[A,B](a: => A, b: => B): Future[(A,B)] = {
// as per Andrey Tyukin's comments, this line runs
// the two futures sequentially and we do not get
// any benefit from it. I will leave this line here
// so others will not fall in my trap
//for {i <- Future(a); j <- Future(b) } yield (i,j)
Future(a) zip Future(b)
parallelAsync(1 + 2, "a" + "b").onComplete {
case Success(x) => println(x)
case Failure(e) => e.printStackTrace()
If you must block until both are complete, you can use this:
def parallelSync[A,B](a: => A, b: => B): (A,B) = {
// see comment above
//val f = for { i <- Future(a); j <- Future(b) } yield (i,j)
val tuple = Future(a) zip Future(b)
Await.result(tuple, 5 second)
println(parallelSync(3 + 4, "c" + "d"))
When running these little examples, don't forget to sleep a little bit at the end so the program won't end before the results come back

How to ensure garbage collection of unused accumulators?

I meet a problem that Accumulator on Spark can not be GC.
def newIteration (lastParams: Accumulable[Params, (Int, Int, Int)], lastChosens: RDD[Document], i: Int): Params = {
if (i == maxIteration)
return lastParams.value
val size1: Int = 100
val size2: Int = 1000
// each iteration generates a new accumulator
val params = sc.accumulable(Params(size1, size2))
// there is map operation here
// if i only use lastParams, the result in not updated
// but params can solve this problem
val chosen = {
case(Document(docID, content)) => {
lastParams += (docID, content, -1)
val newContent = lastParams.localValue.update(docID, content)
lastParams += (docID, newContent, 1)
params += (docID, newContent, 1)
Document(docID, newContent)
return newIteration(params, chosen, i + 1)
The problem is that the memory it allocates is always growing, until memory limits. It seems that lastParms is not GC. Class RDD and Broadcast have a method unpersist(), but I cannot find any method like this in documentation.
Why Accumulable cannot be GC automatically, or is there a better solution?
UPDATE (April 22nd, 2016): SPARK-3885 Provide mechanism to remove accumulators once they are no longer used is now resolved.
There's ongoing work to add support for automatically garbage-collecting accumulators once they are no longer referenced. See SPARK-3885 for tracking progress on this feature. Spark PR #4021, currently under review, is a patch for this feature. I expect this to be included in Spark 1.3.0.

Synchronizing on function parameter for multithreaded memoization

My core question is: how can I implement synchronization in a method on the combination of the object instance and the method parameter?
Here are the details of my situation. I'm using the following code to implement memoization, adapted from this answer:
* Memoizes a unary function
* #param f the function to memoize
* #tparam T the argument type
* #tparam R the result type
class Memoized[-T, +R](f: T => R) extends (T => R) {
import scala.collection.mutable
private[this] val cache = mutable.Map.empty[T, R]
def apply(x: T): R = cache.getOrElse(x, {
val y = f(x)
cache += ((x, y))
In my project, I'm memoizing Futures to deduplicate asynchronous API calls. This worked fine when using for...yield to map over the resulting futures, created with the standard ExcecutionContext, but when I upgraded to Scala Async for nicer handling of these futures. However, I realized that the multithreading that library uses allowed multiple threads to enter apply, defeating memoization, because the async blocks all executed in parallel, entering the "orElse" thunk before cache could be updated with a new Future.
To work around this, I put the main apply function in a this.synchronized block:
def apply(x: T): R = this.synchronized {
cache.getOrElse(x, {
val y = f(x)
cache += ((x, y))
This restored the memoized behavior. The drawback is that this will block calls with different params, at least until the Future is created. I'm wondering if there is a way to set up finer grained synchronization on the combination of the Memoized instance and the value of the x parameter to apply. That way, only calls that would be deduplicated will be blocked.
As a side note, I'm not sure this is truly performance critical, because the synchronized block will release once the Future is created and returned (I think?). But if there are any concerns with this that I'm not thinking of, I would also like to know.
Akka actors combined with futures provide a powerful way to wrap over mutable state without blocking. Here is a simple example of how to use an Actor for memoization:
import akka.util.Timeout
import akka.pattern.ask
import scala.concurrent._
import scala.concurrent.duration._
class Memoize(system: ActorSystem) {
class CacheActor(f: Any => Future[Any]) extends Actor {
private[this] val cache = scala.collection.mutable.Map.empty[Any, Future[Any]]
def receive = {
case x => sender ! cache.getOrElseUpdate(x, f(x))
def apply[K, V](f: K => Future[V]): K => Future[V] = {
val fCast = f.asInstanceOf[Any => Future[Any]]
val actorRef = system.actorOf(Props(new CacheActor(fCast)))
implicit val timeout = Timeout(5.seconds)
import system.dispatcher
x => actorRef.ask(x).asInstanceOf[Future[Future[V]]].flatMap(identity)
We can use it like:
val system = ActorSystem()
val memoize = new Memoize(system)
val f = memoize { x: Int =>
println("Computing for " + x)
scala.concurrent.Future.successful {
x + 1
import system.dispatcher
And "Computing for 5" will only print a single time, but "6" will print twice.
There are some scary looking asInstanceOf calls, but it is perfectly type-safe.

Performance difference in and

While coding Euler problems, I ran across what I think is bizarre:
The method is slower than
Here's an example:
def main(args: Array[String])
def toDigit(num : Int) = - 48) //2137 ms
def toDigitFast(num : Int) = - 48) //592 ms
val startTime = System.currentTimeMillis;
(1 to 1200000).map(toDigit)
println(System.currentTimeMillis - startTime)
Shouldn't the method map on String fallback to a map over the array? Why is there such a noticeable difference? (Note that increasing the number even causes an stack overflow on the non-array case).
Could be because uses the WrappedString implicit, while uses the WrappedArray implicit to resolve map.
Let's see map, as defined in TraversableLike:
def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
val b = bf(repr)
for (x <- this) b += f(x)
WrappedString uses a StringBuilder as builder:
def +=(x: Char): this.type = { append(x); this }
def append(x: Any): StringBuilder = {
underlying append String.valueOf(x)
The String.valueOf call for Any uses Java Object.toString on the Char instances, possibly getting boxed first. These extra ops might be the cause of speed difference, versus the supposedly shorter code paths of the Array builder.
This is a guess though, would have to measure.
After revising, the general point still stands, but the I referred the wrong implicits, since the toDigit methods return an Int sequence (or like), not a translated string as I misread.
toDigit uses LowPriorityImplicits.fallbackStringCanBuildFrom[T]: CanBuildFrom[String, T, immutable.IndexedSeq[T]], with T = Int, which just defers to a general IndexedSeq builder.
toDigitFast uses a direct Array implicit of type CanBuildFrom[Array[_], T, Array[T]], which is unarguably faster.
Passing the following CBF for toDigit explicitly makes the two methods on par:
object FastStringToArrayBuild {
def canBuildFrom[T : ClassManifest] = new CanBuildFrom[String, T, Array[T]] {
private def newBuilder = scala.collection.mutable.ArrayBuilder.make()
def apply(from: String) = newBuilder
def apply() = newBuilder
You're being fooled by running out of memory. The toDigit version does create more intermediate objects, but if you have plenty of memory then the GC won't be heavily impacted (and it'll all run faster). For example, if instead of creating 1.2 million numbers, I create 12k 100x in a row, I get approximately equal times for the two methods. If I create 1.2k 5-digit numbers 1000x in a row, I find that toDigit is about 5% faster.
Given that the toDigit method produces an immutable collection, which is better when all else is equal since it is easier to reason about, and given that all else is equal for all but highly demanding tasks, I think the library is as it should be.
When trying to improve performance, of course one needs to keep all sorts of tricks in mind; one of these is that arrays have better memory characteristics for collections of known length than do the fancy collections in the Scala library. Also, one needs to know that map isn't the fastest way to get things done; if you really wanted this to be fast you should
final def toDigitReallyFast(num: Int, accum: Long = 0L, iter: Int = 0): Array[Byte] = {
if (num==0) {
val ans = new Array[Byte](math.max(1,iter))
var i = 0
var ac = accum
while (i < ans.length) {
ans(ans.length-i-1) = (ac & 0xF).toByte
ac >>= 4
i += 1
else {
val next = num/10
toDigitReallyFast(next, (accum << 4) | (num-10*next), iter+1)
which on my machine is at 4x faster than either of the others. And you can get almost 3x faster yet again if you leave everything in a Long and pack the results in an array instead of using 1 to N:
final def toDigitExtremelyFast(num: Int, accum: Long = 0L, iter: Int = 0): Long = {
if (num==0) accum | (iter.toLong << 48)
else {
val next = num/10
toDigitExtremelyFast(next, accum | ((num-10*next).toLong<<(4*iter)), iter+1)
// loop, instead of 1 to N map, for the 1.2k number case
var i = 10000
val a = new Array[Long](1201)
while (i<=11200) {
a(i-10000) = toDigitReallyReallyFast(i)
i += 1
As with many things, performance tuning is highly dependent on exactly what you want to do. In contrast, library design has to balance many different concerns. I do think it's worth noticing where the library is sub-optimal with respect to performance, but this isn't really one of those cases IMO; the flexibility is worth it for the common use cases.
