Means for performing background operations in Scala

Means for performing background operations in Scala - multithreading

I'd like to do some time consuming task in background. So I need to start computation in a different thread, be able to check if it is completed (maybe failed) and be able to abort the computation when it becomes unnecessary. After computation is ended it should call synchronized callback function to store computed value.
It may be programmed as some wrapper over the Thread class. But I suppose that this basic functionality is implemented in some scala library already. I've tried to search but find only Akka that is too much for my simple task. scala.concurrent.ExecutionContext has useful execute method but it return no object to check status of the computation and abort it on demand.
What library contains already described functionality?
I've checked scala.concurrent.Future. It lacks ability to abort computation, that is crucial. I use following strategy: compute some consuming function in background and provide reasonable default. If arguments to the function is changed, I drop the original computation and start new. I could not imagine how to rewrite this strategy in terms of Future.flatMap.

I'll give a demonstration of how use futures with Twitter's implementation, since you asked for cancellation:
import com.twitter.util.{ Await, Future, FuturePool }
def computeFast(i: Int) = { Thread.sleep(1000); i + 1 }
def computeSlow(i: Int) = { Thread.sleep(1000000); i + 1 }
val fastComputation = FuturePool.unboundedPool(computeFast(1))
val slowComputation = FuturePool.unboundedPool(computeSlow(1))
Now you can poll for a result:
scala> fastComputation.poll
res0: Option[com.twitter.util.Try[Int]] = Some(Return(2))
scala> slowComputation.poll
res1: Option[com.twitter.util.Try[Int]] = None
Or set callbacks:
fastComputation.onSuccess(println)
slowComputation.onFailure(println)
Most of the time it's better to use map and flatMap to describe how to compose computations, though.
Cancellation is a little more complicated (this is just a demo—you'll want to provide your own cancellation logic):
import com.twitter.util.Promise
def cancellableComputation(i: Int): Future[Int] = {
val p = Promise[Int]
p.setInterruptHandler {
case t =>
println("Cancelling the computation")
p.setException(t)
}
FuturePool.unboundedPool(computeSlow(i)).onSuccess(p.setValue)
p
}
And then:
scala> val myFuture = cancellableComputation(10)
myFuture: com.twitter.util.Future[Int] = Promise#129588027(state=Interruptible(List(),<function1>))
scala> myFuture.poll
res4: Option[com.twitter.util.Try[Int]] = None
scala> myFuture.raise(new Exception("Stop this thing"))
Cancelling the computation
scala> myFuture.poll
res6: Option[com.twitter.util.Try[Int]] = Some(Throw(java.lang.Exception: Stop this thing))
You could probably do something similar with the standard library's futures.

Related

Scala - Executing every element until they all have finished

I cannot figure out why my function invokeAll does not give out the correct output/work properly. Any solutions? (No futures or parallel collections allowed and the return type needs to be Seq[Int])
def invokeAll(work: Seq[() => Int]): Seq[Int] = {
//this is what we should return as an output "return res.toSeq"
//res cannot be changed!
val res = new Array[Int](work.length)
var list = mutable.Set[Int]()
var n = res.size
val procedure = (0 until n).map(work =>
new Runnable {
def run {
//add the finished element/Int to list
list += work
}
}
)
val threads = procedure.map(new Thread(_))
threads.foreach(x => x.start())
threads.foreach (x => (x.join()))
res ++ list
//this should be the final output ("return res.toSeq")
return res.toSeq
}

OMG, I know a java programmer, when I see one :)
Don't do this, it's not java!
val results: Future[Seq[Int]] = Future.traverse(work)
This is how you do it in scala.
This gives you a Future with the results of all executions, that will be satisfied when all work is finished. You can use .map, .flatMap etc. to access and transform those results. For example
val sumOfAll: Future[Int] = results.map(_.sum)
Or (in the worst case, when you want to just give the result back to imperative code), you could block and wait on the future to get ahold of the actual result (don't do this unless you are absolutely desperate): Await.result(results, 1 year)
If you want the results as array, results.map(_.toArray) will do that ... but you really should not: arrays aren't really a good choice for the vast majority of use cases in scala. Just stick with Seq.

The main problem in your code is that you are using fixed size array and trying to add some elements using ++ (concatenate) operator: res ++ list. It produces new Seq but you don't store it in some val.
You could remove last line return res.toSeq and see that res ++ lest will be return value. It will be your work.length array of zeros res with some list sequence at the end. Try read more about scala collections most of them immutable and there is a good practice to use immutable data structures. In scala Arrays doesn't accumulate values using ++ operator in left operand. Array's in scala are fixed size.

How to concurrently create Future from Try?

The scala Future has a fromTry method which
Creates an already completed Future with the specified result or
exception.
The problem is that the newly created Future is already completed. Is it possible to have the evaluation of the Try done concurrently?
As an example, given a function that returns a Try:
def foo() : Try[Int] = {
Thread sleep 1000
Success(43)
}
How can the evaluation of foo be done concurrently?
A cursory approach would be to simply wrap a Future around the function:
val f : Future[Try[Int]] = Future { foo() }
But the desired return type would be a Future[Int]
val f : Future[Int] = ???
Effectively, how can the Try be flattened within a Future similar to the fromTry method?
There is a similar question, however the question & answer wait for the evaluation of the Try before constructing the completed Future.

Least ceremony is probably:
Future.unit.transform(_ => foo())
Special mention goes to #Dima for the suggestion of Future(foo().get) which is a bit shorter—but might be slightly less readable.

Based on the comments,
Scala >= 2.13:
val f : Future[Int] = Future.delegate(Future.fromTry(foo()))
Scala < 2.13:
val f : Future[Int] = Future(foo()).flatMap(Future.fromTry)

Scala Stream tail laziness and synchronization

In one of his videos (concerning Scala's lazy evaluation, namely lazy keyword), Martin Odersky shows the following implementation of cons operation used to construct a Stream:
def cons[T](hd: T, tl: => Stream[T]) = new Stream[T] {
def head = hd
lazy val tail = tl
...
}
So tail operation is written concisely using lazy evaluation feature of the language.
But in reality (in Scala 2.11.7), the implementation of tail is a bit less elegant:
#volatile private[this] var tlVal: Stream[A] = _
#volatile private[this] var tlGen = tl _
def tailDefined: Boolean = tlGen eq null
override def tail: Stream[A] = {
if (!tailDefined)
synchronized {
if (!tailDefined) {
tlVal = tlGen()
tlGen = null
}
}
tlVal
}
Double-checked locking and two volatile fields: that's roughly how you would implement a thread-safe lazy computation in Java.
So the questions are:
Doesn't lazy keyword of Scala provide any 'evaluated maximum once' guarantee in a multi-threaded case?
Is the pattern used in real tail implementation an idiomatic way to do a thread-safe lazy evaluation in Scala?

Doesn't lazy keyword of Scala provide any 'evaluated maximum once'
guarantee in a multi-threaded case?
Yes, it does, as others have stated.
Is the pattern used in real tail implementation an idiomatic way to do
a thread-safe lazy evaluation in Scala?
Edit:
I think I have the actual answer as to why not lazy val. Stream has public facing API methods such as hasDefinitionSize inherited from TraversableOnce. In order to know if a Stream has a finite size not, we need a way of checking without materializing the underlying Stream tail. Since lazy val doesn't actually expose the underlying bit, we can't do that.
This is backed by SI-1220
To strengthen this point, #Jasper-M points out that the new LazyList api in strawman (Scala 2.13 collection makeover) no longer has this issue, since the entire collection hierarchy has been reworked and there are no longer such concerns.
Performance related concerns
I would say "it depends" on which angle you're looking at this problem. From a LOB point of view, I'd say definitely go with lazy val for conciseness and clarity of implementation. But, if you look at it from the point of view of a Scala collections library author, things start to look differently. Think of it this way, you're creating a library which will be potentially be used by many people and ran on many machines across the world. This means that you should be thinking of the memory overhead of each structure, especially if you're creating such an essential data structure yourself.
I say this because when you use lazy val, by design you generate an additional Boolean field which flags if the value has been initialized, and I am assuming this is what the library authors were aiming to avoid. The size of a Boolean on the JVM is of course VM dependent, by even a byte is something to consider, especially when people are generating large Streams of data. Again, this is definitely not something I would usually consider and is definitely a micro optimization towards memory usage.
The reason I think performance is one of the key points here is SI-7266 which fixes a memory leak in Stream. Note how it is of importance to track the byte code to make sure no extra values are retained inside the generated class.
The difference in the implementation is that the definition of tail being initialized or not is a method implementation which checks the generator:
def tailDefined: Boolean = tlGen eq null
Instead of a field on the class.

Scala lazy values are evaluated only once in multi-threaded cases. This is because the evaluation of lazy members is actually wrapped in a synchronized block in the generated code.
Lets take a look at the simple claas,
class LazyTest {
lazy val x = 5
}
Now, lets compile this with scalac,
scalac -Xprint:all LazyTest.scala
This will result in,
package <empty> {
class LazyTest extends Object {
final <synthetic> lazy private[this] var x: Int = _;
#volatile private[this] var bitmap$0: Boolean = _;
private def x$lzycompute(): Int = {
LazyTest.this.synchronized(if (LazyTest.this.bitmap$0.unary_!())
{
LazyTest.this.x = (5: Int);
LazyTest.this.bitmap$0 = true
});
LazyTest.this.x
};
<stable> <accessor> lazy def x(): Int = if (LazyTest.this.bitmap$0.unary_!())
LazyTest.this.x$lzycompute()
else
LazyTest.this.x;
def <init>(): LazyTest = {
LazyTest.super.<init>();
()
}
}
}
You should be able to see... that the lazy evaluation is thread-safe. And you will also see some similarity to that "less elegant" implementation in Scala 2.11.7
You can also experiment with tests similar to following,
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
case class A(i: Int) {
lazy val j = {
println("calculating j")
i + 1
}
}
def checkLazyInMultiThread(): Unit = {
val a = A(6)
val futuresList = Range(1, 20).toList.map(i => Future{
println(s"Future $i :: ${a.j}")
})
Future.sequence(futuresList).onComplete(_ => println("completed"))
}
checkLazyInMultiThread()
Now, the implementation in standard library avoids using lazy because they are able to provide a more efficient solution than this generic lazy translation.

You are correct, lazy vals use locking precisely to guard against double evaluation when accessed at the same time by two threads. Future developments, furthermore, will give the same guarantees without locking.
What is idiomatic, in my humble opinion, is a highly debatable subject when it comes to a language that, by design, allows for a wide range of different idioms to be adopted. In general, however, application code tends to be considered idiomatic when going more into the direction of pure functional programming, as it gives a series of interesting advantages in terms of ease of testing and reasoning that would make sense to give up only in case of serious concerns. This concern can be one of performance, which is why the current implementation of the Scala Collection API, while exposing in most cases a functional interface, makes heavy use (internally and in restricted scopes) of vars, while loops and established patterns from imperative programming (as the one you highlighted in your question).

Know when all jobs are completed in ExecutionContext.Implicits.global

I have a simple program to do a blocking IO using futures in ExecutionContext.Implicits.global. I want to bench mark the performance of executing this IO for 100 times, but my loop ends before all the future completes. Is there a way to terminate only when all the tasks pushed to the ExecutionContext completes?

In the absence of any code, I'll provide a simple version but give the caveat that this is an inaccurate way of measuring and not great for things that take a small amount of time. Only really useful if the operation your measuring actually takes a decent amount of time so most of the time is spent in the Future and not in the test code.
import scala.concurrent._
import scala.concurrent.duration._
import ExecutionContext.Implicits.global
val myList = (1 to 10)
val start = System.currentTimeMillis
val seqFutures = for (elem <- myList) yield {
Future {
Thread.sleep(5000) // blocking operation
}
}
val result = Future.sequence(seqFutures)
Await.result(result, 30.seconds)
val end = System.currentTimeMillis
println(s"${end-start} seconds")
If you are going to have a large number of blocking IO calls, it can be worth configuring a separate execution context that has a larger thread pool just for the blocking IO to avoid the blocking IO consuming all your threads.

How to use scala actors

How to solve this using scala Actors:
I have a program that finds out the frequencies of identifiers in files under a given path. The encoding assumed is UTF-8. I want to solve the same problem with scala actors.
//program to find frequencies of identifiers
import java.io._
import java.util.concurrent._
import java.util.concurrent.atomic._
object Main {
// visit all files in dir
def processDirectory(dir: File, visit: (File) => Unit) {
for (f <- dir.listFiles)
if (f.isDirectory) processDirectory(f, visit)
else visit(f)
}
//counters for all identifiers
val frequencies = new scala.collection.mutable.HashMap[String, Int]
// Finds all identifiers in a file and increments their counters
def process(f: File) {
val contents = scala.io.Source.fromFile(f, "UTF-8").mkString
val pattern = "[a-zA-Z_][0-9a-zA-Z_]*".r
for (m <- pattern.findAllIn(contents))
frequencies(m) = frequencies.getOrElse(m, 0) + 1
}
def main(args: Array[String]) { //Give path of a directory here
processDirectory(new File(args(0)), process _)
println("Ten most common identifiers:")
val sorted = frequencies.values.toBuffer.sortWith(_ > _)
for (i <- 0 until 10)
for ((k, v) <- frequencies)
if (v == sorted(i)) println(k + " " + v)
}
}
Also please explain the concept of scala actors. I am confused about scala actors.

Actors help with concurrent design. There's nothing concurrent about this. People who want parallelism, for performance, sometimes want to do exactly what you're doing: take some simple filesystem-munging thing, throw extra threads at it, and see if it's faster. However, this is a disk, and random access is extremely expensive, so you've nothing to gain from parallel processing, Actor-abusing or otherwise.
Scala's Actors come from Erlang. So please see if Concurrency Oriented Programming in Erlang (pdf), by one of Erlang's designers, helps you get an idea of what they're about. They're not really about throwing threads at tasks to make those tasks go faster.
Some resources to help with Scala's Actors:
Actors in Scala -- it's published at the end of the month, but PrePrint PDFs are available now.
Scala Actors: A Short Tutorial

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string