Asynchronous IO in Scala with futures - multithreading

Let's say I'm getting a (potentially big) list of images to download from some URLs. I'm using Scala, so what I would do is :
import scala.actors.Futures._
// Retrieve URLs from somewhere
val urls: List[String] = ...
// Download image (blocking operation)
val fimages: List[Future[...]] = urls.map (url => future { download url })
// Do something (display) when complete
fimages.foreach (_.foreach (display _))
I'm a bit new to Scala, so this still looks a little like magic to me :
Is this the right way to do it? Any alternatives if it is not?
If I have 100 images to download, will this create 100 threads at once, or will it use a thread pool?
Will the last instruction (display _) be executed on the main thread, and if not, how can I make sure it is?
Thanks for your advice!

Use Futures in Scala 2.10. They were joint work between the Scala team, the Akka team, and Twitter to reach a more standardized future API and implementation for use across frameworks. We just published a guide at: http://docs.scala-lang.org/overviews/core/futures.html
Beyond being completely non-blocking (by default, though we provide the ability to do managed blocking operations) and composable, Scala's 2.10 futures come with an implicit thread pool to execute your tasks on, as well as some utilities to manage time outs.
import scala.concurrent.{future, blocking, Future, Await, ExecutionContext.Implicits.global}
import scala.concurrent.duration._
// Retrieve URLs from somewhere
val urls: List[String] = ...
// Download image (blocking operation)
val imagesFuts: List[Future[...]] = urls.map {
url => future { blocking { download url } }
}
// Do something (display) when complete
val futImages: Future[List[...]] = Future.sequence(imagesFuts)
Await.result(futImages, 10 seconds).foreach(display)
Above, we first import a number of things:
future: API for creating a future.
blocking: API for managed blocking.
Future: Future companion object which contains a number of useful methods for collections of futures.
Await: singleton object used for blocking on a future (transferring its result to the current thread).
ExecutionContext.Implicits.global: the default global thread pool, a ForkJoin pool.
duration._: utilities for managing durations for time outs.
imagesFuts remains largely the same as what you originally did- the only difference here is that we use managed blocking- blocking. It notifies the thread pool that the block of code you pass to it contains long-running or blocking operations. This allows the pool to temporarily spawn new workers to make sure that it never happens that all of the workers are blocked. This is done to prevent starvation (locking up the thread pool) in blocking applications. Note that the thread pool also knows when the code in a managed blocking block is complete- so it will remove the spare worker thread at that point, which means that the pool will shrink back down to its expected size.
(If you want to absolutely prevent additional threads from ever being created, then you ought to use an AsyncIO library, such as Java's NIO library.)
Then we use the collection methods of the Future companion object to convert imagesFuts from List[Future[...]] to a Future[List[...]].
The Await object is how we can ensure that display is executed on the calling thread-- Await.result simply forces the current thread to wait until the future that it is passed is completed. (This uses managed blocking internally.)

val all = Future.traverse(urls){ url =>
val f = future(download url) /*(downloadContext)*/
f.onComplete(display)(displayContext)
f
}
Await.result(all, ...)
Use scala.concurrent.Future in 2.10, which is RC now.
which uses an implicit ExecutionContext
The new Future doc is explicit that onComplete (and foreach) may evaluate immediately if the value is available. The old actors Future does the same thing. Depending on what your requirement is for display, you can supply a suitable ExecutionContext (for instance, a single thread executor). If you just want the main thread to wait for loading to complete, traverse gives you a future to await on.

Yes, seems fine to me, but you may want to investigate more powerful twitter-util or Akka Future APIs (Scala 2.10 will have a new Future library in this style).
It uses a thread pool.
No, it won't. You need to use the standard mechanism of your GUI toolkit for this (SwingUtilities.invokeLater for Swing or Display.asyncExec for SWT). E.g.
fimages.foreach (_.foreach(im => SwingUtilities.invokeLater(new Runnable { display im })))

Related

How to wrap Web Worker response messages in futures?

Please consider a scala.js application which runs in the browser and consists of a main program and a web worker.
The main thread delegates long running operations to the web worker by passing messages that contain the names of methods and the parameters required to invoke them. The worker passes method return values back to the main thread in the form of response messages.
In simpler terms, this program abstracts web worker messaging so that code in the main thread can call methods in the worker thread in idiomatic and asynchronous Scala syntax.
Because web workers do not associate messages with their responses in any way, the abstraction relies on a registry, an intermediary object, that governs each cross context method call to associate the invocation with the result. This singleton could also bind callback functions but is there a way to accomplish this with futures instead of callbacks?
How can I build an abstraction over this registry that allows programmers to use it with the standard asynchronous programming structures in Scala: futures and promises?
How should I write this functionality so that scala programmers can interact with it in the canonical way? For example:
// long running method in the web worker
val f: Future[String] = Registry.ultimateQuestion(42) // async
f onSuccess { case q => println("The ultimate question is: " + q) }
I'm new to futures and promises, but it seems like they usually complete when some execution block terminates. In this case, receiving a response from the web worker signifies completion of the future. Is there a way to write a custom future that delegates its completion status to an external process? Is there another way to link the web worker response message to the status of the future?
Can/Should I extend the Future trait? Is this possible in Scala.js? Is there a concrete class that I should extend? Is there some other way to encapsulate these cross context web worker method calls in existing asynchronous Scala functionality?
Thank you for your consideration.
Hmm. Just spitballing here (I haven't used workers yet), but it seems like associating the request with the Future is fairly easy in the single-threaded JavaScript world you're working in.
Here's a hypothetical design. Say that each request/response to the worker is automatically wrapped in an Envelope; the Envelope contains a RequestId. So the send side looks something like (this is pseudo-code, but real-ish):
def sendRequest[R](msg:Message):Future[R] = {
val promise = Promise[R]
val id = nextRequestId()
val envelope = Envelope(id, msg)
register(id, promise)
sendToWorker(envelope)
promise.future
}
The worker processes msg, wraps the result in another Envelope, and the result gets handled back in the main thread with something like:
def handleResult(resultEnv:Envelope):Unit = {
val promise = findRegistered(resultEnv.id)
val result = resultEnv.msg
promise.success(result)
}
That needs some filling in, and some thought about what the types like R should be, but that sort of outline would probably work decently well. If this was the JVM you'd have to worry about all sorts of race conditions, but in the single-threaded JS world it probably can be as simple as using an autoincrementing integer for the request ID, and storing away the Promise...

Play Framework: thread-pool-executor vs fork-join-executor

Let's say we have a an action below in our controller. At each request performLogin will be called by many users.
def performLogin( ) = {
Async {
// API call to the datasource1
val id = databaseService1.getIdForUser();
// API call to another data source different from above
// This process depends on id returned by the call above
val user = databaseService2.getUserGivenId(id);
// Very CPU intensive task
val token = performProcess(user)
// Very CPU intensive calculations
val hash = encrypt(user)
Future.successful(hash)
}
}
I kind of know what the fork-join-executor does. Basically from the main thread which receives a request, it spans multiple worker threads which in tern will divide the work into few chunks. Eventually main thread will join those result and return from the function.
On the other hand, if I were to choose the thread-pool-executor, my understanding is that a thread is chosen from the thread pool, this selected thread will do the work, then go back to the thread pool to listen to more work to do. So no sub dividing of the task happening here.
In above code parallelism by fork-join executor is not possible in my opinion. Each call to the different methods/functions requires something from the previous step. If I were to choose the fork-join executor for the threading how would that benefit me? How would above code execution differ among fork-join vs thread-pool executor.
Thanks
This isn't parallel code, everything inside of your Async call will run in one thread. In fact, Play! never spawns new threads in response to requests - it's event-based, there is an underlying thread pool that handles whatever work needs to be done.
The executor handles scheduling the work from Akka actors and from most Futures (not those created with Future.successful or Future.failed). In this case, each request will be a separate task that the executor has to schedule onto a thread.
The fork-join-executor replaced the thread-pool-executor because it allows work stealing, which improves efficiency. There is no difference in what can be parallelized with the two executors.

Distributed\Parallel computing using app-engine (java api)

I want to use the master-slave (worker) paradigm, to solve a problem. I have read that opening new threads manually (for example using thread pool) is not available and I need to use queue, attached code example:
class MyDeferred implements DeferredTask {
#Override
public void run() {
// Do something interesting
}
};
MyDeferred task = new MyDeferred();
// Set instance variables etc as you wish
Queue queue = QueueFactory.getDefaultQueue();
queue.add(withPayload(task));
How can I get the result of the workers (which were added to the queue)?
I need this info, in-order to solve the bigger problem.
Actually you can use threads on GAE, but there are limitations. If you need long-running threads you can use background threads, but this requires you to use backend instances.
If you opt to use task queue, then keep in mind that tasks do not "return" to caller. To aggregate results you'll need to use datastore.
You will have to write the results into the datastore.
Just as a starting point to think about it, you might pass a JobId as a parameter to the tasks, have each task write an entity with the result and the JobId, and then later query the datstore for the given JobId to get all the results.

Difference between the TPL & async/await (Thread handling)

Trying to understanding the difference between the TPL & async/await when it comes to thread creation.
I believe the TPL (TaskFactory.StartNew) works similar to ThreadPool.QueueUserWorkItem in that it queues up work on a thread in the thread pool. That's of course unless you use TaskCreationOptions.LongRunning which creates a new thread.
I thought async/await would work similarly so essentially:
TPL:
Factory.StartNew( () => DoSomeAsyncWork() )
.ContinueWith(
(antecedent) => {
DoSomeWorkAfter();
},TaskScheduler.FromCurrentSynchronizationContext());
Async/Await:
await DoSomeAsyncWork();
DoSomeWorkAfter();
would be identical. From what I've been reading it seems like async/await only "sometimes" creates a new thread. So when does it create a new thread and when doesn't it create a new thread? If you were dealing with IO completion ports i can see it not having to create a new thread but otherwise I would think it would have to. I guess my understanding of FromCurrentSynchronizationContext always was a bit fuzzy also. I always throught it was, in essence, the UI thread.
I believe the TPL (TaskFactory.Startnew) works similar to ThreadPool.QueueUserWorkItem in that it queues up work on a thread in the thread pool.
Pretty much.
From what i've been reading it seems like async/await only "sometimes" creates a new thread.
Actually, it never does. If you want multithreading, you have to implement it yourself. There's a new Task.Run method that is just shorthand for Task.Factory.StartNew, and it's probably the most common way of starting a task on the thread pool.
If you were dealing with IO completion ports i can see it not having to create a new thread but otherwise i would think it would have to.
Bingo. So methods like Stream.ReadAsync will actually create a Task wrapper around an IOCP (if the Stream has an IOCP).
You can also create some non-I/O, non-CPU "tasks". A simple example is Task.Delay, which returns a task that completes after some time period.
The cool thing about async/await is that you can queue some work to the thread pool (e.g., Task.Run), do some I/O-bound operation (e.g., Stream.ReadAsync), and do some other operation (e.g., Task.Delay)... and they're all tasks! They can be awaited or used in combinations like Task.WhenAll.
Any method that returns Task can be awaited - it doesn't have to be an async method. So Task.Delay and I/O-bound operations just use TaskCompletionSource to create and complete a task - the only thing being done on the thread pool is the actual task completion when the event occurs (timeout, I/O completion, etc).
I guess my understanding of FromCurrentSynchronizationContext always was a bit fuzzy also. I always throught it was, in essence, the UI thread.
I wrote an article on SynchronizationContext. Most of the time, SynchronizationContext.Current:
is a UI context if the current thread is a UI thread.
is an ASP.NET request context if the current thread is servicing an ASP.NET request.
is a thread pool context otherwise.
Any thread can set its own SynchronizationContext, so there are exceptions to the rules above.
Note that the default Task awaiter will schedule the remainder of the async method on the current SynchronizationContext if it is not null; otherwise it goes on the current TaskScheduler. This isn't so important today, but in the near future it will be an important distinction.
I wrote my own async/await intro on my blog, and Stephen Toub recently posted an excellent async/await FAQ.
Regarding "concurrency" vs "multithreading", see this related SO question. I would say async enables concurrency, which may or may not be multithreaded. It's easy to use await Task.WhenAll or await Task.WhenAny to do concurrent processing, and unless you explicitly use the thread pool (e.g., Task.Run or ConfigureAwait(false)), then you can have multiple concurrent operations in progress at the same time (e.g., multiple I/O or other types like Delay) - and there is no thread needed for them. I use the term "single-threaded concurrency" for this kind of scenario, though in an ASP.NET host, you can actually end up with "zero-threaded concurrency". Which is pretty sweet.
async / await basically simplifies the ContinueWith methods ( Continuations in Continuation Passing Style )
It does not introduce concurrency - you still have to do that yourself ( or use the Async version of a framework method. )
So, the C# 5 version would be:
await Task.Run( () => DoSomeAsyncWork() );
DoSomeWorkAfter();

How can I execute multiple tasks in Scala?

I have 50,000 tasks and want to execute them with 10 threads.
In Java I should create Executers.threadPool(10) and pass runnable to is then wait to process all. Scala as I understand especially useful for that task, but I can't find solution in docs.
Scala 2.9.3 and later
THe simplest approach is to use the scala.concurrent.Future class and associated infrastructure. The scala.concurrent.future method asynchronously evaluates the block passed to it and immediately returns a Future[A] representing the asynchronous computation. Futures can be manipulated in a number of non-blocking ways, including mapping, flatMapping, filtering, recovering errors, etc.
For example, here's a sample that creates 10 tasks, where each tasks sleeps an arbitrary amount of time and then returns the square of the value passed to it.
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
val tasks: Seq[Future[Int]] = for (i <- 1 to 10) yield future {
println("Executing task " + i)
Thread.sleep(i * 1000L)
i * i
}
val aggregated: Future[Seq[Int]] = Future.sequence(tasks)
val squares: Seq[Int] = Await.result(aggregated, 15.seconds)
println("Squares: " + squares)
In this example, we first create a sequence of individual asynchronous tasks that, when complete, provide an int. We then use Future.sequence to combine those async tasks in to a single async task -- swapping the position of the Future and the Seq in the type. Finally, we block the current thread for up to 15 seconds while waiting for the result. In the example, we use the global execution context, which is backed by a fork/join thread pool. For non-trivial examples, you probably would want to use an application specific ExecutionContext.
Generally, blocking should be avoided when at all possible. There are other combinators available on the Future class that can help program in an asynchronous style, including onSuccess, onFailure, and onComplete.
Also, consider investigating the Akka library, which provides actor-based concurrency for Scala and Java, and interoperates with scala.concurrent.
Scala 2.9.2 and prior
This simplest approach is to use Scala's Future class, which is a sub-component of the Actors framework. The scala.actors.Futures.future method creates a Future for the block passed to it. You can then use scala.actors.Futures.awaitAll to wait for all tasks to complete.
For example, here's a sample that creates 10 tasks, where each tasks sleeps an arbitrary amount of time and then returns the square of the value passed to it.
import scala.actors.Futures._
val tasks = for (i <- 1 to 10) yield future {
println("Executing task " + i)
Thread.sleep(i * 1000L)
i * i
}
val squares = awaitAll(20000L, tasks: _*)
println("Squares: " + squares)
You want to look at either the Scala actors library or Akka. Akka has cleaner syntax, but either will do the trick.
So it sounds like you need to create a pool of actors that know how to process your tasks. An Actor can basically be any class with a receive method - from the Akka tutorial (http://doc.akkasource.org/tutorial-chat-server-scala):
class MyActor extends Actor {
def receive = {
case "test" => println("received test")
case _ => println("received unknown message")
}}
val myActor = Actor.actorOf[MyActor]
myActor.start
You'll want to create a pool of actor instances and fire your tasks to them as messages. Here's a post on Akka actor pooling that may be helpful: http://vasilrem.com/blog/software-development/flexible-load-balancing-with-akka-in-scala/
In your case, one actor per task may be appropriate (actors are extremely lightweight compared to threads so you can have a LOT of them in a single VM), or you might need some more sophisticated load balancing between them.
EDIT:
Using the example actor above, sending it a message is as easy as this:
myActor ! "test"
The actor will then output "received test" to standard output.
Messages can be of any type, and when combined with Scala's pattern matching, you have a powerful pattern for building flexible concurrent applications.
In general Akka actors will "do the right thing" in terms of thread sharing, and for the OP's needs, I imagine the defaults are fine. But if you need to, you can set the dispatcher the actor should use to one of several types:
* Thread-based
* Event-based
* Work-stealing
* HawtDispatch-based event-driven
It's trivial to set a dispatcher for an actor:
class MyActor extends Actor {
self.dispatcher = Dispatchers.newExecutorBasedEventDrivenDispatcher("thread-pool-dispatch")
.withNewThreadPoolWithBoundedBlockingQueue(100)
.setCorePoolSize(10)
.setMaxPoolSize(10)
.setKeepAliveTimeInMillis(10000)
.build
}
See http://doc.akkasource.org/dispatchers-scala
In this way, you could limit the thread pool size, but again, the original use case could probably be satisfied with 50K Akka actor instances using default dispatchers and it would parallelize nicely.
This really only scratches the surface of what Akka can do. It brings a lot of what Erlang offers to the Scala language. Actors can monitor other actors and restart them, creating self-healing applications. Akka also provides Software Transactional Memory and many other features. It's arguably the "killer app" or "killer framework" for Scala.
If you want to "execute them with 10 threads", then use threads. Scala's actor model, which is usually what people is speaking of when they say Scala is good for concurrency, hides such details so you won't see them.
Using actors, or futures with all you have are simple computations, you'd just create 50000 of them and let them run. You might be faced with issues, but they are of a different nature.
Here's another answer similar to mpilquist's response but without deprecated API and including the thread settings via a custom ExecutionContext:
import java.util.concurrent.Executors
import scala.concurrent.{ExecutionContext, Await, Future}
import scala.concurrent.duration._
val numJobs = 50000
var numThreads = 10
// customize the execution context to use the specified number of threads
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(numThreads))
// define the tasks
val tasks = for (i <- 1 to numJobs) yield Future {
// do something more fancy here
i
}
// aggregate and wait for final result
val aggregated = Future.sequence(tasks)
val oneToNSum = Await.result(aggregated, 15.seconds).sum

Resources