How does Spark send closures to workers?

How does Spark send closures to workers? - apache-spark

When I write an RDD transformation, e.g.
val rdd = sc.parallelise(1 to 1000)
rdd.map(x => x * 3)
I understand that the closure (x => x * 3) which is simply a Function1 needs to be Serializable and I think I read somewhereEDIT: it's right there implied in the documentation: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark that it is "sent" to the workers for execution. (e.g. Akka sending an "executable piece of code" down the wire to workers to run)
Is that how it works?
Someone at a meetup I attended commented and said that it is not actually sending any serialized code, but since each worker get a "copy" of the jar anyway, it just needs a reference to which function to run or something like this (but I'm not sure I quote that person correctly)
I'm now at an utter confusion on how it actually works.
So my questions are
how are transformation closures sent to workers? Serialized via akka? or they are "already there" because spark sends the entire uber jar to each worker (sounds unlikely to me...)
if so, then how the rest of the jar is sent to the workers? is this is what the "cleanupClosure" doing? e.g. sending only the relevant bytecode to the worker instead of the entire uberjar? (e.g. only dependent code to the closure?)
so to summarise, does spark, at any point, syncs the jars in the --jars classpath with the workers somehow? or does it sends "just the right amount" of code to workers? and if it does send closures, are they being cached for the need of recalculation? or does it send the closure with the task every time a task is scheduled? sorry if this is silly questions but I really don't know.
Please add sources if you can for your answer, I couldn't find it explicit in the documentation, and I'm too wary to try and conclude it just by reading the code.

The closures are most certainly serialized at runtime. I have plenty of instances seen Closure Not Serializable exceptions at runtime - from pyspark and from scala. There is complex code called
From ClosureCleaner.scala
def clean(
closure: AnyRef,
checkSerializable: Boolean = true,
cleanTransitively: Boolean = true): Unit = {
clean(closure, checkSerializable, cleanTransitively, Map.empty)
}
that attempts to minify the code being serialized. The code is then sent across the wire - if it were serializable. Otherwise an exception will be thrown.
Here is another excerpt from ClosureCleaner to check the ability to serialize an incoming function:
private def ensureSerializable(func: AnyRef) {
try {
if (SparkEnv.get != null) {
SparkEnv.get.closureSerializer.newInstance().serialize(func)
}
} catch {
case ex: Exception => throw new SparkException("Task not serializable", ex)
}
}

Related

How to wrap Web Worker response messages in futures?

Please consider a scala.js application which runs in the browser and consists of a main program and a web worker.
The main thread delegates long running operations to the web worker by passing messages that contain the names of methods and the parameters required to invoke them. The worker passes method return values back to the main thread in the form of response messages.
In simpler terms, this program abstracts web worker messaging so that code in the main thread can call methods in the worker thread in idiomatic and asynchronous Scala syntax.
Because web workers do not associate messages with their responses in any way, the abstraction relies on a registry, an intermediary object, that governs each cross context method call to associate the invocation with the result. This singleton could also bind callback functions but is there a way to accomplish this with futures instead of callbacks?
How can I build an abstraction over this registry that allows programmers to use it with the standard asynchronous programming structures in Scala: futures and promises?
How should I write this functionality so that scala programmers can interact with it in the canonical way? For example:
// long running method in the web worker
val f: Future[String] = Registry.ultimateQuestion(42) // async
f onSuccess { case q => println("The ultimate question is: " + q) }
I'm new to futures and promises, but it seems like they usually complete when some execution block terminates. In this case, receiving a response from the web worker signifies completion of the future. Is there a way to write a custom future that delegates its completion status to an external process? Is there another way to link the web worker response message to the status of the future?
Can/Should I extend the Future trait? Is this possible in Scala.js? Is there a concrete class that I should extend? Is there some other way to encapsulate these cross context web worker method calls in existing asynchronous Scala functionality?
Thank you for your consideration.

Hmm. Just spitballing here (I haven't used workers yet), but it seems like associating the request with the Future is fairly easy in the single-threaded JavaScript world you're working in.
Here's a hypothetical design. Say that each request/response to the worker is automatically wrapped in an Envelope; the Envelope contains a RequestId. So the send side looks something like (this is pseudo-code, but real-ish):
def sendRequest[R](msg:Message):Future[R] = {
val promise = Promise[R]
val id = nextRequestId()
val envelope = Envelope(id, msg)
register(id, promise)
sendToWorker(envelope)
promise.future
}
The worker processes msg, wraps the result in another Envelope, and the result gets handled back in the main thread with something like:
def handleResult(resultEnv:Envelope):Unit = {
val promise = findRegistered(resultEnv.id)
val result = resultEnv.msg
promise.success(result)
}
That needs some filling in, and some thought about what the types like R should be, but that sort of outline would probably work decently well. If this was the JVM you'd have to worry about all sorts of race conditions, but in the single-threaded JS world it probably can be as simple as using an autoincrementing integer for the request ID, and storing away the Promise...

Companion object variable update while other actors reading

We are building REST using spray and akka. For this we need to read more than 10k files from disk (Mostly static, updates might come twice per day). Reading from disk for each request is giving performance hit, we put all required information in DataMap (Map object). Using akka scheduler updating DataMap for each 15min(Needs to be up-to date with disk data).
class SampleScheduler extends Actor with ActorLogging {
import context._
val tick = context.system.scheduler.schedule(1.second,15.minute, self,"mytick")
override def postStop() = tick.cancel()
override def receive: Receive = {
case "mytick" => {
println(s"Yes got the tick now ${new Date().toGMTString}")
Test.setDataMap()
}
}
}
object Test {
var DataMap:Map[String,List[String]]=Map()
def setDataMap()={
DataMap = //Read files from disk
}
}
object Main extends App {
//For each new request look into DataMap
if(Test.DataMap.isEmpty) {
//How to handle this, can i use like this
Thread.sleep(1000)
}
}
So when the new request comes, it searches required data from map and get information, process accordingly.
How to achieve below requirements with above said design.
Now for each request, creating one actor and reading above DataMap and starts processing. After started processing, if DataMap becomes empty and re-loaded, how to handle this?
If DataMap found empty, how to retry? Can i use Thread.sleep method?
Is storing DataMap and resetting it for each 15min in "Object" good practice?

First and most important of all, you must avoid (3). Storing state outside of actor is evil! Besides, storing state and maintaining it is what actors are good at.
And after you put state inside of actor and share it via messages, (2) will be obsolete. The request actor will ask state actor and if state actor is busy with re-reading files, it will answer after its job done.
Lastly, there are 2 different strategies you can follow to solve (1). State actor will process each messages in order so it can (a) reply message with last know state or (b) hold messages and reply with newly populated state if it thinks it should re-read files.

Distributed\Parallel computing using app-engine (java api)

I want to use the master-slave (worker) paradigm, to solve a problem. I have read that opening new threads manually (for example using thread pool) is not available and I need to use queue, attached code example:
class MyDeferred implements DeferredTask {
#Override
public void run() {
// Do something interesting
}
};
MyDeferred task = new MyDeferred();
// Set instance variables etc as you wish
Queue queue = QueueFactory.getDefaultQueue();
queue.add(withPayload(task));
How can I get the result of the workers (which were added to the queue)?
I need this info, in-order to solve the bigger problem.

Actually you can use threads on GAE, but there are limitations. If you need long-running threads you can use background threads, but this requires you to use backend instances.
If you opt to use task queue, then keep in mind that tasks do not "return" to caller. To aggregate results you'll need to use datastore.

You will have to write the results into the datastore.
Just as a starting point to think about it, you might pass a JobId as a parameter to the tasks, have each task write an entity with the result and the JobId, and then later query the datstore for the given JobId to get all the results.

Asynchronous IO in Scala with futures

Let's say I'm getting a (potentially big) list of images to download from some URLs. I'm using Scala, so what I would do is :
import scala.actors.Futures._
// Retrieve URLs from somewhere
val urls: List[String] = ...
// Download image (blocking operation)
val fimages: List[Future[...]] = urls.map (url => future { download url })
// Do something (display) when complete
fimages.foreach (_.foreach (display _))
I'm a bit new to Scala, so this still looks a little like magic to me :
Is this the right way to do it? Any alternatives if it is not?
If I have 100 images to download, will this create 100 threads at once, or will it use a thread pool?
Will the last instruction (display _) be executed on the main thread, and if not, how can I make sure it is?
Thanks for your advice!

Use Futures in Scala 2.10. They were joint work between the Scala team, the Akka team, and Twitter to reach a more standardized future API and implementation for use across frameworks. We just published a guide at: http://docs.scala-lang.org/overviews/core/futures.html
Beyond being completely non-blocking (by default, though we provide the ability to do managed blocking operations) and composable, Scala's 2.10 futures come with an implicit thread pool to execute your tasks on, as well as some utilities to manage time outs.
import scala.concurrent.{future, blocking, Future, Await, ExecutionContext.Implicits.global}
import scala.concurrent.duration._
// Retrieve URLs from somewhere
val urls: List[String] = ...
// Download image (blocking operation)
val imagesFuts: List[Future[...]] = urls.map {
url => future { blocking { download url } }
}
// Do something (display) when complete
val futImages: Future[List[...]] = Future.sequence(imagesFuts)
Await.result(futImages, 10 seconds).foreach(display)
Above, we first import a number of things:
future: API for creating a future.
blocking: API for managed blocking.
Future: Future companion object which contains a number of useful methods for collections of futures.
Await: singleton object used for blocking on a future (transferring its result to the current thread).
ExecutionContext.Implicits.global: the default global thread pool, a ForkJoin pool.
duration._: utilities for managing durations for time outs.
imagesFuts remains largely the same as what you originally did- the only difference here is that we use managed blocking- blocking. It notifies the thread pool that the block of code you pass to it contains long-running or blocking operations. This allows the pool to temporarily spawn new workers to make sure that it never happens that all of the workers are blocked. This is done to prevent starvation (locking up the thread pool) in blocking applications. Note that the thread pool also knows when the code in a managed blocking block is complete- so it will remove the spare worker thread at that point, which means that the pool will shrink back down to its expected size.
(If you want to absolutely prevent additional threads from ever being created, then you ought to use an AsyncIO library, such as Java's NIO library.)
Then we use the collection methods of the Future companion object to convert imagesFuts from List[Future[...]] to a Future[List[...]].
The Await object is how we can ensure that display is executed on the calling thread-- Await.result simply forces the current thread to wait until the future that it is passed is completed. (This uses managed blocking internally.)

val all = Future.traverse(urls){ url =>
val f = future(download url) /*(downloadContext)*/
f.onComplete(display)(displayContext)
f
}
Await.result(all, ...)
Use scala.concurrent.Future in 2.10, which is RC now.
which uses an implicit ExecutionContext
The new Future doc is explicit that onComplete (and foreach) may evaluate immediately if the value is available. The old actors Future does the same thing. Depending on what your requirement is for display, you can supply a suitable ExecutionContext (for instance, a single thread executor). If you just want the main thread to wait for loading to complete, traverse gives you a future to await on.

Yes, seems fine to me, but you may want to investigate more powerful twitter-util or Akka Future APIs (Scala 2.10 will have a new Future library in this style).
It uses a thread pool.
No, it won't. You need to use the standard mechanism of your GUI toolkit for this (SwingUtilities.invokeLater for Swing or Display.asyncExec for SWT). E.g.
fimages.foreach (_.foreach(im => SwingUtilities.invokeLater(new Runnable { display im })))

How can I execute multiple tasks in Scala?

I have 50,000 tasks and want to execute them with 10 threads.
In Java I should create Executers.threadPool(10) and pass runnable to is then wait to process all. Scala as I understand especially useful for that task, but I can't find solution in docs.

Scala 2.9.3 and later
THe simplest approach is to use the scala.concurrent.Future class and associated infrastructure. The scala.concurrent.future method asynchronously evaluates the block passed to it and immediately returns a Future[A] representing the asynchronous computation. Futures can be manipulated in a number of non-blocking ways, including mapping, flatMapping, filtering, recovering errors, etc.
For example, here's a sample that creates 10 tasks, where each tasks sleeps an arbitrary amount of time and then returns the square of the value passed to it.
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
val tasks: Seq[Future[Int]] = for (i <- 1 to 10) yield future {
println("Executing task " + i)
Thread.sleep(i * 1000L)
i * i
}
val aggregated: Future[Seq[Int]] = Future.sequence(tasks)
val squares: Seq[Int] = Await.result(aggregated, 15.seconds)
println("Squares: " + squares)
In this example, we first create a sequence of individual asynchronous tasks that, when complete, provide an int. We then use Future.sequence to combine those async tasks in to a single async task -- swapping the position of the Future and the Seq in the type. Finally, we block the current thread for up to 15 seconds while waiting for the result. In the example, we use the global execution context, which is backed by a fork/join thread pool. For non-trivial examples, you probably would want to use an application specific ExecutionContext.
Generally, blocking should be avoided when at all possible. There are other combinators available on the Future class that can help program in an asynchronous style, including onSuccess, onFailure, and onComplete.
Also, consider investigating the Akka library, which provides actor-based concurrency for Scala and Java, and interoperates with scala.concurrent.
Scala 2.9.2 and prior
This simplest approach is to use Scala's Future class, which is a sub-component of the Actors framework. The scala.actors.Futures.future method creates a Future for the block passed to it. You can then use scala.actors.Futures.awaitAll to wait for all tasks to complete.
For example, here's a sample that creates 10 tasks, where each tasks sleeps an arbitrary amount of time and then returns the square of the value passed to it.
import scala.actors.Futures._
val tasks = for (i <- 1 to 10) yield future {
println("Executing task " + i)
Thread.sleep(i * 1000L)
i * i
}
val squares = awaitAll(20000L, tasks: _*)
println("Squares: " + squares)

You want to look at either the Scala actors library or Akka. Akka has cleaner syntax, but either will do the trick.
So it sounds like you need to create a pool of actors that know how to process your tasks. An Actor can basically be any class with a receive method - from the Akka tutorial (http://doc.akkasource.org/tutorial-chat-server-scala):
class MyActor extends Actor {
def receive = {
case "test" => println("received test")
case _ => println("received unknown message")
}}
val myActor = Actor.actorOf[MyActor]
myActor.start
You'll want to create a pool of actor instances and fire your tasks to them as messages. Here's a post on Akka actor pooling that may be helpful: http://vasilrem.com/blog/software-development/flexible-load-balancing-with-akka-in-scala/
In your case, one actor per task may be appropriate (actors are extremely lightweight compared to threads so you can have a LOT of them in a single VM), or you might need some more sophisticated load balancing between them.
EDIT:
Using the example actor above, sending it a message is as easy as this:
myActor ! "test"
The actor will then output "received test" to standard output.
Messages can be of any type, and when combined with Scala's pattern matching, you have a powerful pattern for building flexible concurrent applications.
In general Akka actors will "do the right thing" in terms of thread sharing, and for the OP's needs, I imagine the defaults are fine. But if you need to, you can set the dispatcher the actor should use to one of several types:
* Thread-based
* Event-based
* Work-stealing
* HawtDispatch-based event-driven
It's trivial to set a dispatcher for an actor:
class MyActor extends Actor {
self.dispatcher = Dispatchers.newExecutorBasedEventDrivenDispatcher("thread-pool-dispatch")
.withNewThreadPoolWithBoundedBlockingQueue(100)
.setCorePoolSize(10)
.setMaxPoolSize(10)
.setKeepAliveTimeInMillis(10000)
.build
}
See http://doc.akkasource.org/dispatchers-scala
In this way, you could limit the thread pool size, but again, the original use case could probably be satisfied with 50K Akka actor instances using default dispatchers and it would parallelize nicely.
This really only scratches the surface of what Akka can do. It brings a lot of what Erlang offers to the Scala language. Actors can monitor other actors and restart them, creating self-healing applications. Akka also provides Software Transactional Memory and many other features. It's arguably the "killer app" or "killer framework" for Scala.

If you want to "execute them with 10 threads", then use threads. Scala's actor model, which is usually what people is speaking of when they say Scala is good for concurrency, hides such details so you won't see them.
Using actors, or futures with all you have are simple computations, you'd just create 50000 of them and let them run. You might be faced with issues, but they are of a different nature.

Here's another answer similar to mpilquist's response but without deprecated API and including the thread settings via a custom ExecutionContext:
import java.util.concurrent.Executors
import scala.concurrent.{ExecutionContext, Await, Future}
import scala.concurrent.duration._
val numJobs = 50000
var numThreads = 10
// customize the execution context to use the specified number of threads
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(numThreads))
// define the tasks
val tasks = for (i <- 1 to numJobs) yield Future {
// do something more fancy here
i
}
// aggregate and wait for final result
val aggregated = Future.sequence(tasks)
val oneToNSum = Await.result(aggregated, 15.seconds).sum

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string