How to solve this using scala Actors:
I have a program that finds out the frequencies of identifiers in files under a given path. The encoding assumed is UTF-8. I want to solve the same problem with scala actors.
//program to find frequencies of identifiers
import java.io._
import java.util.concurrent._
import java.util.concurrent.atomic._
object Main {
// visit all files in dir
def processDirectory(dir: File, visit: (File) => Unit) {
for (f <- dir.listFiles)
if (f.isDirectory) processDirectory(f, visit)
else visit(f)
}
//counters for all identifiers
val frequencies = new scala.collection.mutable.HashMap[String, Int]
// Finds all identifiers in a file and increments their counters
def process(f: File) {
val contents = scala.io.Source.fromFile(f, "UTF-8").mkString
val pattern = "[a-zA-Z_][0-9a-zA-Z_]*".r
for (m <- pattern.findAllIn(contents))
frequencies(m) = frequencies.getOrElse(m, 0) + 1
}
def main(args: Array[String]) { //Give path of a directory here
processDirectory(new File(args(0)), process _)
println("Ten most common identifiers:")
val sorted = frequencies.values.toBuffer.sortWith(_ > _)
for (i <- 0 until 10)
for ((k, v) <- frequencies)
if (v == sorted(i)) println(k + " " + v)
}
}
Also please explain the concept of scala actors. I am confused about scala actors.
Actors help with concurrent design. There's nothing concurrent about this. People who want parallelism, for performance, sometimes want to do exactly what you're doing: take some simple filesystem-munging thing, throw extra threads at it, and see if it's faster. However, this is a disk, and random access is extremely expensive, so you've nothing to gain from parallel processing, Actor-abusing or otherwise.
Scala's Actors come from Erlang. So please see if Concurrency Oriented Programming in Erlang (pdf), by one of Erlang's designers, helps you get an idea of what they're about. They're not really about throwing threads at tasks to make those tasks go faster.
Some resources to help with Scala's Actors:
Actors in Scala -- it's published at the end of the month, but PrePrint PDFs are available now.
Scala Actors: A Short Tutorial
Related
In one of his videos (concerning Scala's lazy evaluation, namely lazy keyword), Martin Odersky shows the following implementation of cons operation used to construct a Stream:
def cons[T](hd: T, tl: => Stream[T]) = new Stream[T] {
def head = hd
lazy val tail = tl
...
}
So tail operation is written concisely using lazy evaluation feature of the language.
But in reality (in Scala 2.11.7), the implementation of tail is a bit less elegant:
#volatile private[this] var tlVal: Stream[A] = _
#volatile private[this] var tlGen = tl _
def tailDefined: Boolean = tlGen eq null
override def tail: Stream[A] = {
if (!tailDefined)
synchronized {
if (!tailDefined) {
tlVal = tlGen()
tlGen = null
}
}
tlVal
}
Double-checked locking and two volatile fields: that's roughly how you would implement a thread-safe lazy computation in Java.
So the questions are:
Doesn't lazy keyword of Scala provide any 'evaluated maximum once' guarantee in a multi-threaded case?
Is the pattern used in real tail implementation an idiomatic way to do a thread-safe lazy evaluation in Scala?
Doesn't lazy keyword of Scala provide any 'evaluated maximum once'
guarantee in a multi-threaded case?
Yes, it does, as others have stated.
Is the pattern used in real tail implementation an idiomatic way to do
a thread-safe lazy evaluation in Scala?
Edit:
I think I have the actual answer as to why not lazy val. Stream has public facing API methods such as hasDefinitionSize inherited from TraversableOnce. In order to know if a Stream has a finite size not, we need a way of checking without materializing the underlying Stream tail. Since lazy val doesn't actually expose the underlying bit, we can't do that.
This is backed by SI-1220
To strengthen this point, #Jasper-M points out that the new LazyList api in strawman (Scala 2.13 collection makeover) no longer has this issue, since the entire collection hierarchy has been reworked and there are no longer such concerns.
Performance related concerns
I would say "it depends" on which angle you're looking at this problem. From a LOB point of view, I'd say definitely go with lazy val for conciseness and clarity of implementation. But, if you look at it from the point of view of a Scala collections library author, things start to look differently. Think of it this way, you're creating a library which will be potentially be used by many people and ran on many machines across the world. This means that you should be thinking of the memory overhead of each structure, especially if you're creating such an essential data structure yourself.
I say this because when you use lazy val, by design you generate an additional Boolean field which flags if the value has been initialized, and I am assuming this is what the library authors were aiming to avoid. The size of a Boolean on the JVM is of course VM dependent, by even a byte is something to consider, especially when people are generating large Streams of data. Again, this is definitely not something I would usually consider and is definitely a micro optimization towards memory usage.
The reason I think performance is one of the key points here is SI-7266 which fixes a memory leak in Stream. Note how it is of importance to track the byte code to make sure no extra values are retained inside the generated class.
The difference in the implementation is that the definition of tail being initialized or not is a method implementation which checks the generator:
def tailDefined: Boolean = tlGen eq null
Instead of a field on the class.
Scala lazy values are evaluated only once in multi-threaded cases. This is because the evaluation of lazy members is actually wrapped in a synchronized block in the generated code.
Lets take a look at the simple claas,
class LazyTest {
lazy val x = 5
}
Now, lets compile this with scalac,
scalac -Xprint:all LazyTest.scala
This will result in,
package <empty> {
class LazyTest extends Object {
final <synthetic> lazy private[this] var x: Int = _;
#volatile private[this] var bitmap$0: Boolean = _;
private def x$lzycompute(): Int = {
LazyTest.this.synchronized(if (LazyTest.this.bitmap$0.unary_!())
{
LazyTest.this.x = (5: Int);
LazyTest.this.bitmap$0 = true
});
LazyTest.this.x
};
<stable> <accessor> lazy def x(): Int = if (LazyTest.this.bitmap$0.unary_!())
LazyTest.this.x$lzycompute()
else
LazyTest.this.x;
def <init>(): LazyTest = {
LazyTest.super.<init>();
()
}
}
}
You should be able to see... that the lazy evaluation is thread-safe. And you will also see some similarity to that "less elegant" implementation in Scala 2.11.7
You can also experiment with tests similar to following,
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
case class A(i: Int) {
lazy val j = {
println("calculating j")
i + 1
}
}
def checkLazyInMultiThread(): Unit = {
val a = A(6)
val futuresList = Range(1, 20).toList.map(i => Future{
println(s"Future $i :: ${a.j}")
})
Future.sequence(futuresList).onComplete(_ => println("completed"))
}
checkLazyInMultiThread()
Now, the implementation in standard library avoids using lazy because they are able to provide a more efficient solution than this generic lazy translation.
You are correct, lazy vals use locking precisely to guard against double evaluation when accessed at the same time by two threads. Future developments, furthermore, will give the same guarantees without locking.
What is idiomatic, in my humble opinion, is a highly debatable subject when it comes to a language that, by design, allows for a wide range of different idioms to be adopted. In general, however, application code tends to be considered idiomatic when going more into the direction of pure functional programming, as it gives a series of interesting advantages in terms of ease of testing and reasoning that would make sense to give up only in case of serious concerns. This concern can be one of performance, which is why the current implementation of the Scala Collection API, while exposing in most cases a functional interface, makes heavy use (internally and in restricted scopes) of vars, while loops and established patterns from imperative programming (as the one you highlighted in your question).
I'm relatively new to spark and might even be wrong before finishing building up the scenario questions so feel free to skip reading and point it out where you find I'm conceptually wrong, thanks!
Imagine a piece of driver code like this:
val A = ... (some transformation)
val B = A.filter( fun1 )
val C = A.filter( fun2 )
...
B.someAction()... //do sth with B
...
C.someAction()... //do sth with C
Transformation RDDs B and C both depend on A which might itself be a complex transformation. So will A be computed twice ? I argue that it will because spark can't do anything that's inter-transformations, right ? Spark is intelligent on optimizing one transformation execution at a time because the bundled tasks in it could be throughly analyzed. For example it's possible that some state change occurs after B.someAction but before C.someAction which may affect the value of A so the re-computation becomes necessary. For further example It could happen like this:
val arr = Array(...)
val A = sc.parallelize(...).flatMap(e => arr.map(_ * e)) //now A depends on some local array
... //B and C stays the same as above
B.someAction()
...
arr(i) = arr(i) + 10 //local state modified
...
C.someAction() //should A be recomputed? YES
This is easy to verify so I did a quick experiment and the result supports my reasoning.
However if B and C just independently depend on A and no other logic like above exists then a programmer or some tool could statically analyze the code and say hey it’s feasible to add a cache on A so that it doesn’t unnecessarily recompute! But spark can do nothing about this and sometimes it’s even hard for human to decide:
val A = ... (some transformation)
var B = A.filter( fun1 )
var C: ??? = null
var D: ??? = null
if (cond) {
//now whether multiple dependencies exist is runtime determined
C = A.filter( fun2 )
D = A.filter( fun3 )
}
B.someAction()... //do sth with B
if (cond) {
C.someAction()... //do sth with C
D.someAction()... //do sth with D
}
If the condition is true then it’s tempting to cache A but you’ll never know until runtime. I know this is an artificial crappy example but these are already simplified models things could get more complicated in practice and the dependencies could be quite long and implicit and spread across modules so my question is what’s the general principle to deal with this kind of problem. When should the common ancestors on the transformation dependency graph be cached (provided memory is not an issue) ?
I’d like to hear something like always follow functional programming paradigms doing spark or always cache them if you can however there’s another situation that I may not need to:
val A = ... (some transformation)
val B = A.filter( fun1 )
val C = A.filter( fun2 )
...
B.join(C).someAction()
Again B and C both depend on A but instead of calling two actions separately they are joined to form one single transformation. This time I believe spark is smart enough to compute A exactly once. Haven’t found a proper way to run and examine yet but should be obvious in the web UI DAG. What's further I think spark can even reduce the two filter operations into one traversal on A to get B and C at the same time. Is this true?
There's a lot to unpack here.
Transformation RDDs B and C both depend on A which might itself be a complex transformation. So will A be computed twice ? I argue that it will because spark can't do anything that's inter-transformations, right ?
Yes, it will be computed twice, unless you call A.cache() or A.persist(), in which case it will be calculated only once.
For example it's possible that some state change occurs after B.someAction but before C.someAction which may affect the value of A so the re-computation becomes necessary
No, this is not correct, A is immutable, therefore it's state cannot change. B and C are also immutable RDDs that represent transformations of A.
sc.parallelize(...).flatMap(e => arr.map(_ * e)) //now A depends on some local array
No, it doesn't depend on the local array, it is an immutable RDD containing the copy of the elements of the (driver) local array. If the array changes, A does not change. To obtain that behaviour you would have to var A = sc. parallelize(...) and then set A again when local array changes A = sc.paralellize(...). In that scenario, A isn't 'updated' it is replaced by a new RDD representation of the local array, and as such any cached version of A is invalid.
The subsequent examples you have posted benefit from caching A. Again because RDDs are immutable.
In our use case of using the groupByKey(...): RDD[(K, Iterable[V]], there might be a case that even for a single key (an extreme case though), the associated Iterable[V] could resulting in OOM.
Is it possible to provide the above 'groupByKeyWithRDD'?
And, ideally, it would be great if the internal impl of the RDD[V] is smart enough to only spill the data into disk upon a configured threshold. That way, we won't sacrifice the performance for the normal cases as well.
Any suggestions/comments are welcomed. Thanks a lot!
Just a side note: we do understand the points mentioned here: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html, and the 'reduceByKey', 'foldByKey' don't quite fit our needs right now, that is to say, we couldn't really avoid 'groupByKey'.
Assuming that the #(of-unique-keys) << #(key-value-pairs), which seems to be the case, there should be no need for RDD[(K, RDD[V])]. Instead you can transform into Map[(K, RDD[V])] by mapping unique keys with filter:
import scala.reflect.ClassTag
import org.apache.spark.rdd.RDD
def splitByKey[K : ClassTag, V: ClassTag](rdd: RDD[(K, V)]): Map[K, RDD[V]] = {
val keys = rdd.keys.distinct.collect.toSeq
keys.map(key => (key -> rdd.filter{case (k, _) => k == key}.values)).toMap
}
It requires multiple scans over data so it is not exactly cheap but doesn't require shuffling, gives you much better control over caching and is rather unlikely to cause OOM as long as initial RDD fits into memory.
I'd like to do some time consuming task in background. So I need to start computation in a different thread, be able to check if it is completed (maybe failed) and be able to abort the computation when it becomes unnecessary. After computation is ended it should call synchronized callback function to store computed value.
It may be programmed as some wrapper over the Thread class. But I suppose that this basic functionality is implemented in some scala library already. I've tried to search but find only Akka that is too much for my simple task. scala.concurrent.ExecutionContext has useful execute method but it return no object to check status of the computation and abort it on demand.
What library contains already described functionality?
I've checked scala.concurrent.Future. It lacks ability to abort computation, that is crucial. I use following strategy: compute some consuming function in background and provide reasonable default. If arguments to the function is changed, I drop the original computation and start new. I could not imagine how to rewrite this strategy in terms of Future.flatMap.
I'll give a demonstration of how use futures with Twitter's implementation, since you asked for cancellation:
import com.twitter.util.{ Await, Future, FuturePool }
def computeFast(i: Int) = { Thread.sleep(1000); i + 1 }
def computeSlow(i: Int) = { Thread.sleep(1000000); i + 1 }
val fastComputation = FuturePool.unboundedPool(computeFast(1))
val slowComputation = FuturePool.unboundedPool(computeSlow(1))
Now you can poll for a result:
scala> fastComputation.poll
res0: Option[com.twitter.util.Try[Int]] = Some(Return(2))
scala> slowComputation.poll
res1: Option[com.twitter.util.Try[Int]] = None
Or set callbacks:
fastComputation.onSuccess(println)
slowComputation.onFailure(println)
Most of the time it's better to use map and flatMap to describe how to compose computations, though.
Cancellation is a little more complicated (this is just a demo—you'll want to provide your own cancellation logic):
import com.twitter.util.Promise
def cancellableComputation(i: Int): Future[Int] = {
val p = Promise[Int]
p.setInterruptHandler {
case t =>
println("Cancelling the computation")
p.setException(t)
}
FuturePool.unboundedPool(computeSlow(i)).onSuccess(p.setValue)
p
}
And then:
scala> val myFuture = cancellableComputation(10)
myFuture: com.twitter.util.Future[Int] = Promise#129588027(state=Interruptible(List(),<function1>))
scala> myFuture.poll
res4: Option[com.twitter.util.Try[Int]] = None
scala> myFuture.raise(new Exception("Stop this thing"))
Cancelling the computation
scala> myFuture.poll
res6: Option[com.twitter.util.Try[Int]] = Some(Throw(java.lang.Exception: Stop this thing))
You could probably do something similar with the standard library's futures.
This code comes from a paper called "Lazy v. Yield". Its about a way to decouple producers and consumer of streams of data. I understand the Haskell portion of the code but the O'Caml/F# eludes me. I don't understand this code for the following reasons:
What kind of behavior can I expect from a function that takes as argument an exception and returns unit?
How does the consumer project into a specific exception? (what does that mean?)
What would be an example of a consumer?
module SimpleGenerators
type 'a gen = unit -> 'a
type producer = unit gen
type consumer = exn -> unit (* consumer will project into specific exception *)
type 'a transducer = 'a gen -> 'a gen
let yield_handler : (exn -> unit) ref =
ref (fun _ -> failwith "yield handler is not set")
let iterate (gen : producer) (consumer : consumer) : unit =
let oldh = !yield_handler in
let rec newh x =
try
yield_handler := oldh
consumer x
yield_handler := newh
with e -> yield_handler := newh; raise e
in
try
yield_handler := newh
let r = gen () in
yield_handler := oldh
r
with e -> yield_handler := oldh; raise e
I'm not familiar with the paper, so others will probably be more enlightening. Here are some quick answers/guesses in the meantime.
A function of type exn -> unit is basically an exception handler.
Exceptions can contain data. They're quite similar to polymorphic variants that way--i.e., you can add a new exception whenever you want, and it can act as a data constructor.
It looks like the consumer is going to look for a particular exception(s) that give it the data it wants. Others it will just re-raise. So, it's only looking at a projection of the space of possible exceptions (I guess).
I think the OCaml sample is using a few constructs and design patterns that you would not typically use in F#, so it is quite OCaml-specific. As Jeffrey says, OCaml programs often use exceptions for control flow (while in F# they are only used for exceptional situations).
Also, F# has really powerful sequence expressions mechanism that can be used quite nicely to separate producers of data from the consumers of data. I did not read the paper in detail, so maybe they have something more complicated, but a simple example in F# could look like this:
// Generator: Produces infinite sequence of numbers from 'start'
// and prints the numbers as they are being generated (to show I/O behaviour)
let rec numbers start = seq {
printfn "generating: %d" start
yield start
yield! numbers (start + 1) }
A simple consumer can be implemented using for loop, but if we want to consume the stream, we need to say how many elements to consume using Seq.take:
// Consumer: takes a sequence of numbers generated by the
// producer and consumes first 100 elements
let consumer nums =
for n in nums |> Seq.take 100 do
printfn "consuming: %d" n
When you run consumer (numbers 0) the code starts printing:
generating: 0
consuming: 0
generating: 1
consuming: 1
generating: 2
consuming: 2
So you can see that the effects of producers and consumers are interleaved. I think this is quite simple & powerful mechanism, but maybe I'm missing the point of the paper and they have something even more interesting. If so, please let me know! Although I think the idiomatic F# solution will probably look quite similar to the above.