Need sth like "def groupByKeyWithRDD(partitioner: Partitioner): RDD[(K, RDD[V])] = ???" - apache-spark

In our use case of using the groupByKey(...): RDD[(K, Iterable[V]], there might be a case that even for a single key (an extreme case though), the associated Iterable[V] could resulting in OOM.
Is it possible to provide the above 'groupByKeyWithRDD'?
And, ideally, it would be great if the internal impl of the RDD[V] is smart enough to only spill the data into disk upon a configured threshold. That way, we won't sacrifice the performance for the normal cases as well.
Any suggestions/comments are welcomed. Thanks a lot!
Just a side note: we do understand the points mentioned here: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html, and the 'reduceByKey', 'foldByKey' don't quite fit our needs right now, that is to say, we couldn't really avoid 'groupByKey'.

Assuming that the #(of-unique-keys) << #(key-value-pairs), which seems to be the case, there should be no need for RDD[(K, RDD[V])]. Instead you can transform into Map[(K, RDD[V])] by mapping unique keys with filter:
import scala.reflect.ClassTag
import org.apache.spark.rdd.RDD
def splitByKey[K : ClassTag, V: ClassTag](rdd: RDD[(K, V)]): Map[K, RDD[V]] = {
val keys = rdd.keys.distinct.collect.toSeq
keys.map(key => (key -> rdd.filter{case (k, _) => k == key}.values)).toMap
}
It requires multiple scans over data so it is not exactly cheap but doesn't require shuffling, gives you much better control over caching and is rather unlikely to cause OOM as long as initial RDD fits into memory.

Related

Scala Stream tail laziness and synchronization

In one of his videos (concerning Scala's lazy evaluation, namely lazy keyword), Martin Odersky shows the following implementation of cons operation used to construct a Stream:
def cons[T](hd: T, tl: => Stream[T]) = new Stream[T] {
def head = hd
lazy val tail = tl
...
}
So tail operation is written concisely using lazy evaluation feature of the language.
But in reality (in Scala 2.11.7), the implementation of tail is a bit less elegant:
#volatile private[this] var tlVal: Stream[A] = _
#volatile private[this] var tlGen = tl _
def tailDefined: Boolean = tlGen eq null
override def tail: Stream[A] = {
if (!tailDefined)
synchronized {
if (!tailDefined) {
tlVal = tlGen()
tlGen = null
}
}
tlVal
}
Double-checked locking and two volatile fields: that's roughly how you would implement a thread-safe lazy computation in Java.
So the questions are:
Doesn't lazy keyword of Scala provide any 'evaluated maximum once' guarantee in a multi-threaded case?
Is the pattern used in real tail implementation an idiomatic way to do a thread-safe lazy evaluation in Scala?
Doesn't lazy keyword of Scala provide any 'evaluated maximum once'
guarantee in a multi-threaded case?
Yes, it does, as others have stated.
Is the pattern used in real tail implementation an idiomatic way to do
a thread-safe lazy evaluation in Scala?
Edit:
I think I have the actual answer as to why not lazy val. Stream has public facing API methods such as hasDefinitionSize inherited from TraversableOnce. In order to know if a Stream has a finite size not, we need a way of checking without materializing the underlying Stream tail. Since lazy val doesn't actually expose the underlying bit, we can't do that.
This is backed by SI-1220
To strengthen this point, #Jasper-M points out that the new LazyList api in strawman (Scala 2.13 collection makeover) no longer has this issue, since the entire collection hierarchy has been reworked and there are no longer such concerns.
Performance related concerns
I would say "it depends" on which angle you're looking at this problem. From a LOB point of view, I'd say definitely go with lazy val for conciseness and clarity of implementation. But, if you look at it from the point of view of a Scala collections library author, things start to look differently. Think of it this way, you're creating a library which will be potentially be used by many people and ran on many machines across the world. This means that you should be thinking of the memory overhead of each structure, especially if you're creating such an essential data structure yourself.
I say this because when you use lazy val, by design you generate an additional Boolean field which flags if the value has been initialized, and I am assuming this is what the library authors were aiming to avoid. The size of a Boolean on the JVM is of course VM dependent, by even a byte is something to consider, especially when people are generating large Streams of data. Again, this is definitely not something I would usually consider and is definitely a micro optimization towards memory usage.
The reason I think performance is one of the key points here is SI-7266 which fixes a memory leak in Stream. Note how it is of importance to track the byte code to make sure no extra values are retained inside the generated class.
The difference in the implementation is that the definition of tail being initialized or not is a method implementation which checks the generator:
def tailDefined: Boolean = tlGen eq null
Instead of a field on the class.
Scala lazy values are evaluated only once in multi-threaded cases. This is because the evaluation of lazy members is actually wrapped in a synchronized block in the generated code.
Lets take a look at the simple claas,
class LazyTest {
lazy val x = 5
}
Now, lets compile this with scalac,
scalac -Xprint:all LazyTest.scala
This will result in,
package <empty> {
class LazyTest extends Object {
final <synthetic> lazy private[this] var x: Int = _;
#volatile private[this] var bitmap$0: Boolean = _;
private def x$lzycompute(): Int = {
LazyTest.this.synchronized(if (LazyTest.this.bitmap$0.unary_!())
{
LazyTest.this.x = (5: Int);
LazyTest.this.bitmap$0 = true
});
LazyTest.this.x
};
<stable> <accessor> lazy def x(): Int = if (LazyTest.this.bitmap$0.unary_!())
LazyTest.this.x$lzycompute()
else
LazyTest.this.x;
def <init>(): LazyTest = {
LazyTest.super.<init>();
()
}
}
}
You should be able to see... that the lazy evaluation is thread-safe. And you will also see some similarity to that "less elegant" implementation in Scala 2.11.7
You can also experiment with tests similar to following,
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
case class A(i: Int) {
lazy val j = {
println("calculating j")
i + 1
}
}
def checkLazyInMultiThread(): Unit = {
val a = A(6)
val futuresList = Range(1, 20).toList.map(i => Future{
println(s"Future $i :: ${a.j}")
})
Future.sequence(futuresList).onComplete(_ => println("completed"))
}
checkLazyInMultiThread()
Now, the implementation in standard library avoids using lazy because they are able to provide a more efficient solution than this generic lazy translation.
You are correct, lazy vals use locking precisely to guard against double evaluation when accessed at the same time by two threads. Future developments, furthermore, will give the same guarantees without locking.
What is idiomatic, in my humble opinion, is a highly debatable subject when it comes to a language that, by design, allows for a wide range of different idioms to be adopted. In general, however, application code tends to be considered idiomatic when going more into the direction of pure functional programming, as it gives a series of interesting advantages in terms of ease of testing and reasoning that would make sense to give up only in case of serious concerns. This concern can be one of performance, which is why the current implementation of the Scala Collection API, while exposing in most cases a functional interface, makes heavy use (internally and in restricted scopes) of vars, while loops and established patterns from imperative programming (as the one you highlighted in your question).

Should the common ancestors on the transformation dependency graph be cached?

I'm relatively new to spark and might even be wrong before finishing building up the scenario questions so feel free to skip reading and point it out where you find I'm conceptually wrong, thanks!
Imagine a piece of driver code like this:
val A = ... (some transformation)
val B = A.filter( fun1 )
val C = A.filter( fun2 )
...
B.someAction()... //do sth with B
...
C.someAction()... //do sth with C
Transformation RDDs B and C both depend on A which might itself be a complex transformation. So will A be computed twice ? I argue that it will because spark can't do anything that's inter-transformations, right ? Spark is intelligent on optimizing one transformation execution at a time because the bundled tasks in it could be throughly analyzed. For example it's possible that some state change occurs after B.someAction but before C.someAction which may affect the value of A so the re-computation becomes necessary. For further example It could happen like this:
val arr = Array(...)
val A = sc.parallelize(...).flatMap(e => arr.map(_ * e)) //now A depends on some local array
... //B and C stays the same as above
B.someAction()
...
arr(i) = arr(i) + 10 //local state modified
...
C.someAction() //should A be recomputed? YES
This is easy to verify so I did a quick experiment and the result supports my reasoning.
However if B and C just independently depend on A and no other logic like above exists then a programmer or some tool could statically analyze the code and say hey it’s feasible to add a cache on A so that it doesn’t unnecessarily recompute! But spark can do nothing about this and sometimes it’s even hard for human to decide:
val A = ... (some transformation)
var B = A.filter( fun1 )
var C: ??? = null
var D: ??? = null
if (cond) {
//now whether multiple dependencies exist is runtime determined
C = A.filter( fun2 )
D = A.filter( fun3 )
}
B.someAction()... //do sth with B
if (cond) {
C.someAction()... //do sth with C
D.someAction()... //do sth with D
}
If the condition is true then it’s tempting to cache A but you’ll never know until runtime. I know this is an artificial crappy example but these are already simplified models things could get more complicated in practice and the dependencies could be quite long and implicit and spread across modules so my question is what’s the general principle to deal with this kind of problem. When should the common ancestors on the transformation dependency graph be cached (provided memory is not an issue) ?
I’d like to hear something like always follow functional programming paradigms doing spark or always cache them if you can however there’s another situation that I may not need to:
val A = ... (some transformation)
val B = A.filter( fun1 )
val C = A.filter( fun2 )
...
B.join(C).someAction()
Again B and C both depend on A but instead of calling two actions separately they are joined to form one single transformation. This time I believe spark is smart enough to compute A exactly once. Haven’t found a proper way to run and examine yet but should be obvious in the web UI DAG. What's further I think spark can even reduce the two filter operations into one traversal on A to get B and C at the same time. Is this true?
There's a lot to unpack here.
Transformation RDDs B and C both depend on A which might itself be a complex transformation. So will A be computed twice ? I argue that it will because spark can't do anything that's inter-transformations, right ?
Yes, it will be computed twice, unless you call A.cache() or A.persist(), in which case it will be calculated only once.
For example it's possible that some state change occurs after B.someAction but before C.someAction which may affect the value of A so the re-computation becomes necessary
No, this is not correct, A is immutable, therefore it's state cannot change. B and C are also immutable RDDs that represent transformations of A.
sc.parallelize(...).flatMap(e => arr.map(_ * e)) //now A depends on some local array
No, it doesn't depend on the local array, it is an immutable RDD containing the copy of the elements of the (driver) local array. If the array changes, A does not change. To obtain that behaviour you would have to var A = sc. parallelize(...) and then set A again when local array changes A = sc.paralellize(...). In that scenario, A isn't 'updated' it is replaced by a new RDD representation of the local array, and as such any cached version of A is invalid.
The subsequent examples you have posted benefit from caching A. Again because RDDs are immutable.

Can we prevent laziness of Apache Spark Transformation?

Recently, one employer ask me a question that how can we prevent laziness of Apache Spark transformation. I know that we can persists and cache RDD data-set but in case of failure, it recompute from parent.
Can anyone please explain me, is there any function to stop the laziness of Spark transformation?
By design, Spark transformations are lazy, and you must use an action in order to retrieve a concrete value out of them.
For example, the following transformations will always remain lazy:
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
Functions like map return RDDs, and you can only turn those RDDs into real values by performing actions, such as reduce:
int totalLength = lineLengths.reduce((a, b) -> a + b);
There is no flag that will make map return a concrete value (for example, a list of integers).
The bottom line is that you can use collect or any other Spark action to 'prevent the laziness' of a transformation:
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
List<Integer> collectedLengths = lineLengths.collect()
Remember, though, the using collect on a large dataset will probably be a very bad practice, making your driver run out of memory.

Will there be any scenario, where Spark RDD's fail to satisfy immutability.?

Spark RDD's are constructed in immutable, fault tolerant and resilient manner.
Does RDDs satisfy immutability in all scenarios? Or is there any case, be it in Streaming or Core, where RDD might fail to satisfy immutability?
It depends on what you mean when you talk about RDD. Strictly speaking RDD is just a description of lineage which exists only on the driver and it doesn't provide any methods which can be used to mutate its lineage.
When data is processed we can no longer talk about about RDDs but tasks nevertheless data is exposed using immutable data structures (scala.collection.Iterator in Scala, itertools.chain in Python).
So far so good. Unfortunately immutability of a data structure doesn't imply immutability of the stored data. Lets create a small example to illustrate that:
val rdd = sc.parallelize(Array(0) :: Array(0) :: Array(0) :: Nil)
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 3.0
You can execute this as many times as you want and get the same result. Now lets cache rdd and repeat a whole process:
rdd.cache
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 3.0
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 6.0
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 9.0
Since function we use in the first map is not pure and modifies its mutable argument in place these changes are accumulated with each execution and result in unpredictable output. For example if rdd is evicted from cache we can once again get 3.0. If some partitions are not cached you can mixed results.
PySpark provides stronger isolation and obtaining result like this is not possible but it is a matter of architecture not a immutability.
Take away message here is that you should be extremely careful when working with mutable data and avoid any modifications in place unless it is explicitly allowed (fold, aggregate).
Take this example:
sc.makeRDD(1 to 100000).map(x=>{
println(x)
x + 1
}.collect
If a node fails after the map has been completed, but the full results have yet to be sent back to the driver, then the map will recompute on a different machine. The final results will always be the same, as any value double computed will only be sent back once. However, the println will have occurred twice for some calls. So, yes, immutability of the DAG itself is guaranteed, but you must still write your code with the assumption that it will be run more than once.

Unexpected behaviour of iterator on String

Can anyone explain why these iterators behave differently? I generally expect a String to act like an IndexedSeq[Char]. Is this documented anywhere?
val si: Iterator[Char] = "uvwxyz".iterator
val vi: Iterator[Char] = "uvwxyz".toIndexedSeq.iterator
val sr = for (i <- 1 to 3)
yield si take 2 mkString
//sr: scala.collection.immutable.IndexedSeq[String] = Vector(uv, uv, uv)
val vr = for (i <- 1 to 3)
yield vi take 2 mkString
//vr: scala.collection.immutable.IndexedSeq[String] = Vector(uv, wx, yz)
There are no guarantees about the state of the iterator after you invoke take on it.
The problem with iterators is that many useful operations can only be implemented by causing side effects. All these operations have a specified direct effect but may also have side effects that cannot be specified (or would complicate the implementation).
In the case of take there are implementations that clone the internal state of the iterator and others that advance the iterator. If you want to guarantee the absence of side-effects you will have to use immutable data structures, in any other case your code should only rely on direct effects.

Resources