Recently, one employer ask me a question that how can we prevent laziness of Apache Spark transformation. I know that we can persists and cache RDD data-set but in case of failure, it recompute from parent.
Can anyone please explain me, is there any function to stop the laziness of Spark transformation?
By design, Spark transformations are lazy, and you must use an action in order to retrieve a concrete value out of them.
For example, the following transformations will always remain lazy:
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
Functions like map return RDDs, and you can only turn those RDDs into real values by performing actions, such as reduce:
int totalLength = lineLengths.reduce((a, b) -> a + b);
There is no flag that will make map return a concrete value (for example, a list of integers).
The bottom line is that you can use collect or any other Spark action to 'prevent the laziness' of a transformation:
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
List<Integer> collectedLengths = lineLengths.collect()
Remember, though, the using collect on a large dataset will probably be a very bad practice, making your driver run out of memory.
Related
I'm relatively new to spark and might even be wrong before finishing building up the scenario questions so feel free to skip reading and point it out where you find I'm conceptually wrong, thanks!
Imagine a piece of driver code like this:
val A = ... (some transformation)
val B = A.filter( fun1 )
val C = A.filter( fun2 )
...
B.someAction()... //do sth with B
...
C.someAction()... //do sth with C
Transformation RDDs B and C both depend on A which might itself be a complex transformation. So will A be computed twice ? I argue that it will because spark can't do anything that's inter-transformations, right ? Spark is intelligent on optimizing one transformation execution at a time because the bundled tasks in it could be throughly analyzed. For example it's possible that some state change occurs after B.someAction but before C.someAction which may affect the value of A so the re-computation becomes necessary. For further example It could happen like this:
val arr = Array(...)
val A = sc.parallelize(...).flatMap(e => arr.map(_ * e)) //now A depends on some local array
... //B and C stays the same as above
B.someAction()
...
arr(i) = arr(i) + 10 //local state modified
...
C.someAction() //should A be recomputed? YES
This is easy to verify so I did a quick experiment and the result supports my reasoning.
However if B and C just independently depend on A and no other logic like above exists then a programmer or some tool could statically analyze the code and say hey it’s feasible to add a cache on A so that it doesn’t unnecessarily recompute! But spark can do nothing about this and sometimes it’s even hard for human to decide:
val A = ... (some transformation)
var B = A.filter( fun1 )
var C: ??? = null
var D: ??? = null
if (cond) {
//now whether multiple dependencies exist is runtime determined
C = A.filter( fun2 )
D = A.filter( fun3 )
}
B.someAction()... //do sth with B
if (cond) {
C.someAction()... //do sth with C
D.someAction()... //do sth with D
}
If the condition is true then it’s tempting to cache A but you’ll never know until runtime. I know this is an artificial crappy example but these are already simplified models things could get more complicated in practice and the dependencies could be quite long and implicit and spread across modules so my question is what’s the general principle to deal with this kind of problem. When should the common ancestors on the transformation dependency graph be cached (provided memory is not an issue) ?
I’d like to hear something like always follow functional programming paradigms doing spark or always cache them if you can however there’s another situation that I may not need to:
val A = ... (some transformation)
val B = A.filter( fun1 )
val C = A.filter( fun2 )
...
B.join(C).someAction()
Again B and C both depend on A but instead of calling two actions separately they are joined to form one single transformation. This time I believe spark is smart enough to compute A exactly once. Haven’t found a proper way to run and examine yet but should be obvious in the web UI DAG. What's further I think spark can even reduce the two filter operations into one traversal on A to get B and C at the same time. Is this true?
There's a lot to unpack here.
Transformation RDDs B and C both depend on A which might itself be a complex transformation. So will A be computed twice ? I argue that it will because spark can't do anything that's inter-transformations, right ?
Yes, it will be computed twice, unless you call A.cache() or A.persist(), in which case it will be calculated only once.
For example it's possible that some state change occurs after B.someAction but before C.someAction which may affect the value of A so the re-computation becomes necessary
No, this is not correct, A is immutable, therefore it's state cannot change. B and C are also immutable RDDs that represent transformations of A.
sc.parallelize(...).flatMap(e => arr.map(_ * e)) //now A depends on some local array
No, it doesn't depend on the local array, it is an immutable RDD containing the copy of the elements of the (driver) local array. If the array changes, A does not change. To obtain that behaviour you would have to var A = sc. parallelize(...) and then set A again when local array changes A = sc.paralellize(...). In that scenario, A isn't 'updated' it is replaced by a new RDD representation of the local array, and as such any cached version of A is invalid.
The subsequent examples you have posted benefit from caching A. Again because RDDs are immutable.
I know map function can do like
val a=5
map(data=>data+5)
Is that possible variable a can be dynamic?
For example, the value of variable a is between 1 to 5 so a=1,2,3,4,5.
When I call map function, it can distributed execute like
data + 1
data + 2
data + 3
data + 4
data + 5
If I'm understanding your question correctly, it doesn't make sense from a Spark perspective. What you're asking for makes sense in a non-distributed, sequential processing environment (where each datum can be deterministically applied a different function). However, Spark applies transformations across distributed datasets and the functions applied by these transformations are identical.
One way to achieve what you are trying to do is to use some inherent qualities of the input in transforming your data. This way, even if your transformation function is identical, the arguments provided to it will allow it do behave like (what you described as) a "dynamic variable". In your example, the zipWithIndex() function can suffice. Though it is important to note that if ordering is not guaranteed, the indexes are subject to change on each run of the transformation.
scala> val rdd = sc.parallelize(Array(1,1,1,1,1,1))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> val newRDD = rdd.zipWithIndex().map { case (elem, idx) => elem + idx }
...
scala> newRDD.take(6)
...
res0: Array[Long] = Array(1, 2, 3, 4, 5, 6)
I'm looking for a way to split an RDD into two or more RDDs. The closest I've seen is Scala Spark: Split collection into several RDD? which is still a single RDD.
If you're familiar with SAS, something like this:
data work.split1, work.split2;
set work.preSplit;
if (condition1)
output work.split1
else if (condition2)
output work.split2
run;
which resulted in two distinct data sets. It would have to be immediately persisted to get the results I intend...
It is not possible to yield multiple RDDs from a single transformation*. If you want to split a RDD you have to apply a filter for each split condition. For example:
def even(x): return x % 2 == 0
def odd(x): return not even(x)
rdd = sc.parallelize(range(20))
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If you have only a binary condition and computation is expensive you may prefer something like this:
kv_rdd = rdd.map(lambda x: (x, odd(x)))
kv_rdd.cache()
rdd_odd = kv_rdd.filter(lambda kv: kv[1]).keys()
rdd_even = kv_rdd.filter(lambda kv: not kv[1]).keys()
It means only a single predicate computation but requires additional pass over all data.
It is important to note that as long as an input RDD is properly cached and there no additional assumptions regarding data distribution there is no significant difference when it comes to time complexity between repeated filter and for-loop with nested if-else.
With N elements and M conditions number of operations you have to perform is clearly proportional to N times M. In case of for-loop it should be closer to (N + MN) / 2 and repeated filter is exactly NM but at the end of the day it is nothing else than O(NM). You can see my discussion** with Jason Lenderman to read about some pros-and-cons.
At the very high level you should consider two things:
Spark transformations are lazy, until you execute an action your RDD is not materialized
Why does it matter? Going back to my example:
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If later I decide that I need only rdd_odd then there is no reason to materialize rdd_even.
If you take a look at your SAS example to compute work.split2 you need to materialize both input data and work.split1.
RDDs provide a declarative API. When you use filter or map it is completely up to Spark engine how this operation is performed. As long as the functions passed to transformations are side effects free it creates multiple possibilities to optimize a whole pipeline.
At the end of the day this case is not special enough to justify its own transformation.
This map with filter pattern is actually used in a core Spark. See my answer to How does Sparks RDD.randomSplit actually split the RDD and a relevant part of the randomSplit method.
If the only goal is to achieve a split on input it is possible to use partitionBy clause for DataFrameWriter which text output format:
def makePairs(row: T): (String, String) = ???
data
.map(makePairs).toDF("key", "value")
.write.partitionBy($"key").format("text").save(...)
* There are only 3 basic types of transformations in Spark:
RDD[T] => RDD[T]
RDD[T] => RDD[U]
(RDD[T], RDD[U]) => RDD[W]
where T, U, W can be either atomic types or products / tuples (K, V). Any other operation has to be expressed using some combination of the above. You can check the original RDD paper for more details.
** https://chat.stackoverflow.com/rooms/91928/discussion-between-zero323-and-jason-lenderman
*** See also Scala Spark: Split collection into several RDD?
As other posters mentioned above, there is no single, native RDD transform that splits RDDs, but here are some "multiplex" operations that can efficiently emulate a wide variety of "splitting" on RDDs, without reading multiple times:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.multiplex.MuxRDDFunctions
Some methods specific to random splitting:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions
Methods are available from open source silex project:
https://github.com/willb/silex
A blog post explaining how they work:
http://erikerlandson.github.io/blog/2016/02/08/efficient-multiplexing-for-spark-rdds/
def muxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[U],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => Iterator.single(itr.next()(j)) } }
}
def flatMuxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[TraversableOnce[U]],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => itr.next()(j).toIterator } }
}
As mentioned elsewhere, these methods do involve a trade-off of memory for speed, because they operate by computing entire partition results "eagerly" instead of "lazily." Therefore, it is possible for these methods to run into memory problems on large partitions, where more traditional lazy transforms will not.
One way is to use a custom partitioner to partition the data depending upon your filter condition. This can be achieved by extending Partitioner and implementing something similar to the RangePartitioner.
A map partitions can then be used to construct multiple RDDs from the partitioned RDD without reading all the data.
val filtered = partitioned.mapPartitions { iter => {
new Iterator[Int](){
override def hasNext: Boolean = {
if(rangeOfPartitionsToKeep.contains(TaskContext.get().partitionId)) {
false
} else {
iter.hasNext
}
}
override def next():Int = iter.next()
}
Just be aware that the number of partitions in the filtered RDDs will be the same as the number in the partitioned RDD so a coalesce should be used to reduce this down and remove the empty partitions.
If you split an RDD using the randomSplit API call, you get back an array of RDDs.
If you want 5 RDDs returned, pass in 5 weight values.
e.g.
val sourceRDD = val sourceRDD = sc.parallelize(1 to 100, 4)
val seedValue = 5
val splitRDD = sourceRDD.randomSplit(Array(1.0,1.0,1.0,1.0,1.0), seedValue)
splitRDD(1).collect()
res7: Array[Int] = Array(1, 6, 11, 12, 20, 29, 40, 62, 64, 75, 77, 83, 94, 96, 100)
Spark RDD's are constructed in immutable, fault tolerant and resilient manner.
Does RDDs satisfy immutability in all scenarios? Or is there any case, be it in Streaming or Core, where RDD might fail to satisfy immutability?
It depends on what you mean when you talk about RDD. Strictly speaking RDD is just a description of lineage which exists only on the driver and it doesn't provide any methods which can be used to mutate its lineage.
When data is processed we can no longer talk about about RDDs but tasks nevertheless data is exposed using immutable data structures (scala.collection.Iterator in Scala, itertools.chain in Python).
So far so good. Unfortunately immutability of a data structure doesn't imply immutability of the stored data. Lets create a small example to illustrate that:
val rdd = sc.parallelize(Array(0) :: Array(0) :: Array(0) :: Nil)
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 3.0
You can execute this as many times as you want and get the same result. Now lets cache rdd and repeat a whole process:
rdd.cache
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 3.0
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 6.0
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 9.0
Since function we use in the first map is not pure and modifies its mutable argument in place these changes are accumulated with each execution and result in unpredictable output. For example if rdd is evicted from cache we can once again get 3.0. If some partitions are not cached you can mixed results.
PySpark provides stronger isolation and obtaining result like this is not possible but it is a matter of architecture not a immutability.
Take away message here is that you should be extremely careful when working with mutable data and avoid any modifications in place unless it is explicitly allowed (fold, aggregate).
Take this example:
sc.makeRDD(1 to 100000).map(x=>{
println(x)
x + 1
}.collect
If a node fails after the map has been completed, but the full results have yet to be sent back to the driver, then the map will recompute on a different machine. The final results will always be the same, as any value double computed will only be sent back once. However, the println will have occurred twice for some calls. So, yes, immutability of the DAG itself is guaranteed, but you must still write your code with the assumption that it will be run more than once.
I am looking for a Spark RDD operation like top or takeOrdered, but that returns another RDD, not an Array, that is, does not collect the full result to RAM.
It can be a sequence of operations, but ideally, in no step trying to collect the full result into the memory of a single node.
Let's say you want to have the top 50% of an RDD.
def top50(rdd: RDD[(Double, String)]) = {
val sorted = rdd.sortByKey(ascending = false)
val partitions = sorted.partitions.size
// Throw away the contents of the lower partitions.
sorted.mapPartitionsWithIndex { (pid, it) =>
if (pid <= partitions / 2) it else Nil
}
}
This is an approximation — you may get more or less than 50%. You could do better but it would cost an extra evaluation of the RDD. For the use cases I have in mind this would not be worth it.
Take a look at
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
val rdd: RDD[(String, Int)] // the first string is the key, the rest is the value
val topByKey:RDD[(String, Array[Int])] = rdd.topByKey(n)
Or use aggregate with BoundedPriorityQueue.