SPARK rdd performance pipelining - apache-spark

If we have, say, :
val rdd1 = rdd0.map( ...
followed by
val rdd2 = rdd1.filter( ...
Then, when actually running due to an action, can rdd2 start computing the already computed rdd1 results that are known - or must this wait until rdd1 work is all complete? It is not apparent to me when reading the SPARK stuff. Informatica pipelining does do this, so I assume it probably does in SPARK as well.

Spark transformations are lazy so both calls doesn't do anything, beyond computing dependency DAG. So your code doesn't even touch the data.
For anything to be computed you have to execute an action on rdd2 or one of its descendants.
By default there are also forgetful, so unless you cache rdd1 it will be evaluated all over again, every time rdd2 is evaluated.
Finally, due to lazy evaluation, multiple narrow transformations are combined together in a single stage and your code will interleave calls to map and filter functions.

Related

Does spark recompute intermediate result if two different transformations are applied to this intermediate result?

Suppose we start from some data and gets some intermediate result df_intermediate. Along the pipeline from source data to df_intermediate, all transformations are lazy and nothing is actually computed.
Then I would like to perform two different transformations to df_intermediate. For example, I would like to calculate df_intermediate.agg({"col":"max"}) and df_intermediate.approxquantile("col", [0.1,0.2,0.3], 0.01) using two separate commands.
I wonder in the following scenario, does spark need to recompute df_intermediate when it is performing the second transformation? In other words, does Spark perform the calculation for the above two transformations both start from the raw data without storing the intermediate result? Obviously I can cache the intermediate result but I'm just wondering if Spark does this kind of optimization internallly.
It is somewhat disappointing. But firstly you need to see it in terms of Actions. I will not consider the caching.
If you do the following there will be optimization for sure.
val df1 = df0.withColumn(...
val df2 = df1.withColumn(...
Your example needs an Action like count to work. But the two statements are too diverse, so that there is no skipped processing evident. There is thus no sharing.
In general the Action = Job is correct way to look at it. For DFs Catalyst Optimizer can kick a Job off even though you may not realize this. For RRDs (legacy) this was a little different.
This does not get optimized either:
import org.apache.spark.sql.functions._
val df = spark.range(1,10000).toDF("c1")
val df_intermediate = df.withColumn("c2", col("c1") + 100)
val x = df_intermediate.agg(max("c2"))
val y = df_intermediate.agg(min("c2"))
val z = x.union(y).count
x and y both go back to source. One would have thought that would be easier to do and it is also 1 Action here. Need to do the .explain, but the idea is to leave it to Spark due to lazy evaluation, etc.
As an aside: Is it efficient to cache a dataframe for a single Action Spark application in which that dataframe is referenced more than once? & In which situations are the stages of DAG skipped?

Transformation vs Action in the context of Laziness

As mentioned in the "Learning Spark: Lightning-Fast Big Data Analysis" book:
Transformations and actions are different because of the way Spark computes RDDs.
After some explanation about laziness, as I found, both transformations and actions are working lazily. Therefore, the question is, what does the quoted sentence mean?
It is not necessarily valid to contrast laziness of RDD actions vs transformations.
The correct statement would be that RDDs are lazily evaluated, from the perspective of an RDD as a collection of data: there's not necessarily "data" in memory when the RDD instance is created.
The question raised by this statement is: when does the RDD's data get loaded in memory? Which can be rephrased as "when does the RDD get evaluated?". It's here that we have the distinction between actions and transformations:
Consider the following sequence of code:
Line #1:
rdd = sc.textFile("text-file-path")
Does the RDD exist? Yes.
Is the data loaded in memory? No.
--> RDD evaluation is lazy
Line #2:
rdd2 = rdd.map(lambda line: list.split())
Does the RDD exist? Yes. In fact, there are 2 RDDs.
Is the data loaded in memory? No.
--> Still, it's lazy, all Spark does is record how to load the data and transform it, remembering the lineage (how to derive RDDs one from another).
Line #3
print(rdd2.collect())
Does the RDD exist? Yes (2 RDDs still).
Is the data loaded in memory? Yes.
What's the difference? collect() forces Spark to return the result of the transformations. Spark now does all that it recorded in steps #1, #2, and #3.
In spark's terminology, #1 and #2 are transformations.
Transformations typically return another RDD instance, and that's a hint for recognizing the lazy part.
#3 has an action, which simply means an operation that causes plans in transformations to be carried out in order to return a result or perform a final action, such as saving results (yes, "such as saving the actual collection of data loaded in memory").
So, in short, I'd say that RDDs are lazily evaluated, but, in my opinion, it's incorrect to label operations (actions or transformations) as lazy or not.
Transformations are lazy, actions are not.
Definitions:
Transformation - A function that mutates the data out on the cluster. These actions will change the data in place when they are executed. Examples of this are map, filter, and aggregate. These are not executed until an action is called.
Action - Any function that results in data being persisted or returned to the driver (also foreach, which doesn't really fall into those two categories).
In order to run an action (like saving the data), all the transformations you have requested up till now have to be run to materialize the data. Spark can implement optimizations if it looks at the total execution plan of the operations you want to run, so it is beneficial not to compute anything until it is required.

Spark does persistence required for single action?

I have a workflow like below:
rdd1 = sc.textFile(input);
rdd2 = rdd1.filter(filterfunc1);
rdd3 = rdd1.filter(fiterfunc2);
rdd4 = rdd2.map(mapptrans1);
rdd5 = rdd3.map(maptrans2);
rdd6 = rdd4.union(rdd5);
rdd6.foreach(some transformation);
1.Do I need to persist rdd1 ?Or its not required since there is only one action at rdd6 which will create only one job and in a single job no need of persist ?
2.Also what if transformation on rdd2 is reduceByKey instead of map ? Will this again the same thing no need of persist since single job.
You only need to persist if you plan to reuse the RDD in more than one action. In a single action, spark does a good job of deciding when to recalculate and when to reuse.
You can see the DAG in the UI to make sure the rdd1 is only read once from file.

Create JavaPairRDD from a collection with a custom partitioner

Is it possible to create a JavaPairRDD<K,V> from a List<Tuple2<K,V>> with a specified partitioner? the method parallelizePairs in JavaSparkContext only takes the number of slices and does not allow using a custom partitioner. Invoking partitionBy(...) results in a shuffle which I would like to avoid.
Why do I need this? let's say I have rdd1 of some type JavaPairRDD<K,V> which is partitioned according to the hashCode of K. Now, I would like to create rdd2 of another type JavaPairRDD<K,U> from a List<Tuple2<K,U>> in order to finally obtain rdd3 = rdd1.join(rdd2).mapValues(...). If rdd2 is not partitioned the same way rdd1 is, the cogroup call in join will result in expensive data movement across the machines. Calling rdd2.partitionBy(rdd1.partitioner()) does not help either since it also invokes shuffle. Therefore, it seems like the only remedy is to ensure rdd2 is created with the same partitioner as rdd1 to begin with. Any suggestions?
ps. If List<Tuple2<K,U>> is small, another option is broadcast hash joins, i.e. making a HashMap<K,U> from List<Tuple2<K,U>>, broadcasting it to all partitions of rdd1, and performing a map-side joining. This turns out to be faster than repartitioning rdd2, however, it is not an ideal solution.

What is the result of RDD transformation in Spark?

Can anyone explain, what is the result of RDD transformations? Is it the new set of data (copy of data) or it is only new set of pointers, to filtered blocks of old data?
RDD transformations allow you to create dependencies between RDDs. Dependencies are only steps for producing results (a program). Each RDD in lineage chain (string of dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD. Spark will divide RDD dependencies into stages and tasks and send those to workers for execution.
So if you do this:
val lines = sc.textFile("...")
val words = lines.flatMap(line => line.split(" "))
val localwords = words.collect()
words will be an RDD containing a reference to lines RDD. When the program is executed, first lines' function will be executed (load the data from a text file), then words' function will be executed on the resulting data (split lines into words). Spark is lazy, so nothing will get executed unless you call some transformation or action that will trigger job creation and execution (collect in this example).
So, an RDD (transformed RDD, too) is not 'a set of data', but a step in a program (might be the only step) telling Spark how to get the data and what to do with it.
Transformations create new RDD based on the existing RDD. Basically, RDD's are immutable.
All transformations in Spark are lazy.Data in RDD's is not processed until an acton is performed.
Example of RDD transformations:
map,filter,flatMap,groupByKey,reduceByKey
As others have mentioned, an RDD maintains a list of all the transformations which have been programmatically applied to it. These are lazily evaluated, so though (in the REPL, for example), you may get a result back of a different parameter type (after applying a map, for example), the 'new' RDD doensn't yet contain anything, because nothing has forced the original RDD to evaluate the transformations / filters which are in its lineage. Methods such as count, the various reduction methods, etc will cause the transportations to be applied. The checkpoint method applies all RDD actions as well, returning an RDD which is the result of the transportations but has no lineage (this can be a performance advantage, especially with iterative applications).
All answers are perfectly valid. I just want to add a quick picture :-)
Transformations are kind of operations which will transform your RDD data from one form to another. And when you apply this operation on any RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable, Remember????). Operations like map, filter, flatMap are transformations.
Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.
The other answers give a good explanation already. Here are my some cents:
To know well what's inside that returned RDD, it'd better to check what's inside the RDD abstract class (quoted from source code):
Internally, each RDD is characterized by five main properties:
A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Resources