What is the result of RDD transformation in Spark? - apache-spark

Can anyone explain, what is the result of RDD transformations? Is it the new set of data (copy of data) or it is only new set of pointers, to filtered blocks of old data?

RDD transformations allow you to create dependencies between RDDs. Dependencies are only steps for producing results (a program). Each RDD in lineage chain (string of dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD. Spark will divide RDD dependencies into stages and tasks and send those to workers for execution.
So if you do this:
val lines = sc.textFile("...")
val words = lines.flatMap(line => line.split(" "))
val localwords = words.collect()
words will be an RDD containing a reference to lines RDD. When the program is executed, first lines' function will be executed (load the data from a text file), then words' function will be executed on the resulting data (split lines into words). Spark is lazy, so nothing will get executed unless you call some transformation or action that will trigger job creation and execution (collect in this example).
So, an RDD (transformed RDD, too) is not 'a set of data', but a step in a program (might be the only step) telling Spark how to get the data and what to do with it.

Transformations create new RDD based on the existing RDD. Basically, RDD's are immutable.
All transformations in Spark are lazy.Data in RDD's is not processed until an acton is performed.
Example of RDD transformations:
map,filter,flatMap,groupByKey,reduceByKey

As others have mentioned, an RDD maintains a list of all the transformations which have been programmatically applied to it. These are lazily evaluated, so though (in the REPL, for example), you may get a result back of a different parameter type (after applying a map, for example), the 'new' RDD doensn't yet contain anything, because nothing has forced the original RDD to evaluate the transformations / filters which are in its lineage. Methods such as count, the various reduction methods, etc will cause the transportations to be applied. The checkpoint method applies all RDD actions as well, returning an RDD which is the result of the transportations but has no lineage (this can be a performance advantage, especially with iterative applications).

All answers are perfectly valid. I just want to add a quick picture :-)

Transformations are kind of operations which will transform your RDD data from one form to another. And when you apply this operation on any RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable, Remember????). Operations like map, filter, flatMap are transformations.
Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.

The other answers give a good explanation already. Here are my some cents:
To know well what's inside that returned RDD, it'd better to check what's inside the RDD abstract class (quoted from source code):
Internally, each RDD is characterized by five main properties:
A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Related

What algorithm spark uses to bring same keys together

What algorithm Spark uses to identify similar keys and pushes the data to the next stage?
Scenarios include,
When I apply distinct(), I know a pre-distinct applied in the current stage and then the data is shuffled to the next stage. In this case, all the similar keys need to be in the same partition in the next stage.
When Dataset1 joins with Dataset2 (SortMergeJoin). In this case, all the similar keys in Dataset1 and Dataset2 needs to be in the same partition in next stage.
There are other scenarios as well, but overall picture is this.
How does Spark efficiently does this? and will there be any time lag between Stage1 and Stage2 when identifying the similar keys?
Algorithm Spark uses to partition the data is Hash by default. Also stages don't push but pull the data from previous stage.
Spark creates a stage boundaries whenever a shuffle is needed. Second stage will wait untill all the tasks in stage first complete and write their output to temp files. Second stage then starts pulling the data needed for its partitions from across the partitions written in stage 1.
Distinct as you see isn't as simple as it looks. Spark does distinct by applying aggregates. Also shuffling is needed because duplicates can be in multiple partitions. One of the conditions for shuffling is Spark needs a pair RDD and if your parent isn't one, it will create intermediary pair RDDs.
If you see the logical plan of Distinct, it would be more or less like
Parent RDD ---> Mapped RDD (record as key and null values) ---> MapPartitionsRDD (running distinct at partition level) ----> Shuffled RDD (pulling needed partitions data) ----> MapPartitionsRDD (distinct from segregated partitions for each key) ----> Mapped RDD (collecting only keys and discarding null values for result)
Spark uses RDD Dependency to achieve that data is shuffled to the next stage. And know which is a complexed processes;
The getDependencies function in RDD.SCALA is responsible to get the data from parent.
/**
* Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
*/
protected def getDependencies: Seq[Dependency[_]] = deps
And Some RDD dont have to get parent rdd, so the rdd dont implement the function compute,like DataSource RDD;
Shuffle rowRDD usually appear in chain compute, so it usually have the parent data to fetch.

Is a PairRDD faster than a non-Pair RDD?

In Spark, operations on the RDD (like Map) are applied to the whole RDD while operations on Pair RDD are applied on each element in parallel.
I want to know which one is faster for operations on the larger data sets?
No, and there is no such comparison to be made.
What is relevant, is that a pairRDD has more capabilities. E.g. use of a JOIN.

What's the overhead of converting an RDD to a DataFrame and back again?

It was my assumption that Spark Data Frames were built from RDDs. However, I recently learned that this is not the case, and Difference between DataFrame, Dataset, and RDD in Spark does a good job explaining that they are not.
So what is the overhead of converting an RDD to a DataFrame, and back again? Is it negligible or significant?
In my application, I create a DataFrame by reading a text file into an RDD and then custom-encoding every line with a map function that returns a Row() object. Should I not be doing this? Is there a more efficient way?
RDDs have a double role in Spark. First of all is the internal data structure for tracking changes between stages in order to manage failures and secondly until Spark 1.3 was the main interface for interaction with users. Therefore after after Spark 1.3 Dataframes constitute the main interface offering much richer functionality than RDDs.
There is no significant overhead when converting one Dataframe to RDD with df.rdd since the dataframes they already keep an instance of their RDDs initialized therefore returning a reference to this RDD should not have any additional cost. On the other side, generating a dataframe from an RDD requires some extra effort. There are two ways to convert an RDD to dataframe 1st by calling rdd.toDF() and 2nd with spark.createDataFrame(rdd, schema). Both methods will evaluate lazily although there will be an extra overhead regarding the schema validation and execution plan (you can check the toDF() code here for more details). Of course that would be identical to the overhead that you have just by initializing your data with spark.read.text(...) but with one less step, the conversion from RDD to dataframe.
This the first reason that I would go directly with Dataframes instead of working with two different Spark interfaces.
The second reason is that when using the RDD interface you are missing some significant performance features that dataframes and datasets offer related to Spark optimizer (catalyst) and memory management (tungsten).
Finally I would use the RDDs interface only if I need some features that are missing in dataframes such as key-value pairs, zipWithIndex function etc. But even then you can access those via df.rdd which is costless as already mentioned. As for your case , I believe that would be faster to use directly a dataframe and use the map function of that dataframe to ensure that Spark leverages the usage of tungsten ensuring efficient memory management.

Spark - do transformations also involve driver operations

My course notes have the following sentence: "RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset." But I think this is misleading because the transformation reduceByKey is performed locally on the workers and then on the driver as well (although the change does not take place until there's an action to be performed). Could you please correct me if I am wrong.
Here are the concepts
In Spark Transformation defines where one RDD generates one or more RDD. Everytime a new RDD is created. RDDs are immutable so any transformation on one RDD generates a new RDD and its added to DAG.
Action in spark are the function where new RDDs are not generated its generated other datatypes like String, int etc.. and result is returned to driver or other storage system.
Transformations are lazy in nature and nothing happen until action is triggered.
reduceByKey - Its a transformation as it generates a RDD from input RDD and its a WIDE TRANFORMATION. In reduce by key nothing happens until action is triggered. Please see the image below
reduce - its an action as it generates a non RDD type. Please see the image below
As a matter of fact, driver's first responsibility is managing the job. Moreover, RDD's objects are not located on driver to have an action on them. So, all the results are on workers till the actions' turns come. The thing which I mean is about lazy execution of spark, it means at first of the execution the plan is reviewed to the first action and if it could not find any then the whole program would result nothing. Otherwise, whole the program will be executed on the input data which would be presented as rdd object on the worker nodes to reach the action and all the data during this period would all be on workers and just the result according to the type of the action would be sent to or at least managed by the driver.

Transformation vs Action in the context of Laziness

As mentioned in the "Learning Spark: Lightning-Fast Big Data Analysis" book:
Transformations and actions are different because of the way Spark computes RDDs.
After some explanation about laziness, as I found, both transformations and actions are working lazily. Therefore, the question is, what does the quoted sentence mean?
It is not necessarily valid to contrast laziness of RDD actions vs transformations.
The correct statement would be that RDDs are lazily evaluated, from the perspective of an RDD as a collection of data: there's not necessarily "data" in memory when the RDD instance is created.
The question raised by this statement is: when does the RDD's data get loaded in memory? Which can be rephrased as "when does the RDD get evaluated?". It's here that we have the distinction between actions and transformations:
Consider the following sequence of code:
Line #1:
rdd = sc.textFile("text-file-path")
Does the RDD exist? Yes.
Is the data loaded in memory? No.
--> RDD evaluation is lazy
Line #2:
rdd2 = rdd.map(lambda line: list.split())
Does the RDD exist? Yes. In fact, there are 2 RDDs.
Is the data loaded in memory? No.
--> Still, it's lazy, all Spark does is record how to load the data and transform it, remembering the lineage (how to derive RDDs one from another).
Line #3
print(rdd2.collect())
Does the RDD exist? Yes (2 RDDs still).
Is the data loaded in memory? Yes.
What's the difference? collect() forces Spark to return the result of the transformations. Spark now does all that it recorded in steps #1, #2, and #3.
In spark's terminology, #1 and #2 are transformations.
Transformations typically return another RDD instance, and that's a hint for recognizing the lazy part.
#3 has an action, which simply means an operation that causes plans in transformations to be carried out in order to return a result or perform a final action, such as saving results (yes, "such as saving the actual collection of data loaded in memory").
So, in short, I'd say that RDDs are lazily evaluated, but, in my opinion, it's incorrect to label operations (actions or transformations) as lazy or not.
Transformations are lazy, actions are not.
Definitions:
Transformation - A function that mutates the data out on the cluster. These actions will change the data in place when they are executed. Examples of this are map, filter, and aggregate. These are not executed until an action is called.
Action - Any function that results in data being persisted or returned to the driver (also foreach, which doesn't really fall into those two categories).
In order to run an action (like saving the data), all the transformations you have requested up till now have to be run to materialize the data. Spark can implement optimizations if it looks at the total execution plan of the operations you want to run, so it is beneficial not to compute anything until it is required.

Resources