How does lineage get passed down in RDDs in Apache Spark - apache-spark

Do each RDD point to the same lineage graph
or
when a parent RDD gives its lineage to a new RDD, is the lineage graph copied by the child as well so both the parent and child have different graphs. In this case isn't it memory intensive?

Each RDD maintains a pointer to one or more parent along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on an RDD, the RDD b just keeps a reference (and never copies) to its parent a, that's a lineage.
And when the driver submits the job, the RDD graph is serialized to the worker nodes so that each of the worker nodes apply the series of transformations (like, map filter and etc..) on different partitions. Also, this RDD lineage will be used to recompute the data if some failure occurs.
To display the lineage of an RDD, Spark provides a debug method toDebugString() method.
Consider the following example:
val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
Executing toDebugString() on splitedLines RDD, will output the following,
(2) ShuffledRDD[6] at reduceByKey at <console>:25 []
+-(2) MapPartitionsRDD[5] at map at <console>:24 []
| MapPartitionsRDD[4] at map at <console>:23 []
| log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
| log.txt HadoopRDD[0] at textFile at <console>:21 []
For more information about how Spark works internally, please read my another post

When a transformation(map or filter etc) is called, it is not executed by Spark immediately, instead a lineage is created for each transformation.
A lineage will keep track of what all transformations has to be applied on that RDD,
including the location from where it has to read the data.
For example, consider the following example
val myRdd = sc.textFile("spam.txt")
val filteredRdd = myRdd.filter(line => line.contains("wonder"))
filteredRdd.count()
sc.textFile() and myRdd.filter() do not get executed immediately,
it will be executed only when an Action is called on the RDD - here filteredRdd.count().
An Action is used to either save result to some location or to display it.
RDD lineage information can also be printed by using the command filteredRdd.toDebugString(filteredRdd is the RDD here).
Also, DAG Visualization shows the complete graph in a very intuitive manner as follows:

Related

When does a RDD lineage is created? How to find lineage graph?

I am learning Apache Spark and trying to get the lineage graph of the RDDs.
But i could not find when does a particular lineage is created?
Also, where to find the lineage of an RDD?
RDD Lineage is the logical execution plan of a distributed computation that is created and expanded every time you apply a transformation on any RDD.
Note the part "logical" not "physical" that happens after you've executed an action.
Quoting Mastering Apache Spark 2 gitbook:
RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of a RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan.
A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.
Any RDD has a RDD lineage even if that means that the RDD lineage is just a single node, i.e. the RDD itself. That's because an RDD may or may not be a result of a series of transformations (and no transformations is a "zero-effect" transformation :))
You can check out the RDD lineage of an RDD using RDD.toDebugString:
toDebugString: String A description of this RDD and its recursive dependencies for debugging.
val nums = sc.parallelize(0 to 9)
scala> nums.toDebugString
res0: String = (8) ParallelCollectionRDD[0] at parallelize at <console>:24 []
val doubles = nums.map(_ * 2)
scala> doubles.toDebugString
res1: String =
(8) MapPartitionsRDD[1] at map at <console>:25 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
val groups = doubles.groupBy(_ < 10)
scala> groups.toDebugString
res2: String =
(8) ShuffledRDD[3] at groupBy at <console>:25 []
+-(8) MapPartitionsRDD[2] at groupBy at <console>:25 []
| MapPartitionsRDD[1] at map at <console>:25 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []

Is Dataset.rdd an action or transformation?

one of the way to evaluate if a dataframe is empty or not is to do df.rdd.isEmpty(), however, I see rdd at mycode.scala:123 in sparkUI executions. which makes me wonder if this rdd() function is actually an action is instead of a transformation.
I know that isEmpty() is an action, but I do see a separate stage where isEmpty() at mycode.scala:234, so I think they are different actions?
rdd is generated to represent a structured query in "RDD terms" so Spark can execute it. It is an RDD of JVM objects of your type T. If used incorrectly can cause memory problems since:
Transfers internally-managed optimized rows that live outside JVM to the memory space in JVM
Transforms the binary rows to your business objects (the JVM "true" representation)
The first will increase the JVM memory required for the computation while the latter is an extra transformation step.
For such a simple calculation where you count the number of rows, you'd rather stick to count as the optimized and fairly cheap computation (that can avoid copying objects and applying schema).
Internally, Dataset keeps rows in their InternalRow. That decreases JVM memory requirement for your Spark application. The RDD (from rdd) is computed to represent the Spark transformations that are going to be executed once a Spark action is executed.
Please note that executing rdd creates a RDD and does require some calculations too.
So, yes, rdd might be considered an action as it "executes" the query (i.e. the physical plan of the Dataset that sits behind), but in the end it just gives an RDD (so it can't be an action by definition since Spark actions return a non-RDD value).
As you can see in the code:
lazy val rdd: RDD[T] = {
val objectType = exprEnc.deserializer.dataType
val deserialized = CatalystSerde.deserialize[T](logicalPlan) // <-- HERE see explanation below
sparkSession.sessionState.executePlan(deserialized).toRdd.mapPartitions { rows =>
rows.map(_.get(0, objectType).asInstanceOf[T])
}
}
rdd is computed lazily and only once.
one of the way to evaluate if a dataframe is empty or not is to do df.rdd.isEmpty()
I wonder where did you find it. I'd just count:
count(): Long Returns the number of rows in the Dataset.
toRdd Lazy Value
If you insist on going fairly low-level to check whether your Dataset is empty or not, I'd rather use Dataset.queryExecution.toRdd instead. That's almost like rdd without this extra copying and applying schema.
df.queryExecution.toRdd.isEmpty
Compare the following RDD lineages and think which may seem better.
val dataset = spark.range(5).withColumn("group", 'id % 2)
scala> dataset.rdd.toDebugString
res1: String =
(8) MapPartitionsRDD[8] at rdd at <console>:26 [] // <-- extra deserialization step
| MapPartitionsRDD[7] at rdd at <console>:26 []
| MapPartitionsRDD[6] at rdd at <console>:26 []
| MapPartitionsRDD[5] at rdd at <console>:26 []
| ParallelCollectionRDD[4] at rdd at <console>:26 []
// Compare with a more memory-optimized alternative
// Avoids copies and has no schema
scala> dataset.queryExecution.toRdd.toDebugString
res2: String =
(8) MapPartitionsRDD[11] at toRdd at <console>:26 []
| MapPartitionsRDD[10] at toRdd at <console>:26 []
| ParallelCollectionRDD[9] at toRdd at <console>:26 []
From Spark perspective, the transformations are fairly cheap since they don't cause any shuffles, but given the memory requirements change between the computation I'd use the latter (with toRdd).
rdd Lazy Value
rdd represents the content of the Dataset as a (lazily-created) RDD with rows of the JVM type T.
rdd: RDD[T]
As you can see in the source code (pasted above), requesting rdd in the end will trigger one extra computation just to get the RDD.
Creates a new logical plan to deserialize the Dataset’s logical plan, i.e. you get extra deserialization from internal binary row format that is managed outside JVM to its corresponding representation as JVM objects living inside JVM (think of GC that you should avoid at all cost)

What happens to previous RDD when the next RDD is materialized?

In spark,I would like to know what happens to previous RDD when the next RDD is materialized.
let say I have the below scala code
val lines = sc.textFile("/user/cloudera/data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
I have linesRDD is a base RDD
and similarly i have linesLengths RDD
I know that these two RDD gets materialized when reduce Action is invoked.
My question is while the data is flowing through these 2 RDD's , What happens to linesRDD when the linesLengthsRDD gets materialized .
once the linesLengthsRDD gets materialized then does the data inside linesRDD gets removed?
Let's say in production spark job there might 100 RDD's, a single Action is called against 100th RDD.
what happens to data in 1st RDD when the 99th RDD gets materialized?
Data in all RDD's get deleted only the respective final Action returned the respective output ?
Or
Data in each RDD gets removed automatically once that RDD passes its data to its next RDD as per DAG?
Actually both lines and lineLength will hold their rdds after the reduce. You can think of the rdd as DAG of transformations, as you mentioned. So if later you would like to perform some other transformations on lines or lineLength you can. Even though they materialize during the reduce, unless you cache the directly, they will run through their transformations again when another action will be invoked on a DAG they belong to.

adding new elements to batch RDD from DStream RDD

The only way to join / union /cogroup a DStream RDD with Batch RDD is via the "transform" method, which returns another DStream RDD and hence it gets discarded at the end of the micro-batch.
Is there any way to e.g. union Dstream RDD with Batch RDD which produces a new Batch RDD containing the elements of both the DStream RDD and the Batch RDD.
And once such Batch RDD is created in the above way, can it be used by other DStream RDDs to e.g. join with as this time the result can be another DStream RDD
Effectively the functionality described above will result in periodical updates (additions) of elements to a Batch RDD - the additional elements will keep coming from DStream RDDs which keep streaming in with every micro-batch.
Also newly arriving DStream RDDs will be able to join with the thus previously updated BAtch RDD and produce a result DStream RDD
Something almost like that can be achieved with updateStateByKey, but is there a way to do it as described here
Another approach would be to transform the batch input to a DStream and union it with your streaming input. Then you write it out using foreachRDD which is new your batch input to other jobs.
val batch = sc.textFile(...)
val ssc = new StreamingContext(sc, Seconds(30))
val stream = ssc.textFileStream(...)
import scala.collection.mutable
val batchStream = ssc.queueStream(mutable.Queue.empty[RDD[String]], oneAtATime = false, defaultRDD = batch)
val union = ssc.union(Seq(stream, batchStream))
union.print()
union.foreachRDD { rdd =>
// Delete previous, or use SchemaRDD with .insertInto(, overwrite = true)
rdd.saveTextFile(...)
}
ssc.start()
ssc.awaitTermination()

Apache spark applying map transformation on RDDs

I have a HadoopRDD from which I'm creating a first RDD with a simple Map function then a second RDD from the first RDD with another simple Map function. Something like :
HadoopRDD -> RDD1 -> RDD2.
My question is whether Spak will iterate over the HadoopRDD record by record to generate RDD1 then it will iterate over RDD1 record by record to generate RDD2 or does it ietrate over HadoopRDD and then generate RDD1 and then RDD2 in one go.
Short answer: rdd.map(f).map(g) will be executed in one pass.
tl;dr
Spark splits a job into stages. A stage applied to a partition of data is a task.
In a stage, Spark will try to pipeline as many operations as possible. "Possible" is determined by the need to rearrange data: an operation that requires a shuffle will typically break the pipeline and create a new stage.
In practical terms:
Given `rdd.map(...).map(..).filter(...).sort(...).map(...)`
will result in two stages:
.map(...).map(..).filter(...)
.sort(...).map(...)
This can be retrieved from an rdd using rdd.toDebugString
The same job example above will produce this output:
val mapped = rdd.map(identity).map(identity).filter(_>0).sortBy(x=>x).map(identity)
scala> mapped.toDebugString
res0: String =
(6) MappedRDD[9] at map at <console>:14 []
| MappedRDD[8] at sortBy at <console>:14 []
| ShuffledRDD[7] at sortBy at <console>:14 []
+-(8) MappedRDD[4] at sortBy at <console>:14 []
| FilteredRDD[3] at filter at <console>:14 []
| MappedRDD[2] at map at <console>:14 []
| MappedRDD[1] at map at <console>:14 []
| ParallelCollectionRDD[0] at parallelize at <console>:12 []
Now, coming to the key point of your question: pipelining is very efficient. The complete pipeline will be applied to each element of each partition once. This means that rdd.map(f).map(g) will perform as fast as rdd.map(f andThen g) (with some neglectable overhead)
Apache Spark will iterate over the HadoopRDD record by record in no specific order (data will be split and sent to the workers) and "apply" the first transformation to compute RDD1. After that, the second transformation is applied to each element of RDD1 to get RDD2, again in no specific order, and so on for successive transformations. You can notice it from the map method signature:
// Return a new RDD by applying a function to all elements of this RDD.
def map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]
Apache Spark follows a DAG (Directed Acyclic Graph) execution engine. It won't actually trigger any computation until a value is required, so you have to distinguish between transformations and actions.
EDIT:
In terms of performance, I am not completely aware of the underlying implementation of Spark, but I understand there shouldn't be a significant performance loss other than adding extra (unnecessary) tasks in the related stage. From my experience, you don't normally use transformations of the same "nature" successively (in this case two successive map's). You should be more concerned of performance when shuffling operations take place, because you are moving data around and this has a clear impact on your job performance. Here you can find a common issue regarding that.

Resources