What happens to previous RDD when the next RDD is materialized? - apache-spark

In spark,I would like to know what happens to previous RDD when the next RDD is materialized.
let say I have the below scala code
val lines = sc.textFile("/user/cloudera/data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
I have linesRDD is a base RDD
and similarly i have linesLengths RDD
I know that these two RDD gets materialized when reduce Action is invoked.
My question is while the data is flowing through these 2 RDD's , What happens to linesRDD when the linesLengthsRDD gets materialized .
once the linesLengthsRDD gets materialized then does the data inside linesRDD gets removed?
Let's say in production spark job there might 100 RDD's, a single Action is called against 100th RDD.
what happens to data in 1st RDD when the 99th RDD gets materialized?
Data in all RDD's get deleted only the respective final Action returned the respective output ?
Or
Data in each RDD gets removed automatically once that RDD passes its data to its next RDD as per DAG?

Actually both lines and lineLength will hold their rdds after the reduce. You can think of the rdd as DAG of transformations, as you mentioned. So if later you would like to perform some other transformations on lines or lineLength you can. Even though they materialize during the reduce, unless you cache the directly, they will run through their transformations again when another action will be invoked on a DAG they belong to.

Related

Is Dataset.rdd an action or transformation?

one of the way to evaluate if a dataframe is empty or not is to do df.rdd.isEmpty(), however, I see rdd at mycode.scala:123 in sparkUI executions. which makes me wonder if this rdd() function is actually an action is instead of a transformation.
I know that isEmpty() is an action, but I do see a separate stage where isEmpty() at mycode.scala:234, so I think they are different actions?
rdd is generated to represent a structured query in "RDD terms" so Spark can execute it. It is an RDD of JVM objects of your type T. If used incorrectly can cause memory problems since:
Transfers internally-managed optimized rows that live outside JVM to the memory space in JVM
Transforms the binary rows to your business objects (the JVM "true" representation)
The first will increase the JVM memory required for the computation while the latter is an extra transformation step.
For such a simple calculation where you count the number of rows, you'd rather stick to count as the optimized and fairly cheap computation (that can avoid copying objects and applying schema).
Internally, Dataset keeps rows in their InternalRow. That decreases JVM memory requirement for your Spark application. The RDD (from rdd) is computed to represent the Spark transformations that are going to be executed once a Spark action is executed.
Please note that executing rdd creates a RDD and does require some calculations too.
So, yes, rdd might be considered an action as it "executes" the query (i.e. the physical plan of the Dataset that sits behind), but in the end it just gives an RDD (so it can't be an action by definition since Spark actions return a non-RDD value).
As you can see in the code:
lazy val rdd: RDD[T] = {
val objectType = exprEnc.deserializer.dataType
val deserialized = CatalystSerde.deserialize[T](logicalPlan) // <-- HERE see explanation below
sparkSession.sessionState.executePlan(deserialized).toRdd.mapPartitions { rows =>
rows.map(_.get(0, objectType).asInstanceOf[T])
}
}
rdd is computed lazily and only once.
one of the way to evaluate if a dataframe is empty or not is to do df.rdd.isEmpty()
I wonder where did you find it. I'd just count:
count(): Long Returns the number of rows in the Dataset.
toRdd Lazy Value
If you insist on going fairly low-level to check whether your Dataset is empty or not, I'd rather use Dataset.queryExecution.toRdd instead. That's almost like rdd without this extra copying and applying schema.
df.queryExecution.toRdd.isEmpty
Compare the following RDD lineages and think which may seem better.
val dataset = spark.range(5).withColumn("group", 'id % 2)
scala> dataset.rdd.toDebugString
res1: String =
(8) MapPartitionsRDD[8] at rdd at <console>:26 [] // <-- extra deserialization step
| MapPartitionsRDD[7] at rdd at <console>:26 []
| MapPartitionsRDD[6] at rdd at <console>:26 []
| MapPartitionsRDD[5] at rdd at <console>:26 []
| ParallelCollectionRDD[4] at rdd at <console>:26 []
// Compare with a more memory-optimized alternative
// Avoids copies and has no schema
scala> dataset.queryExecution.toRdd.toDebugString
res2: String =
(8) MapPartitionsRDD[11] at toRdd at <console>:26 []
| MapPartitionsRDD[10] at toRdd at <console>:26 []
| ParallelCollectionRDD[9] at toRdd at <console>:26 []
From Spark perspective, the transformations are fairly cheap since they don't cause any shuffles, but given the memory requirements change between the computation I'd use the latter (with toRdd).
rdd Lazy Value
rdd represents the content of the Dataset as a (lazily-created) RDD with rows of the JVM type T.
rdd: RDD[T]
As you can see in the source code (pasted above), requesting rdd in the end will trigger one extra computation just to get the RDD.
Creates a new logical plan to deserialize the Dataset’s logical plan, i.e. you get extra deserialization from internal binary row format that is managed outside JVM to its corresponding representation as JVM objects living inside JVM (think of GC that you should avoid at all cost)

In spark Streaming how to reload a lookup non stream rdd after n batches

Suppose i have a streaming context which does lot of steps and then at the end the micro batch look's up or joins to a preloaded RDD. I have to refresh that preloaded RDD every 12 hours . how can i do this. Anything i do which does not relate to streaming context is not replayed to my understanding, how i get this called form one of the streaming RDD. I need to make only one call non matter how many partition the streaming dstream has
This is possible by re-creating the external RDD at the time it needs to be reloaded. It requires defining a mutable variable to hold the RDD reference that's active at a given moment in time. Within the dstream.foreachRDD we can then check for the moment when the RDD reference needs to be refreshed.
This is an example on how that would look like:
val stream:DStream[Int] = ??? //let's say that we have some DStream of Ints
// Some external data as an RDD of (x,x)
def externalData():RDD[(Int,Int)] = sparkContext.textFile(dataFile)
.flatMap{line => try { Some((line.toInt, line.toInt)) } catch {case ex:Throwable => None}}
.cache()
// this mutable var will hold the reference to the external data RDD
var cache:RDD[(Int,Int)] = externalData()
// force materialization - useful for experimenting, not needed in reality
cache.count()
// a var to count iterations -- use to trigger the reload in this example
var tick = 1
// reload frequency
val ReloadFrequency = 5
stream.foreachRDD{ rdd =>
if (tick == 0) { // will reload the RDD every 5 iterations
// unpersist the previous RDD, otherwise it will linger in memory, taking up resources.
cache.unpersist(false)
// generate a new RDD
cache = externalData()
}
// join the DStream RDD with our reference data, do something with it...
val matches = rdd.keyBy(identity).join(cache).count()
updateData(dataFile, (matches + 1).toInt) // so I'm adding data to the static file in order to see when the new records become alive
tick = (tick + 1) % ReloadFrequency
}
streaming.start
Previous to come with this solution, I studied the possibility to play with the persist flag in the RDD, but it didn't work as expected. Looks like unpersist() does not force re-materialization of the RDD when it's used again.

Apache Spark: stepwise execution

Due to a performance measurement I want to execute my scala programm written for spark stepwise, i.e.
execute first operator; materialize result;
execute second operator; materialize result;
...
and so on. The original code:
var filename = new String("<filename>")
var text_file = sc.textFile(filename)
var counts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
counts.saveAsTextFile("file://result")
So I want the execution of var counts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) to be stepwise.
Is calling counts.foreachPartition(x => {}) after every operator the right way to do it?
Or is writing to /dev/null with saveAsTextFile() a better alternative? And does spark actually have something like a NullSink for that purpose? I wasn't able to write to /dev/null with saveAsTextFile() because /dev/null already exists. Is there a way to overwrite the spark result folder?
And does the temporary result after each operation should be cached with cache()?
What is the best way to separate the execution?
Spark supports two types of operations: actions and transformations. Transformations, as the name implies, turn datasets into new ones through the combination of the transformation operator and (in some cases, optionally) a function provided to the transformation. Actions, on the other hand, run through a dataset with some computation to provide a value to the driver.
There are two things that Spark does that makes your desired task a little difficult: it bundles non-shuffling transformations into execution blocks called stages and stages in the scheduling graph must be triggered through actions.
For your case, provided your input isn't massive, I think it would be easiest to trigger your transformations with a dummy action (e.g. count(), collect()) as the RDD will be materialized. During RDD computation, you can check the Spark UI to gather any performance statistics about the steps/stages/jobs used to create it.
This would look something like:
val text_file = sc.textFile(filename)
val words = text_file.flatMap(line => line.split(" "))
words.count()
val wordCount = words.map(word => (word, 1))
wordCount.count()
val wordCounts = wordCount.reduceByKey(_ + _)
wordCounts.count()
Some notes:
Since RDD's for all intents and purposes are immutable, they should be stored in val's
You can shorten your reduceByKey() syntax with underscore notation
Your approach with foreachPartition() could work since it is an action but it would require a change in your functions since your are operating over an iterator on your partition
Caching only makes since if you either create multiple RDD's from a parent RDD (branching out) or run iterated computation over the same RDD (perhaps in a loop)
You can also simple invoke RDD.persist() or RDD.cache() after every transformation. but ensure that you have right level of StorageLevel defined.

How does lineage get passed down in RDDs in Apache Spark

Do each RDD point to the same lineage graph
or
when a parent RDD gives its lineage to a new RDD, is the lineage graph copied by the child as well so both the parent and child have different graphs. In this case isn't it memory intensive?
Each RDD maintains a pointer to one or more parent along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on an RDD, the RDD b just keeps a reference (and never copies) to its parent a, that's a lineage.
And when the driver submits the job, the RDD graph is serialized to the worker nodes so that each of the worker nodes apply the series of transformations (like, map filter and etc..) on different partitions. Also, this RDD lineage will be used to recompute the data if some failure occurs.
To display the lineage of an RDD, Spark provides a debug method toDebugString() method.
Consider the following example:
val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
Executing toDebugString() on splitedLines RDD, will output the following,
(2) ShuffledRDD[6] at reduceByKey at <console>:25 []
+-(2) MapPartitionsRDD[5] at map at <console>:24 []
| MapPartitionsRDD[4] at map at <console>:23 []
| log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
| log.txt HadoopRDD[0] at textFile at <console>:21 []
For more information about how Spark works internally, please read my another post
When a transformation(map or filter etc) is called, it is not executed by Spark immediately, instead a lineage is created for each transformation.
A lineage will keep track of what all transformations has to be applied on that RDD,
including the location from where it has to read the data.
For example, consider the following example
val myRdd = sc.textFile("spam.txt")
val filteredRdd = myRdd.filter(line => line.contains("wonder"))
filteredRdd.count()
sc.textFile() and myRdd.filter() do not get executed immediately,
it will be executed only when an Action is called on the RDD - here filteredRdd.count().
An Action is used to either save result to some location or to display it.
RDD lineage information can also be printed by using the command filteredRdd.toDebugString(filteredRdd is the RDD here).
Also, DAG Visualization shows the complete graph in a very intuitive manner as follows:

Is the DStream return by updateStateByKey function only contains one RDD?

Is the DStream return by updateStateByKey function only contains one RDD? If not,Under what circumstances will the DStream contains more than one RDD?
It contains a RDD every batch. The DStream returned by updateStateByKey is a "state" DStream. You can still view this DStream as a normal DStream though. For every batch, the RDD is representing the latest state (key-value pairs) according to your update function that you pass in to updateStateByKey.
it seemed not like what you said, the code as a part of application bleow only print once every batch, so i think every stateful DStream just have only one RDD
#transient val statefulDStream = lines.transform(...).map(x => (x, 1)).updateStateByKey(updateFuncs)
statefulDStream.foreachRDD { rdd =>
println(rdd.first())
}
Yes, the DStream return by updateStateByKey only hava one RDD

Resources