spark - RDD process twice after persist - apache-spark

I made a RDD and created another RDD from origin like below.
val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)
RDD2.foreach({
println("RDD2")
...
})
...so on..
I expected that RDD1's process does ONLY one time, because RDD1 is saved on memory or disk by persist method.
BUT somehow "RDD1" is printed after "RDD2" printed like below.
RDD1
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2
RDD1 -- repeat RDD1 process. WHY?
RDD1
RDD1
RDD1
RDD2
RDD2
RDD2
RDD2
RDD2

This is the expected behaviour of spark. Like most of the operations persist in spark is also lazy operation. So, even if you add the persist for the 1st RDD, spark doesn't cache the data unless you add any action after the persist operation. The map operation is not an action in spark and it is also lazy.
The way to enforce the caching is to add count action after the persist of RDD2
val RDD2 = RDD1.map({
println("RDD1")
....
}).persist(StorageLevel.MEMORY_AND_DISK)
RDD2.count // Forces the caching
Now if you do any other operation the RDD2 won't be recomputed

Related

Persistence of RDD

Consider the following code.
val rdd1 = sc.textFile("...").persist()
val rdd2 = rdd1.map(_.length).persist()
val cnt = rdd2.count()
val rdd3 = rdd1.map(_.split(" ")).persist()
After rdd2.count() is called, is rdd1 persisted? Or is rdd1 persisted only if an action is called on it?
rdd1 is persisted during rdd2.count action.
you can check spark ui so better understand the DAG

Shuffle intermediate files shared between jobs?

Referring to https://spark.apache.org/docs/1.6.2/programming-guide.html#performance-impact
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed
I understand why these files will be retained. However, I cant seem to figure out whether these intermedaite files are shared between jobs?
My experimentations show that these shuffle files are NOT shared between jobs. Can anyone confirm?
The scenario I am talking about:
```
val rdd1 = sc.text...
val rdd2 = sc.text...
val rdd3 = rdd1.join(rdd2)
// at this point shuffle takes place
//Now, if I do this again:
val rdd4 = rdd1.join(rdd2)
// will the shuffle files be reused? And I think I ve got the answer, which is know since the rdds do not share the lineage
```
Between jobs - yes. That's the whole purpose of preserving shuffle files (What does "Stage Skipped" mean in Apache Spark web UI?). Consider following session transcript:
scala> val rdd1 = sc.parallelize(Seq((1, None), (2, None)), 4)
rdd1: org.apache.spark.rdd.RDD[(Int, None.type)] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(Seq((1, None), (2, None)), 4)
rdd2: org.apache.spark.rdd.RDD[(Int, None.type)] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> val rdd3 = rdd1.join(rdd2)
rdd3: org.apache.spark.rdd.RDD[(Int, (None.type, None.type))] = MapPartitionsRDD[4] at join at <console>:27
scala> rdd3.count // First job
res0: Long = 2
scala> rdd3.foreach(_ => ()) // Second job
and corresponding state of the Spark UI
Between applications - no. Shuffle files are discarded when SparkContext is closed.
The shuffle files are meant for the stages within a job. Other jobs won't be able to use these shuffle files. So, afaik, No ! shuffle files cannot be shared between jobs

Apache Spark DAG behaviour cogrouped operation

I would like some clarifications about the DAG behaviour, and how exactly has been handle the following job:
val rdd = sc.parallelize(List(1 to 10).flatMap(x=>x).zipWithIndex,3)
.partitionBy(new HashPartitioner(4))
val rdd1 = sc.parallelize(List(1 to 10).flatMap(x=>x).zipWithIndex,2)
.partitionBy(new HashPartitioner(3))
val rdd2 = rdd.join(rdd1)
rdd2.collect()
This is the related rdd2.toDebugString:
(4) MapPartitionsRDD[6] at join at IntegrationStatusJob.scala:92 []
| MapPartitionsRDD[5] at join at IntegrationStatusJob.scala:92 []
| CoGroupedRDD[4] at join at IntegrationStatusJob.scala:92 []
| ShuffledRDD[1] at partitionBy at IntegrationStatusJob.scala:90 []
+-(3) ParallelCollectionRDD[0] at parallelize at IntegrationStatusJob.scala:90 []
+-(3) ShuffledRDD[3] at partitionBy at IntegrationStatusJob.scala:91 []
+-(2) ParallelCollectionRDD[2] at parallelize at IntegrationStatusJob.scala:91 []
This is the spark UI image:
Looking at the toDebugString and at the spark UI, if I understood well, in order to perform the join, the DAG looks at what partitioner should be used and because both rdds are HashPartitioned,it choose the partitioner with the greater number of partitions, so rdd partitioner.
Now from the spark UI, it seems that rdd partitionBy and join being performed in the same stage, so under this conditions, the shuffle needed for to perform the join, will be done just from one side? From one side, I mean that just the rdd1 will be shuffled and no both.
Is my assumption correct?
You right. If both RDDs are partitioned using different partitioner Spark will pick one as a reference and reparation / shuffle only the second one.
If both have the same partitioner there is no need for a shuffle.

compute new RDD from 2 original RDD

I have 2 RDD in Key-Value type. RDD1 is [K,V], RDD2 is [K,U].
The set of K of both RDD1 and RDD2 are the same.
I need to map to a new RDD with [K, (U-V)/(U+v)].
My way is firstly to join RDD1 to
val newRDD = RDD1. RDD2.join(RDD2)
Then map new RDD.
newRDD.map(line=> (line._1, (line._2._1-line._2._2)/(line._2._1+line._2._2)))
The problem is that set RDD1( RDD2) has over 100 million, so the join between 2 sets take a very expensive cost as well as a long time(3 mins) to execute.
Are there any better ways to reduce the time of this task?
Try converting them to DataFrame first:
val df1 = RDD1.toDF("v_key", "v")
val df2 = RDD2.toDF("u_key", "u")
val newDf = df1.join(df2, $"v_key" === $"u_key")
newDF.select($"v_key", ($"u" - $"v") / ($"u" + $"v")).rdd
Aside from being a lot faster (because Spark will do the optimizing for you) I think it reads better.
I should also note that if it were me, I wouldn't do the .rdd at the end -- I would leave it a DataFrame. But that's me.

Understanding Spark's caching

I'm trying to understand how Spark's cache work.
Here is my naive understanding, please let me know if I'm missing something:
val rdd1 = sc.textFile("some data")
rdd1.cache() //marks rdd1 as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when rdd2 is saved I assume) and then from cache (assuming there is enough RAM) when rdd3 is saved)
Now here is my question. Let's say I want to cache rdd2 and rdd3 as they will both be used later on, but I don't need rdd1 after creating them.
Basically there is duplication, isn't it? Since once rdd2 and rdd3 are calculated, I don't need rdd1 anymore, I should probably unpersist it, right? the question is when?
Will this work? (Option A)
val rdd1 = sc.textFile("some data")
rdd1.cache() // marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd1.unpersist()
Does spark add the unpersist call to the DAG? or is it done immediately? if it's done immediately, then basically rdd1 will be non cached when I read from rdd2 and rdd3, right?
Should I do it this way instead (Option B)?
val rdd1 = sc.textFile("some data")
rdd1.cache() // marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
rdd1.unpersist()
So the question is this:
Is Option A good enough? i.e. will rdd1 still load the file only once?
Or do I need to go with Option B?
It would seem that Option B is required. The reason is related to how persist/cache and unpersist are executed by Spark. Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution.
This is relevant because a cache or persist call just adds the RDD to a Map of RDDs that marked themselves to be persisted during job execution. However, unpersist directly tells the blockManager to evict the RDD from storage and removes the reference in the Map of persistent RDDs.
persist function
unpersist function
So you would need to call unpersist after Spark actually executed and stored the RDD with the block manager.
The comments for the RDD.persist method hint towards this:
rdd.persist
In option A, you have not shown when you are calling the action (call to save)
val rdd1 = sc.textFile("some data")
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd1.unpersist()
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
If the sequence is as above, Option A should use cached version of rdd1 for computing both rdd2 and rdd 3
Option B is an optimal approach with small tweak-in. Make use of less expensive action methods. In the approach mentioned by your code, saveAsTextFile is an expensive operation, replace it by count method.
Idea here is to remove the big rdd1 from DAG, if it's not relevant for further computation (after rdd2 and rdd3 are created)
Updated approach from code
val rdd1 = sc.textFile("some data").cache()
val rdd2 = rdd1.filter(...).cache()
val rdd3 = rdd1.map(...).cache()
rdd2.count
rdd3.count
rdd1.unpersist()

Resources