Does spark cache rdds automatically after shuffle? - apache-spark

I tried the following code, rdd2 will compute only once. Dose spark cache all shuffle rdd automatically ?
I noticed the Dataframe shuffle result will not auto cache.
val rdd2: RDD[(String, Int)] = spark.sparkContext.parallelize(Array("jan"))
.map(x => {
println("---")
(x, 1)
}).reduceByKey(_ + _)
rdd2.count()
rdd2.count()

Some data (like intermediate shuffle data) is persisted automatically.
The official website has the following sentence
Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.

Related

Is there a way to write RDD rows to HDFS or S3 from inside a map transformation?

I'm aware that the typical way of writing RDD or Dataframe rows to HDFS or S3 is by using saveAsTextFile or df.write. However, I would like to figure out how to write individual records from inside a map transformation like this:
myRDD.map(row => {
if(row.contains("something")) {
// write record to HDFS or S3
}
row
}
I know that this can be accomplished with the following code,
val newRDD = myRDD.filter(row => row.contains("something"))
newRDD.saveAsTextFile("myFile")
but I want to continue processing the original myRDD after writing to HDFS and that would require caching myRDD and I am low on memory resources.
I want to continue processing the original myRDD after writing to HDFS and that would require caching myRDD and I am low on memory resources.
The above statement is not correct. You can operate on an RDD further without caching if you have low memory.
You can write inside a map() function using the Hadoop API, but it's not a good idea to operate terminal actions inside a map() function. map() operations should be side effect free. However you can use the mappartition() function.
You don't need to cache an RDD for doing subsequent operations on it. Caching helps in avoiding recomputation, but RDDs are immutable. A new RDD will be created (preserving the lineage) on each and every transformation.

Does UpdateStateByKey in Spark shuffles the data across

I'm a newbie in Spark and i would like to understand whether i need to aggregate the DStream data by key before calling updateStateByKey?
My application basically counts the number of words in every second using Spark Streaming where i perform couple of map operations before doing a state-full update as follows,
val words = inputDstream.flatMap(x => x.split(" "))
val wordDstream = words.map(x => (x, 1))
val stateDstream = wordDstream.updateStateByKey(UpdateFunc _)
stateDstream.print()
Say after the second Map operation, same keys (words) might present across worker nodes due to various partitions, So i assume that the updateStateByKey method internally shuffles and aggregates the key values as Seq[Int] and calls the updateFunc. Is my assumption correct?
correct: as you can see in the method signature it takes an optional partitionNum/Partitioner argument, which denotes the number of reducers i.e. state updaters. This leads to a shuffle.
Also, I suggest to explicitly put a number there otherwise Spark may significantly decrease your job's parallelism trying to run tasks locally with respect to the location of the blocks of the HDFS checkpoint files
updateStateByKey() does not shuffle the state , rather the new data is brought to the nodes containing the state for the same key.
Link to Tathagat's answer to a similar question : https://www.mail-archive.com/user#spark.apache.org/msg43512.html

Why mapPartitionsWithIndex cause a shuffle in Spark?

I'm new in Spark. I'm checking the shuffling issues in a test application and I don't know why in my program the mapPartitionsWithIndex method cause a shuffle! As you can see in picture my initial RDD has two 16MB partition and Shuffle write about 49.8 MB.
I know that the map or mapPartition or mapPartitionsWithIndex are not shuffling transformation like groupByKey but I see that they also cause shuffle in Spark. Why?
I think you are performing some join/group operation after mapPartitionsWithIndex and that is causing shuffle.
you can establish it by modifying your code.
current code
val rdd = inputRDD1.mapPartitionsWithIndex(....)
val outRDD = rdd.join(inputRDD2)
Modified code
val rdd = inputRDD1.mapPartitionsWithIndex(....)
println(rdd.count)

How to duplicate RDD into multiple RDDs?

Is it possible to duplicate a RDD into two or several RDDs ?
I want to use the cassandra-spark driver and save a RDD into a Cassandra table, and, in addition, keep going with more calculations (and eventually save the result to Cassandra as well).
RDDs are immutable and transformations on RDDs create new RDDs. Therefore, it's not necessary to create copies of an RDD to apply different operations.
You could save the base RDD to secondary storage and further apply operations to it.
This is perfectly OK:
val rdd = ???
val base = rdd.byKey(...)
base.saveToCassandra(ks,table)
val processed = byKey.map(...).reduceByKey(...)
processed.saveToCassandra(ks,processedTable)
val analyzed = base.map(...).join(suspectsRDD).reduceByKey(...)
analyzed.saveAsTextFile("./path/to/save")

How far will Spark RDD cache go?

Say I have three RDD transformation function called on rdd1:
def rdd2 = rdd1.f1
def rdd3 = rdd2.f2
def rdd4 = rdd3.f3
Now I want to cache rdd4, so I call rdd4.cache().
My question:
Will only the result from the action on rdd4 be cached or will every RDD above rdd4 be cached? Say I want to cache both rdd3 and rdd4, do I need to cache them separately?
The whole idea of cache is that spark is not keeping the results in memory unless you tell it to. So if you cache the last RDD in the chain it only keeps the results of that one in memory. So, yes, you do need to cache them separately, but keep in mind you only need to cache an RDD if you are going to use it more than once, for example:
rdd4.cache()
val v1 = rdd4.lookup("key1")
val v2 = rdd4.lookup("key2")
If you do not call cache in this case rdd4 will be recalculated for every call to lookup (or any other function that requires evaluation). You might want to read the paper on RDD's it is pretty easy to understand and explains the ideas behind certain choices they made regarding how RDD's work.

Resources