Spark does persistence required for single action? - apache-spark

I have a workflow like below:
rdd1 = sc.textFile(input);
rdd2 = rdd1.filter(filterfunc1);
rdd3 = rdd1.filter(fiterfunc2);
rdd4 = rdd2.map(mapptrans1);
rdd5 = rdd3.map(maptrans2);
rdd6 = rdd4.union(rdd5);
rdd6.foreach(some transformation);
1.Do I need to persist rdd1 ?Or its not required since there is only one action at rdd6 which will create only one job and in a single job no need of persist ?
2.Also what if transformation on rdd2 is reduceByKey instead of map ? Will this again the same thing no need of persist since single job.

You only need to persist if you plan to reuse the RDD in more than one action. In a single action, spark does a good job of deciding when to recalculate and when to reuse.
You can see the DAG in the UI to make sure the rdd1 is only read once from file.

Related

Is Spark DAG execution order in parallel or sequential?

I have two sources, they can be different type of sources(database or files) or can be of same type.
Dataset1 = source1.load;
Dataset2 = source2.load;
Will spark loads the data parallelly into different datasets or will it load in sequence?
Actions occur sequentially. Your statement ... will load parallel into different datasets ... has as answer sequentially as these are Actions.
Data pipelines required for Actions including the Transformations, occur in parallel where possible. E.g. creating a Data Frame with 4 loads that are subject to Union, say, will cause those loads to occur in parallel, provided enough Executors (Slots) can be allocated.
So, as the comment also states, you need an Action and the DAG path will determine flow and any parallelism that can be applied. You can see that in the Spark UI.
To demonstrate:
rdd1 = get some data
rdd2 = get some other data
rdd3 = get some other other data
rddA = rdd1 union rdd2 union rdd3
rddA.toDF.write ...
// followed by
rdd1' = get some data
rdd2' = get some other data
rdd3' = get some other other data
rddA' = rdd1 union rdd2 union rdd3
rddA'.toDF.write ...
rddA'.toDF.write ... will occur after rddA.toDF.write... None of rdd1' and rdd2' and rdd3' Transformations occur in parallel with rddA.toDF.write 's Transformations / Action. That cannot be the case. This means that if you want write parallelism you need two separate SPARK apps - running concurrently - provided resources allow that of course.

SPARK rdd performance pipelining

If we have, say, :
val rdd1 = rdd0.map( ...
followed by
val rdd2 = rdd1.filter( ...
Then, when actually running due to an action, can rdd2 start computing the already computed rdd1 results that are known - or must this wait until rdd1 work is all complete? It is not apparent to me when reading the SPARK stuff. Informatica pipelining does do this, so I assume it probably does in SPARK as well.
Spark transformations are lazy so both calls doesn't do anything, beyond computing dependency DAG. So your code doesn't even touch the data.
For anything to be computed you have to execute an action on rdd2 or one of its descendants.
By default there are also forgetful, so unless you cache rdd1 it will be evaluated all over again, every time rdd2 is evaluated.
Finally, due to lazy evaluation, multiple narrow transformations are combined together in a single stage and your code will interleave calls to map and filter functions.

Create JavaPairRDD from a collection with a custom partitioner

Is it possible to create a JavaPairRDD<K,V> from a List<Tuple2<K,V>> with a specified partitioner? the method parallelizePairs in JavaSparkContext only takes the number of slices and does not allow using a custom partitioner. Invoking partitionBy(...) results in a shuffle which I would like to avoid.
Why do I need this? let's say I have rdd1 of some type JavaPairRDD<K,V> which is partitioned according to the hashCode of K. Now, I would like to create rdd2 of another type JavaPairRDD<K,U> from a List<Tuple2<K,U>> in order to finally obtain rdd3 = rdd1.join(rdd2).mapValues(...). If rdd2 is not partitioned the same way rdd1 is, the cogroup call in join will result in expensive data movement across the machines. Calling rdd2.partitionBy(rdd1.partitioner()) does not help either since it also invokes shuffle. Therefore, it seems like the only remedy is to ensure rdd2 is created with the same partitioner as rdd1 to begin with. Any suggestions?
ps. If List<Tuple2<K,U>> is small, another option is broadcast hash joins, i.e. making a HashMap<K,U> from List<Tuple2<K,U>>, broadcasting it to all partitions of rdd1, and performing a map-side joining. This turns out to be faster than repartitioning rdd2, however, it is not an ideal solution.

How to duplicate RDD into multiple RDDs?

Is it possible to duplicate a RDD into two or several RDDs ?
I want to use the cassandra-spark driver and save a RDD into a Cassandra table, and, in addition, keep going with more calculations (and eventually save the result to Cassandra as well).
RDDs are immutable and transformations on RDDs create new RDDs. Therefore, it's not necessary to create copies of an RDD to apply different operations.
You could save the base RDD to secondary storage and further apply operations to it.
This is perfectly OK:
val rdd = ???
val base = rdd.byKey(...)
base.saveToCassandra(ks,table)
val processed = byKey.map(...).reduceByKey(...)
processed.saveToCassandra(ks,processedTable)
val analyzed = base.map(...).join(suspectsRDD).reduceByKey(...)
analyzed.saveAsTextFile("./path/to/save")

How far will Spark RDD cache go?

Say I have three RDD transformation function called on rdd1:
def rdd2 = rdd1.f1
def rdd3 = rdd2.f2
def rdd4 = rdd3.f3
Now I want to cache rdd4, so I call rdd4.cache().
My question:
Will only the result from the action on rdd4 be cached or will every RDD above rdd4 be cached? Say I want to cache both rdd3 and rdd4, do I need to cache them separately?
The whole idea of cache is that spark is not keeping the results in memory unless you tell it to. So if you cache the last RDD in the chain it only keeps the results of that one in memory. So, yes, you do need to cache them separately, but keep in mind you only need to cache an RDD if you are going to use it more than once, for example:
rdd4.cache()
val v1 = rdd4.lookup("key1")
val v2 = rdd4.lookup("key2")
If you do not call cache in this case rdd4 will be recalculated for every call to lookup (or any other function that requires evaluation). You might want to read the paper on RDD's it is pretty easy to understand and explains the ideas behind certain choices they made regarding how RDD's work.

Resources