Spark RDD - avoiding shuffle - Does partitioning help to process huge files? - apache-spark

I have an application with around 10 flat files each worth more than 200MM+ records in them. Business logic involves in joining all of them sequentially.
my environment:
1 master - 3 slaves (for testing i have assigned a 1GB memory to each node)
Most of the code just does the below for each join
RDD1 = sc.textFile(file1).mapToPair(..)
RDD2 = sc.textFile(file2).mapToPair(..)
join = RDD1.join(RDD2).map(peopleObject)
Any suggestion for tuning , like repartitioning, parallelize ..? if so, any best practices in coming up with good number for repartitioning?
with the current config the job takes more than an hour and i see the shuffle write for almost every file is > 3GB

In practice, with large datasets (5, 100G+ each), I have seen that the join works best when you co-partition all the RDDs involved in a series of join before you start joining them.
RDD1 = sc.textFile(file1).mapToPair(..).partitionBy(new HashPartitioner(2048))
RDD2 = sc.textFile(file2).mapToPair(..).partitionBy(new HashPartitioner(2048))
RDDN = sc.textFile(fileN).mapToPair(..).partitionBy(new HashPartitioner(2048))
//start joins
Side note:
I refer to this kind of a join as a tree join (each RDD used once). The rationale is presented in the following beautiful pic taken from the Spark-UI:

If we are always joining one RDD (say rdd1) with all the others, the idea is to partition that RDD and then persist it.
Here is sudo-Scala implementation (can easily be converted to Python or Java):
val rdd1 = sc.textFile(file1).mapToPair(..).partitionBy(new HashPartitioner(200)).cache()
Up to here we have rdd1 to be hashed into 200 partitions. The first time it will get evaluated it will be persisted (cached).
Now let's read two more rdds and join them.
val rdd2 = sc.textFile(file2).mapToPair(..)
val join1 = rdd1.join(rdd2).map(peopleObject)
val rdd3 = sc.textFile(file3).mapToPair(..)
val join2 = rdd1.join(rdd3).map(peopleObject)
Note that for the remanning RDDs we do not partition them nor do we cache them.
Spark will see that rdd1 is already hashed partition and it will use the same partitions for all remaining joins. So rdd2 and rdd3 will shuffle their keys to the same locations where the keys of rdd1 are located.
To make it more clear, let's assume that we don't do the partition and we use the same code shown by the question; Each time we do a join both rdds will be shuffled. This means that if we have N joins to rdd1, the non partition version will shuffle rdd1 N times. The partitioned approach will shuffle rdd1 just once.


How to effectively cache/perssit Spark RDD join result

What am I trying to do:
read large Terabyte size RDD
filter it using broadcast variable, it'll reduce it to few gigabytes
join filtered RDD with another RDD which few gigabytes too
persist join result and reuse multiple times
join executed once
join result persisted
join result reused several times w/o recomputation
join recomputed several times.
half of entire job runtime spent on re-computing same thing several times.
My presudo-code
val nonPartitioned = sparkContext.readData("path")
val terabyteSizeRDD = nonPartitioned
.partitionBy(new HashPartitioner(nonPartitioned.getNumPartitions))
//filters down to few Gigabytes
val filteredTerabyteSizeRDD = terabyteSizeDataset.mapPartitions(filterAndMapPartitionFunc, preservesPartitioning = true)
val (joined, count) = {
val result = filteredTerabyteSizeRDD
.leftOuterJoin(anotherFewGbRDD, filteredTerabyteSizeRDD.partitioner.get)
result -> result.count()
DAG says that join is executed several times
first time
another time for .count() I don't know how to trigger persist is another way
three more times since code uses joined three times to create another RDDs.
How can I align expectation and reality?
You can cache or persist data in spark with df.cache() or df.persist(). If you are using persist you have further options then using cache. If you are using persist without an argument its just like a simple cache()see here . Why dont you cache your filteredTerabyteSizeRDD? It should fit in memory if its just a few GB? If it doesnt fit in memory, you could tryfilteredTerabyteSizeRDD.persist(StorageLevel.MEMORY_AND_DISK).
Hope I could answer your question.

Joining RDDs in Spark per partition to avoid shuffle

I have to perform a join between two rdds, of the form rdd1.join(rdd2).
In order to avoid shuffling, I have partitioned the two rdds based on the expected queries. Both of them have the same number of partitions, generated using the same partitioner.
The problem is now reduced to a per-partition join, i.e. I'd like to join partition i from rdd1 with partition i from rdd2 and collect the results.
How can this be achieved (in scala)?

Can I put back a partitioner to a PairRDD after transformations?

It seems that the "partitioner" of a pairRDD is reset to None after most transformations (e.g. values() , or toDF() ). However my understanding is that the partitioning may not always be changed for these transformations.
Since cogroup and maybe other examples perform more efficiently when the partitioning is known to be co-partitioned, I'm wondering if there's a way to tell spark that the rdd's are still co-partitioned.
See the simple example below where I create two co-partitioned rdd's, then cast them to DFs and perform cogroup on the resulting rdds. A similar example could be done with values, and then adding the right pairs back on.
Although this example is simple, my real case is maybe I load two parquet dataframes with the same partitioning.
Is this possible and would it result in a performance benefit in this case?
data1 = [Row(a=1,b=2),Row(a=2,b=3)]
data2 = [Row(a=1,c=4),Row(a=2,c=5)]
rdd1 = sc.parallelize(data1)
rdd2 = sc.parallelize(data2)
rdd1 = x: (x.a,x)).partitionBy(2)
rdd2 = x: (x.a,x)).partitionBy(2)
print(rdd1.cogroup(rdd2).getNumPartitions()) #2 partitions
rdd3 = rdd1.toDF(["a","b"]).rdd
rdd4 = rdd2.toDF(["a","c"]).rdd
print(rdd3.cogroup(rdd4).getNumPartitions()) #4 partitions (2 empty)
In the scala api most transformations include the
option. Some of the python RDD api's retain that capability: but for example the
is a significant exception. As far as Dataframe API's the partitioning scheme seems to be mostly outside of end user control - even on the scala end.
It is likely then that you would have to:
restrict yourself to using rdds - i.e. refrain from the DataFrame/Dataset approach
be choosy on which RDD transformations you choose: take a look at the ones that do allow either
retaining the parent's partitioning schem
using preservesPartitioning=true

In Apache Spark, why does RDD.union not preserve the partitioner?

As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code:
val rdd1 =
sc.parallelize(1 to 50).keyBy(_ % 10)
.partitionBy(new HashPartitioner(10))
val rdd2 =
sc.parallelize(200 to 230).keyBy(_ % 13)
val cogrouped = rdd1.cogroup(rdd2)
println("cogrouped: " + cogrouped.partitioner)
val unioned = rdd1.union(rdd2)
println("union: " + unioned.partitioner)
I see that by default cogroup() always yields an RDD with the customized partitioner, but union() doesn't, it will always revert back to default. This is counterintuitive as we usually assume that a PairRDD should use its first element as partition key. Is there a way to "force" Spark to merge 2 PairRDDs to use the same partition key?
union is a very efficient operation, because it doesn't move any data around. If rdd1 has 10 partitions and rdd2 has 20 partitions then rdd1.union(rdd2) will have 30 partitions: the partitions of the two RDDs put after each other. This is just a bookkeeping change, there is no shuffle.
But necessarily it discards the partitioner. A partitioner is constructed for a given number of partitions. The resulting RDD has a number of partitions that is different from both rdd1 and rdd2.
After taking the union you can run repartition to shuffle the data and organize it by key.
There is one exception to the above. If rdd1 and rdd2 have the same partitioner (with the same number of partitions), union behaves differently. It will join the partitions of the two RDDs pairwise, giving it the same number of partitions as each of the inputs had. This may involve moving data around (if the partitions were not co-located) but will not involve a shuffle. In this case the partitioner is retained. (The code for this is in PartitionerAwareUnionRDD.scala.)
This is no longer true. Iff two RDDs have exactly the same partitioner and number of partitions, the unioned RDD will also have those same partitions. This was introduced in and incorporated into Spark 1.3.

How to duplicate RDD into multiple RDDs?

Is it possible to duplicate a RDD into two or several RDDs ?
I want to use the cassandra-spark driver and save a RDD into a Cassandra table, and, in addition, keep going with more calculations (and eventually save the result to Cassandra as well).
RDDs are immutable and transformations on RDDs create new RDDs. Therefore, it's not necessary to create copies of an RDD to apply different operations.
You could save the base RDD to secondary storage and further apply operations to it.
This is perfectly OK:
val rdd = ???
val base = rdd.byKey(...)
val processed =
val analyzed =
