Partition data when reading file in Spark - apache-spark

I am new to Spark. Consider the following code:
val rdd = sc
.objectFile[(Int, Int)]("path")
.partitionBy(new HashPartitioner(sc.defaultParallelism))
.persist()
rdd.count()
Is each tuple read from the file directly sent to the its partition specified by the hash partitioner? Or is it that the whole file is first read into memory without considering the partitioner and then distributed according to the partitioner. To me, the former may be more efficient since the data is shuffled once while the latter needs two shuffles.

Please find the comments in the code
val rdd = sc
.objectFile[(Int, Int)]("path") // Loads the whole file with default minimum partitions and default partitioner
.partitionBy(new HashPartitioner(sc.defaultParallelism)) // Re-partitions the RDD using HashPartitioner
.persist()

Related

Spark: How can we remove partitioner from RDD?

I am grouping a RDD based on a key.
rdd.groupBy(_.key).partitioner
=> org.apache.spark.HashPartitioner#a
I see that by default Spark, associates HashPartitioner with this RDD, which is fine by me because I agree that we need some kind of partitioner to bring alike data to one executor. But, later in the program I want the RDD to forget about its partitioner strategy because I want to join it with another RDD which follows different partitioning strategy. How can we remove the partitioner from the RDD?

repartition() is not affecting RDD partition size

I am trying to change partition size of an RDD using repartition() method. The method call on the RDD succeeds, but when I explicitly check the partition size using partition.size property of the RDD, I get back the same number of partitions that it originally had:-
scala> rdd.partitions.size
res56: Int = 50
scala> rdd.repartition(10)
res57: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at repartition at <console>:27
At this stage I perform some action like rdd.take(1) just to force evaluation, just in case if that matters. And then I again check the partition size:-
scala> rdd.partitions.size
res58: Int = 50
As one can see, it's not changing. Can someone answer why?
First, it does matter that you run an action as repartition is indeed lazy. Second, repartition returns a new RDD with the partitioning changed, so you must use the returned RDD or else you are still working off of the old partitioning. Finally, when shrinking your partitions, you should use coalesce, as that will not reshuffle the data. It will instead keep data on the number of nodes and pull in the remaining orphans.

Why mapPartitionsWithIndex cause a shuffle in Spark?

I'm new in Spark. I'm checking the shuffling issues in a test application and I don't know why in my program the mapPartitionsWithIndex method cause a shuffle! As you can see in picture my initial RDD has two 16MB partition and Shuffle write about 49.8 MB.
I know that the map or mapPartition or mapPartitionsWithIndex are not shuffling transformation like groupByKey but I see that they also cause shuffle in Spark. Why?
I think you are performing some join/group operation after mapPartitionsWithIndex and that is causing shuffle.
you can establish it by modifying your code.
current code
val rdd = inputRDD1.mapPartitionsWithIndex(....)
val outRDD = rdd.join(inputRDD2)
Modified code
val rdd = inputRDD1.mapPartitionsWithIndex(....)
println(rdd.count)

In Apache Spark, why does RDD.union not preserve the partitioner?

As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code:
val rdd1 =
sc.parallelize(1 to 50).keyBy(_ % 10)
.partitionBy(new HashPartitioner(10))
val rdd2 =
sc.parallelize(200 to 230).keyBy(_ % 13)
val cogrouped = rdd1.cogroup(rdd2)
println("cogrouped: " + cogrouped.partitioner)
val unioned = rdd1.union(rdd2)
println("union: " + unioned.partitioner)
I see that by default cogroup() always yields an RDD with the customized partitioner, but union() doesn't, it will always revert back to default. This is counterintuitive as we usually assume that a PairRDD should use its first element as partition key. Is there a way to "force" Spark to merge 2 PairRDDs to use the same partition key?
union is a very efficient operation, because it doesn't move any data around. If rdd1 has 10 partitions and rdd2 has 20 partitions then rdd1.union(rdd2) will have 30 partitions: the partitions of the two RDDs put after each other. This is just a bookkeeping change, there is no shuffle.
But necessarily it discards the partitioner. A partitioner is constructed for a given number of partitions. The resulting RDD has a number of partitions that is different from both rdd1 and rdd2.
After taking the union you can run repartition to shuffle the data and organize it by key.
There is one exception to the above. If rdd1 and rdd2 have the same partitioner (with the same number of partitions), union behaves differently. It will join the partitions of the two RDDs pairwise, giving it the same number of partitions as each of the inputs had. This may involve moving data around (if the partitions were not co-located) but will not involve a shuffle. In this case the partitioner is retained. (The code for this is in PartitionerAwareUnionRDD.scala.)
This is no longer true. Iff two RDDs have exactly the same partitioner and number of partitions, the unioned RDD will also have those same partitions. This was introduced in https://github.com/apache/spark/pull/4629 and incorporated into Spark 1.3.

How to duplicate RDD into multiple RDDs?

Is it possible to duplicate a RDD into two or several RDDs ?
I want to use the cassandra-spark driver and save a RDD into a Cassandra table, and, in addition, keep going with more calculations (and eventually save the result to Cassandra as well).
RDDs are immutable and transformations on RDDs create new RDDs. Therefore, it's not necessary to create copies of an RDD to apply different operations.
You could save the base RDD to secondary storage and further apply operations to it.
This is perfectly OK:
val rdd = ???
val base = rdd.byKey(...)
base.saveToCassandra(ks,table)
val processed = byKey.map(...).reduceByKey(...)
processed.saveToCassandra(ks,processedTable)
val analyzed = base.map(...).join(suspectsRDD).reduceByKey(...)
analyzed.saveAsTextFile("./path/to/save")

Resources