Spark: How can we remove partitioner from RDD? - apache-spark

I am grouping a RDD based on a key.
rdd.groupBy(_.key).partitioner
=> org.apache.spark.HashPartitioner#a
I see that by default Spark, associates HashPartitioner with this RDD, which is fine by me because I agree that we need some kind of partitioner to bring alike data to one executor. But, later in the program I want the RDD to forget about its partitioner strategy because I want to join it with another RDD which follows different partitioning strategy. How can we remove the partitioner from the RDD?

Related

Partition data when reading file in Spark

I am new to Spark. Consider the following code:
val rdd = sc
.objectFile[(Int, Int)]("path")
.partitionBy(new HashPartitioner(sc.defaultParallelism))
.persist()
rdd.count()
Is each tuple read from the file directly sent to the its partition specified by the hash partitioner? Or is it that the whole file is first read into memory without considering the partitioner and then distributed according to the partitioner. To me, the former may be more efficient since the data is shuffled once while the latter needs two shuffles.
Please find the comments in the code
val rdd = sc
.objectFile[(Int, Int)]("path") // Loads the whole file with default minimum partitions and default partitioner
.partitionBy(new HashPartitioner(sc.defaultParallelism)) // Re-partitions the RDD using HashPartitioner
.persist()

Create JavaPairRDD from a collection with a custom partitioner

Is it possible to create a JavaPairRDD<K,V> from a List<Tuple2<K,V>> with a specified partitioner? the method parallelizePairs in JavaSparkContext only takes the number of slices and does not allow using a custom partitioner. Invoking partitionBy(...) results in a shuffle which I would like to avoid.
Why do I need this? let's say I have rdd1 of some type JavaPairRDD<K,V> which is partitioned according to the hashCode of K. Now, I would like to create rdd2 of another type JavaPairRDD<K,U> from a List<Tuple2<K,U>> in order to finally obtain rdd3 = rdd1.join(rdd2).mapValues(...). If rdd2 is not partitioned the same way rdd1 is, the cogroup call in join will result in expensive data movement across the machines. Calling rdd2.partitionBy(rdd1.partitioner()) does not help either since it also invokes shuffle. Therefore, it seems like the only remedy is to ensure rdd2 is created with the same partitioner as rdd1 to begin with. Any suggestions?
ps. If List<Tuple2<K,U>> is small, another option is broadcast hash joins, i.e. making a HashMap<K,U> from List<Tuple2<K,U>>, broadcasting it to all partitions of rdd1, and performing a map-side joining. This turns out to be faster than repartitioning rdd2, however, it is not an ideal solution.

Worker Behavior with two (or more) dataframes having the same key

I'm using PySpark (Spark 1.4.1) in a cluster. I have two DataFrames each containing the same key values but different data for the other fields.
I partitioned each DataFrame separately using the key and wrote a parquet file to HDFS. I then read the parquet file back into memory as a new DataFrame. If I join the two DataFrames, will the processing for the join happen on the same workers?
For example:
dfA contains {userid, firstname, lastname} partitioned by userid
dfB contains {userid, activity, job, hobby} partitioned by userid
dfC = dfA.join(dfB, dfA.userid==dfB.userid)
Is dfC already partitioned by userid?
Is dfC already partitioned by userid
The answer depends on what you mean by partitioned. Records with the same userid should be located on the same partition, but DataFrames don't support partitioning understood as having a Partitioner. Only the PairRDDs (RDD[(T, U)]) can have partitioner in Spark. It means that for most applications the answer is no. Neither DataFrame or underlaying RDD is partitioned.
You'll find more details about DataFrames and partitioning in How to define partitioning of DataFrame? Another question you can follow is Co-partitioned joins in spark SQL.
If I join the two DataFrames, will the processing for the join happen on the same workers?
Once again it depends on what you mean. Records with the same userid have to be transfered to the same node before transformed rows can be yielded. I ask if it is guaranteed to happen without any network traffic the answer is no.
To be clear it would be exactly the same even if DataFrame had a partitioner. Data co-partitioning is not equivalent to data co-location. It simply means that join operation can be performed using one-to-one mapping not shuffling. You can find more in Daniel Darbos' answer to Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?.

Can the partition number of a Spark RDD be manually changed without repartitioning

In Spark I have two PairRDDs (let us call them A and B) consisting of n partitions each. I want to join those RDDs based upon their keys.
Both RDDs are consistently partitioned, i.e., if keys x and y are in the same partition in RDD A, they are also in the same partition in RDD B. For RDD A, I can assure that the partitioning is done using a particular Partitioner. But for RDD B, the partition indices may be different than those from RDD A (RDD B is the output of some legacy library that I am reluctant to touch if not absolutely necessary).
I would like to efficiently join RDD A and B without performing a shuffle. In theory this would be easy if I could reassign the partition numbers of RDD B such that they match those in RDD A.
My question now is: Is it possible to edit the partition numbers of an RDD (basically permuting them)? Or alternatively can one assign a partitioner without causing a shuffle operation? Or do you see another way for solving this task that I am currently too blind to see?
Yes, you can change the partition. but to reduce shuffling data must be co-located on the same cluster nodes.
Control the partitioning at data source level and/or using .partition operator
If the small RDD can fit in memory of all workers, then using broadcast variable is the faster option.
As you mentioned, there is consistent partitioning, you do not need to repartition(or editing the existing number of partitions).
Keep in mind to gurantee of data colocation is hard to achieve

Is there a way to check if a variable in Spark is parallelizable?

So I am using groupByKey function in spark, but its not being parallelized, as I can see that during its execution, only 1 core is being used. It seems that the data I'm working with doesn't allow parallelization. Is there a way in spark to know if the input data is amicable to parallelization or if it's not a proper RDD?
The unit of parallelization in Spark is the 'partition'. That is, RDDs are split in partitions and transformations are applied to each partition in parallel. How RDD data is distributed across partitions is determined by the Partitioner. By default, the HashPartitioner is used which should work fine for most purposes.
You can check how many partitions your RDD is split into using:
rdd.partitions // Array of partitions

Resources