Ordering of rows in JavaRdds after union - apache-spark

I am trying to find out any information on the ordering of the rows in a RDD.
Here is what I am trying to do:
Rdd1, Rdd2
Rdd3 = Rdd1.union(rdd2);
in Rdd3, is there any guarantee that rdd1 records will appear first and rdd2 afterwards?
For my tests I saw this behaviorunion
happening but wasn't able to find it in any docs.
just FI, I really do not care about the ordering of RDDs in itself (i.e. rdd2's or rdd1's data order is really not concern but after union Rdd1 record data must come first is the requirement).

In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered http://spark.apache.org/docs/latest/programming-guide.html#background
If you check your RDD3, you should find that RDD3 is just all the partitions of RDD1 followed by all the partitions of RDD2, so in this case the results happen to be ordered in the way you want. You can read here that simply concatenating the partitions from the 2 RDDs is the standard behaviour of Spark In Apache Spark, why does RDD.union not preserve the partitioner?
So in this case, it appears that Union will give you what you want. However this behaviour is an implementation detail of Union, it is not part of its interface definition, so you cannot rely on the fact that it won't be reimplemented with different behaviour in the future.

Related

How to preserve partitioning through dataframe operations

Is there a reliable way to predict which Spark dataframe operations will preserve partitioning and which won't?
Specifically, let's say my dataframes are all partitioned with .repartition(500,'field1','field2').
Can I expect an output with 500 partitions arranged by these same fields if I apply:
select()
filter()
groupBy() followed by agg() when grouping happens on 'field1' and 'field2' (as in the above)
join() on 'field1' and 'field2' when both dataframes are partitioned as above
Given the special way my data is prepartitioned, I'd expect no extra shuffling to take place. However, I always seem to end up with at least few stages having number of tasks equal to spark.sql.shuffle.partitions. Any way to avoid that extra shuffling bit?
Thanks
This is an well known issue with Spark. Even if you have re-partitioned the data Spark will shuffle the data.
What is the Problem
The re-partition ensures each partition contains the data about a single column value.
Good example here:
val people = List(
(10, "blue"),
(13, "red"),
(15, "blue"),
(99, "red"),
(67, "blue")
)
val peopleDf = people.toDF("age", "color")
colorDf = peopleDf.repartition($"color")
Partition 00091
13,red
99,red
Partition 00168
10,blue
15,blue
67,blue
However Spark doesn't remember this information for subsequent operations. Also the total ordering of the partitions across different partitions are not kept in spark. i.e. Spark knows for a single partition it has data about one partition but doesn't know which other partitions have the data about the same column. Also the sorting is required in the data to ensure shuffle not required.
How can you solve
You need to use the Spark Bucketing feature
feature to ensure no shuffle in subsequent stages.
I found this Wiki is pretty detailed about the bucketing features.
Bucketing is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning.
The motivation is to optimize performance of a join query by avoiding shuffles (aka exchanges) of tables participating in the join. Bucketing results in fewer exchanges (and so stages).

Create JavaPairRDD from a collection with a custom partitioner

Is it possible to create a JavaPairRDD<K,V> from a List<Tuple2<K,V>> with a specified partitioner? the method parallelizePairs in JavaSparkContext only takes the number of slices and does not allow using a custom partitioner. Invoking partitionBy(...) results in a shuffle which I would like to avoid.
Why do I need this? let's say I have rdd1 of some type JavaPairRDD<K,V> which is partitioned according to the hashCode of K. Now, I would like to create rdd2 of another type JavaPairRDD<K,U> from a List<Tuple2<K,U>> in order to finally obtain rdd3 = rdd1.join(rdd2).mapValues(...). If rdd2 is not partitioned the same way rdd1 is, the cogroup call in join will result in expensive data movement across the machines. Calling rdd2.partitionBy(rdd1.partitioner()) does not help either since it also invokes shuffle. Therefore, it seems like the only remedy is to ensure rdd2 is created with the same partitioner as rdd1 to begin with. Any suggestions?
ps. If List<Tuple2<K,U>> is small, another option is broadcast hash joins, i.e. making a HashMap<K,U> from List<Tuple2<K,U>>, broadcasting it to all partitions of rdd1, and performing a map-side joining. This turns out to be faster than repartitioning rdd2, however, it is not an ideal solution.

Can I put back a partitioner to a PairRDD after transformations?

It seems that the "partitioner" of a pairRDD is reset to None after most transformations (e.g. values() , or toDF() ). However my understanding is that the partitioning may not always be changed for these transformations.
Since cogroup and maybe other examples perform more efficiently when the partitioning is known to be co-partitioned, I'm wondering if there's a way to tell spark that the rdd's are still co-partitioned.
See the simple example below where I create two co-partitioned rdd's, then cast them to DFs and perform cogroup on the resulting rdds. A similar example could be done with values, and then adding the right pairs back on.
Although this example is simple, my real case is maybe I load two parquet dataframes with the same partitioning.
Is this possible and would it result in a performance benefit in this case?
data1 = [Row(a=1,b=2),Row(a=2,b=3)]
data2 = [Row(a=1,c=4),Row(a=2,c=5)]
rdd1 = sc.parallelize(data1)
rdd2 = sc.parallelize(data2)
rdd1 = rdd1.map(lambda x: (x.a,x)).partitionBy(2)
rdd2 = rdd2.map(lambda x: (x.a,x)).partitionBy(2)
print(rdd1.cogroup(rdd2).getNumPartitions()) #2 partitions
rdd3 = rdd1.toDF(["a","b"]).rdd
rdd4 = rdd2.toDF(["a","c"]).rdd
print(rdd3.cogroup(rdd4).getNumPartitions()) #4 partitions (2 empty)
In the scala api most transformations include the
preservesPartitioning=true
option. Some of the python RDD api's retain that capability: but for example the
groupBy
is a significant exception. As far as Dataframe API's the partitioning scheme seems to be mostly outside of end user control - even on the scala end.
It is likely then that you would have to:
restrict yourself to using rdds - i.e. refrain from the DataFrame/Dataset approach
be choosy on which RDD transformations you choose: take a look at the ones that do allow either
retaining the parent's partitioning schem
using preservesPartitioning=true

Can the partition number of a Spark RDD be manually changed without repartitioning

In Spark I have two PairRDDs (let us call them A and B) consisting of n partitions each. I want to join those RDDs based upon their keys.
Both RDDs are consistently partitioned, i.e., if keys x and y are in the same partition in RDD A, they are also in the same partition in RDD B. For RDD A, I can assure that the partitioning is done using a particular Partitioner. But for RDD B, the partition indices may be different than those from RDD A (RDD B is the output of some legacy library that I am reluctant to touch if not absolutely necessary).
I would like to efficiently join RDD A and B without performing a shuffle. In theory this would be easy if I could reassign the partition numbers of RDD B such that they match those in RDD A.
My question now is: Is it possible to edit the partition numbers of an RDD (basically permuting them)? Or alternatively can one assign a partitioner without causing a shuffle operation? Or do you see another way for solving this task that I am currently too blind to see?
Yes, you can change the partition. but to reduce shuffling data must be co-located on the same cluster nodes.
Control the partitioning at data source level and/or using .partition operator
If the small RDD can fit in memory of all workers, then using broadcast variable is the faster option.
As you mentioned, there is consistent partitioning, you do not need to repartition(or editing the existing number of partitions).
Keep in mind to gurantee of data colocation is hard to achieve

Which operations preserve RDD order?

RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed by sortBy(), as explained in this reply.
Now, which operations preserve that order?
E.g., is it guaranteed that (after a.sortBy())
a.map(f).zip(a) ===
a.map(x => (f(x),x))
How about
a.filter(f).map(g) ===
a.map(x => (x,g(x))).filter(f(_._1)).map(_._2)
what about
a.filter(f).flatMap(g) ===
a.flatMap(x => g(x).map((x,_))).filter(f(_._1)).map(_._2)
Here "equality" === is understood as "functional equivalence", i.e., there is no way to distinguish the outcome using user-level operations (i.e., without reading logs &c).
All operations preserve the order, except those that explicitly do not. Ordering is always "meaningful", not just after a sortBy. For example, if you read a file (sc.textFile) the lines of the RDD will be in the order that they were in the file.
Without trying to give a complete list, map, filter and flatMap do preserve the order. sortBy, partitionBy, join do not preserve the order.
The reason is that most RDD operations work on Iterators inside the partitions. So map or filter just has no way to mess up the order. You can take a look at the code to see for yourself.
You may now ask: What if I have an RDD with a HashPartitioner. What happens when I use map to change the keys? Well, they will stay in place, and now the RDD is not partitioned by the key. You can use partitionBy to restore the partitioning with a shuffle.
In Spark 2.0.0+ coalesce doesn't guarantee partitions order during merge. DefaultPartitionCoalescer has optimization algorithm which is based on partition locality. When a partition contains information about its locality DefaultPartitionCoalescer tries to merge partitions on the same host. And only when there is no locality information it simply splits partition based on their index and preserves partitions order.
UPDATE:
If you load DataFrame from files, like parquet, Spark breaks order when it plans file splits. You can see it in DataSourceScanExec.scala#L629 or in new Spark 3.x FileScan#L152 if you use it. It just sorts partitions by size and the splits which are less than spark.sql.files.maxPartitionBytes gets to last partitions.
So, if you need to load sorted dataset from files you need to implement your own reader.

Resources