How to preserve partitioning through dataframe operations - apache-spark

Is there a reliable way to predict which Spark dataframe operations will preserve partitioning and which won't?
Specifically, let's say my dataframes are all partitioned with .repartition(500,'field1','field2').
Can I expect an output with 500 partitions arranged by these same fields if I apply:
groupBy() followed by agg() when grouping happens on 'field1' and 'field2' (as in the above)
join() on 'field1' and 'field2' when both dataframes are partitioned as above
Given the special way my data is prepartitioned, I'd expect no extra shuffling to take place. However, I always seem to end up with at least few stages having number of tasks equal to spark.sql.shuffle.partitions. Any way to avoid that extra shuffling bit?

This is an well known issue with Spark. Even if you have re-partitioned the data Spark will shuffle the data.
What is the Problem
The re-partition ensures each partition contains the data about a single column value.
Good example here:
val people = List(
(10, "blue"),
(13, "red"),
(15, "blue"),
(99, "red"),
(67, "blue")
val peopleDf = people.toDF("age", "color")
colorDf = peopleDf.repartition($"color")
Partition 00091
Partition 00168
However Spark doesn't remember this information for subsequent operations. Also the total ordering of the partitions across different partitions are not kept in spark. i.e. Spark knows for a single partition it has data about one partition but doesn't know which other partitions have the data about the same column. Also the sorting is required in the data to ensure shuffle not required.
How can you solve
You need to use the Spark Bucketing feature
feature to ensure no shuffle in subsequent stages.
I found this Wiki is pretty detailed about the bucketing features.
Bucketing is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning.
The motivation is to optimize performance of a join query by avoiding shuffles (aka exchanges) of tables participating in the join. Bucketing results in fewer exchanges (and so stages).


Difference between shuffle partition and repartition

I am a newbie in spark and I am trying to understand shuffle partition and repartition function. But i still dont understand how they are different. Both reduces the number of partition??
The biggest difference between shuffle partition and repartition is when things are defined.
The configuration spark.sql.shuffle.partitions is a property and according to the documentation
Configures the number of partitions to use when shuffling data for joins or aggregations.
That means, every time you run a Join or any type of aggregation in spark that will shuffle the data according to the configuration, where the default value is 200. So if you join two datasets the number of partitions in the shuffle will be 200.
The repartition(numPartitions, *cols) function is applied during an execution, where you can define how many partitions you will write, that usually is for output writing based in partition columns or just number. The example in the documentation is pretty good to show.
So in general, Shuffle Partition is for Joins and Aggregations during the execution. Repartition is for number of output files, based in number or partition column.

Spark's Shuffle Sort Merge Join. One DataFrame is bucketed. Does Spark take advantage of this?

I remember from working with RDDs, that if one key-value RDD (rdd1) has a known partitioning, then performing a join with a different, unpartitioned, key-value RDD (rdd2) would give performance benefits. This is because 1) only the data of rdd2 would need to be transferred across the network, and 2) each element of rdd2 would only need to be transferred to one node rather than all, by applying the partitioning of the key of rdd1 to the key of rdd2
I'm learning about Shuffle Sort Merge Joins with DataFrames. The example in the book I am reading (Learning Spark, 2nd Edition) is for joining two DataFrames based on user_id columns. The example is attempting to demonstrate the elimination of the Exchange stage from the join operation, so, prior to the join, both DataFrames are bucketed into an equal number of buckets by the column to be joined on.
My question is, what happens if only one of the DataFrames has been bucketed? Clearly the Exchange stage will reappear. But if we know that DataFrame1 is bucketed into N buckets by the column we want to join on, will Spark use this bucketing information to efficiently transfer the rows of DataFrame2 over the network, as in the RDD case? Would Spark leave the rows of DataFrame1 where they are, and just apply an identical bucketing to DataFrame2? (Assuming that N buckets results in a reasonable amount of data in the partitions to be joined by the executors) Or instead, does Spark inefficiently shuffle both DataFrames?
In particular, I can imagine a situation where I have a single 'master' DataFrame against which I will need to perform many independent joins with other supplemental DataFrames on the same column. Surely it should only be necessary to pre-bucket the master DataFrame in order to see the performance benefits for all joins? (Although taking the trouble to bucket the supplemental DataFrames wouldn't hurt either, I think) This explains it all with some embellishment over their original postings which I summarize.
Bottom line:
val t1 = spark.table("unbucketed")
val t2 = spark.table("bucketed")
val t3 = spark.table("bucketed")
Unbucketed - bucketed join. Both sides need to be repartitioned.
t1.join(t2, Seq("key")).explain()
Unbucketed with repartition - bucketed join. Unbucketed side is
correctly repartitioned, and only one shuffle is needed.
t1.repartition(16, $"key").join(t2, Seq("key")).explain()
Unbucketed with incorrect repartitiong (default(200) - bucketed join.
Unbucketed side is incorrectly repartitioned, and two shuffles are
t1.repartition($"key").join(t2, Seq("key")).explain()
bucketed - bucketed join. Ideal case, both sides have the same
bucketing, and no shuffles are needed.
t3.join(t2, Seq("key")).explain()
So, both sides need same bucketing for optimal performance.

Does spark do local aggregation when groupBy is used?

I know that rdd.groupByKey() shuffles everything and after that proceeds with consequent operations. So if you need to group rows and transform them, groupByKey will shuffle all the data and only then do a transformation. Which in a case of reductive transformations and large number of rows with the same grouping key is inefficient, because number of rows inside of each partition could be reduced greatly before a shuffle with local reduction. Does datset.groupBy() act the same?
I'm using Spark 1.6

DataFrame partitionBy to a single Parquet file (per partition)

I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API. So I could do that like this:
.partitionBy("entity", "year", "month", "day", "status")
I've tested this and it doesn't seem to perform well. This is because there is only one partition to work on in the dataset and all the partitioning, compression and saving of files has to be done by one CPU core.
I could rewrite this to do the partitioning manually (using filter with the distinct partition values for example) before calling coalesce.
But is there a better way to do this using the standard Spark SQL API?
I had the exact same problem and I found a way to do this using DataFrame.repartition(). The problem with using coalesce(1) is that your parallelism drops to 1, and it can be slow at best and error out at worst. Increasing that number doesn't help either -- if you do coalesce(10) you get more parallelism, but end up with 10 files per partition.
To get one file per partition without using coalesce(), use repartition() with the same columns you want the output to be partitioned by. So in your case, do this:
import spark.implicits._
.repartition($"entity", $"year", $"month", $"day", $"status")
.partitionBy("entity", "year", "month", "day", "status")
Once I do that I get one parquet file per output partition, instead of multiple files.
I tested this in Python, but I assume in Scala it should be the same.
By definition :
coalesce(numPartitions: Int): DataFrame
Returns a new DataFrame that has exactly numPartitions partitions.
You can use it to decrease the number of partitions in the RDD/DataFrame with the numPartitions parameter. It's useful for running operations more efficiently after filtering down a large dataset.
Concerning your code, it doesn't perform well because what you are actually doing is :
putting everything into 1 partition which overloads the driver since it's pull all the data into 1 partition on the driver (and also it not a good practice)
coalesce actually shuffles all the data on the network which may also result in performance loss.
The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
The shuffle concept is very important to manage and understand. It's always preferable to shuffle the minimum possible because it is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.
Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.
Concerning partitioning parquet, I suggest that you read the answer here about Spark DataFrames with Parquet Partitioning and also this section in the Spark Programming Guide for Performance Tuning.
I hope this helps !
It isn't much on top of #mortada's solution, but here's a little abstraction that ensures you are using the same partitioning to repartition and write, and demonstrates sorting as wel:
def one_file_per_partition(df, path, partitions, sort_within_partitions, VERBOSE = False):
start =
# TODO: Format of your choosing here
# or, e.g.:
#.option("compression", "gzip").option("header", "true").mode("overwrite").csv(path)
print(f"Wrote data partitioned by {partitions} and sorted by {sort_within_partitions} to:" +
f"\n {path}\n Time taken: {( - start).total_seconds():,.2f} seconds")
one_file_per_partition(df, location, ["entity", "year", "month", "day", "status"])

Ordering of rows in JavaRdds after union

I am trying to find out any information on the ordering of the rows in a RDD.
Here is what I am trying to do:
Rdd1, Rdd2
Rdd3 = Rdd1.union(rdd2);
in Rdd3, is there any guarantee that rdd1 records will appear first and rdd2 afterwards?
For my tests I saw this behaviorunion
happening but wasn't able to find it in any docs.
just FI, I really do not care about the ordering of RDDs in itself (i.e. rdd2's or rdd1's data order is really not concern but after union Rdd1 record data must come first is the requirement).
In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered
If you check your RDD3, you should find that RDD3 is just all the partitions of RDD1 followed by all the partitions of RDD2, so in this case the results happen to be ordered in the way you want. You can read here that simply concatenating the partitions from the 2 RDDs is the standard behaviour of Spark In Apache Spark, why does RDD.union not preserve the partitioner?
So in this case, it appears that Union will give you what you want. However this behaviour is an implementation detail of Union, it is not part of its interface definition, so you cannot rely on the fact that it won't be reimplemented with different behaviour in the future.
