I went through the documentation here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
It says:
for repartition: resulting DataFrame is hash partitioned.
for repartitionByRange: resulting DataFrame is range partitioned.
And a previous question also mentions it. However, I still don't understand how exactly they differ and what the impact will be when choosing one over the other?
More importantly, if repartition does hash partitioning, what impact does providing columns as its argument have?
I think it is best to look into the difference with some experiments.
Test Dataframes
For this experiment, I am using the following two Dataframes (I am showing the code in Scala but the concept is identical to Python APIs):
// Dataframe with one column "value" containing the values ranging from 0 to 1000000
val df = Seq(0 to 1000000: _*).toDF("value")
// Dataframe with one column "value" containing 1000000 the number 0 in addition to the numbers 5000, 10000 and 100000
val df2 = Seq((0 to 1000000).map(_ => 0) :+ 5000 :+ 10000 :+ 100000: _*).toDF("value")
Theory
repartition applies the HashPartitioner when one or more columns are provided and the RoundRobinPartitioner when no column is provided. If one or more columns are provided (HashPartitioner), those values will be hashed and used to determine the partition number by calculating something like partition = hash(columns) % numberOfPartitions. If no column is provided (RoundRobinPartitioner) the data gets evenly distributed across the specified number of partitions.
repartitionByRange will partition the data based on a range of the column values. This is usually used for continuous (not discrete) values such as any kind of numbers. Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition.
It is also worth mentioning that for both methods if numPartitions is not given, by default it partitions the Dataframe data into spark.sql.shuffle.partitions configured in your Spark session, and could be coalesced by Adaptive Query Execution (available since Spark 3.x).
Test Setup
Based on the given Testdata I am always applying the same code:
val testDf = df
// here I will insert the partition logic
.withColumn("partition", spark_partition_id()) // applying SQL built-in function to determine actual partition
.groupBy(col("partition"))
.agg(
count(col("value")).as("count"),
min(col("value")).as("min_value"),
max(col("value")).as("max_value"))
.orderBy(col("partition"))
testDf.show(false)
Test Results
df.repartition(4, col("value"))
As expected, we get 4 partitions and because the values of df are ranging from 0 to 1000000 we see that their hashed values will result in a well distributed Dataframe.
+---------+------+---------+---------+
|partition|count |min_value|max_value|
+---------+------+---------+---------+
|0 |249911|12 |1000000 |
|1 |250076|6 |999994 |
|2 |250334|2 |999999 |
|3 |249680|0 |999998 |
+---------+------+---------+---------+
df.repartitionByRange(4, col("value"))
Also in this case, we get 4 partitions but this time the min and max values clearly shows the ranges of values within a partition. It is almost equally distributed with 250000 values per partition.
+---------+------+---------+---------+
|partition|count |min_value|max_value|
+---------+------+---------+---------+
|0 |244803|0 |244802 |
|1 |255376|244803 |500178 |
|2 |249777|500179 |749955 |
|3 |250045|749956 |1000000 |
+---------+------+---------+---------+
df2.repartition(4, col("value"))
Now, we are using the other Dataframe df2. Here, the hashing algorithm is hashing the values which are only 0, 5000, 10000 or 100000. Of course, the hash of the value 0 will always be the same, so all Zeros end up in the same partition (in this case partition 3). The other two partitions only contain one value.
+---------+-------+---------+---------+
|partition|count |min_value|max_value|
+---------+-------+---------+---------+
|0 |1 |100000 |100000 |
|1 |1 |10000 |10000 |
|2 |1 |5000 |5000 |
|3 |1000001|0 |0 |
+---------+-------+---------+---------+
df2.repartition(4)
Without using the content of the column "value" the repartition method will distribute the messages on a RoundRobin basis. All partitions have almost the same amount of data.
+---------+------+---------+---------+
|partition|count |min_value|max_value|
+---------+------+---------+---------+
|0 |250002|0 |5000 |
|1 |250002|0 |10000 |
|2 |249998|0 |100000 |
|3 |250002|0 |0 |
+---------+------+---------+---------+
df2.repartitionByRange(4, col("value"))
This case shows that the Dataframe df2 is not well defined for a repartitioning by range as almost all values are 0. Therefore, we end up having only two partitions whereas the partition 0 contains all Zeros.
+---------+-------+---------+---------+
|partition|count |min_value|max_value|
+---------+-------+---------+---------+
|0 |1000001|0 |0 |
|1 |3 |5000 |100000 |
+---------+-------+---------+---------+
By using df.explain you can get much information about these operations.
I'm using this DataFrame for the example :
df = spark.createDataFrame([(i, f"value {i}") for i in range(1, 22, 1)], ["id", "value"])
Repartition
Depending on whether a key expression (column) is specified or not, the partitioning method will be different. It is not always hash partitioning as you said.
df.repartition(3).explain(True)
== Parsed Logical Plan ==
Repartition 3, true
+- LogicalRDD [id#0L, value#1], false
== Analyzed Logical Plan ==
id: bigint, value: string
Repartition 3, true
+- LogicalRDD [id#0L, value#1], false
== Optimized Logical Plan ==
Repartition 3, true
+- LogicalRDD [id#0L, value#1], false
== Physical Plan ==
Exchange RoundRobinPartitioning(3)
+- Scan ExistingRDD[id#0L,value#1]
We can see in the generated physical plan that RoundRobinPartitioning is used:
Represents a partitioning where rows are distributed evenly across
output partitions by starting from a random target partition number
and distributing rows in a round-robin fashion. This partitioning is
used when implementing the DataFrame.repartition() operator.
When using repartition by column expression:
df.repartition(3, "id").explain(True)
== Parsed Logical Plan ==
'RepartitionByExpression ['id], 3
+- LogicalRDD [id#0L, value#1], false
== Analyzed Logical Plan ==
id: bigint, value: string
RepartitionByExpression [id#0L], 3
+- LogicalRDD [id#0L, value#1], false
== Optimized Logical Plan ==
RepartitionByExpression [id#0L], 3
+- LogicalRDD [id#0L, value#1], false
== Physical Plan ==
Exchange hashpartitioning(id#0L, 3)
+- Scan ExistingRDD[id#0L,value#1]
Now the picked partitioning method is hashpartitioning.
In hash partitioning method, a Java Object.hashCode is being calculated for every key expression to determine the destination partition_id by calculating a modulo: key.hashCode % numPartitions.
RepartitionByRange
This partitioning method creates numPartitions consecutive and not overlapping ranges of values based on the partitioning key. Thus, at least one key expression is required and needs to be orderable.
df.repartitionByRange(3, "id").explain(True)
== Parsed Logical Plan ==
'RepartitionByExpression ['id ASC NULLS FIRST], 3
+- LogicalRDD [id#0L, value#1], false
== Analyzed Logical Plan ==
id: bigint, value: string
RepartitionByExpression [id#0L ASC NULLS FIRST], 3
+- LogicalRDD [id#0L, value#1], false
== Optimized Logical Plan ==
RepartitionByExpression [id#0L ASC NULLS FIRST], 3
+- LogicalRDD [id#0L, value#1], false
== Physical Plan ==
Exchange rangepartitioning(id#0L ASC NULLS FIRST, 3)
+- Scan ExistingRDD[id#0L,value#1]
Looking at the generated physical plan, we can see that rangepartitioning differs from the two others described above by the presence of the ordering clause in the partitioning expression. When no explicit sort order is specified in the expression, it uses ascending order by default.
Some interesting links:
Repartition Logical Operators — Repartition and RepartitionByExpression
Range partitioning in Apache SparkSQL
hash vs range partitioning
Related
Seems like I'm missing something about repartition in spark.
AFAIK, you can repartition with a key:
df.repartition("key") , in which case spark will use a hash partitioning method.
And you can repartition with setting only partitions number:
df.repartition(10), in which spark will use a round robin partitioning method.
In which case a round robin partition will have a data skew which will require using salt to randomize the results equally, if repartitioning with only column numbers is done in a round robin manner?
With df.repartition(10) you cannot have a skew. As you mention it, spark uses a round robin partitioning method so that partitions have the same size.
We can check that:
spark.range(100000).repartition(5).explain
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange RoundRobinPartitioning(5), REPARTITION_BY_NUM, [id=#1380]
+- Range (0, 100000, step=1, splits=16)
spark.range(100000).repartition(5).groupBy(spark_partition_id).count
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 0|20000|
| 1|20000|
| 2|20000|
| 3|20000|
| 4|20000|
+--------------------+-----+
If you use df.repartition("key"), something different happens:
// let's specify the number of partitions as well
spark.range(100000).repartition(5, 'id).explain
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange hashpartitioning(id#352L, 5), REPARTITION_BY_NUM, [id=#1424]
+- Range (0, 100000, step=1, splits=16)
Let's try:
spark.range(100000).repartition(5, 'id).groupBy(spark_partition_id).count.show
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 0|20128|
| 1|20183|
| 2|19943|
| 3|19940|
| 4|19806|
+--------------------+-----+
Each element of the column is hashed and hashes are split between partitions. Therefore partitions have similar sizes but they don't have exactly the same size. However, two rows with the same key necessarily end up in the same partition. So if your key is skewed (one or more particular keys are over-represented in the dataframe), your partitioning will be skewed as well:
spark.range(100000)
.withColumn("key", when('id < 1000, 'id).otherwise(lit(0)))
.repartition(5, 'key)
.groupBy(spark_partition_id).count.show
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 0|99211|
| 1| 196|
| 2| 190|
| 3| 200|
| 4| 203|
+--------------------+-----+
Say we have the following dataframe (which is borrowed from 'PySpark by Examples' website):
simpleData = [("James","Sales","NY",90000,34,10000), \
("Michael","Sales","NY",86000,56,20000), \
("Robert","Sales","CA",81000,30,23000), \
("Maria","Finance","CA",90000,24,23000), \
("Raman","Finance","CA",99000,40,24000), \
("Scott","Finance","NY",83000,36,19000), \
("Jen","Finance","NY",79000,53,15000), \
("Jeff","Marketing","CA",80000,25,18000), \
("Kumar","Marketing","NY",91000,50,21000) \
]
Then, if we run the two following sort (orderBy) commands:
df.sort("department","state").show(truncate=False)
or
df.sort(col("department"),col("state")).show(truncate=False)
We get the same result:
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|Maria |Finance |CA |90000 |24 |23000|
|Raman |Finance |CA |99000 |40 |24000|
|Jen |Finance |NY |79000 |53 |15000|
|Scott |Finance |NY |83000 |36 |19000|
|Jeff |Marketing |CA |80000 |25 |18000|
|Kumar |Marketing |NY |91000 |50 |21000|
|Robert |Sales |CA |81000 |30 |23000|
|James |Sales |NY |90000 |34 |10000|
|Michael |Sales |NY |86000 |56 |20000|
+-------------+----------+-----+------+---+-----+
I know the first one takes the DataFrame column name as a string and the next one takes columns in Column type. But is there a difference between these two in case of tasks such as processing or future uses? Is one of them better than the other or pySpark standard form? Or are they just aliases?
PS: In addition to the above, one of the reasons I'm asking this question is that someone told me there is a 'standard' business form for using Spark. For example, 'alias' is more popular than 'withColumnRenamed' in the business. Of course, this doesn't sound right to me.
If you look at the explain plan you'll see that both queries generate the same physical plan, so processing wise they are identical.
df_sort1 = df.sort("department", "state")
df_sort2 = df.sort(col("department"), col("state"))
df_sort1.explain()
df_sort2.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#8]
+- Scan ExistingRDD[employee_name#0,department#1,state#2,salary#3L,age#4L,bonus#5L]
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#18]
+- Scan ExistingRDD[employee_name#0,department#1,state#2,salary#3L,age#4L,bonus#5L]
Businesses might have coding guidelines in which they specify what to use. If they exist then follow them. If not and you're working on existing code then usually best to follow what is there already. Otherwise its mainly preference, I'm not aware of a 'standard business form' of pyspark.
In case of alias vs withColumnRenamed there is an argument to be made in favor of alias if you're renaming multiple columns. Selecting with alias will generate a single projection in the parsed logical plan where using multiple withColumnRenamed will generate multiple projections.
To be certain that the two versions do the same thing, we can have a look at the source code of dataframe.py. Here is the signature of the sort method:
def sort(
self, *cols: Union[str, Column, List[Union[str, Column]]], **kwargs: Any
) -> "DataFrame":
When you follow the various method calls, you end up on this line:
jcols = [_to_java_column(cast("ColumnOrName", c)) for c in cols]
, that converts all column objects, whether they are strings or columns (cf method signature) to java columns. Then only the java columns are used regardless of how they were passed to the method so the two versions of the sort method do the exact same thing with the exact same code.
Recently I was asked in an interview about the algorithm of spark df.show() function.
how will spark decide from which executor/executors the records will be fetched?
Without undermining #thebluephantom's and #Hristo Iliev's answers (each give some insight into what's happening under the hood), I also wanted to add my answer to this list.
I came to the same conclusion(s), albeit by observing the behaviour of the underlying partitions.
Partitions have an index associated with them. This is seen in the code below.
(Taken from original spark source code here).
trait Partition extends Serializable {
def index: Int
:
So amongst partitions, there is an order.
And as already mentioned in other answers, the df.show() is the same as df.show(20) or the top 20 rows. So the underlying partition indexes determine which partition (and hence executor) those rows come from.
The partition indexes are assigned at the time of read, or (re-assigned) during a shuffle.
Here is some code to see this behaviour:
val df = Seq((5,5), (6,6), (7,7), (8,8), (1,1), (2,2), (3,3), (4,4)).toDF("col1", "col2")
// above sequence is defined out of order - to make behaviour visible
// see partition structure
df.rdd.glom().collect()
/* Array(Array([5,5]), Array([6,6]), Array([7,7]), Array([8,8]), Array([1,1]), Array([2,2]), Array([3,3]), Array([4,4])) */
df.show(4, false)
/*
+----+----+
|col1|col2|
+----+----+
|5 |5 |
|6 |6 |
|7 |7 |
|8 |8 |
+----+----+
only showing top 4 rows
*/
In the above code, we see 8 partitions (each inner-Array is a partition) - this is because spark defaults to 8 partitions when we create a dataframe.
Now let's repartition the dataframe.
// Now let's repartition df
val df2 = df.repartition(2)
// lets see the partition structure
df2.rdd.glom().collect()
/* Array(Array([5,5], [6,6], [7,7], [8,8], [1,1], [2,2], [3,3], [4,4]), Array()) */
// lets see output
df2.show(4,false)
/*
+----+----+
|col1|col2|
+----+----+
|5 |5 |
|6 |6 |
|7 |7 |
|8 |8 |
+----+----+
only showing top 4 rows
*/
In the above code, the top 4 rows came from the first partition (which actually has all elements of the original data in it). Also note the skew in partition sizes, since no partitioning column was mentioned.
Now lets try and create 3 partitions
val df3 = df.repartition(3)
// lets see partition structures
df3.rdd.glom().collect()
/*
Array(Array([8,8], [1,1], [2,2]), Array([5,5], [6,6]), Array([7,7], [3,3], [4,4]))
*/
// And lets see the top 4 rows this time
df3.show(4, false)
/*
+----+----+
|col1|col2|
+----+----+
|8 |8 |
|1 |1 |
|2 |2 |
|5 |5 |
+----+----+
only showing top 4 rows
*/
From the above code, we observe that Spark went to the first partition and tried to get 4 rows. Since only 3 were available, it grabbed those. Then moved on to the next partition, and got one more row. Thus, the order that you see from show(4, false), is due to the underlying data partitioning and the index ordering amongst partitions.
This example uses show(4), but this behaviour can be extended to show() or show(20).
It's simple.
In Spark 2+, show() calls showString() to format the data as a string and then prints it. showString() calls getRows() to get the top rows of the dataset as a collection of strings. getRows() calls take() to take the original rows and transforms them into strings. take() simply wraps head(). head() calls limit() to build a limit query and executes it. limit() adds a Limit(n) node at the front of the logical plan which is really a GlobalLimit(n, LocalLimit(n)). Both GlobalLimit and LocalLimit are subclasses of OrderPreservingUnaryNode that override its maxRows (in GlobalLimit) or maxRowsPerPartition (in LocalLimit) methods. The logical plan now looks like:
GlobalLimit n
+- LocalLimit n
+- ...
This goes through analysis and optimisation by Catalyst where limits are removed if something down the tree produces less rows than the limit and ends up as CollectLimitExec(m) (where m <= n) in the execution strategy, so the physical plan looks like:
CollectLimit m
+- ...
CollectLimitExec executes its child plan, then checks how many partitions the RDD has. If none, it returns an empty dataset. If one, it runs mapPartitionsInternal(_.take(m)) to take the first m elements. If more than one, it applies take(m) on each partition in the RDD using mapPartitionsInternal(_.take(m)), builds a shuffle RDD that collects the results in a single partition, then again applies take(m).
In other words, it depends (because optimisation phase), but in the general case it takes the top rows of the concatenation of the top rows of each partition and so involves all executors holding a part of the dataset.
OK, perhaps not so simple.
A shitty question as not what u would use in prod.
It is a smart action that looks at what you have in terms of transformations.
Show() is in fact show(20). If just show it looks at 1st and consecutive partitions to get 20 rows. Order by also optimized. A count does need complete processing.
Many google posts btw.
I have a performance issue on query after partitioning.
I have a daily parquet file of around 30 millions rows and 20 columns. For example, the file data_20210721.parquet looks like:
+-----------+---------------------+---------------------+------------+-----+
| reference | date_from | date_to | daytime | ... |
+-----------+---------------------+---------------------+------------+-----+
| A | 2021-07-21 17:30:25 | 2021-07-22 02:21:57 | 2021-07-22 | ... |
| A | 2021-07-21 12:10:10 | 2021-07-21 13:00:00 | 2021-07-21 | ... |
| A | ... | ... | ... | ... |
+-----------+---------------------+---------------------+------------+-----+
We have a code to process it to have only a single day and cut a midnight such that we have:
+-----------+---------------------+---------------------+------------+-----+
| reference | date_from | date_to | daytime | ... |
+-----------+---------------------+---------------------+------------+-----+
| A | 2021-07-21 17:30:25 | 2021-07-22 00:00:00 | 2021-07-21 | ... | <- split at midnight + daytime update
| A | 2021-07-22 00:00:00 | 2021-07-22 02:21:57 | 2021-07-22 | ... | <- residual
| A | 2021-07-21 12:10:10 | 2021-07-21 13:00:00 | 2021-07-21 | ... |
| A | ... | ... | ... | ... |
+-----------+---------------------+---------------------+------------+-----+
The line 2, can be called a residual because it is not from the same day as the file.
Then we wanted to generate 1 parquet per daytime so the default solution was to process each file and save the dataframe with:
df.write.partitionBy(["id", "daytime"]).mode("append").parquet("hdfs/path")
The mode is set to append because the next day, we may have residuals from past / future days.
There is also other levels of partitioning such as:
ID : it is fixed for around a year (quite good to save so storage ;) )
weeknumber
country
Even if partition are quite "balanced" in term of rows, the processing time becames incredibly slow.
For example, to count the number of rows per day for a given set of date:
Original df (7s seconds):
spark.read.parquet("path/to/data_2021071[0-5].parquet")\
.groupBy("DayTime")\
.count()\
.show()
Partitioned data (several minutes)
spark.read.parquet("path/to/data")\
.filter( (col("DayTime") >= "2021-07-10") & (col("DayTime") <= "2021-07-15") )\
.groupBy("DayTime")\
.count()\
.show()
We thought that there is too many small partitions at the final level (because of the append, there is around 600 very small files of few Kb/Mb) so we tried to coalesce them for each partition and there is no improvements. We also tried to partition only on daytime (in case having to many level of partition creates issues).
Is there is any solutions to improve the performance (or understand where is the bottleneck) ?
Can it be linked to the fact that we are partitioning a date column ? I saw a lot of example with partition by year/month/day for example which are 3 integers but does not fit our need.
This solution was perfect to solve a lot of problems we had but the loss of performance if far too important to be kept as is. Any suggestion is welcome :)
EDIT 1 :
The issues come from the fact the the plan is not the same between:
spark.read.parquet("path/to/data/DayTime=2021-07-10")
and
spark.read.parquet("path/to/data/").filter(col("DayTime")=="2021-07-10")
Here is the plan for a small example where DayTime has been converted to a "long" as I thought maybe the slowness was due to the datatype:
spark.read.parquet("path/to/test/").filter(col("ts") == 20200103).explain(extended=True)
== Parsed Logical Plan ==
'Filter ('ts = 20200103)
+- AnalysisBarrier
+- Relation[date_from#4297,date_to#4298, ....] parquet
== Analyzed Logical Plan ==
date_from: timestamp, date_to: timestamp, ts: int, ....
Filter (ts#4308 = 20200103)
+- Relation[date_from#4297,date_to#4298,ts#4308, ....] parquet
== Optimized Logical Plan ==
Filter (isnotnull(ts#4308) && (ts#4308 = 20200103))
+- Relation[date_from#4297,date_to#4298,ts#4308, ....] parquet
== Physical Plan ==
*(1) FileScan parquet [date_from#4297,date_to#4298,ts#4308, ....] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://.../test_perf], PartitionCount: 1, PartitionFilters: [isnotnull(ts#4308), (ts#4308 = 20200103)], PushedFilters: [], ReadSchema: struct<date_from:timestamp,date_to:timestamp, ....
vs
spark.read.parquet("path/to/test/ts=20200103").explain(extended=True)
== Parsed Logical Plan ==
Relation[date_from#2086,date_to#2087, ....] parquet
== Analyzed Logical Plan ==
date_from: timestamp, date_to: timestamp,, ....] parquet
== Optimized Logical Plan ==
Relation[date_from#2086,date_to#2087, ....] parquet
== Physical Plan ==
*(1) FileScan parquet [date_from#2086,date_to#2087, .....] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://.../test_perf/ts=20200103], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<date_from:timestamp,date_to:timestamp, ....
Thanks in advance,
Nicolas
You have to ensure that your filter is actually utilising the partitioned structure, pruning at disk level rather than bringing all data into memory and then applying filter.
Try to check your physical plan
spark.read.parquet("path/to/data")\
.filter( (col("DayTime") >= "2021-07-10") & (col("DayTime") <= "2021-07-15") )
.explain()
It should have a stage similar to PartitionFilters: [isnotnull(DayTime#123), (DayTime#76 = your condition)],
My guess is in your case, it is not utilising this PartitionFilters and whole data is scanned.
I would suggest to try experimenting your syntax / repartition strategy using a small data set until you achieve PartitionFilters.
If I have columns [a,b,c] in df1 and [a,b,c] in df2, and also a column d, in both where d=concat_ws('_', *[a,b,c]) would there be a performance difference between:
df1.join(df2, [a,b,c])
df1.join(df2, d)
?
The question cannot be answered with yes or no as the answer depends on the details of the DataFrames.
The performance of a join depends to some good part on the question how much shuffling is necessary to execute it. If both sides of the join are partitioned by the same column(s) the join will be faster. You can see the effect of partitioning by looking at the execution plan of the join.
We create two DataFrames df1 and df2 with the columns a, b, c and d:
val sparkSession = ...
sparkSession.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
import sparkSession.implicits._
val cols = Seq("a","b","c")
def createDf = (1 to 3).map(i => (i,i,i)).toDF(cols:_*).withColumn("d", concat_ws("_", cols.map(col):_*))
val df1 = createDf
val df2 = createDf
df1 and df2 look both the same:
+---+---+---+-----+
| a| b| c| d|
+---+---+---+-----+
| 1| 1| 1|1_1_1|
| 2| 2| 2|2_2_2|
| 3| 3| 3|3_3_3|
+---+---+---+-----+
When we partition both DataFrames by column d and use this column as join condition
df1.repartition(4, col("d")).join(df2.repartition(4, col("d")), "d").explain()
we get the execution plan
== Physical Plan ==
*(3) Project [d#13, a#7, b#8, c#9, a#25, b#26, c#27]
+- *(3) SortMergeJoin [d#13], [d#31], Inner
:- *(1) Sort [d#13 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(d#13, 4)
: +- LocalTableScan [a#7, b#8, c#9, d#13]
+- *(2) Sort [d#31 ASC NULLS FIRST], false, 0
+- ReusedExchange [a#25, b#26, c#27, d#31], Exchange hashpartitioning(d#13, 4)
Partitioning both DataFrames by d but joining over a, b and c
df1.repartition(4, col("d")).join(df2.repartition(4, col("d")), cols).explain()
leads to the execution plan
== Physical Plan ==
*(3) Project [a#7, b#8, c#9, d#13, d#31]
+- *(3) SortMergeJoin [a#7, b#8, c#9], [a#25, b#26, c#27], Inner
:- *(1) Sort [a#7 ASC NULLS FIRST, b#8 ASC NULLS FIRST, c#9 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(a#7, b#8, c#9, 200)
: +- Exchange hashpartitioning(d#13, 4)
: +- LocalTableScan [a#7, b#8, c#9, d#13]
+- *(2) Sort [a#25 ASC NULLS FIRST, b#26 ASC NULLS FIRST, c#27 ASC NULLS FIRST], false, 0
+- ReusedExchange [a#25, b#26, c#27, d#31], Exchange hashpartitioning(a#7, b#8, c#9, 200)
which contains one Exchange hashpartitioning more than the first plan. In this case the join by a, b, c would be slower.
On the other side, if the DataFrames are partitioned by a, b and c the join by a, b, c would be faster than a join by d.
I'd suspect join without the concatenate to be faster because its likely cheaper to just hash the individual strings instead of concatenate and then hash. The former involves fewer java objects that need to be GC'd, but this isn't the full answer.
Keep in ming that this may not be the performance limiting step of your query and so either way would be just as fast. When it comes to performance tuning its best to test rather than guessing without data.
Also as mentioned above, leaving the columns unconcatenated gives the optimizer a chance to eliminate an exchange on the join if the input data is already partitioned correctly.
df1.join(df2, [a,b,c])
df1.join(df2, d)