Optimize Join of two large pyspark dataframes - apache-spark

I have two large pyspark dataframes df1 and df2 containing GBs of data.
The columns in first dataframe are id1, col1.
The columns in second dataframe are id2, col2.
The dataframes have equal number of rows.
Also all values of id1 and id2 are unique.
Also all values of id1 correspond to exactly one value id2.
For. first few entries are as for df1 and df2 areas follows
df1:
id1 | col1
12 | john
23 | chris
35 | david
df2:
id2 | col2
23 | lewis
35 | boon
12 | cena
So I need to join the two dataframes on key id1 and id2.
df = df1.join(df2, df1.id1 == df2.id2)
I am afraid this may suffer from shuffling.
How can I optimize the join operation for this special case?

To avoid the shuffling at the time of join operation, reshuffle the data based on your id column.
The reshuffle operation will also do a full shuffle but it will optimize your further joins if there are more than one.
df1 = df1.repartition('id1')
df2 = df2.repartition('id2')
Another way to avoid shuffles at join is to leverage bucketing.
Save both the dataframes by using bucketBy clause on id then later when you read the dataframes the id column will reside in same executors, hence avoiding the shuffle.
But to leverage benefit of bucketing, you need a hive metastore as the bucketing information is contained in it.
Also this will include additional steps of creating the bucket then reading.

Related

Should I reduce not required columns in DFs before join them in Spark?

Is there any sense to reduce not required columns before I join it in Spark data frames?
For example:
DF1 has 10 columns, DF2 has 15 columns, DF3 has 25 columns.
I want to join them, select needed 10 columns and save it in .parquet.
Does it make sense to transform DFs with select only needed columns before the join or Spark engine will optimize the join by itself and will not operate with all 50 columns during the join operation?
Yes, it makes a perfect sense because it reduce the amount of data shuffled between executors. And it's better to make selection of only necessary columns as early as possible - in most cases, if file format allows (Parquet, Delta Lake), Spark will read data only for necessary columns, not for all columns. I.e.:
df1 = spark.read.parquet("file1") \
.select("col1", "col2", "col3")
df2 = spark.read.parquet("file2") \
.select("col1", "col5", "col6")
joined = df1.join(df2, "col1")

Partitioning the data while reading from hive/hdfs based on column values in Spark

I have 2 spark dataframes that I read from hive using the sqlContext. Lets call these dataframes as df1 and df2. The data in both the dataframes is sorted on a Column called PolicyNumber at hive level. PolicyNumber also happens to be the primary key for both the dataframes. Below are the sample values for both the dataframes although in reality, both my dataframes are huge and spread across 5 executors as 5 partitions. For simplity sake, I will assume that each partition will have one record.
Sample df1
PolicyNumber FirstName
1 A
2 B
3 C
4 D
5 E
Sample df2
PolicyNumber PremiumAmount
1 450
2 890
3 345
4 563
5 2341
Now, I want to join df1 and df2 on PolicyNumber column. I can run the below piece of code and get my required output.
df1.join(df2,df1.PolicyNumber=df2.PolicyNumber)
Now, I want to avoid as much shuffle as possible to make this join efficient. So to avoid shuffle, while reading from hive, I want to partition df1 based on values of PolicyNumber Column in such a way that the row with PolicyNumber 1 will go to Executor 1, row with PolicyNumber 2 will go to Executor 2, row with PolicyNumber 3 will go to Executor 3 and so on. And I want to partition df2 in the exact same way I did for df1 as well.
This way, Executor 1 will now have the row from df1 with PolicyNumber=1 and also the row from df2 with PolicyNumber=1 as well.
Similarly, Executor 2 will have the row from df1 with PolicyNumber=2 and also the row from df2 with PolicyNumber=2 ans so on.
This way, there will not be any shuffle required as now, the data is local to that executor.
My question is, is there a way to control the partitions in this granularity? And if yes, how do I do it.
Unfortunately there is no direct control on the data which is floating into each executor, however, while you reading data into each dataframe, use the CLUSTER BY on join column which helps sorted data distributing into right executors.
ex:
df1 = sqlContext.sql("select * from CLSUTER BY JOIN_COLUMN")
df2 = sqlContext.sql("SELECT * FROM TABLE2 CLSUTER BY JOIN_COLUMN")
hope it helps.

Difference (if there is one) between spark.sql.shuffle.partitions and df.repartition

I'm having a bit of difficulty reconciling the difference (if one exists) between sqlContext.sql("set spark.sql.shuffle.partitions=n") and re-partitioning a Spark DataFrame utilizing df.repartition(n).
The Spark documentation indicates that set spark.sql.shuffle.partitions=n configures the number of partitions that are used when shuffling data, while df.repartition seems to return a new DataFrame partitioned by the number of key specified.
To make this question clearer, here is a toy example of how I believe df.reparition and spark.sql.shuffle.partitions to work:
Let's say we have a DataFrame, like so:
ID | Val
--------
A | 1
A | 2
A | 5
A | 7
B | 9
B | 3
C | 2
Scenario 1: 3 Shuffle Partitions, Reparition DF by ID:
If I were to set sqlContext.sql("set spark.sql.shuffle.partitions=3") and then did df.repartition($"ID"), I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C".
Scenario 2: 5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.
Is my understanding off base here? In general, my questions are:
I am trying to optimize my partitioning of a dataframe as to avoid
skew, but to have each partition hold as much of the same key
information as possible. How do I achieve that with set
spark.sql.shuffle.partitions and df.repartiton?
Is there a link
between set spark.sql.shuffle.partitions and df.repartition? If
so, what is that link?
Thanks!
I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C".
No
5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.
and no.
This is not how partitioning works. Partitioners map values to partitions, but mapping in general case is not unique (you can check How does HashPartitioner work? for a detailed explanation).
Is there a link between set spark.sql.shuffle.partitions and df.repartition? If so, what is that link?
Indeed there is. If you df.repartition, but don't provide number of partitions then spark.sql.shuffle.partitions is used.

Why do I get so many empty partitions when repartionning a Spark Dataframe?

I want to partition a dataframe "df1" on 3 columns. This dataframe has exactly 990 unique combinaisons for those 3 columns:
In [17]: df1.createOrReplaceTempView("df1_view")
In [18]: spark.sql("select count(*) from (select distinct(col1,col2,col3) from df1_view) as t").show()
+--------+
|count(1)|
+--------+
| 990|
+--------+
In order to optimize the processing of this dataframe, I want to partition df1 in order to get 990 partitions, one for each key possibility:
In [19]: df1.rdd.getNumPartitions()
Out[19]: 24
In [20]: df2 = df1.repartition(990, "col1", "col2", "col3")
In [21]: df2.rdd.getNumPartitions()
Out[21]: 990
I wrote a simple way to count rows in each partition:
In [22]: def f(iterator):
...: a = 0
...: for partition in iterator:
...: a = a + 1
...: print(a)
...:
In [23]: df2.foreachPartition(f)
And I notice that what I get in fact is 628 partitions with one or more key values, and 362 empty partitions.
I assumed spark would repartition in an even way (1 key value = 1 partition) but that does not seem like it, and I feel like this repartitionning is adding data skew even though it should be the other way around...
What's the algorithm Spark uses to partition a dataframe on columns ?
Is there a way to achieve what I thought was possible ?
I'm using Spark 2.2.0 on Cloudera.
To distribute data across partitions spark needs somehow to convert value of the column to index of the partition. There are two default partitioners in Spark - HashPartitioner and RangePartitioner. Different transformations in Spark can apply different partitioners - e.g. join will apply hash partitioner.
Basically for hash partitioner formula to convert value to partition index would be value.hashCode() % numOfPartitions. In your case multiple values are mapping to same partition index.
You could implement your own partitioner if you want better distribution. More about it is here and here and here.

Efficient way to do a join between dataframe when joining field is unique

I have 2 dataframes in Spark. Both of them have an id which is unique.
The structure is the following
df1:
id_df1 values
abc abc_map_value
cde cde_map_value
fgh fgh_map_value
df2:
id_df2 array_id_df1
123 [abc, fgh]
456 [cde]
I want to get the following dataframe result:
result_df:
id_df2 array_values
123 [map(abc,abc_map_value), map(fgh,fgh_map_value)]
456 [map(cde,cde_map_value)]
I can use spark sql to do so but i don't think that it's the most efficient way as ids are unique.
Is there a way to store a key/values dictionary in memory to lookup for the value based on the key rather than to do a join ? Would it be more efficient than to do a join ?
If you explode the df2 into key,value pairs, the join becomes easy and just a groupBy is needed.
You could experiment other aggregations & reductions for more efficiency / parallelisation
df2
.select('id_df2, explode('array_id_df1).alias("id_df1"))
.join(df1, usingColumn="id_df1")
.groupBy('id_df2)
.agg(collect_list(struct('id_df1, 'values)).alias("array_values"))

Resources