Spark3 Join Optimization in Sort Merge Join - apache-spark

I have 2 big tables and both are having records in 500 gb. There is no SKEW as all the keys are distributed. There is sort merge join happening. I am not able to comprehend what optimization can be done here. I am attaching the stats summary.
One issue I see: median record processing in KB where as Max record processing in MBs.
Code is :
select a.* ,b.id from fact as inner join dim b on a.x_id=b.id;

Related

Why does spark repartition take so much time on skewed data?

Below is an analogic schema
Table 1 - Players (Upto 10M players per team, HUGE)
player_number - INT
team_id - bigint (indexed)
Table 2 - Team (Relatively small table which I want to broadcast)
team_id - bigInt (indexed)
team_size - INT
Use case - I want to create a join of team1 and team2 on team_id. Players' table can have a large skew on team_id. For team_id number of players can range between 10-10M.
Approach 1 - I repartition players and team on team_id and do a sort-merge join. Repartitioning players takes close to 2.5mins due to large skew, and joining takes close to 3 seconds. Hence overall time is close to 2.6mins
Approach 2 - Teams and players' tables are uniformly distributed, and I do a join without repartitioning on the same key. This takes around 20secs and hence overall time is 20secs
Question - Why does non-repartitioned join take less time than repartitioning. In sort-merge join, won't spark do an internal repartitioning on team_id; how is that different from approach 1. The skew is same for both the cases
So we would need to look at the physical plan by using .explain() on the joined DataFrame to be completely sure but here is my assumption:
Spark detects that Table 2 is small enough to be broadcasted to each executor. Then each worker can join the tables on its own, avoiding shuffling and reducing the overall execution time.
In Approach 1 you are forcing Spark to do a repartition, which is not necessary.

How to build a custom spark partitioner to avoid exchange / shuffle steps

Version: DBR 8.4 | Spark 3.1.2
While reading solutions to How to avoid shuffles while joining DataFrames on unique keys?, I've found a few mentions of the need for to create a "custom partitioner", but I can't find any information on that.
I've noticed that in the ~4 hour job I'm currently trying to optimize, most of the time goes to exchanging terabytes of data from a temporary cross-join-and-reduce operation.
Here is a visualization of the current operation:
I'm hoping that if I can set up the cross-join operation with a "custom partitioner" I can force the ~29 billion rows from the cross join operation (which shares the same 2-column primary key with the left joined ~0.6 billion row table) to stay on the workers they were generated on until the whole dataset can be reduced to a mere 1 million rows. i.e. I'm hoping to avoid any shuffles during this time.
The steps in the operation are:
Generate 28 billion rows temporary "TableA" partitioned by 'columnA', keyed by ['columnA', 'columnB']
Left join 1 billion rows "TableB" also partitioned by 'columnA', keyed by ['columnA', 'columnB'] (Kind of a sparse version of temp Table A)
Project a new column (TableC.columnC = TableA.columnC - Coalesce(TableB.columnC, 0) in this specific case)
Project a new row_order() column within each partition e.g. F.row_number().over( Window.partitionBy(['columnA', 'columnB']).orderBy(F.col('columnC').desc())
Take the top N (say 2) - so filter out only the rows with rank (row_number) < 3 (for example), throwing away the other 49,998 rows per-partition.
Since all of these operations are independently performed within each partition ['columnA', 'columnB'] (no interactions between partitions), I was hoping there was some way that I can get through all 5 of those steps without ever reshuffling partitions between workers.
What I've tried:
I've tried not specifying any repartitioning instructions at all, this leads to the 3.5 hours time and the DAG below.
I've tried explicitly specifying .repartition(9600, 'columnA') on each data source on both sides of the join (excluding the broadcast join case), right before joining. (Note that '9600' is configured as the default number of shuffle partitions to use). This code change resulted in no changes to the query plan - there is still an exchange step happening both before and after the sort-merge-join.

PySpark - Issue with CPU heavy cartesian join when using multiple join columns

Background / scenario:
I have two tables: One 1-2 million entry table with transactions of the form
TRX-ID , PROCESS-ID , ACTOR-ID
Additionally a participant-lookup (one of multiple categories of users of the system) table of the form
USER-ID , PARTICIPANT-ID
The transaction table is for historical reasons a bit messy. The PROCESS-ID can be a participant-id and the ACTOR-ID the user-id of a different kind of user. In some situations the PROCESS-ID is something else and the ACTOR-ID is the user-id of the participant.
I need to join the transaction and the participant-lookup table in order to get the participant-id for all transactions. I tried this in two ways.
(I left out some code steps in the snippets and focused on the join parts. Assume that df variables are data frames and the I select right columns to support e.g. unions.)
First approach:
transactions_df.join(pt_lookup_df, (transactions_df['actor-id'] == pt_lookup_df['user-id']) | (transactions_df['process-id'] == pt_lookup_df['participant-id']))
The code with this join is extrem slow. It ends up in my job running 45 minutes on a 10 instances AWS glue cluster with nearly 99% load on all executers.
Second approach:
I realised that some of the transactions already have the participant-id and I do not need to join for them. So I changed to:
transactions_df_1.join(pt_lookup_df, (transactions_df_1['actor-id'] == pt_lookup_df['user-id']))
transactions_df_2 = transactions_df_2.withColumnRenamed('process-id', 'participant-id')
transactions_df_1.union(transactions_df_2)
This finished in 5 minutes!
Both approaches give the correct results.
Question
I do not understand why the one is so slow and the other not. The amount of data excluded in the second approach is minimal. So transactions_df_2 has only a very small subset of the total data.
Looking at the plans, the affect is mainly around on Cartesian product that is done in approach 1 but not 2. So I assume, this is the performance bottleneck. I still do not understand how this can lead to 40 minutes differences in compute time.
Can someone give explanations?
Would a Cartesian product in the DAG be in general a warning sign in Spark?
Summary
It seems that a join with multiple columns in the condition triggers an extrem slow Cartesian product operation. Should I have done a broadcast operation on the smaller data set to avoid this?
DAG approach 1
DAG approach 2
This is because a Cartesian Product join and a regular join do not involve the same data shuffling process. Even thought the amount of data is similar the amount of shuffling is different.
This article explains where is the extra shuffling coming from.

ORDER BY vs SORT BY in Spark SQL

I use Spark 2.4 and use the %sql mode to query tables.
If I am using a Window function on a large data-set, then which one between ORDER BY vs SORT BY will be more efficient from a query performance standpoint ?
I understand that ORDER BY ensures global ordering but the computation gets pushed to only 1 reducer. However, SORT BY will sort within each partition but the partitions may receive overlapping ranges.
I want to understand if SORT BY too could be used in this case ? And Which one will be more efficient while processing a large data-set (say 100 M rows) ?
For e.g.
ROW_NUMBER() OVER (PARTITION BY prsn_id ORDER BY purch_dt desc) AS RN
VS
ROW_NUMBER() OVER (PARTITION BY prsn_id SORT BY purch_dt desc) AS RN
Can anyone please help. Thanks.
It does not matter whether you use SORT BY or ORDER BY. There is a notion about Hive that you are likely referring to, but you are using Spark, that has no such issue.
For partition BY ...the 1 Reducer aspect is only an issue if you have nothing to partition by. You do have prsn_id, so not an issue.
sort by is applied at each bucket and does not guarantee that entire dataset is sorted.
But order by is applied at entire dataset (in a single reducer).
Since your query is partitioned and sorted/ordered for each partition key, the both usage returns the same output.

Is there a data architecture for efficient joins in Spark (a la RedShift)?

I have data that I would like to do a lot of analytic queries on and I'm trying to figure out if there is a mechanism I can use to store it so that Spark can efficiently do joins on it. I have a solution using RedShift, but would ideally prefer to have something that is based on files in S3 instead of having a whole RedShift cluster up 24/7.
Introduction to the data
This is a simplified example. We have 2 initial CSV files.
Person records
Event records
The two tables are linked via the person_id field. person_id is unique in the Person table. Events have a many-to-one relationship with person.
The goal
I'd like to understand how to set up the data so I can efficiently perform the following query. I will need to perform many queries like this (all queries are evaluated on a per person basis):
The query is to produce a data frame with 4 columns, and 1 row for every person.
person_id - person_id for each person in the data set
age - "age" field from the person record
cost - The sum of the "cost" field for all event records for that person where "date" is during the month of 6/2013
All current solutions I have with Spark to this problem involve reshuffling all the data, which ends up making the process slow for large amounts (hundreds of millions of people). I am happy with a solution that requires me to reshuffle the data and write it to a different format once if that can then speed up later queries.
The solution using RedShift
I can accomplish this solution using RedShift in a fairly straightforward way:
Each both files are loaded in as RedShift tables, with DISTKEY person_id, SORTKEY person_id. This distributes the data so that all the data for a person is on a single node. The following query will produce the desired data frame:
select person_id, age, e.cost from person
left join (select person_id, sum(cost) as cost from events
where date between '2013-06-01' and '2013-06-30'
group by person_id) as e using (person_id)
The solution using Spark/Parquet
I have thought of several potential ways to handle this in Spark, but none accomplishes what I need. My ideas and the issues are listed below:
Spark Dataset write 'bucketBy' - Read the CSV files and then rewrite them out as parquet files using "bucketBy". Queries on these parquet files could then be very fast. This would produce a data setup similar to RedShift, but parquet files don't support bucketBy.
Spark parquet partitioning - Parquet does support partitioning. Because parquet creates a separate set of files for each partition key, you have to create a computed column to partition on and use a hash of person_id to create the partitionKey. However, when you later join these tables in spark based on "partition_key" and "person_id", the query plan still does a full hash partition. So this approach is no better than just reading the CSVs and shuffling every time.
Stored in some other data format besides parquet - I am open to this, but don't know of another data source that will work.
Using a compound record format - Parquet supports hierarchical data formats, so can prejoin both tables into a hierarchical record (where a person record has an "events" field which is an array of struct elements) and then do processing on that. When you have a hierarchical record, there are two approaches that to processing it:
** Use explode to create separate records ** - Using this approach you explode array fields into full rows, then use standard data frame operations to do analytics, and then join them back to the main table. Unfortunately, I've been unable to get this approach to efficiently compile queries.
** Use UDFs to perform operations on subrecords ** - This preserves the structure and executes without shuffles, but is an awkward and verbose way to program. Also, it requires lots of UDFs which aren't great for performance (although they beat large scale shuffling of data).
For my use cases, Spark has advantages over RedShift which aren't obvious in this simple example, so I'd prefer to do this with Spark. Please let me know if I am missing something and there is a good approach to this.
Edited per comment.
Assumptions:
Using parquet
Here's what I would try:
val eventAgg = spark.sql("""select person_id, sum(cost) as cost
from events
where date between '2013-06-01' and '2013-06-30'
group by person_id""")
eventAgg.cache.count
val personDF = spark.sql("""SELECT person_id, age from person""")
personDF.cache.count // cache is less important here, so feel free to omit
eventAgg.join(personDF, "person_id", "left")
I just did this with some of my data and here's how it went (9
node/140 vCPUs cluster, ~600GB RAM):
27,000,000,000 "events" (aggregated to 14,331,487 "people")
64,000,000 "people" (~20 columns)
aggregated events building and caching took ~3 min
people caching took ~30 seconds (pulling from network, not parquet)
left joining took several seconds
Not caching the "people" led to the join taking a few seconds longer. Then forcing spark to broadcast the couple hundred MB aggregated events made the join take under 1 second.

Resources