Joining Dataframe performance in Spark - apache-spark

I am joining two dataframes which are reading csv files from s3 and joining them using df.join it is taking 9 mins to complete when using default spark.sql.shuffle.partitions (200).
When I change spark.sql.shuffle.partitions to 10, It is still taking almost the same time.
Is there any way to improve the performance of the same job.
Also, How to dynamically decide the value of spark.sql.shuffle.partitions in a production scenario.

One of the most effective ways to speed up the spark joins is to minimize the number of elements in each data frame; for example, you could apply as many filters as possible on data frames before join them. Another way is to use the broadcast data frame approach for smaller data frames (keep in mind that broadcast data frames must be magnitude smaller than others). For more details, you can use the following tips on spark join optimization:
databricks presentation on optimizing apache-spark SQL joins
Performance Tuning of apache-spark

Related

Spark SQL output multiple small files

We are having multiple joins involving a large table (about 500gb in size). The output of the joins is stored into multiple small files each of size 800kb-1.5mb. Because of this the job is split into multiple tasks and taking a long time to complete.
We have tried using spark tuning configurations like using broadcast join, changing partition size, changing max records per file etc., But there is no performance improvement with this methods and the issue is also not fixed. Using coalesce makes the job struck at that stage and there is no progress.
Please view this link for Spark UI metrics screenshot, https://i.stack.imgur.com/FfyYy.png
The spark UI confirms your report of too many small files. You will get a file for every spark partition, and you have 33,479 in your final stage where you're writing the output. 33k partitions was probably the right number of partitions for your join but not the right number for your write.
You need to add another stage in your job that comes after your join. That 2nd needs to reduce the number of spark partitions to a reasonable number (that outputs 32MB - ~128MB files)
Something like a coalesce, or repartition. Maybe even a sort :(
You want to target ~350 partitions.
This diagram shows what you want to do manually or automatically (with spark on Databricks)
If you're using Databricks then it's easy as with Delta Lake you can turn on Auto Optimize

Suggestion for multiple joins in spark

Recently I got a requirement to perform combination joins.
I have to perform around 30 to 36 joins in Spark.
It was consuming more time to build the execution plan. So I cached the execution plan in intermediate stages using df.localCheckpoint().
Is this a good way to do? Any thoughts, please share.
Yes, it is fine.
This is mostly discussed for iterative ML algorithms, but can be equally applied for a Spark App with many steps - e.g. joins.
Quoting from https://medium.com/#adrianchang/apache-spark-checkpointing-ebd2ec065371:
Spark programs take a huge performance hit when fault tolerance occurs
as the entire set of transformations to a DataFrame or RDD have to be
recomputed when fault tolerance occurs or for each additional
transformation that is applied on top of an RDD or DataFrame.
localCheckpoint() is not "reliable".
Caching is definitely a strategy to optimize your performance. In general, given that your data size and resource of your spark application remains unchanged, there are three points that need to be considered when you want to optimize your joining operation:
Data skewness: In most of the time, when I'm trying to find out the reason why the joining takes a lot of time, data skewness is always be one of the reasons. In fact, not only the joining operation, any transformation need a even data distribution so that you won't have a skewed partition that have lots of data and wait the single task in single partition. Make sure your data are well distributed.
Data broadcasting: When we do the joining operation, data shuffling is inevitable. In some case, we use a relatively small dataframe as a reference to filter the data in a very big dataframe. In this case, it's a very expensive operation to shuffle the dataframe. Instead, we can use the dataframe broadcasting to broadcast your small dataframe to every single node and prevent the costly shuffling.
Keep your joining data as lean as possible: like what I mentioned in point 2, data shuffling is inevitable when you do the joining operation. Therefore, please keep your dataframe as lean as possible, which means remove the rows / columns if it's unnecessary to reduce the size of data that need to be moved across the network during the data shuffling.

Improve Spark denormalization/partition performance

I have a denormalization use case - one hive avro fact table to join with 14 smaller dimension tables and produce a denormalized parquet output table. Both the input fact table and output table are partitioned in the same way (Category=TEST1, YearMonthId=202101). And I do run historical processing, which means processing and loading several months for a given category at once.
I am using Spark 2.4.0/pyspark dataframe, broadcast join for all the table joins, dynamic partition inserts, using coalasce at the end to control the number of output files. (seeing a shuffle at the last stage probably because of dynamic partition inserts)
Would like to know the optimizations possible w.r.t to managing partitions - say maintain partitions consistently from input to output stage such that no shuffle is involved. Want to leverage the fact that the input and output storage tables are partitioned by the same columns.
I am also thinking about this - Use static partitions writes by determining the partitions and write to partitions parallelly - would this help in speeding-up or avoid shuffle?
Appreciate any help that would lead me in the right direction.
Couple of options below that I tried that improved the performance (both time + avoid small files).
Tried using repartition (instead of coalesce) in the data frame before doing a broadcast join, which minimized shuffle and hence the shuffle spill.
-- repartition(count, *PartitionColumnList, AnyOtherSaltingColumn) (Add salting column if the repartition is not even)
Make sure that the the base tables are properly compacted. This might even eliminate the need for #1 in some cases, and reduce # of tasks resulting in reduced overhead due to task scheduling.

Iterative Broadcast Join in Spark SQL

I recently came across this talk about dealing with Skew in Spark SQL by using "Iterative" Broadcast Joins to improve query performance when joining a large table with another not so small table. The talk advises to tackle such scenarios using "Iterative Broadcast Joins". Unfortunately, the talk doesn't probe deep enough for me to understand its implementation.
Hence, I was hoping if someone could please shed some light on how to implement this Iterative Broadcast Join in Spark SQL with few examples. How do I implement the same using Spark SQL queries with the SQL-API ?
Note: I am using Spark SQL query 2.4
Any help is appreciated. Thanks
Iterative Broadcast Join : large it might be worth considering the approach of iteratively taking slices of your smaller (but not that small) table, broadcasting those, joining with the larger table, then unioning the result.
To Solve this there is concept called:
i) Salting Technique: : In this we add a random number to a key and make data evenly distributed across clusters .Let see this through a example as below
In the above image Suppose we are performing a join on large and small table, data then is divided into three executors x,y and z as below and later union and since we have data skews all X will be in one executor and Y in another executor and z in another executor.
Since Y and Z data is relatively small it will get completed and wait for X-executor to complete which will take time.
SO to improve performance we should get X-executor data, evenly distributed across all executors
Since the data is stuck on one executor we will add a random number to all key (to both large and small table) and execute our process
Adding a random number : Key =explode(key, range(1,3)), which will give key_1,key_2,key_3
Now if you see is evenly distributed across executors, hence provides faster performance
If you need more help,please see this video :
https://www.youtube.com/watch?v=d41_X78ojCg&ab_channel=TechIsland
and this link:
https://dzone.com/articles/improving-the-performance-of-your-spark-job-on-ske#:~:text=Iterative%20(Chunked)%20Broadcast%20Join,table%2C%20then%20unioning%20the%20result.

Spark Broadcasting Alternatives

Our application uses a long-running spark context(just like spark RPEL) to enable users perform tasks online. We use spark broadcasts heavily to process dimensional data. As in common practice, we broadcast the dimension tables and use dataframe APIs to join the fact table with the other dimension tables. One of the dimension tables is quite big and has about 100k records and 15MB of size in-memory(kyro serialized is just few MBs lesser).
We see that every spark JOB on the de-normalized dataframe is causing all the dimensions to be broadcasted over and over again. The bigger table takes ~7 secs every time it is broadcasted. We are trying to find a way to have the dimension tables broadcasted only once per context life span. We tried both sqlcontext and sparkcontext broadcasting.
Are there any other alternatives to spark broadcasting? Or is there a way to reduce the memory footprint of the dataframe(compression/serialization etc. - post-kyro is still 15MB :( ) ?
Possible Alternative
We use Iginite spark integration to load large amount of data at start of job and keep on mutating as needed.
In embedded mode you can start ignite at boot of Spark context and kill in the end.
You can read more about it here.
https://ignite.apache.org/features/igniterdd.html
Finally we were able to find a stopgap solution until spark support pinning of RDDs or preferably RDDs in a later version. This is apparently not addressed even in v2.1.0.
The solution relies on RDD mapPartitions, below is a brief summary of the approach
Collect the dimension table records as map of key-value pairs and broadcast using spark context. You can possibly use RDD.keyBy
Map fact rows using RDD mapPartitions method.
For each fact row mapParitions
collect the dimension ids in the fact row and lookup the dimension records
yields a new fact row by denormalizing the dimension ids in the fact
table

Resources