I recently came across this talk about dealing with Skew in Spark SQL by using "Iterative" Broadcast Joins to improve query performance when joining a large table with another not so small table. The talk advises to tackle such scenarios using "Iterative Broadcast Joins". Unfortunately, the talk doesn't probe deep enough for me to understand its implementation.
Hence, I was hoping if someone could please shed some light on how to implement this Iterative Broadcast Join in Spark SQL with few examples. How do I implement the same using Spark SQL queries with the SQL-API ?
Note: I am using Spark SQL query 2.4
Any help is appreciated. Thanks
Iterative Broadcast Join : large it might be worth considering the approach of iteratively taking slices of your smaller (but not that small) table, broadcasting those, joining with the larger table, then unioning the result.
To Solve this there is concept called:
i) Salting Technique: : In this we add a random number to a key and make data evenly distributed across clusters .Let see this through a example as below
In the above image Suppose we are performing a join on large and small table, data then is divided into three executors x,y and z as below and later union and since we have data skews all X will be in one executor and Y in another executor and z in another executor.
Since Y and Z data is relatively small it will get completed and wait for X-executor to complete which will take time.
SO to improve performance we should get X-executor data, evenly distributed across all executors
Since the data is stuck on one executor we will add a random number to all key (to both large and small table) and execute our process
Adding a random number : Key =explode(key, range(1,3)), which will give key_1,key_2,key_3
Now if you see is evenly distributed across executors, hence provides faster performance
If you need more help,please see this video :
https://www.youtube.com/watch?v=d41_X78ojCg&ab_channel=TechIsland
and this link:
https://dzone.com/articles/improving-the-performance-of-your-spark-job-on-ske#:~:text=Iterative%20(Chunked)%20Broadcast%20Join,table%2C%20then%20unioning%20the%20result.
Related
Recently I got a requirement to perform combination joins.
I have to perform around 30 to 36 joins in Spark.
It was consuming more time to build the execution plan. So I cached the execution plan in intermediate stages using df.localCheckpoint().
Is this a good way to do? Any thoughts, please share.
Yes, it is fine.
This is mostly discussed for iterative ML algorithms, but can be equally applied for a Spark App with many steps - e.g. joins.
Quoting from https://medium.com/#adrianchang/apache-spark-checkpointing-ebd2ec065371:
Spark programs take a huge performance hit when fault tolerance occurs
as the entire set of transformations to a DataFrame or RDD have to be
recomputed when fault tolerance occurs or for each additional
transformation that is applied on top of an RDD or DataFrame.
localCheckpoint() is not "reliable".
Caching is definitely a strategy to optimize your performance. In general, given that your data size and resource of your spark application remains unchanged, there are three points that need to be considered when you want to optimize your joining operation:
Data skewness: In most of the time, when I'm trying to find out the reason why the joining takes a lot of time, data skewness is always be one of the reasons. In fact, not only the joining operation, any transformation need a even data distribution so that you won't have a skewed partition that have lots of data and wait the single task in single partition. Make sure your data are well distributed.
Data broadcasting: When we do the joining operation, data shuffling is inevitable. In some case, we use a relatively small dataframe as a reference to filter the data in a very big dataframe. In this case, it's a very expensive operation to shuffle the dataframe. Instead, we can use the dataframe broadcasting to broadcast your small dataframe to every single node and prevent the costly shuffling.
Keep your joining data as lean as possible: like what I mentioned in point 2, data shuffling is inevitable when you do the joining operation. Therefore, please keep your dataframe as lean as possible, which means remove the rows / columns if it's unnecessary to reduce the size of data that need to be moved across the network during the data shuffling.
I am trying to understand spark partitions and in a blog I come across this passage
However, you should understand that you can drastically reduce the parallelism of your data processing — coalesce is often pushed up further in the chain of transformation and can lead to fewer nodes for your processing than you would like. To avoid this, you can pass shuffle = true. This will add a shuffle step, but it also means that the reshuffled partitions will be using full cluster resources if possible.
I understand that coalesce means to take the data on some of the least data containing executors and shuffle them to already existing executors via a hash partitioner. I am not able to understand what the author is trying to say in this para though. Can somebody please explain to me what is being said in this paragraph?
Coalesce has some not so obvious effects due to Spark
Catalyst.
E.g.
Let’s say you had a parallelism of 1000, but you only wanted to write
10 files at the end. You might think you could do:
load().map(…).filter(…).coalesce(10).save()
However, Spark’s will effectively push down the coalesce operation to
as early a point as possible, so this will execute as:
load().coalesce(10).map(…).filter(…).save()
You can read in detail here an excellent article, that I quote from, that I chanced upon some time ago: https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908
In summary: Catalyst treatment of coalesce can reduce concurrency early in the pipeline. This I think is what is being alluded to, though of course each case is different and JOIN and aggregating are not subject to such effects in general due to 200 default partitioning that applies for such Spark operations.
As what you have said in your question "coalesce means to take the data on some of the least data containing executors and shuffle them to already existing executors via a hash practitioner". This effectively means the following
The number of partitions have reduced
The main difference between repartition and coalesce is that in coalesce the movement of the data across the partitions is fewer than in repartition thus reducing the level of shuffle thus being more efficient.
Adding the property shuffle=true is just to distribute the data evenly across the nodes which is the same as using repartition(). You can use shuffle=true if you feel that your data might get skewed in the nodes after performing coalesce.
Hope this answers your question
I am joining two dataframes which are reading csv files from s3 and joining them using df.join it is taking 9 mins to complete when using default spark.sql.shuffle.partitions (200).
When I change spark.sql.shuffle.partitions to 10, It is still taking almost the same time.
Is there any way to improve the performance of the same job.
Also, How to dynamically decide the value of spark.sql.shuffle.partitions in a production scenario.
One of the most effective ways to speed up the spark joins is to minimize the number of elements in each data frame; for example, you could apply as many filters as possible on data frames before join them. Another way is to use the broadcast data frame approach for smaller data frames (keep in mind that broadcast data frames must be magnitude smaller than others). For more details, you can use the following tips on spark join optimization:
databricks presentation on optimizing apache-spark SQL joins
Performance Tuning of apache-spark
This question already has an answer here:
Spark final task takes 100x times longer than first 199, how to improve
(1 answer)
Closed 5 years ago.
I have a basic spark job that does a couple of joins. The 3 data frames that get joined are somewhat big, nearly 2 billion records in each of them. I have a spark infrastructure that automatically scales up nodes whenever necessary. It seems like a very simple spark SQL query whose results I write to disk. But the job always gets stuck at 99% when I look at from Spark UI.
Bunch of things I have tried are:
Increase the number of executors and executor memory.
Use repartition while writing the file.
Use the native spark join instead of spark SQL join etc
However, none of these things have worked. It would be great if somebody can share the experience of solving this problem. Thanks in advance.
Because of the join operations, all records with the same key are shuffled to the same executor. If you data is skewed, which means that there is one or a few keys which are very dominant in terms of the number of rows. Then this single executor which has to process all these rows. Essentially your Spark job becomes single threaded since this single key needs to be processed by a single thread.
Repartitioning will not help since your join operation will shuffle the data again by hashing the join key. You could try to increase the number of partitions in case of an unlucky hash.
This video explains the problems, and suggests a solution:
https://www.youtube.com/watch?v=6zg7NTw-kTQ
Cheers, Fokko
I am trying to "train" a DecisionTreeClassifier using Apache Spark running in a cluster in Amazon EMR. Even though I can see that there are around 50 Executors added and that the features are created by querying a Postgres database using SparkSQL and stored in a DataFrame.
The DesisionTree fit method takes for many hours even though the Dataset is not that big (10.000 db entries with a couple of hundreds of bytes each row).I can see that there is only one task for this so I assume this is the reason that it's been so slow.
Where should I look for the reason that this is running in one task?
Is it the way that I retrieve the data?
I am sorry if this is a bit vague but I don't know if the code that retrieves the data is relevant, or is it a parameter in the algorithm (although I didn't find anything online), or is it just Spark tuning?
I would appreciate any direction!
Thanks in advance.
Spark relies on data locality. It seems that all the data is located in a single place. Hence spark uses a single partition to process it. You could apply a repartition or state the number of partitions you would like to use at load time. I would also look into the decision tree Api and see if you can set the number of partitions for it specifically.
Basically, partitions are your level of parallelism.