Spark Dataframe needs to be repartition after filter like RDD? - apache-spark

According so many good resources, it is advisable to re-partition a RDD after filter operation. since, there is a possibility that most of the partitions are now empty.
I have a doubt that in case of Data Frames has this been handled in current versions or do we still need to repartition it after a filter operation?

I have a doubt that in case of Data Frames has this been handled in current versions or do we still need to repartition it after a filter operation?
If you ask if Spark automatically repartitions data the answer is negative (and I hope it won't change in the future)
According so many good resources, it is advisable to re-partition a RDD after filter operation. since, there is a possibility that most of the partitions are now empty.
This really depends on two factors:
How selective is the filter (what is the expected fraction of the records preserved).
What is the distribution of data, in respect to predicate, prior to filter.
Unless you expect that predicate prunes majority of data or prior distribution will leave significant fraction of partitions empty, costs of repartitioning usually outweigh potential benefits, so the main reason to call repartition is to limit the number of the output files.

Spark does not automatically repartition data. It would be a good idea to repartition the data after filtering if you need to do operations such as join and aggregate. Based on your needs you should either use repartition or coalesce. Typically coalesce is preferable since it tries to group data together without shuffling, therefore it only decreases the # of partitions. (good link for understanding coalesce and repartition)
There aren't huge performance boost if you don't do any heavy computation after your filtering operation. Keep in mind that repartition by itself could also be expensive. You must know your data to make that decision

I am assuming that this is your question.
Shall I run a filter operation before repartition or after repartition?
Based on this assumption, a filter will always try to find records matching some conditions. So, the resultant data frame/RDD is always either less than or equal to the previous data frame/RDD. In most cases, the resultant set is less than the previous one.
Whereas repartition is one of the most expensive operations because it does a shuffle. Always remember whenever we are performing a repartition the less the data is in memory the better the performance we can get out of it.
I don't even have to talk more about how Spark handles it etc, in
general filter before repartition is good for performance!
For example, catalyst optimizer itself uses before and after filter to improve performance.
Blog Link:
For example, Spark knows how and when to do things like combine
filters, or move filters before joins. Spark 2.0 even allows you to
define, add, and test out your own additional optimization rules at
runtime. 1[2]

Related

Suggestion for multiple joins in spark

Recently I got a requirement to perform combination joins.
I have to perform around 30 to 36 joins in Spark.
It was consuming more time to build the execution plan. So I cached the execution plan in intermediate stages using df.localCheckpoint().
Is this a good way to do? Any thoughts, please share.
Yes, it is fine.
This is mostly discussed for iterative ML algorithms, but can be equally applied for a Spark App with many steps - e.g. joins.
Quoting from https://medium.com/#adrianchang/apache-spark-checkpointing-ebd2ec065371:
Spark programs take a huge performance hit when fault tolerance occurs
as the entire set of transformations to a DataFrame or RDD have to be
recomputed when fault tolerance occurs or for each additional
transformation that is applied on top of an RDD or DataFrame.
localCheckpoint() is not "reliable".
Caching is definitely a strategy to optimize your performance. In general, given that your data size and resource of your spark application remains unchanged, there are three points that need to be considered when you want to optimize your joining operation:
Data skewness: In most of the time, when I'm trying to find out the reason why the joining takes a lot of time, data skewness is always be one of the reasons. In fact, not only the joining operation, any transformation need a even data distribution so that you won't have a skewed partition that have lots of data and wait the single task in single partition. Make sure your data are well distributed.
Data broadcasting: When we do the joining operation, data shuffling is inevitable. In some case, we use a relatively small dataframe as a reference to filter the data in a very big dataframe. In this case, it's a very expensive operation to shuffle the dataframe. Instead, we can use the dataframe broadcasting to broadcast your small dataframe to every single node and prevent the costly shuffling.
Keep your joining data as lean as possible: like what I mentioned in point 2, data shuffling is inevitable when you do the joining operation. Therefore, please keep your dataframe as lean as possible, which means remove the rows / columns if it's unnecessary to reduce the size of data that need to be moved across the network during the data shuffling.

Why can coalesce lead to too few nodes for processing?

I am trying to understand spark partitions and in a blog I come across this passage
However, you should understand that you can drastically reduce the parallelism of your data processing — coalesce is often pushed up further in the chain of transformation and can lead to fewer nodes for your processing than you would like. To avoid this, you can pass shuffle = true. This will add a shuffle step, but it also means that the reshuffled partitions will be using full cluster resources if possible.
I understand that coalesce means to take the data on some of the least data containing executors and shuffle them to already existing executors via a hash partitioner. I am not able to understand what the author is trying to say in this para though. Can somebody please explain to me what is being said in this paragraph?
Coalesce has some not so obvious effects due to Spark
Catalyst.
E.g.
Let’s say you had a parallelism of 1000, but you only wanted to write
10 files at the end. You might think you could do:
load().map(…).filter(…).coalesce(10).save()
However, Spark’s will effectively push down the coalesce operation to
as early a point as possible, so this will execute as:
load().coalesce(10).map(…).filter(…).save()
You can read in detail here an excellent article, that I quote from, that I chanced upon some time ago: https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908
In summary: Catalyst treatment of coalesce can reduce concurrency early in the pipeline. This I think is what is being alluded to, though of course each case is different and JOIN and aggregating are not subject to such effects in general due to 200 default partitioning that applies for such Spark operations.
As what you have said in your question "coalesce means to take the data on some of the least data containing executors and shuffle them to already existing executors via a hash practitioner". This effectively means the following
The number of partitions have reduced
The main difference between repartition and coalesce is that in coalesce the movement of the data across the partitions is fewer than in repartition thus reducing the level of shuffle thus being more efficient.
Adding the property shuffle=true is just to distribute the data evenly across the nodes which is the same as using repartition(). You can use shuffle=true if you feel that your data might get skewed in the nodes after performing coalesce.
Hope this answers your question

Difference between repartition(1) and coalesce(1)

In our project, we are using repartition(1) to write data into table, I am interested to know why coalesce(1) cannot be used here because repartition is a costly operation compared to coalesce.
I know repartition distributes data evenly across partitions, but when the output file is of single part file, why can't we use coalesce(1)?
coalesce has an issue where if you're calling it using a number smaller than your current number of executors, the number of executors used to process that step will be limited by the number you passed in to the coalesce function.
The repartition function avoids this issue by shuffling the data. In any scenario where you're reducing the data down to a single partition (or really, less than half your number of executors), you should almost always use repartition over coalesce because of this. The shuffle caused by repartition is a small price to pay compared to the single-threaded operation of a call to coalesce(1)
You state nothing else in terms of logic.
coalesce will use existing partitions to minimize shuffling. In case of coalsece(1) and counterpart may be not a big deal, but one can take this guiding principle that repartition creates new partitions and hence does a full shuffle. That said, coalsece can be said to minimize the amount of shuffling.
In my spare time I chanced upon this https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908 excellent article. Look for the quote: Coalesce sounds useful in some cases, but has some problems.

How to spark partitionBy/bucketBy correctly?

Q1. Will adhoc (dynamic) repartition of the data a line before a join help to avoid shuffling or will the shuffling happen anyway at the repartition and there is no way to escape it?
Q2. should I repartition/partitionBy/bucketBy? what is the right approach if I will join according to column day and user_id in the future? (I am saving the results as hive tables with .write.saveAsTable). I guess to partition by day and bucket by user_id but that seems to create thousands of files (see Why is Spark saveAsTable with bucketBy creating thousands of files?)
Some 'guidance' off the top of my head, noting that title and body of text differ to a degree:
Question 1:
A JOIN will do any (hash) partitioning / repartitioning required automatically - if needed and if not using a Broadcast JOIN. You may
set the number of partitions for shuffling or use the default - 200.
There are more parties (DF's) to consider.
repartition is a transformation, so any up-front repartition may not be executed at all due to Catalyst optimization - see the physical plan generated from the .explain. That's the deal with lazy
evaluation - determining if something is necessary upon Action
invocation.
Question 2:
If you have a use case to JOIN certain input / output regularly, then using Spark's bucketBy is a good approach. It obviates shuffling. The
databricks docs show this clearly.
A Spark schema using bucketBy is NOT compatible with Hive. so these remain Spark only tables, unless this changed recently.
Using Hive partitioning as you state depend on push-down logic, partition pruning etc. It should work as well but you may have have
different number of partitions inside Spark framework after the read.
It's a bit more complicated than saying I have N partitions so I will
get N partitions on the initial read.

Do wide dependency and shuffle always occur at the same time in Spark?

I am reading a spark book, and can hardly understand one sentence below. To me, I cannot imagine a case which is wide dependency, but we don't need shuffle. Can anyone give me an example?
"In certain instances, for example, when Spark already knows the data is partitioned in a certain way, operations with wide dependencies do not cause a shuffle." -- "High Performance Spark" by Holden Karau
RDD dependencies are actually in terms of partitions and how partitions are created.
Note: Below definitions are for ease of understanding:
If each of the partitions of an RDDs are created from only one partition of a single RDD, then it is a narrow dependency.
On the other hand, if a partition in a RDD is created from more than one partition(from same or different RDD), then it is a wide dependency.
Shuffle operation is required whenever data required to create a partition is not at one place(that means, it has to be taken from different locations/partitions).
If data is already grouped in one or more partitions(using operations like groupBy, partitionBy etc), you just have to take the corresponding items from each of the partitions and merge them. In this case, shuffle operation is not necessary.
For more details refer this, especially the visual example images.

Resources