Spark DataFrame / Dataset groupBy optimization via bucketBy - apache-spark

I'm researching options for a use-case where we store the dataset as parquet files and want to run efficient groupBy queries for a specific key later on when we read the data.
I've read a bit about the optimizations for groupBy, however couldn't really find much about it (other than RDD level reduceByKey).
What I have in mind is, if the dataset is written bucketed by the key that will also be used in the groupBy. Theoretically the groupBy could be optimized, since all the rows containing the key will be co-located (and even consecutive if it's also stored sorted on the same key).
One idea i have in mind is to apply the transformation via mapPartitions then groupBy, however, this will require breaking down my functions into two, it's not really desirable. I believe for some class of functions (say sum/count) spark would optimize the query with a similar fashion as well, but the optimization would be kicked in by the choice of function, and would work regardless of the co-location of the rows, but not because of the co-location.
Can spark leverage the co-location of the rows to optimize the groupBy using any function subsequently?

It seems like bucketing's main use-case is for doing JOINs on the bucketed key, which allows Spark to avoid doing a shuffle across the whole table. If Spark knows that the rows are already partitioned across the buckets, I don't see why it wouldn't know to use the pre-partitioned buckets in a GROUP BY. You might need to sort by the group by key as well though.
I'm also interested in this use-case so will be trying it out and seeing if a shuffle occurs.

Related

Suggestion for multiple joins in spark

Recently I got a requirement to perform combination joins.
I have to perform around 30 to 36 joins in Spark.
It was consuming more time to build the execution plan. So I cached the execution plan in intermediate stages using df.localCheckpoint().
Is this a good way to do? Any thoughts, please share.
Yes, it is fine.
This is mostly discussed for iterative ML algorithms, but can be equally applied for a Spark App with many steps - e.g. joins.
Quoting from https://medium.com/#adrianchang/apache-spark-checkpointing-ebd2ec065371:
Spark programs take a huge performance hit when fault tolerance occurs
as the entire set of transformations to a DataFrame or RDD have to be
recomputed when fault tolerance occurs or for each additional
transformation that is applied on top of an RDD or DataFrame.
localCheckpoint() is not "reliable".
Caching is definitely a strategy to optimize your performance. In general, given that your data size and resource of your spark application remains unchanged, there are three points that need to be considered when you want to optimize your joining operation:
Data skewness: In most of the time, when I'm trying to find out the reason why the joining takes a lot of time, data skewness is always be one of the reasons. In fact, not only the joining operation, any transformation need a even data distribution so that you won't have a skewed partition that have lots of data and wait the single task in single partition. Make sure your data are well distributed.
Data broadcasting: When we do the joining operation, data shuffling is inevitable. In some case, we use a relatively small dataframe as a reference to filter the data in a very big dataframe. In this case, it's a very expensive operation to shuffle the dataframe. Instead, we can use the dataframe broadcasting to broadcast your small dataframe to every single node and prevent the costly shuffling.
Keep your joining data as lean as possible: like what I mentioned in point 2, data shuffling is inevitable when you do the joining operation. Therefore, please keep your dataframe as lean as possible, which means remove the rows / columns if it's unnecessary to reduce the size of data that need to be moved across the network during the data shuffling.

How to spark partitionBy/bucketBy correctly?

Q1. Will adhoc (dynamic) repartition of the data a line before a join help to avoid shuffling or will the shuffling happen anyway at the repartition and there is no way to escape it?
Q2. should I repartition/partitionBy/bucketBy? what is the right approach if I will join according to column day and user_id in the future? (I am saving the results as hive tables with .write.saveAsTable). I guess to partition by day and bucket by user_id but that seems to create thousands of files (see Why is Spark saveAsTable with bucketBy creating thousands of files?)
Some 'guidance' off the top of my head, noting that title and body of text differ to a degree:
Question 1:
A JOIN will do any (hash) partitioning / repartitioning required automatically - if needed and if not using a Broadcast JOIN. You may
set the number of partitions for shuffling or use the default - 200.
There are more parties (DF's) to consider.
repartition is a transformation, so any up-front repartition may not be executed at all due to Catalyst optimization - see the physical plan generated from the .explain. That's the deal with lazy
evaluation - determining if something is necessary upon Action
invocation.
Question 2:
If you have a use case to JOIN certain input / output regularly, then using Spark's bucketBy is a good approach. It obviates shuffling. The
databricks docs show this clearly.
A Spark schema using bucketBy is NOT compatible with Hive. so these remain Spark only tables, unless this changed recently.
Using Hive partitioning as you state depend on push-down logic, partition pruning etc. It should work as well but you may have have
different number of partitions inside Spark framework after the read.
It's a bit more complicated than saying I have N partitions so I will
get N partitions on the initial read.

When should we go for Spark-sql and when should we go for Spark RDD

On which scenario we should prefer spark RDD to write a solution and on which scenario we should choose to go for spark-sql. I know spark-sql gives better performance and it works best with structure and semistructure data. But what else factors are there that we need to take into consideration while choosing betweeen spark Rdd and spark-sql.
I don't see much reasons to still use RDDs.
Assuming you are using JVM based language, you can use DataSet that is the mix of SparkSQL+RDD (DataFrame == DataSet[Row]), according to spark documentation:
Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
The problem is python is not support DataSet so, you will use RDD and lose spark-sql optimization when you work with non-structed data.
I found using DFs easier to use than DSs - the latter are still subject to development imho. The comment on pyspark indeed still relevant.
RDDs still handy for zipWithIndex to put asc, contiguous sequence numbers on items.
DFs / DSs have a columnar store and have a better Catalyst (Optimizer) support.
Also, may things with RDDs are painful, like a JOIN requiring a key, value and multi-step join if needing to JOIN more than 2 tables. They are legacy. Problem is the internet is full of legacy and thus RDD jazz.
RDD
RDD is a collection of data across the clusters and it handles both unstructured and structured data. It's typically a function part of handling data.
DF
Data frames are basically two dimensional array of objects defining the data in a rows and columns. It's similar to relations tables in the database. Data frame handles only the structured data.

Spark Dataframe needs to be repartition after filter like RDD?

According so many good resources, it is advisable to re-partition a RDD after filter operation. since, there is a possibility that most of the partitions are now empty.
I have a doubt that in case of Data Frames has this been handled in current versions or do we still need to repartition it after a filter operation?
I have a doubt that in case of Data Frames has this been handled in current versions or do we still need to repartition it after a filter operation?
If you ask if Spark automatically repartitions data the answer is negative (and I hope it won't change in the future)
According so many good resources, it is advisable to re-partition a RDD after filter operation. since, there is a possibility that most of the partitions are now empty.
This really depends on two factors:
How selective is the filter (what is the expected fraction of the records preserved).
What is the distribution of data, in respect to predicate, prior to filter.
Unless you expect that predicate prunes majority of data or prior distribution will leave significant fraction of partitions empty, costs of repartitioning usually outweigh potential benefits, so the main reason to call repartition is to limit the number of the output files.
Spark does not automatically repartition data. It would be a good idea to repartition the data after filtering if you need to do operations such as join and aggregate. Based on your needs you should either use repartition or coalesce. Typically coalesce is preferable since it tries to group data together without shuffling, therefore it only decreases the # of partitions. (good link for understanding coalesce and repartition)
There aren't huge performance boost if you don't do any heavy computation after your filtering operation. Keep in mind that repartition by itself could also be expensive. You must know your data to make that decision
I am assuming that this is your question.
Shall I run a filter operation before repartition or after repartition?
Based on this assumption, a filter will always try to find records matching some conditions. So, the resultant data frame/RDD is always either less than or equal to the previous data frame/RDD. In most cases, the resultant set is less than the previous one.
Whereas repartition is one of the most expensive operations because it does a shuffle. Always remember whenever we are performing a repartition the less the data is in memory the better the performance we can get out of it.
I don't even have to talk more about how Spark handles it etc, in
general filter before repartition is good for performance!
For example, catalyst optimizer itself uses before and after filter to improve performance.
Blog Link:
For example, Spark knows how and when to do things like combine
filters, or move filters before joins. Spark 2.0 even allows you to
define, add, and test out your own additional optimization rules at
runtime. 1[2]

join DataFrames within partitions in PySpark

I have two dataframes with a large (millions to tens of millions) number of rows. I'd like to do a join between them.
In the BI system I'm currently using, you make this fast by first partitioning on a particular key, then doing the join on that key.
Is this a pattern that I need to be following in Spark, or does that not matter? It seems at first glance like a lot of time is wasted shuffling data between partitions, because it hasn't been pre-partitioned correctly.
If it is necessary, then how do I do that?
If it is necessary, then how do I do that?
How to define partitioning of DataFrame?
However it makes sense only under two conditions:
There multiple joins withing the same application. Partitioning shuffles itself, so if it is a single join there is no added value.
It is long lived application where shuffled data will be reused. Spark cannot take advantage of the partitioning of the data stored in the external format.

Resources