How to spark partitionBy/bucketBy correctly? - apache-spark

Q1. Will adhoc (dynamic) repartition of the data a line before a join help to avoid shuffling or will the shuffling happen anyway at the repartition and there is no way to escape it?
Q2. should I repartition/partitionBy/bucketBy? what is the right approach if I will join according to column day and user_id in the future? (I am saving the results as hive tables with .write.saveAsTable). I guess to partition by day and bucket by user_id but that seems to create thousands of files (see Why is Spark saveAsTable with bucketBy creating thousands of files?)

Some 'guidance' off the top of my head, noting that title and body of text differ to a degree:
Question 1:
A JOIN will do any (hash) partitioning / repartitioning required automatically - if needed and if not using a Broadcast JOIN. You may
set the number of partitions for shuffling or use the default - 200.
There are more parties (DF's) to consider.
repartition is a transformation, so any up-front repartition may not be executed at all due to Catalyst optimization - see the physical plan generated from the .explain. That's the deal with lazy
evaluation - determining if something is necessary upon Action
invocation.
Question 2:
If you have a use case to JOIN certain input / output regularly, then using Spark's bucketBy is a good approach. It obviates shuffling. The
databricks docs show this clearly.
A Spark schema using bucketBy is NOT compatible with Hive. so these remain Spark only tables, unless this changed recently.
Using Hive partitioning as you state depend on push-down logic, partition pruning etc. It should work as well but you may have have
different number of partitions inside Spark framework after the read.
It's a bit more complicated than saying I have N partitions so I will
get N partitions on the initial read.

Related

How to distribute data into X partitions on read with Spark?

I’m trying to read data from Hive with Spark DF and distribute it into a specific configurable number of partitions (in a correlation to the number of cores). My job is pretty straightforward and it does not contain any joins or aggregations. I’ve read on the spark.sql.shuffle.partitions property but the documentation says:
Configures the number of partitions to use when shuffling data for joins or aggregations.
Does this mean that it would be irrelevant for me to configure this property? Or does the read operation is considered as a shuffle? If not, what is the alternative? Repartition and coalesce seems a bit like an overkill for that matter.
To verify my understanding of your problem, you want to increase number of partitions in your rdd/dataframe which is created immediately after reading data.
In this case the property you are after is spark.sql.files.maxPartitionBytes which controls the maximum data that can be pushed in a partition at max (please refer to https://spark.apache.org/docs/2.4.0/sql-performance-tuning.html)
Default value is 128 MB which can be overridden to improve parallelism.
Read is not a shuffle as such. You need to get the data in at some stage.
The answer below can be used or an algorithm by Spark sets the number of partitions upon a read.
You do not state if you are using RDD or DF. With RDD you can set num partitions. With DF you need to repartition after read in general.
Your point on controlling parallelism is less relevant when joining or aggregating as you note.

Improve Spark denormalization/partition performance

I have a denormalization use case - one hive avro fact table to join with 14 smaller dimension tables and produce a denormalized parquet output table. Both the input fact table and output table are partitioned in the same way (Category=TEST1, YearMonthId=202101). And I do run historical processing, which means processing and loading several months for a given category at once.
I am using Spark 2.4.0/pyspark dataframe, broadcast join for all the table joins, dynamic partition inserts, using coalasce at the end to control the number of output files. (seeing a shuffle at the last stage probably because of dynamic partition inserts)
Would like to know the optimizations possible w.r.t to managing partitions - say maintain partitions consistently from input to output stage such that no shuffle is involved. Want to leverage the fact that the input and output storage tables are partitioned by the same columns.
I am also thinking about this - Use static partitions writes by determining the partitions and write to partitions parallelly - would this help in speeding-up or avoid shuffle?
Appreciate any help that would lead me in the right direction.
Couple of options below that I tried that improved the performance (both time + avoid small files).
Tried using repartition (instead of coalesce) in the data frame before doing a broadcast join, which minimized shuffle and hence the shuffle spill.
-- repartition(count, *PartitionColumnList, AnyOtherSaltingColumn) (Add salting column if the repartition is not even)
Make sure that the the base tables are properly compacted. This might even eliminate the need for #1 in some cases, and reduce # of tasks resulting in reduced overhead due to task scheduling.

Spark SQL joining multiple tables design

I am developing a Spark SQL analytics solutions using set of tables. Suppose there are 5 tables which i need to building my solution and finally i am creating one output table.
Here is my flow
dataframe1 = table1 join table2
dataframe2 = dataframe1 join table3
dataframe3 = datamframe2 + filter + agg
dataframe4 = dataframe3 join table4 join table 5
// finally
dataframe4.saveAsTable
When I save final dataframe that's when all the above dataframe is evaluated.
Is my approach is good? or
Do i need to cache/persist intermediate dataframes?
This is a very generic question and it is hard to provide a definitive answer.
Depending on the size of tables you would want to do broadcast hint for any of tables that are relatively small.
You can do this via
table_i.join(broadcast(table_j), ....)
This behaviour depends on the value in:
Now broadcast hint will be honoured only if Spark is able to evaluate the value of the table so you might need to cache().
Another option is via Spark checkpoints that can help to truncate local plan for optimisation (also this allows you to resume jobs from checkpoint location, it is similar to writing to HDFS but with some overhead).
In case of broadcasting few houndres of Mb tables, you might need to increase your kryo buffer:
--conf spark.kryoserializer.buffer.max=1g
It also depends which join types you will use.
You would probably want to do filter and aggregagtion as early as possible since it will reduce the join surface.
There are many other considerations to be consider in order to properly optimise this. In case of power law distribution of join keys in any of the joins you would need to do salting and explode smaller table.
In your case, in principle, there is not really a cache or persist required Why?
As there are no reuse paths evident (for other Actions or other Transformations within the same Action), it is all sequential.
Also, lazy evaluation and Catalyst.
Try the .explain and see how Spark will process.
However, due to memory eviction possibilities on the Cluster, there may be the need to re-compute on a Worker. There are various settings that you could apply via .cache and .persist, but Spark handles memory and disk spills without explicit .cache or .persist. See https://sparkbyexamples.com/spark/spark-difference-between-cache-and-persist/
Also, using .cache can affect performance. So use .explain. See here an excellent posting: Spark: Explicit caching can interfere with Catalyst optimizer's ability to optimize some queries?
So, each case is different but yours seems Ok to answer as I have. In summary: An RDD or DF that is not cached, nor check-pointed, is re-evaluated again each time an Action is invoked on that RDD or DF or if re-accessed within the current Action and no skipped stage situation applies. In your case no issue. Doing otherwise would slow your App down in fact.

Spark Dataframe needs to be repartition after filter like RDD?

According so many good resources, it is advisable to re-partition a RDD after filter operation. since, there is a possibility that most of the partitions are now empty.
I have a doubt that in case of Data Frames has this been handled in current versions or do we still need to repartition it after a filter operation?
I have a doubt that in case of Data Frames has this been handled in current versions or do we still need to repartition it after a filter operation?
If you ask if Spark automatically repartitions data the answer is negative (and I hope it won't change in the future)
According so many good resources, it is advisable to re-partition a RDD after filter operation. since, there is a possibility that most of the partitions are now empty.
This really depends on two factors:
How selective is the filter (what is the expected fraction of the records preserved).
What is the distribution of data, in respect to predicate, prior to filter.
Unless you expect that predicate prunes majority of data or prior distribution will leave significant fraction of partitions empty, costs of repartitioning usually outweigh potential benefits, so the main reason to call repartition is to limit the number of the output files.
Spark does not automatically repartition data. It would be a good idea to repartition the data after filtering if you need to do operations such as join and aggregate. Based on your needs you should either use repartition or coalesce. Typically coalesce is preferable since it tries to group data together without shuffling, therefore it only decreases the # of partitions. (good link for understanding coalesce and repartition)
There aren't huge performance boost if you don't do any heavy computation after your filtering operation. Keep in mind that repartition by itself could also be expensive. You must know your data to make that decision
I am assuming that this is your question.
Shall I run a filter operation before repartition or after repartition?
Based on this assumption, a filter will always try to find records matching some conditions. So, the resultant data frame/RDD is always either less than or equal to the previous data frame/RDD. In most cases, the resultant set is less than the previous one.
Whereas repartition is one of the most expensive operations because it does a shuffle. Always remember whenever we are performing a repartition the less the data is in memory the better the performance we can get out of it.
I don't even have to talk more about how Spark handles it etc, in
general filter before repartition is good for performance!
For example, catalyst optimizer itself uses before and after filter to improve performance.
Blog Link:
For example, Spark knows how and when to do things like combine
filters, or move filters before joins. Spark 2.0 even allows you to
define, add, and test out your own additional optimization rules at
runtime. 1[2]

Spark: Most efficient way to sort and partition data to be written as parquet

My data is in principle a table, which contains a column ID and a column GROUP_ID, besides other 'data'.
In the first step I am reading CSV's into Spark, do some processing to prepare the data for the second step, and write the data as parquet.
The second step does a lot of groupBy('GROUP_ID') and Window.partitionBy('GROUP_ID').orderBy('ID').
The goal now is -- in order to avoid shuffling in the second step -- to efficiently load the data in the first step, as this is a one-timer.
Question Part 1: AFAIK, Spark preserves the partitioning when loading from parquet (which is actually the basis of any "optimized write consideration" to be made) - correct?
I came up with three possibilities:
df.orderBy('ID').write.partitionBy('TRIP_ID').parquet('/path/to/parquet')
df.orderBy('ID').repartition(n, 'TRIP_ID').write.parquet('/path/to/parquet')
df.repartition(n, 'TRIP_ID').sortWithinPartitions('ID').write.parquet('/path/to/parquet')
I would set n such that the individual parquet files would be ~100MB.
Question Part 2: Is it correct that the three options produce "the same"/similar results in regard of the goal (avoid shuffling in the 2nd step)? If not, what is the difference? And which one is 'better'?
Question Part 3: Which of the three options performs better regarding step 1?
Thanks for sharing your knowledge!
EDIT 2017-07-24
After doing some tests (writing to and reading from parquet) it seems that Spark is not able to recover partitionBy and orderBy information by default in the second step. The number of partitions (as obtained from df.rdd.getNumPartitions() seems to be determined by the number of cores and/or by spark.default.parallelism (if set), but not by the number of parquet partitions. So answer for question 1 would be WRONG, and questions 2 and 3 would be irrelevant.
So it turns out the REAL QUESTION is: is there a way to tell Spark, that the data is already partitioned by column X and sorted by column Y?
You probably will be interested in bucketing support in Spark.
See details here
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-bucketing.html
large.write
.bucketBy(4, "id")
.sortBy("id")
.mode(SaveMode.Overwrite)
.saveAsTable(bucketedTableName)
Notice Spark 2.4 added support for bucket pruning (like partition pruning)
More direct functionality you're looking at is Hive' bucketed-sorted tables
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables
This is not yet available in Spark (see PS section below)
Also notice that the sorting information will not be loaded by Spark automatically, but since the data is already sorted.. the sorting operation on it will actually be much faster as not much work to do - e.g. one pass on data just to confirm that it is already sorted.
PS.
Spark and Hive bucketing are slightly different.
This is umbrella ticket to provide a compatibility in Spark for bucketed tables created in Hive -
https://issues.apache.org/jira/browse/SPARK-19256
As far as I know, NO there is no way to read data from parquet and tell Spark that it is already partitioned by some expression and ordered.
In short, one file on HDFS etc. is too big for one Spark partition. And even if you read whole file to one partition playing with Parquet properties such as parquet.split.files=false, parquet.task.side.metadata=true etc. there are would be most costs compare to just one shuffle.
Try bucketBy. Also, partition discovery can help.

Resources