How to choose the optimal repartition value in spark - apache-spark

I have 3 input files
File1 - 27gb
File2 - 3gb
File3 - 12mb
My cluster configuration
2 executor
Each executor has 2 cores
Executor memory - 13gb (2gb overhead)
The transformation that I'm going to perform is left join, in which the left table is file1 and right tables are file2 and file3
I need to repartition the file1 and file2 to optimal number of partitions so that it don't waste the time/resources.
Thanks in advance

You are not writing about any other transformations so i am assuming that you want to create very simple job which is only performing this one join
You are not asking about file3 so i am assuming that you are going to broadcast it with hint and this is a good direction.
If you are not doing anything before this join i am not sure if this is worth to repartition file1/file2 because most probably they are going to be joined with SMJ (sort merge join - its shuffling both datasets based on column from join condition) and output df from this join will have number of partitions equals to spark.sql.shuffle.partitions so you may try to tune this parameter (this will affect also other shuffles so keep in mind my assumption from first line)
You may try to adjust this parameter to bigger dataset (file1) to create partitions around 100-200 mb. I think its worth to read this blog post: Medium blog post

Related

What are the main differences, in the compute resources, between left join and cross join?

I am very new in spark configuration resources and I would like to understand the main differences between using a left join vs cross join in spark in resources/compute behaviour.
If supplied record volume and spark configurations(cores and memory) being the same, I guess the major gain will be from underlying filtering of rows(join condition) for non-cartesian joins utilizing relatively lesser cores and memory.
When both of your tables have a similar size/record count:
Cartesian or Cross joins will be extremely expensive as they can easily explode the number of output rows.
Imagine 10,000 X 10,000 = 100 million
all rows from corresponding datasets will be read, sorted, and write (n cores) and fit into memory for join thus having larger footprint
Inner/Outer Joins will work on the principles of map/reduce and co-locality
by filtering rows matching join condition(map stage) from data tables using n cores followed by shuffle and sort on local executors and output result(reduce).
But, when one of your tables has a smaller size/record count:
the smaller table will be read, build a hashtable and write it using (maybe) a single partition i.e. broadcast to each executor reading X partitions of a larger table

Improve Spark denormalization/partition performance

I have a denormalization use case - one hive avro fact table to join with 14 smaller dimension tables and produce a denormalized parquet output table. Both the input fact table and output table are partitioned in the same way (Category=TEST1, YearMonthId=202101). And I do run historical processing, which means processing and loading several months for a given category at once.
I am using Spark 2.4.0/pyspark dataframe, broadcast join for all the table joins, dynamic partition inserts, using coalasce at the end to control the number of output files. (seeing a shuffle at the last stage probably because of dynamic partition inserts)
Would like to know the optimizations possible w.r.t to managing partitions - say maintain partitions consistently from input to output stage such that no shuffle is involved. Want to leverage the fact that the input and output storage tables are partitioned by the same columns.
I am also thinking about this - Use static partitions writes by determining the partitions and write to partitions parallelly - would this help in speeding-up or avoid shuffle?
Appreciate any help that would lead me in the right direction.
Couple of options below that I tried that improved the performance (both time + avoid small files).
Tried using repartition (instead of coalesce) in the data frame before doing a broadcast join, which minimized shuffle and hence the shuffle spill.
-- repartition(count, *PartitionColumnList, AnyOtherSaltingColumn) (Add salting column if the repartition is not even)
Make sure that the the base tables are properly compacted. This might even eliminate the need for #1 in some cases, and reduce # of tasks resulting in reduced overhead due to task scheduling.

How to create partitions for two dataframes while couple of partitions can be located on the same instance/machine on Spark?

We have two DataFrames: df_A, df_B
Let's say, both has a huge # of rows. And we need to partition them.
How to partition them as couples?
For example, partition number is 5:
df_A partitions: partA_1, partA_2, partA_3, partA_4, partA_5
df_B partitions: partB_1, partB_2, partB_3, partB_4, partB_5
If we have 5 machines:
machine_1: partA_1 and partB_1
machine_2: partA_2 and partB_2
machine_3: partA_3 and partB_3
machine_4: partA_4 and partB_4
machine_5: partA_5 and partB_5
If we have 3 machine:
machine_1: partA_1 and partB_1
machine_2: partA_2 and partB_2
machine_3: partA_3 and partB_3
...(when machines are free up)...
machine_1: partA_4 and partB_4
machine_2: partA_5 and partB_5
Note: If one of DataFrames is small enough, we can use broadcast technique.
What to do(how to partition) when both (or more than two) DataFrames are large enough?
I think we need to take a step back here. Looking at big sizes aspect only, not broadcast.
Spark is a framework that manages things for your App in terms of co-location of dataframe partitions, taking into account resources allocated vs. resources available and the type of Action, and thus if Workers need to acquire partitions for processing.
repartitions are Transformations. When an Action, such as write:
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
occurs then things kick in.
If you have a JOIN, then Spark will work out if re-partitioning and movement is required.
That is to say, if you join on c1 for both DF's, then re-partitioning may most likely well occur for the c1 column, so that occurrences in the DF's for that c1 column are shuffled to the same Nodes where a free Executor resides waiting to serve that JOIN of 2 or more partitions.
That only occurs when an Action is invoked. In this way, if you do unnecessary Transformation, Catalyst can obviate those things.
Also, for number of partitions used, this is a good link imho: spark.sql.shuffle.partitions of 200 default partitions conundrum

How does merge-sort join work in Spark and why can it throw OOM?

I want to understand the concept of merge-sort join in Spark in depth.
I understand the overall idea: this is the same approach as in merge sort algorithm: Take 2 sorted datasets, compare first rows, write smallest one, repeat.
I also understand how I can implement distributed merge sort.
But I cannot get how it is implemented in Spark with respect to concepts of partitions and executors.
Here is my take.
Given I need to join 2 tables A and B. Tables are read from Hive via Spark SQL, if this matters.
By default Spark uses 200 partitions.
Spark then will calculate join key range (from minKey(A,B) to maxKey(A,B)
) and split it into 200 parts. Both datasets to be split by key
ranges into 200 parts: A-partitions and B-partitions.
Each A-partition and each B-partition that relate to same key are sent to same executor and are
sorted there separatelt from each other.
Now 200 executors can join 200 A-partitions with 200 B-partitions
with guarantee that they share same key range.
The join happes via merge-sort algo: take smallest key from
A-partition, compare with smallest key from B-partition, write
match, or iterate.
Finally, I have 200 partitions of my data which are joined.
Does it make sense?
Issues:
Skewed keys. If some key range comprises 50% of dataset keys, some executor would suffer, because too many rows would go to the same partition.
It can even fail with OOM, while trying to sort too big A-partition or B-partition in memory (I cannot get why Spark cannot sort with disk spill, as Hadoop does?..) Or maybe it fails because it tries to read both partitions into memory for joining?
So, this was my guess. Could you please correct me and help to understand the way Spark works?
This is a common problem with joins on MPP databases and Spark is no different. As you say, to perform a join, all the data for the same join key value must be colocated so if you have a skewed distribution on the join key, you have a skewed distribution of data and one node gets overloaded.
If one side of the join is small you could use a map side join. The Spark query planner really ought to do this for you but it is tunable - not sure how current this is but it looks useful.
Did you run ANALYZE TABLE on both tables?
If you have a key on both sides that won't break the join semantics you could include that in the join.
why Spark cannot sort with disk spill, as Hadoop does?
Spark merge-sort join does spill to disk. Taking a look at Spark SortMergeJoinExec class, it uses ExternalAppendOnlyUnsafeRowArray which is described as:
An append-only array for UnsafeRows that strictly keeps content in an in-memory array until numRowsInMemoryBufferThreshold is reached post which it will switch to a mode which would flush to disk after numRowsSpillThreshold is met (or before if there is excessive memory consumption)
This is consistent with the experience of seeing tasks spilling to disk during a join operation from the Web UI.
why [merge-sort join] can throw OOM?
From the Spark Memory Management overview:
Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large. The simplest fix here is to increase the level of parallelism, so that each task’s input set is smaller.
i.e. in the case of join, increase spark.sql.shuffle.partitions to reduce the size of the partitions and the resulting hash table and correspondingly reduce the risk of OOM.

Spark SQL: why does not Spark do broadcast all the time

I work on a project with Spark 2.4 on aws s3 and emr and I have a left join with two huge part of data. The spark execution is not stable, it fails frequently for memory issue.
The cluster has 10 machines of type m3.2xlarge, each machine has 16 vCore, 30 GiB memory, 160 SSD GB storage.
I have configuration like this:
"--executor-memory",
"6512M",
"--driver-memory",
"12g",
"--conf",
"spark.driver.maxResultSize=4g",
"--conf",
"spark.sql.autoBroadcastJoinThreshold=1073741824",
The left join happens between a left side of 150GB and right side around 30GB, so there are many shuffle. My solution will be to cut the right side to small enough, like 1G, so instead of shuffle, data will be broadcast. The only problem is after the first left join, the left side will already have the new columns from the right side, so the following left join will have duplication column, like col1_right_1, col2_right_1, col1_right_2, col2_right_2 and I have to rename col1_right_1/col1_right_2 to col1_left, col2_right_1/col2_right_2 to col2_left.
So I wonder, why does Spark allow shuffle to happen, instead of using broadcast everywhere. Shouldn't broadcast always be faster than shuffle? Why does not Spark do join like what I said, cut one side to small piece and broadcast it?
Let’s see the two options.
If I understood correctly You are performing a broadcast and a join for each piece of the dataframe, where the size of the piece is the max broadcast threshold.
Here the advantage is that you are basically sending over the network just one dataframe, but you are performing multiple joins. Each join to be performed has a an overhead. From:
Once the broadcasted Dataset is available on an executor machine, it
is joined with each partition of the other Dataset. That is, for the
values of the join columns for each row (in each partition) of the
other Dataset, the corresponding row is fetched from the broadcasted
Dataset and the join is performed.
This means that for each batch of the broadcast join, in each partition you would have to look the whole other dataset and perform the join.
Sortmerge or hash join have to perform a shuffle (if the datasets are not equally partitioned) but their joins are way more efficients.

Resources