I want to understand the concept of merge-sort join in Spark in depth.
I understand the overall idea: this is the same approach as in merge sort algorithm: Take 2 sorted datasets, compare first rows, write smallest one, repeat.
I also understand how I can implement distributed merge sort.
But I cannot get how it is implemented in Spark with respect to concepts of partitions and executors.
Here is my take.
Given I need to join 2 tables A and B. Tables are read from Hive via Spark SQL, if this matters.
By default Spark uses 200 partitions.
Spark then will calculate join key range (from minKey(A,B) to maxKey(A,B)
) and split it into 200 parts. Both datasets to be split by key
ranges into 200 parts: A-partitions and B-partitions.
Each A-partition and each B-partition that relate to same key are sent to same executor and are
sorted there separatelt from each other.
Now 200 executors can join 200 A-partitions with 200 B-partitions
with guarantee that they share same key range.
The join happes via merge-sort algo: take smallest key from
A-partition, compare with smallest key from B-partition, write
match, or iterate.
Finally, I have 200 partitions of my data which are joined.
Does it make sense?
Skewed keys. If some key range comprises 50% of dataset keys, some executor would suffer, because too many rows would go to the same partition.
It can even fail with OOM, while trying to sort too big A-partition or B-partition in memory (I cannot get why Spark cannot sort with disk spill, as Hadoop does?..) Or maybe it fails because it tries to read both partitions into memory for joining?
So, this was my guess. Could you please correct me and help to understand the way Spark works?

This is a common problem with joins on MPP databases and Spark is no different. As you say, to perform a join, all the data for the same join key value must be colocated so if you have a skewed distribution on the join key, you have a skewed distribution of data and one node gets overloaded.
If one side of the join is small you could use a map side join. The Spark query planner really ought to do this for you but it is tunable - not sure how current this is but it looks useful.
Did you run ANALYZE TABLE on both tables?
If you have a key on both sides that won't break the join semantics you could include that in the join.

why Spark cannot sort with disk spill, as Hadoop does?
Spark merge-sort join does spill to disk. Taking a look at Spark SortMergeJoinExec class, it uses ExternalAppendOnlyUnsafeRowArray which is described as:
An append-only array for UnsafeRows that strictly keeps content in an in-memory array until numRowsInMemoryBufferThreshold is reached post which it will switch to a mode which would flush to disk after numRowsSpillThreshold is met (or before if there is excessive memory consumption)
This is consistent with the experience of seeing tasks spilling to disk during a join operation from the Web UI.
why [merge-sort join] can throw OOM?
From the Spark Memory Management overview:
Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large. The simplest fix here is to increase the level of parallelism, so that each task’s input set is smaller.
i.e. in the case of join, increase spark.sql.shuffle.partitions to reduce the size of the partitions and the resulting hash table and correspondingly reduce the risk of OOM.


Suggestion for multiple joins in spark

Recently I got a requirement to perform combination joins.
I have to perform around 30 to 36 joins in Spark.
It was consuming more time to build the execution plan. So I cached the execution plan in intermediate stages using df.localCheckpoint().
Is this a good way to do? Any thoughts, please share.
Yes, it is fine.
This is mostly discussed for iterative ML algorithms, but can be equally applied for a Spark App with many steps - e.g. joins.
Quoting from
Spark programs take a huge performance hit when fault tolerance occurs
as the entire set of transformations to a DataFrame or RDD have to be
recomputed when fault tolerance occurs or for each additional
transformation that is applied on top of an RDD or DataFrame.
localCheckpoint() is not "reliable".
Caching is definitely a strategy to optimize your performance. In general, given that your data size and resource of your spark application remains unchanged, there are three points that need to be considered when you want to optimize your joining operation:
Data skewness: In most of the time, when I'm trying to find out the reason why the joining takes a lot of time, data skewness is always be one of the reasons. In fact, not only the joining operation, any transformation need a even data distribution so that you won't have a skewed partition that have lots of data and wait the single task in single partition. Make sure your data are well distributed.
Data broadcasting: When we do the joining operation, data shuffling is inevitable. In some case, we use a relatively small dataframe as a reference to filter the data in a very big dataframe. In this case, it's a very expensive operation to shuffle the dataframe. Instead, we can use the dataframe broadcasting to broadcast your small dataframe to every single node and prevent the costly shuffling.
Keep your joining data as lean as possible: like what I mentioned in point 2, data shuffling is inevitable when you do the joining operation. Therefore, please keep your dataframe as lean as possible, which means remove the rows / columns if it's unnecessary to reduce the size of data that need to be moved across the network during the data shuffling.

Spark SQL: why does not Spark do broadcast all the time

I work on a project with Spark 2.4 on aws s3 and emr and I have a left join with two huge part of data. The spark execution is not stable, it fails frequently for memory issue.
The cluster has 10 machines of type m3.2xlarge, each machine has 16 vCore, 30 GiB memory, 160 SSD GB storage.
I have configuration like this:
The left join happens between a left side of 150GB and right side around 30GB, so there are many shuffle. My solution will be to cut the right side to small enough, like 1G, so instead of shuffle, data will be broadcast. The only problem is after the first left join, the left side will already have the new columns from the right side, so the following left join will have duplication column, like col1_right_1, col2_right_1, col1_right_2, col2_right_2 and I have to rename col1_right_1/col1_right_2 to col1_left, col2_right_1/col2_right_2 to col2_left.
So I wonder, why does Spark allow shuffle to happen, instead of using broadcast everywhere. Shouldn't broadcast always be faster than shuffle? Why does not Spark do join like what I said, cut one side to small piece and broadcast it?
Let’s see the two options.
If I understood correctly You are performing a broadcast and a join for each piece of the dataframe, where the size of the piece is the max broadcast threshold.
Here the advantage is that you are basically sending over the network just one dataframe, but you are performing multiple joins. Each join to be performed has a an overhead. From:
Once the broadcasted Dataset is available on an executor machine, it
is joined with each partition of the other Dataset. That is, for the
values of the join columns for each row (in each partition) of the
other Dataset, the corresponding row is fetched from the broadcasted
Dataset and the join is performed.
This means that for each batch of the broadcast join, in each partition you would have to look the whole other dataset and perform the join.
Sortmerge or hash join have to perform a shuffle (if the datasets are not equally partitioned) but their joins are way more efficients.

Does spark check for empty Datasets before joining?

I have a Spark job that inner joins a large Hive table (5bn rows, 400MB x 1000 partitions, compressed parquet) with a much smaller table which is likely to contain less than a few hundred rows and on some/most weeks may be empty.
The data in the large table is not partitioned/bucketed by the join key and in any case the join key is very heavily skewed such that attempting a non-broadcast join causes some executors to exceed memory limits.
Luckily the smaller table size will always be way below the broadcast threshold so by using broadcast(rhs) I can avoid shuffling the large Dataset by the skewed key.
Now when the RHS is empty Spark still seems to do a fair amount of work when it seems fairly obvious the result will be an empty Dataset.
I can only assume Spark does not check for empty Datasets before (inner) joining because the check may be expensive but would appreciate a definitive answer.
In my case I know the RHS will be small so invoking rhs.rdd.count will be cheap and I can skip the join if unnecessary.
I have had to omit business sensitive code but the basic algorithm is:
// Note small and large tables are cached for later re-use
// Complex DAG
// write to hive
// read from hive
.join(broadcast("r")), $"l.key" === $"r.key", "inner")
Thanks for any insight.

Spark containers killed by YARN during group by

I have a data set extracted from Hbase, which is a long form of wide table, i.e has rowKey, columnQualifier and value columns. To get a form of pivot, I need to group by rowKey, which is a string UUID, into a collection and make an object out of the collection. The problem is that only group-by I manage to perform is count the number of elements in groups; other group-bys fail due to container being kill due to memory overflow beyond YARN container limits. I did experiment a lot with the memory sizes, including overhead, partitioning with and without sorting etc. I went even into a high number of partitions i.e. about 10 000 but the job dies the same. I tried both DataFrame groupBy and collect_list, as well as Dataset grouByKey and mapGroups.
The code works on a small data set but not on the larger one. The data set is about 500 GB in Parquet files. The data is not skewed as the largest group in group by have only 50 elements. Thus, by all known to me means the partitions should easily fit in memory as the aggregated data per one rowKey is not really large. The data keys and values are mostly strings and there are not long.
I am using Spark 2.0.2; the above computations were all done is Scala.
You're probably running into the dreaded groupByKey shuffle. Please read this Databricks article on avoiding groupByKey, which details the underlying differences between the two functions.
If you don't want the read the article, the short story is this: Though groupByKey and reduceByKey produce the same results, groupByKey instantiates a shuffle of ALL data, while reduceByKey tries to minimize data shuffle by reducing first. A bit like MapReduce Combiners, if you're familiar with that concept.

Does spark keep all elements of an RDD[K,V] for a particular key in a single partition after "groupByKey" even if the data for a key is very huge?

Consider I have a PairedRDD of,say 10 partitions. But the keys are not evenly distributed, i.e, all the 9 partitions having data belongs to a single key say a and rest of the keys say b,c are there in last partition only.This is represented by the below figure:
Now if I do a groupByKey on this rdd, from my understanding all data for same key will eventually go to different partitions or no data for the same key will not be in multiple partitions. Please correct me if I am wrong.
If that is the case then there can be a chance that the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do ? My assumption is like it will spill the data to worker's disk.
Is that correct?
Or how spark handle such situations
Does spark keep all elements (...) for a particular key in a single partition after groupByKey
Yes, it does. This is a whole point of the shuffle.
the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do
Size of a particular partition is not the biggest issue here. Partitions are represented using lazy Iterators and can easily store data which exceeds amount of available memory. The main problem is non-lazy local data structure generated in the process of grouping.
All values for the particular key are stored in memory as a CompactBuffer so a single large group can result in OOM. Even if each record separately fits in memory you may still encounter serious GC issues.
In general:
It is safe, although not optimal performance wise, to repartition data where amount of data assigned to a partition exceeds amount of available memory.
It is not safe to use PairRDDFunctions.groupByKey in the same situation.
Note: You shouldn't extrapolate this to different implementations of groupByKey though. In particular both Spark Dataset and PySpark RDD.groupByKey use more sophisticated mechanisms.
