join DataFrames within partitions in PySpark - apache-spark

I have two dataframes with a large (millions to tens of millions) number of rows. I'd like to do a join between them.
In the BI system I'm currently using, you make this fast by first partitioning on a particular key, then doing the join on that key.
Is this a pattern that I need to be following in Spark, or does that not matter? It seems at first glance like a lot of time is wasted shuffling data between partitions, because it hasn't been pre-partitioned correctly.
If it is necessary, then how do I do that?

If it is necessary, then how do I do that?
How to define partitioning of DataFrame?
However it makes sense only under two conditions:
There multiple joins withing the same application. Partitioning shuffles itself, so if it is a single join there is no added value.
It is long lived application where shuffled data will be reused. Spark cannot take advantage of the partitioning of the data stored in the external format.

Related

Suggestion for multiple joins in spark

Recently I got a requirement to perform combination joins.
I have to perform around 30 to 36 joins in Spark.
It was consuming more time to build the execution plan. So I cached the execution plan in intermediate stages using df.localCheckpoint().
Is this a good way to do? Any thoughts, please share.
Yes, it is fine.
This is mostly discussed for iterative ML algorithms, but can be equally applied for a Spark App with many steps - e.g. joins.
Quoting from https://medium.com/#adrianchang/apache-spark-checkpointing-ebd2ec065371:
Spark programs take a huge performance hit when fault tolerance occurs
as the entire set of transformations to a DataFrame or RDD have to be
recomputed when fault tolerance occurs or for each additional
transformation that is applied on top of an RDD or DataFrame.
localCheckpoint() is not "reliable".
Caching is definitely a strategy to optimize your performance. In general, given that your data size and resource of your spark application remains unchanged, there are three points that need to be considered when you want to optimize your joining operation:
Data skewness: In most of the time, when I'm trying to find out the reason why the joining takes a lot of time, data skewness is always be one of the reasons. In fact, not only the joining operation, any transformation need a even data distribution so that you won't have a skewed partition that have lots of data and wait the single task in single partition. Make sure your data are well distributed.
Data broadcasting: When we do the joining operation, data shuffling is inevitable. In some case, we use a relatively small dataframe as a reference to filter the data in a very big dataframe. In this case, it's a very expensive operation to shuffle the dataframe. Instead, we can use the dataframe broadcasting to broadcast your small dataframe to every single node and prevent the costly shuffling.
Keep your joining data as lean as possible: like what I mentioned in point 2, data shuffling is inevitable when you do the joining operation. Therefore, please keep your dataframe as lean as possible, which means remove the rows / columns if it's unnecessary to reduce the size of data that need to be moved across the network during the data shuffling.

How does merge-sort join work in Spark and why can it throw OOM?

I want to understand the concept of merge-sort join in Spark in depth.
I understand the overall idea: this is the same approach as in merge sort algorithm: Take 2 sorted datasets, compare first rows, write smallest one, repeat.
I also understand how I can implement distributed merge sort.
But I cannot get how it is implemented in Spark with respect to concepts of partitions and executors.
Here is my take.
Given I need to join 2 tables A and B. Tables are read from Hive via Spark SQL, if this matters.
By default Spark uses 200 partitions.
Spark then will calculate join key range (from minKey(A,B) to maxKey(A,B)
) and split it into 200 parts. Both datasets to be split by key
ranges into 200 parts: A-partitions and B-partitions.
Each A-partition and each B-partition that relate to same key are sent to same executor and are
sorted there separatelt from each other.
Now 200 executors can join 200 A-partitions with 200 B-partitions
with guarantee that they share same key range.
The join happes via merge-sort algo: take smallest key from
A-partition, compare with smallest key from B-partition, write
match, or iterate.
Finally, I have 200 partitions of my data which are joined.
Does it make sense?
Issues:
Skewed keys. If some key range comprises 50% of dataset keys, some executor would suffer, because too many rows would go to the same partition.
It can even fail with OOM, while trying to sort too big A-partition or B-partition in memory (I cannot get why Spark cannot sort with disk spill, as Hadoop does?..) Or maybe it fails because it tries to read both partitions into memory for joining?
So, this was my guess. Could you please correct me and help to understand the way Spark works?
This is a common problem with joins on MPP databases and Spark is no different. As you say, to perform a join, all the data for the same join key value must be colocated so if you have a skewed distribution on the join key, you have a skewed distribution of data and one node gets overloaded.
If one side of the join is small you could use a map side join. The Spark query planner really ought to do this for you but it is tunable - not sure how current this is but it looks useful.
Did you run ANALYZE TABLE on both tables?
If you have a key on both sides that won't break the join semantics you could include that in the join.
why Spark cannot sort with disk spill, as Hadoop does?
Spark merge-sort join does spill to disk. Taking a look at Spark SortMergeJoinExec class, it uses ExternalAppendOnlyUnsafeRowArray which is described as:
An append-only array for UnsafeRows that strictly keeps content in an in-memory array until numRowsInMemoryBufferThreshold is reached post which it will switch to a mode which would flush to disk after numRowsSpillThreshold is met (or before if there is excessive memory consumption)
This is consistent with the experience of seeing tasks spilling to disk during a join operation from the Web UI.
why [merge-sort join] can throw OOM?
From the Spark Memory Management overview:
Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large. The simplest fix here is to increase the level of parallelism, so that each task’s input set is smaller.
i.e. in the case of join, increase spark.sql.shuffle.partitions to reduce the size of the partitions and the resulting hash table and correspondingly reduce the risk of OOM.

Is it possible to outperform the Catalyst optimizer on highly skewed data using only RDDs

I am reading High Performance Spark and the author introduces a technique that can be used to perform joins on highly skewed data by selectively filtering the data to build a HashMap with the data containing the most common keys. This HashMap is then sent to all the partitions to perform a broadcast join. The resulting data are concatenated with a union operation at the very end.
Apologies in advance, but the text does not give an example of this technique using code, so I cannot share a code snippet to illustrate it.
Text follows.
Sometimes not all of our smaller RDD will fit into memory, but some keys are so overrepresented in the large dataset that you want to broadcast just the most common keys. This is especially useful if one key is so large that it can't fit on a single partition. In this case you can use countByKeyApprox on the large RDD to get an approximate idea of which keys would most benefit from a broadcast. You then filter the smaller RDD for only these keys, collecting the result locally in a HashMap. Using sc.broadcast you can broadcast the HashMap so that each worker only has one copy and manually perform the join against the HashMap. Using the same HashMap you can then filter your large RDD down to not include the large number of duplicate keys and perform your standard join, uniting it with the result of your manual join. This approach is quite convoluted but may allow you to handle highly skewed data you couldn't otherwise process.
For those who don't know, a broadcast join is a technique where the user can avoid a shuffle incurred when joining two chunks of data by sending the smaller chunk to every single executor. Each executor then performs the join on its own. The idea is that the shuffle is so expensive that having each executor perform the join and then discard the data it doesn't need is sometimes the best way to go.
The text describes a situation where part of a chunk of data can be extracted and joined using a broadcast join. The result of the join is then unioned with the rest of the data.
The reason why this might be necessary is that excessive shuffling can usually be avoided by making sure data consisting of the same keys in the two chunks are both present in the same partition, so that the same executor handles it. However, there are situations where a single key is too large to fit on a single partition. In that case, the author suggests that separating the overrepresented key into a HashMap and performing a broadcast join on just the overrepresented key may be a good idea.
Is this a good idea? Moreover, a technique like this seems very situational, so Catalyst probably does not use this technique. Is that correct? Is it true Catalyst does not use this technique? If so, does that mean on highly skewed data this technique using RDDs can beat Catalyst operating on Dataframes or Datasets?

Spark DataFrame / Dataset groupBy optimization via bucketBy

I'm researching options for a use-case where we store the dataset as parquet files and want to run efficient groupBy queries for a specific key later on when we read the data.
I've read a bit about the optimizations for groupBy, however couldn't really find much about it (other than RDD level reduceByKey).
What I have in mind is, if the dataset is written bucketed by the key that will also be used in the groupBy. Theoretically the groupBy could be optimized, since all the rows containing the key will be co-located (and even consecutive if it's also stored sorted on the same key).
One idea i have in mind is to apply the transformation via mapPartitions then groupBy, however, this will require breaking down my functions into two, it's not really desirable. I believe for some class of functions (say sum/count) spark would optimize the query with a similar fashion as well, but the optimization would be kicked in by the choice of function, and would work regardless of the co-location of the rows, but not because of the co-location.
Can spark leverage the co-location of the rows to optimize the groupBy using any function subsequently?
It seems like bucketing's main use-case is for doing JOINs on the bucketed key, which allows Spark to avoid doing a shuffle across the whole table. If Spark knows that the rows are already partitioned across the buckets, I don't see why it wouldn't know to use the pre-partitioned buckets in a GROUP BY. You might need to sort by the group by key as well though.
I'm also interested in this use-case so will be trying it out and seeing if a shuffle occurs.

Spark containers killed by YARN during group by

I have a data set extracted from Hbase, which is a long form of wide table, i.e has rowKey, columnQualifier and value columns. To get a form of pivot, I need to group by rowKey, which is a string UUID, into a collection and make an object out of the collection. The problem is that only group-by I manage to perform is count the number of elements in groups; other group-bys fail due to container being kill due to memory overflow beyond YARN container limits. I did experiment a lot with the memory sizes, including overhead, partitioning with and without sorting etc. I went even into a high number of partitions i.e. about 10 000 but the job dies the same. I tried both DataFrame groupBy and collect_list, as well as Dataset grouByKey and mapGroups.
The code works on a small data set but not on the larger one. The data set is about 500 GB in Parquet files. The data is not skewed as the largest group in group by have only 50 elements. Thus, by all known to me means the partitions should easily fit in memory as the aggregated data per one rowKey is not really large. The data keys and values are mostly strings and there are not long.
I am using Spark 2.0.2; the above computations were all done is Scala.
You're probably running into the dreaded groupByKey shuffle. Please read this Databricks article on avoiding groupByKey, which details the underlying differences between the two functions.
If you don't want the read the article, the short story is this: Though groupByKey and reduceByKey produce the same results, groupByKey instantiates a shuffle of ALL data, while reduceByKey tries to minimize data shuffle by reducing first. A bit like MapReduce Combiners, if you're familiar with that concept.

Resources