Why is shuffle write huge compared to shuffle read? - apache-spark

Lets say I have a task that joins two dataframes or rdds
dfA: col1, col2
dfB: col1, col3
dfA.join(dfB, dfA.col1=dfB.col1)
Looking at the executor summary, my max task has 1gb shuffle read and 40gb shuffle write.
There is no imbalance issue, among all the tasks the shuffle reads are similar (900mb - 1gb). I used a salted key technique to make sure the keys are spread out uniformly.
So why the 40x difference on the max task? (col1, col2, col3) is not 40x the input. So where is all that extra data going?

I figured it out. Its because of the groupBy step I have right after this.
The imbalance is with the groupBy key.

Related

Repartitioned data bottlenecks in few tasks in Spark

I have a simple spark job which does the following
val dfIn = spark.read.parquet(PATH_IN)
val dfOut = dfIn.repartition(col1, col2, col3)
dfOut.write.mode(SaveMode.Append).partitionBy(col1, col2, col3).parquet(PATH_OUT)
I noticed in this job big performance deterioration. Inspecting the Spark UI showed that the write bottlenecked in a few tasks which showed huge memory spill and much bigger output size compared to the fast partitions.
So I suspected that this issue is caused by the data skew and changed the way the data is repartitioned to
import org.apache.spark.sql.functions.rand
val dfOut = dfIn.withColumn("rand", rand()).repartitionByRange(col1, col2, col3, $"rand")
however this did not help to resolve the performance issues.
In the Spark UI you can see now that the data is very evenly distributed across ALL partitions (based on the output size). But still a few tasks are very long running.
I have no ideas what else could cause this and would be thankful for any ideas.
While this is not a final answer for your issue, this tip might help: you can easily inspect your actual data for possible skewness with
for i, part in enumerate(dfIn.rdd.glom().collect()):
print({i: len(part)})
and then salting as needed. Of course all data might not fit, limit as appropriate to get a proper sample :)
PS: example in Python but you get the idea

Spark's Shuffle Sort Merge Join. One DataFrame is bucketed. Does Spark take advantage of this?

I remember from working with RDDs, that if one key-value RDD (rdd1) has a known partitioning, then performing a join with a different, unpartitioned, key-value RDD (rdd2) would give performance benefits. This is because 1) only the data of rdd2 would need to be transferred across the network, and 2) each element of rdd2 would only need to be transferred to one node rather than all, by applying the partitioning of the key of rdd1 to the key of rdd2
I'm learning about Shuffle Sort Merge Joins with DataFrames. The example in the book I am reading (Learning Spark, 2nd Edition) is for joining two DataFrames based on user_id columns. The example is attempting to demonstrate the elimination of the Exchange stage from the join operation, so, prior to the join, both DataFrames are bucketed into an equal number of buckets by the column to be joined on.
My question is, what happens if only one of the DataFrames has been bucketed? Clearly the Exchange stage will reappear. But if we know that DataFrame1 is bucketed into N buckets by the column we want to join on, will Spark use this bucketing information to efficiently transfer the rows of DataFrame2 over the network, as in the RDD case? Would Spark leave the rows of DataFrame1 where they are, and just apply an identical bucketing to DataFrame2? (Assuming that N buckets results in a reasonable amount of data in the partitions to be joined by the executors) Or instead, does Spark inefficiently shuffle both DataFrames?
In particular, I can imagine a situation where I have a single 'master' DataFrame against which I will need to perform many independent joins with other supplemental DataFrames on the same column. Surely it should only be necessary to pre-bucket the master DataFrame in order to see the performance benefits for all joins? (Although taking the trouble to bucket the supplemental DataFrames wouldn't hurt either, I think)
https://kb.databricks.com/data/bucketing.html This explains it all with some embellishment over their original postings which I summarize.
Bottom line:
val t1 = spark.table("unbucketed")
val t2 = spark.table("bucketed")
val t3 = spark.table("bucketed")
Unbucketed - bucketed join. Both sides need to be repartitioned.
t1.join(t2, Seq("key")).explain()
Unbucketed with repartition - bucketed join. Unbucketed side is
correctly repartitioned, and only one shuffle is needed.
t1.repartition(16, $"key").join(t2, Seq("key")).explain()
Unbucketed with incorrect repartitiong (default(200) - bucketed join.
Unbucketed side is incorrectly repartitioned, and two shuffles are
needed.
t1.repartition($"key").join(t2, Seq("key")).explain()
bucketed - bucketed join. Ideal case, both sides have the same
bucketing, and no shuffles are needed.
t3.join(t2, Seq("key")).explain()
So, both sides need same bucketing for optimal performance.

GroupByKey vs Join performance in Spark

I have an RDD like (id, (val1, val2)). I want to normalize the val2 values for each id by dividing by sum of all val2 for that particular id. So my output should look like (id, (val1, val2normalized))
There are 2 ways of doing this
Do a groupByKey on id followed by normalizing the value using mapValues.
Do a reduceByKey to get RDD like (id, val2sum) and join this RDD with original RDD to get (id, ((val1, val2), val2sum)) followed by mapValuesto normalize.
Which one should be chosen?
If you limit yourself to:
RDD API.
groupByKey + mapValues vs. reduceByKey + join
the former one will be preferred. Since RDD.join is implemented using cogroup the cost of the latter strategy can be only higher than groupByKey (cogroup on the unreduced RDD will be equivalent to groupByKey, but you additionally need a full shuffle for reduceByKey). Please keep in mind, that if groups are to large, neither solution will be feasible.
This however might not be the optimal choice. Depending on the size of each group and the total number of groups, you might be able to achieve much better performance using broadcast join.
At the same time DataFrame API comes with significantly improved shuffle internals and can automatically apply some optimizations, including broadcast joins.

Spark: Apply multiple transformations without recalculating or caching

Is it possible to take the output of a transformation (RDD/Dataframe) and feed it to two independent transformations without recalculating the first transformation and without caching the whole dataset?
Long version
Consider the case.
I have a very large dataset that doesn't fit in memory. Now I do some transformations on it which prepare the data to be worked on efficiently (grouping, filtering, sorting....):
DATASET --(TF1: transformation with group by, etc)--> DF1
DF1 --(TF2: more_transformations_some_columns)--> output
DF1 --(TF3: more_transformations_other_columns)--> output2
I was wondering if there is any way (or planned in dev) to tell Spark that, after TF1, it must reuse the same results (at partition level, without caching everything!) to serve both TF2 and TF3.
This can be conceptually imagined as a cache() at each partition, with automatic unpersist() when the partition was consumed by the further transformations.
I searched for a long time but couldn't find any way of doing it.
My attempt:
DF1 = spark.read()... .groupBy().agg()...
DF2 = DF1.select("col1").cache() # col1 fits in mem
DF3 = DF1.select("col1", transformation(other_cols)).write()... # Force evaluation of col1
Unfortunately, DF3 cannot guess it could to the caching of col1. So apparently it isn't possible to ask spark to only cache a few columns. That would already alleviate the problem.
Any ideas?
I don't think it is possible to cache just some of the columns,
but will this solve your problem?
DF1 = spark.read()... .groupBy().agg()...
DF3 = DF1.select("col1", transformation(other_cols)).cache()
DF3.write()
DF2 = DF3.select("col1")

How to avoid shuffles while joining DataFrames on unique keys?

I have two DataFrames A and B:
A has columns (id, info1, info2) with about 200 Million rows
B only has the column id with 1 million rows
The id column is unique in both DataFrames.
I want a new DataFrame which filters A to only include values from B.
if B was very small I know I would something along the lines of
A.filter($("id") isin B("id"))
but B is still pretty large, so not all of it can fit as a broadcast variable.
and I know I could use
A.join(B, Seq("id"))
but that wouldn't harness the uniqueness and I'm afraid will cause unnecessary shuffles.
What is the optimal method to achieve that task?
If you have not applied any partitioner on Dataframe A, May be this will help you understanding Join And Shuffle concepts.
Without Partitioner :
A.join(B, Seq("id"))
By default, this operation will hash all the keys of both dataframes, sending elements with the same key hash across the network to the same machine, and then join together the elements with the same key on that machine. Here you have to notice that both dataframes shuffle across the network.
With HashPartitioner:
Call partitionBy() when building A Dataframe, Spark will now know that it is hash-partitioned, and calls to join() on it will take advantage of this information. In particular, when we call A.join(B, Seq("id")), Spark will shuffle only the B RDD. Since B has less data than A you don't need to apply partitioner on B
ex:
val A = sc.sequenceFile[id, info1, info2]("hdfs://...")
.partitionBy(new HashPartitioner(100)) // Create 100 partitions
.persist()
A.join(B, Seq("id"))
Reference is from Learning Spark book.
My default advice on how to optimize joins is:
Use a broadcast join if you can (From your question it seems your tables are large and a broadcast join is not an option).
One option in Spark is to perform a broadcast join (aka map-side join in hadoop world). With broadcast join, you can very effectively join a large table (fact) with relatively small tables (dimensions) by avoiding sending all data of the large table over the network.
You can use broadcast function to mark a dataset to be broadcast when used in a join operator. It uses spark.sql.autoBroadcastJoinThreshold setting to control the size of a table that will be broadcast to all worker nodes when performing a join.
Use the same partitioner.
If two RDDs have the same partitioner, the join will not cause a shuffle. Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It's possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).
This situation is still better than doing a shuffle, but it's something to keep in mind. Co-location can improve performance, but is hard to guarantee.
If the data is huge and/or your clusters cannot grow such that even (2) above leads to OOM, use a two-pass approach. First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
If I understand your question correctly, you want to use a broadcast join that replicates DataFrame B on every node so that the semi-join computation (i.e., using a join to filter id from DataFrame A) can compute independently on every node instead of having to communicate information back-and-forth between each other (i.e., shuffle join).
You can run join functions that explicitly call for a broadcast join to achieve what you're trying to do:
import org.apache.spark.sql.functions.broadcast
val joinExpr = A.col("id") === B.col("id")
val filtered_A = A.join(broadcast(B), joinExpr, "left_semi")
You can run filtered_A.explain() to verify that a broadcast join is being used.

Resources