GroupByKey vs Join performance in Spark

GroupByKey vs Join performance in Spark - apache-spark

I have an RDD like (id, (val1, val2)). I want to normalize the val2 values for each id by dividing by sum of all val2 for that particular id. So my output should look like (id, (val1, val2normalized))
There are 2 ways of doing this
Do a groupByKey on id followed by normalizing the value using mapValues.
Do a reduceByKey to get RDD like (id, val2sum) and join this RDD with original RDD to get (id, ((val1, val2), val2sum)) followed by mapValuesto normalize.
Which one should be chosen?

If you limit yourself to:
RDD API.
groupByKey + mapValues vs. reduceByKey + join
the former one will be preferred. Since RDD.join is implemented using cogroup the cost of the latter strategy can be only higher than groupByKey (cogroup on the unreduced RDD will be equivalent to groupByKey, but you additionally need a full shuffle for reduceByKey). Please keep in mind, that if groups are to large, neither solution will be feasible.
This however might not be the optimal choice. Depending on the size of each group and the total number of groups, you might be able to achieve much better performance using broadcast join.
At the same time DataFrame API comes with significantly improved shuffle internals and can automatically apply some optimizations, including broadcast joins.

Related

How to avoid shuffles while joining DataFrames on unique keys?

I have two DataFrames A and B:
A has columns (id, info1, info2) with about 200 Million rows
B only has the column id with 1 million rows
The id column is unique in both DataFrames.
I want a new DataFrame which filters A to only include values from B.
if B was very small I know I would something along the lines of
A.filter($("id") isin B("id"))
but B is still pretty large, so not all of it can fit as a broadcast variable.
and I know I could use
A.join(B, Seq("id"))
but that wouldn't harness the uniqueness and I'm afraid will cause unnecessary shuffles.
What is the optimal method to achieve that task?

If you have not applied any partitioner on Dataframe A, May be this will help you understanding Join And Shuffle concepts.
Without Partitioner :
A.join(B, Seq("id"))
By default, this operation will hash all the keys of both dataframes, sending elements with the same key hash across the network to the same machine, and then join together the elements with the same key on that machine. Here you have to notice that both dataframes shuffle across the network.
With HashPartitioner:
Call partitionBy() when building A Dataframe, Spark will now know that it is hash-partitioned, and calls to join() on it will take advantage of this information. In particular, when we call A.join(B, Seq("id")), Spark will shuffle only the B RDD. Since B has less data than A you don't need to apply partitioner on B
ex:
val A = sc.sequenceFile[id, info1, info2]("hdfs://...")
.partitionBy(new HashPartitioner(100)) // Create 100 partitions
.persist()
A.join(B, Seq("id"))
Reference is from Learning Spark book.

My default advice on how to optimize joins is:
Use a broadcast join if you can (From your question it seems your tables are large and a broadcast join is not an option).
One option in Spark is to perform a broadcast join (aka map-side join in hadoop world). With broadcast join, you can very effectively join a large table (fact) with relatively small tables (dimensions) by avoiding sending all data of the large table over the network.
You can use broadcast function to mark a dataset to be broadcast when used in a join operator. It uses spark.sql.autoBroadcastJoinThreshold setting to control the size of a table that will be broadcast to all worker nodes when performing a join.
Use the same partitioner.
If two RDDs have the same partitioner, the join will not cause a shuffle. Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It's possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).
This situation is still better than doing a shuffle, but it's something to keep in mind. Co-location can improve performance, but is hard to guarantee.
If the data is huge and/or your clusters cannot grow such that even (2) above leads to OOM, use a two-pass approach. First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.

If I understand your question correctly, you want to use a broadcast join that replicates DataFrame B on every node so that the semi-join computation (i.e., using a join to filter id from DataFrame A) can compute independently on every node instead of having to communicate information back-and-forth between each other (i.e., shuffle join).
You can run join functions that explicitly call for a broadcast join to achieve what you're trying to do:
import org.apache.spark.sql.functions.broadcast
val joinExpr = A.col("id") === B.col("id")
val filtered_A = A.join(broadcast(B), joinExpr, "left_semi")
You can run filtered_A.explain() to verify that a broadcast join is being used.

reducebykey and aggregatebykey in spark Dataframe

I am using spark 2.0 to read the data from parquet file .
val Df = sqlContext.read.parquet("c:/data/parquet1")
val dfSelect= Df.
select(
"id",
"Currency",
"balance"
)
val dfSumForeachId=dfSelect.groupBy("id").sum("balance")
val total=dfSumForeachId.agg(sum("sum(balance)")).first().getDouble(0)
In order to get a total balance value is this the best way of getting it using an action first() on a dataframe ?
In spark 2.0 is it fine to use groupby key ,does it have the same performance issue like groupbykey on rdd like does it need to shuffle the whole data over the network and then perform aggregation or the aggregation is performed locally like reducebykey in earlier version of the spark
Thanks

Getting the data by using first is a perfectly valid way of getting the data. That said, doing:
val total = dfSelect.agg(sum("balance")).first().getDouble(0)
would probably give you better performance for getting the total.
group by key and reduce by key work exactly the same as previous versions for the same reasons. group by key makes no assumption on the action you want to do and therefore cannot know how to do partial aggregations as reduce by key does.
When you do dataframe groupby and sum you are actually doing reduce by key with the + option and the second aggregation you did is a reduce with the +. That said dataframe does it more efficiently because, knowing exactly what is done it can perform many optimizations such as whole stage code generation.

How to reduce Spark task counts & avoid group by

All, I am using PySpark & need to join two RDD's but to join them both I need to group all elements of each RDD by the joining key and later perform a join function. This causes additional overheads and I am not sure what a work around can be. Also this is creating a high number of tasks that is in turn increasing the number of files to write to HDFS and slowing overall process by a lot here is a example:
RDD1 = [join_col,{All_Elements of RDD1}] #derived by using groupby join_col)
RDD2 = [join_col,{All_Elements of RDD2}] #derived by using groupby join_col)
RDD3 = RDD1.join(RDD2)

If desired output is grouped and both RDDs are to large to be broadcasted there is not much you can do at the code level. It could be cleaner to simply apply cogroup:
rdd1.cogroup(rdd2)
but there should be no significant difference performance-wise. If you suspect there can be a large data / hash skew you can try different partitioning, for example by using sortByKey but it is unlikely to help you in a general case.

Spark RDD groupByKey + join vs join performance

I am using Spark on the cluster which I am sharing with others users. So it is not reliable to tell which one of my code runs more efficient just based on the running time. Because when I am running the more efficient code, someone else maybe running huge data works and makes my code executes for a longer time.
So can I ask 2 questions here:
I was using join function to join 2 RDDsand I am trying to use groupByKey() before using join, like this:
rdd1.groupByKey().join(rdd2)
seems that it took longer time, however I remember when I was using Hadoop Hive, the group by made my query ran faster. Since Spark is using lazy evaluation, I am wondering whether groupByKey before join makes things faster
I have noticed Spark has a SQL module, so far I really don't have time to try it, but can I ask what are the differences between the SQL module and RDD SQL like functions?

There is no good reason for groupByKey followed by join to be faster than join alone. If rdd1 and rdd2 have no partitioner or partitioners differ then a limiting factor is simply shuffling required for HashPartitioning.
By using groupByKey you not only increase a total cost by keeping mutable buffers required for grouping but what is more important you use an additional transformation which results in a more complex DAG. groupByKey + join:
rdd1 = sc.parallelize([("a", 1), ("a", 3), ("b", 2)])
rdd2 = sc.parallelize([("a", 5), ("c", 6), ("b", 7)])
rdd1.groupByKey().join(rdd2)
vs. join alone:
rdd1.join(rdd2)
Finally these two plans are not even equivalent and to get the same results you have to add an additional flatMap to the first one.
This is a quite broad question but to highlight the main differences:
PairwiseRDDs are homogeneous collections of arbitraryTuple2 elements. For default operations you want key to be hashable in a meaningful way otherwise there are no strict requirements regarding the type. In contrast DataFrames exhibit much more dynamic typing but each column can only contain values from a supported set of defined types. It is possible to define UDT but it still has to be expressed using basic ones.
DataFrames use a Catalyst Optimizer which generates logical and physical execution planss and can generate highly optimized queries without need for applying manual low level optimizations. RDD based operations simply follow dependency DAG. It means worse performance without custom optimization but much better control over execution and some potential for fine graded tuning.
Some other things to read:
Difference between DataFrame and RDD in Spark
Why spark.ml don't implement any of spark.mllib algorithms?

I mostly agree with zero323's answer, but I think there is reason to expect join to be faster after groupByKey. groupByKey reduces the amount of data and partitions the data by the key. Both of these help with the performance of a subsequent join.
I don't think the former (reduced data size) is significant. And to reap the benefits of the latter (partitioning) you need to have the other RDD partitioned the same way.
For example:
val a = sc.parallelize((1 to 10).map(_ -> 100)).groupByKey()
val b = sc.parallelize((1 to 10).map(_ -> 100)).partitionBy(a.partitioner.get)
a.join(b).collect

'map-side' aggregation in Spark

I am learning spark using the book 'Learning Spark'. Came across this term(Page 54)
We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it I am confused what is meant by map-side aggregation here?. The only thing that comes to my mind is Mapper & Reducer in Hadoop MapReduce...but believe that is in no way related to Spark.

Idea behind using map-side aggregations is pretty much the same as Hadoop combiners. If a single mapper can yield multiple values for the same key you can reduce shuffling by reducing values locally.
One example of operation which can benefit from map-side aggregation is creating set of value for each key, especially when you partition a RDD before combining:
First lets create some dummy data:
val pairs = sc.parallelize(
("foo", 1) :: ("foo", 1) :: ("foo", 2) ::
("bar", 3) :: ("bar", 4) :: ("bar", 5) :: Nil
)
And merge data using combineByKey:
import collection.mutable.{Set => MSet}
val combined = partitionedPairs.combineByKey(
(v: Int) => MSet[Int](v),
(set: MSet[Int], v: Int) => set += v,
(set1: MSet[Int], set2: MSet[Int]) => set1 ++= set2
)
Depending on the data distribution this can significantly reduce network traffic. Overall
reduceByKey,
combineByKey with mapSideCombine set to true
aggregateByKey
foldByKey
will use map side aggregations, while groupByKey and combineByKey with mapSideCombine set to false won't.
The choice however between applying map side aggregations or not is not always obvious. Cost of maintaining required data structures and subsequent garbage collection can in many cases exceed cost of shuffle.

You're right, the term map-side reduce does come from the Map/Reduce land and the idea is a bit complicated in the Apache Spark side of things. If it's possible that we could combine multiple elements within a partition before shuffling the elements (and the combined elements took up less space) - then performing a per-partition reduction prior to shuffling the data would be useful.
One case where map-side reduction is disabled in Spark is with groupByKey even if we can combine some of the elements in the same partition, they will take up about the same amount of space anyways so there is no corresponding reduction in network/serialization work.
Hope that helps and glad you are reading Learning Spark :)

For example, you cannot use map side aggregation (combiner), if you group values by key (groupByKey operation does not use combiner). The reason is that all values for each key should be present after groupByKey operation is finished. Thus, local reduction of values (combiner) is not possible.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string