Spark difference between reduceByKey vs. groupByKey vs. aggregateByKey vs. combineByKey - apache-spark

Can anyone explain the difference between reducebykey, groupbykey, aggregatebykey and combinebykey? I have read the documents regarding this, but couldn't understand the exact differences.
An explanation with examples would be great.

groupByKey:
Syntax:
sparkContext.textFile("hdfs://")
.flatMap(line => line.split(" ") )
.map(word => (word,1))
.groupByKey()
.map((x,y) => (x,sum(y)))
groupByKey can cause out of disk problems as data is sent over the network and collected on the reduced workers.
reduceByKey:
Syntax:
sparkContext.textFile("hdfs://")
.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey((x,y)=> (x+y))
Data are combined at each partition, with only one output for one key at each partition to send over the network. reduceByKey required combining all your values into another value with the exact same type.
aggregateByKey:
same as reduceByKey, which takes an initial value.
3 parameters as input
initial value
Combiner logic
sequence op logic
Example:
val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C", "bar=D", "bar=D")
val data = sc.parallelize(keysWithValuesList)
//Create key value pairs
val kv = data.map(_.split("=")).map(v => (v(0), v(1))).cache()
val initialCount = 0;
val addToCounts = (n: Int, v: String) => n + 1
val sumPartitionCounts = (p1: Int, p2: Int) => p1 + p2
val countByKey = kv.aggregateByKey(initialCount)(addToCounts, sumPartitionCounts)
ouput:
Aggregate By Key sum Results
bar -> 3
foo -> 5
combineByKey:
3 parameters as input
Initial value: unlike aggregateByKey, need not pass constant always, we can pass a function that will return a new value.
merging function
combine function
Example:
val result = rdd.combineByKey(
(v) => (v,1),
( (acc:(Int,Int),v) => acc._1 +v , acc._2 +1 ) ,
( acc1:(Int,Int),acc2:(Int,Int) => (acc1._1+acc2._1) , (acc1._2+acc2._2))
).map( { case (k,v) => (k,v._1/v._2.toDouble) })
result.collect.foreach(println)
reduceByKey,aggregateByKey,combineByKey preferred over groupByKey
Reference:
Avoid groupByKey

groupByKey() is just to group your dataset based on a key. It will result in data shuffling when RDD is not already partitioned.
reduceByKey() is something like grouping + aggregation. We can say reduceByKey() equivalent to dataset.group(...).reduce(...). It will shuffle less data unlike groupByKey().
aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have an input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output. It also takes zero-value that will be applied at the beginning of each key.
Note: One similarity is they all are wide operations.

While both reducebykey and groupbykey will produce the same answer, the
reduceByKey example works much better on a large dataset. That's
because Spark knows it can combine output with a common key on each
partition before shuffling the data.
On the other hand, when calling groupByKey - all the key-value pairs
are shuffled around. This is a lot of unnessary data to being
transferred over the network.
for more detailed check this below link
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

Although both of them will fetch the same results, there is a significant difference in the performance of both the functions. reduceByKey() works better with larger datasets when compared to groupByKey().
In reduceByKey(), pairs on the same machine with the same key are combined (by using the function passed into reduceByKey()) before the data is shuffled. Then the function is called again to reduce all the values from each partition to produce one final result.
In groupByKey(), all the key-value pairs are shuffled around. This is a lot of unnecessary data to being transferred over the network.

ReduceByKey reduceByKey(func, [numTasks])-
Data is combined so that at each partition there should be at least one value for each key.
And then shuffle happens and it is sent over the network to some particular executor for some action such as reduce.
GroupByKey - groupByKey([numTasks])
It doesn't merge the values for the key but directly the shuffle process happens
and here lot of data gets sent to each partition, almost same as the initial data.
And the merging of values for each key is done after the shuffle.
Here lot of data stored on final worker node so resulting in out of memory issue.
AggregateByKey - aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
It is similar to reduceByKey but you can provide initial values when performing aggregation.
Use of reduceByKey
reduceByKey can be used when we run on large data set.
reduceByKey when the input and output value types are of same type
over aggregateByKey
Moreover it recommended not to use groupByKey and prefer reduceByKey. For details you can refer here.
You can also refer this question to understand in more detail how reduceByKey and aggregateByKey.

Then apart from these 4, we have
foldByKey which is same as reduceByKey but with a user defined Zero Value.
AggregateByKey takes 3 parameters as input and uses 2 functions for merging(one for merging on same partitions and another to merge values across partition. The first parameter is ZeroValue)
whereas
ReduceBykey takes 1 parameter only which is a function for merging.
CombineByKey takes 3 parameter and all 3 are functions. Similar to aggregateBykey except it can have a function for ZeroValue.
GroupByKey takes no parameter and groups everything. Also, it is an overhead for data transfer across partitions.

Related

how to get some element from each group after use groupby in spark

I have a spark rdd data, let's suppose it has 1000 elements and can be grouped into 10 groups, what I want to do is select 2 element which meets my special requirement in each group. And then, get a new rdd with 20 elements.
suppose the rdd data is like
((1,a1),
(1,a2),
(1,a3),
...
(1,a100),
(2,b1),
(2,b2),
(2,b3)
...
(2,b100))
what i want is
((1,a1),
(1,a99),
(2,b1),
(2,b99)
)
and I select a1、a99、b1、b99 with a function called my_func
I think the code may be something like:
myrdd.groupby(x => x._1)....(my_func)...
Not convinced you need groupBy. Not sure of structure of RDD.
This using my own contrived data, so you will need to adapt:
// Gen some data. My data. Adapt to yours.
val rdd = spark.sparkContext.parallelize(Seq((1, "x"), (2, "y"), (3, "z"), (4, "z"), (5, "bbb") ))
// Compare list.
val l = List("x", "y", "z")
// Function to filter, could be inline or via mapPartitions.
def my_function(l: List[String], r: RDD[(Int, String)]) = {
r.map(x => (x._1, x._2)).filter(x => l.contains(x._2))
}
// Run it all.
val rdd2 = my_function(l,rdd)
rdd2.collect
returns:
res24: Array[(Int, String)] = Array((1,x), (2,y), (3,z), (4,z))
I strongly discourage you from using groupBy() or even mapPartitions() for big dataset when you need to subsequently aggregate your data. The purpose of RDD and MapReduce programming model is to distribute computations: computing the max/min/sum etc in the driver or on a single node means using only the HDFS part of Spark.
Besides, there are many ways to perform your task, but focusing on finding a pattern that fits for every type of aggregation you need is just wrong and inevitably make your code inefficient.
Here is a possible PySpark solution for the problem you have:
rdd.reduceByKey(lambda x, y: x if x < y else y)\
.union(rdd.reduceByKey(lambda x, y: x if x > y else y)).sortByKey().collect()
In the first reduceByKey I find the smallest value for each key and in the second one the biggest value for each key. Then I can union them and, if necessary, sort the resulting RDD to obtain the result you showed us.

Internals of reduce function in spark-shell

Input file contains 20 lines. I am trying to count total number of records using reduce function. Can anyone please explain me why there is difference in the results? Because here value of y is nothing but only 1.
Default number of partitions : 4
scala> rdd = sc.textFile("D:\LearningPythonTomaszDenny\Codebase\\wholeTextFiles\\names1.txt")
scala> rdd.map(x=>1).reduce((acc,y) => acc+1)
res17: Int = 8
scala> rdd.map(x=>1).reduce((acc,y) => acc+y)
res18: Int = 20
Because here value of y is nothing but only 1.
That is simply not true. reduce consist of three stages (not in a strict Spark meaning of the word):
Distributed reduce on each partition.
Collection of the partial results to the driver (synchronous or asynchronous depending on the backend).
Local driver reduction.
In your case the results of the first and second stage will be the same, but the first approach will simply ignore the partial results. In other words, no matter what was the result for the partition, it will always add only 1.
Such approach would work only with non-parallel, non-sequential reduce implementations.

Spark Accumulator vs Count

I have a use case where I want to count types of elements in an RDD matching some filter.
e.g. RDD.filter(F1) and RDD.filter(!F1)
I have 2 options
Use accumulators: e.g.
LongAccumulator l1 = sparkContext.longAccumulator("Count1")
LongAccumulator l2 = sparkContext.longAccumulator("Count2")
RDD.forEachPartition(f -> {
if(F1) l1.add(1)
else l2.add(1)
});
Use Count
RDD.filter(F1).count(); RDD.filter(!F1).count()
One benefit of the first approach is that we only need to iterate data once (useful since my data set is 10s of TB)
What is the use of count if same affect can be achieved by using Accumulators ?
Major difference is that if your code will fail in transformation, then Accumulators will be updated and count() result not.
Other option is to use pure map-reduce:
val counts = rdd.map(x => (F1(x), 1)).reduceByKey(_ + _).collectAsMap()
Network cost should be also low as only few numbers will be sent. It creates pairs of (is F1(x) true/false, 1) and then sum all ones - it will give you number of items both F1(x) and !F1(x) in counts map

Pyspark filter top three matches when performing cosine similarity

I have two collection of documents. I have computed cosine similarity between each pair of the cartesian product and got an RDD of the form
(k1,(k2,c))
Where k1 is a document from the first collection, k2 is one from the second and c is the cosine similarity between them.
I'm interested in getting, for each document k1 in the first collection, the three most similar from the second collection. I have performed a group by key:
grouped = (pairRddWithCosine
.groupByKey()
.map(lambda (k, v): (k, sorted(v, key=lambda x: -x[1])))
.map(lambda (x,y): (x, y[0][0],y[0][1], y[1][0], y[1][1], y[2][0] , y[2][1]))
)
It turns out that this group by is performing very bad. Would you please tell me how could I tune it or even better, use something that do not shuffle the data?
If you want to obtain a sum/count/part of the values for a key you should avoid groupByKey, because groupByKey shuffles all data so that all values for a given key end up in the same reducer. For large datasets this is very expensive. Instead you should use reduceByKey or combineByKey. For these operations you could specify the function for accumulating data on each partition and the merge function between accumulators from different partitions. You can read this for more details: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
I think you should try reduceByKey because you're only interested in part of the values
k_with_top_c = rdd.reduceByKey(lambda v: sorted(v, key=lambda x: -x[1])[:3])
reduceByKey will try a local reduce first so it runs faster than groupByKey. However, I don't think you could avoid shuffle in this case.
Alternatively, I think that if we take
smallRdd = pairRddWithCosine.map(lambda (k1,(k2,c)))
then
Combined = (smallRdd
.combineByKey(lambda value: [value],
lambda x, value: x + [value],
lambda x, y : max(x,y))
.map(lambda (x,y): (x,y[0]))
.map(lambda x: (x,0))
)
followed by a join would provide the first match. We may get all the elements from pairRddWithCosine that are not best matches by performing a leftOuterJoin()
with the best matches to get the second best.

Low performance of groupbykey in spark

After reading the Spark documentation, I find that groupByKey function has a low performance compared with reduceByKey. But what I need is to get the average, maximum and minimum value of a certain key. I don't think this could be done by reduceByKey method. I can just create an customized reduceByKey function to realize those goals?
Let's say you have an RDD[(String, Double)] and you want to calculate avg, min, max over the double values using reduceByKey.
This could by done by duplicating the values as many times as operations you like to apply and then applying the different operations with reduceByKey.
Like this:
val srcData:RDD[(String, Double)] = ???
srcData.cache
val count = srcData.count
val baseData = srcData.map{case (k,v) => (k,(v,1,v,v))}
val aggregates = baseData.reduceByKey { case (v1,v2) =>
(v1._1 + v2._1, v1._2 + v2._2, Math.max(v1._3, v2._3), Math.min(v1._4,v2._4))}
val result = aggregates.collect()
.map{case (id, (sum, count, max, min)) => (id, sum/count, max, min)}

Resources