Low performance of groupbykey in spark - apache-spark

After reading the Spark documentation, I find that groupByKey function has a low performance compared with reduceByKey. But what I need is to get the average, maximum and minimum value of a certain key. I don't think this could be done by reduceByKey method. I can just create an customized reduceByKey function to realize those goals?

Let's say you have an RDD[(String, Double)] and you want to calculate avg, min, max over the double values using reduceByKey.
This could by done by duplicating the values as many times as operations you like to apply and then applying the different operations with reduceByKey.
Like this:
val srcData:RDD[(String, Double)] = ???
srcData.cache
val count = srcData.count
val baseData = srcData.map{case (k,v) => (k,(v,1,v,v))}
val aggregates = baseData.reduceByKey { case (v1,v2) =>
(v1._1 + v2._1, v1._2 + v2._2, Math.max(v1._3, v2._3), Math.min(v1._4,v2._4))}
val result = aggregates.collect()
.map{case (id, (sum, count, max, min)) => (id, sum/count, max, min)}

Related

how to get some element from each group after use groupby in spark

I have a spark rdd data, let's suppose it has 1000 elements and can be grouped into 10 groups, what I want to do is select 2 element which meets my special requirement in each group. And then, get a new rdd with 20 elements.
suppose the rdd data is like
((1,a1),
(1,a2),
(1,a3),
...
(1,a100),
(2,b1),
(2,b2),
(2,b3)
...
(2,b100))
what i want is
((1,a1),
(1,a99),
(2,b1),
(2,b99)
)
and I select a1、a99、b1、b99 with a function called my_func
I think the code may be something like:
myrdd.groupby(x => x._1)....(my_func)...
Not convinced you need groupBy. Not sure of structure of RDD.
This using my own contrived data, so you will need to adapt:
// Gen some data. My data. Adapt to yours.
val rdd = spark.sparkContext.parallelize(Seq((1, "x"), (2, "y"), (3, "z"), (4, "z"), (5, "bbb") ))
// Compare list.
val l = List("x", "y", "z")
// Function to filter, could be inline or via mapPartitions.
def my_function(l: List[String], r: RDD[(Int, String)]) = {
r.map(x => (x._1, x._2)).filter(x => l.contains(x._2))
}
// Run it all.
val rdd2 = my_function(l,rdd)
rdd2.collect
returns:
res24: Array[(Int, String)] = Array((1,x), (2,y), (3,z), (4,z))
I strongly discourage you from using groupBy() or even mapPartitions() for big dataset when you need to subsequently aggregate your data. The purpose of RDD and MapReduce programming model is to distribute computations: computing the max/min/sum etc in the driver or on a single node means using only the HDFS part of Spark.
Besides, there are many ways to perform your task, but focusing on finding a pattern that fits for every type of aggregation you need is just wrong and inevitably make your code inefficient.
Here is a possible PySpark solution for the problem you have:
rdd.reduceByKey(lambda x, y: x if x < y else y)\
.union(rdd.reduceByKey(lambda x, y: x if x > y else y)).sortByKey().collect()
In the first reduceByKey I find the smallest value for each key and in the second one the biggest value for each key. Then I can union them and, if necessary, sort the resulting RDD to obtain the result you showed us.

How to compute the dot product of two distributed RowMatrix in Apache Spark?

Let Q be a distributed Row Matrix in Spark, I want to calculate the cross product of Q with its transpose Q'.
However although a Row Matrix does have a multiply() method, but it can only accept local Matrices as an argument.
Code illustration ( Scala ):
val phi = new RowMatrix(phiRDD) // phiRDD is an instance of RDD[Vector]
val phiTranspose = transposeRowMatrix(phi) // transposeRowMatrix()
// returns the transpose of a RowMatrix
val crossMat = ? // phi * phiTranspose
Note that I want to perform the dot product of 2 Distributed RowMatrix not a distributed one with a local one.
One solution is to use an IndexedRowMatrix as following:
val phi = new IndexedRowMatrix(phiRDD) // phiRDD is an instance of RDD[IndexedRow]
val phiTranspose = transposeMatrix(phi) // transposeMatrix()
// returns the transpose of a Matrix
val crossMat = phi.toBlockMatrix().multiply( phiTranspose.toBlockMatrix()
).toIndexedRowMatrix()
However, I want to use the Row Matrix-Methods such as tallSkinnyQR() and this means that I sholud transform crossMat to a RowMatrix, using .toRowMatrix() method:
val crossRowMat = crossMat.toRowMatrix()
and finally I can apply
crossRowMat.tallSkinnyQR()
but this process includes many transformations between the types of the Distributed Matrices and according to what I understood from MLlib Programming Guide this is expensive:
It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive.
Would someone elaborate, please.
Only distributed matrices which support matrix - matrix multiplication are BlockMatrices. You have to convert your data accordingly - artificial indices are good enough:
new IndexedRowMatrix(
rowMatrix.rows.zipWithIndex.map(x => IndexedRow(x._2, x._1))
).toBlockMatrix match { case m => m.multiply(m.transpose) }
I used the algorithm listed on this page which moves the multiplication problem from dot product to distributed scalar product problem by using vectors outer product:
The outer product between two vectors is the scalar product of the
second vector with all the elements in the first vector, resulting in
a matrix
My own created multiplication function (can be more optimized) for Row Matrices ended up like that.
def multiplyRowMatrices(m1: RowMatrix, m2: RowMatrix)(implicit ctx: SparkSession): RowMatrix = {
// Zip m1 columns with m2 rows
val m1Cm2R = transposeRowMatrix(m1).rows.zip(m2.rows)
// Apply scalar product between each entry in m1 vector with m2 row
val scalar = m1Cm2R.map{
case(column:DenseVector,row:DenseVector) => column.toArray.map{
columnValue => row.toArray.map{
rowValue => columnValue*rowValue
}
}
}
// Add all the resulting matrices point wisely
val sum = scalar.reduce{
case(matrix1,matrix2) => matrix1.zip(matrix2).map{
case(array1,array2)=> array1.zip(array2).map{
case(value1,value2)=> value1+value2
}
}
}
new RowMatrix(ctx.sparkContext.parallelize(sum.map(array=> Vectors.dense(array))))
}
After that I tested both approaches- My own function and using block matrix - using a 300*10 Matrix on a one machine
Using my own function:
val PhiMat = new RowMatrix(phi)
val TphiMat = transposeRowMatrix(PhiMat)
val product = multiplyRowMatrices(PhiMat,TphiMat)
Using matrix transformation:
val MatRow = new RowMatrix(phi)
val MatBlock = new IndexedRowMatrix(MatRow.rows.zipWithIndex.map(x => IndexedRow(x._2, x._1))).toBlockMatrix()
val TMatBlock = MatBlock.transpose
val productMatBlock = MatBlock.multiply(TMatBlock)
val productMatRow = productMatBlock.toIndexedRowMatrix().toRowMatrix()
The first approach spanned 1 job with 5 stages and took 2s to finish in total. While the second approach spanned 4 jobs, three with one stage and one with two stages, and took 0.323s in total. Also the second approach outperformed the first with respect to the Shuffle Read/Write size.
Yet I am still confused by the MLlib Programming Guide statement:
It is very important to choose the right format to store large and
distributed matrices. Converting a distributed matrix to a different
format may require a global shuffle, which is quite expensive.

Spark difference between reduceByKey vs. groupByKey vs. aggregateByKey vs. combineByKey

Can anyone explain the difference between reducebykey, groupbykey, aggregatebykey and combinebykey? I have read the documents regarding this, but couldn't understand the exact differences.
An explanation with examples would be great.
groupByKey:
Syntax:
sparkContext.textFile("hdfs://")
.flatMap(line => line.split(" ") )
.map(word => (word,1))
.groupByKey()
.map((x,y) => (x,sum(y)))
groupByKey can cause out of disk problems as data is sent over the network and collected on the reduced workers.
reduceByKey:
Syntax:
sparkContext.textFile("hdfs://")
.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey((x,y)=> (x+y))
Data are combined at each partition, with only one output for one key at each partition to send over the network. reduceByKey required combining all your values into another value with the exact same type.
aggregateByKey:
same as reduceByKey, which takes an initial value.
3 parameters as input
initial value
Combiner logic
sequence op logic
Example:
val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C", "bar=D", "bar=D")
val data = sc.parallelize(keysWithValuesList)
//Create key value pairs
val kv = data.map(_.split("=")).map(v => (v(0), v(1))).cache()
val initialCount = 0;
val addToCounts = (n: Int, v: String) => n + 1
val sumPartitionCounts = (p1: Int, p2: Int) => p1 + p2
val countByKey = kv.aggregateByKey(initialCount)(addToCounts, sumPartitionCounts)
ouput:
Aggregate By Key sum Results
bar -> 3
foo -> 5
combineByKey:
3 parameters as input
Initial value: unlike aggregateByKey, need not pass constant always, we can pass a function that will return a new value.
merging function
combine function
Example:
val result = rdd.combineByKey(
(v) => (v,1),
( (acc:(Int,Int),v) => acc._1 +v , acc._2 +1 ) ,
( acc1:(Int,Int),acc2:(Int,Int) => (acc1._1+acc2._1) , (acc1._2+acc2._2))
).map( { case (k,v) => (k,v._1/v._2.toDouble) })
result.collect.foreach(println)
reduceByKey,aggregateByKey,combineByKey preferred over groupByKey
Reference:
Avoid groupByKey
groupByKey() is just to group your dataset based on a key. It will result in data shuffling when RDD is not already partitioned.
reduceByKey() is something like grouping + aggregation. We can say reduceByKey() equivalent to dataset.group(...).reduce(...). It will shuffle less data unlike groupByKey().
aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have an input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output. It also takes zero-value that will be applied at the beginning of each key.
Note: One similarity is they all are wide operations.
While both reducebykey and groupbykey will produce the same answer, the
reduceByKey example works much better on a large dataset. That's
because Spark knows it can combine output with a common key on each
partition before shuffling the data.
On the other hand, when calling groupByKey - all the key-value pairs
are shuffled around. This is a lot of unnessary data to being
transferred over the network.
for more detailed check this below link
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
Although both of them will fetch the same results, there is a significant difference in the performance of both the functions. reduceByKey() works better with larger datasets when compared to groupByKey().
In reduceByKey(), pairs on the same machine with the same key are combined (by using the function passed into reduceByKey()) before the data is shuffled. Then the function is called again to reduce all the values from each partition to produce one final result.
In groupByKey(), all the key-value pairs are shuffled around. This is a lot of unnecessary data to being transferred over the network.
ReduceByKey reduceByKey(func, [numTasks])-
Data is combined so that at each partition there should be at least one value for each key.
And then shuffle happens and it is sent over the network to some particular executor for some action such as reduce.
GroupByKey - groupByKey([numTasks])
It doesn't merge the values for the key but directly the shuffle process happens
and here lot of data gets sent to each partition, almost same as the initial data.
And the merging of values for each key is done after the shuffle.
Here lot of data stored on final worker node so resulting in out of memory issue.
AggregateByKey - aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
It is similar to reduceByKey but you can provide initial values when performing aggregation.
Use of reduceByKey
reduceByKey can be used when we run on large data set.
reduceByKey when the input and output value types are of same type
over aggregateByKey
Moreover it recommended not to use groupByKey and prefer reduceByKey. For details you can refer here.
You can also refer this question to understand in more detail how reduceByKey and aggregateByKey.
Then apart from these 4, we have
foldByKey which is same as reduceByKey but with a user defined Zero Value.
AggregateByKey takes 3 parameters as input and uses 2 functions for merging(one for merging on same partitions and another to merge values across partition. The first parameter is ZeroValue)
whereas
ReduceBykey takes 1 parameter only which is a function for merging.
CombineByKey takes 3 parameter and all 3 are functions. Similar to aggregateBykey except it can have a function for ZeroValue.
GroupByKey takes no parameter and groups everything. Also, it is an overhead for data transfer across partitions.

Spark Accumulator vs Count

I have a use case where I want to count types of elements in an RDD matching some filter.
e.g. RDD.filter(F1) and RDD.filter(!F1)
I have 2 options
Use accumulators: e.g.
LongAccumulator l1 = sparkContext.longAccumulator("Count1")
LongAccumulator l2 = sparkContext.longAccumulator("Count2")
RDD.forEachPartition(f -> {
if(F1) l1.add(1)
else l2.add(1)
});
Use Count
RDD.filter(F1).count(); RDD.filter(!F1).count()
One benefit of the first approach is that we only need to iterate data once (useful since my data set is 10s of TB)
What is the use of count if same affect can be achieved by using Accumulators ?
Major difference is that if your code will fail in transformation, then Accumulators will be updated and count() result not.
Other option is to use pure map-reduce:
val counts = rdd.map(x => (F1(x), 1)).reduceByKey(_ + _).collectAsMap()
Network cost should be also low as only few numbers will be sent. It creates pairs of (is F1(x) true/false, 1) and then sum all ones - it will give you number of items both F1(x) and !F1(x) in counts map

Incremently load big RDD file into memory

val locations = filelines.map(line => line.split("\t")).map(t => (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect()
val cartesienProduct=locations.cartesian(locations).map(t=> Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2)))
Code executes perfectly fine up till here but when i try to use "cartesienProduct" it got stuck i.e.
val count =cartesienProduct.count()
Any help to efficiently do this will be highly appreciated.
First, the map transformation can be made more readable if written as:
locations.cartesian(locations).map {
case ((a1, (b1, c1)), (a2, (b2, c2)) =>
Edge(a1, a2, distanceAmongPoints(b1,c1,b2,c2)))
}
It seems the objective is to calculate distance between two points for all pairs. cartesian will give the pair twice, effectively computing same distance twice.
To avoid that, one approach could be to broadcast a copy of all points and then compare in parts.
val points: // an array of points.
val pointsRDD = sc.parallelize(points.zipWithIndex)
val bPoints = sc.broadcast(points)
pointsRDD.map { case (point, index) =>
(index + 1 until bPoints.value.size).map { i =>
distanceBetweenPoints(point, bPoints.value.get(i))
}
}
If size of points is N, it will compare point-0 with (point-1 to point-N-1), point-1 with (point-2 to point-N-1) etc.

Resources