Using PartitionBy to split and efficiently compute RDD groups by Key - apache-spark

I've implemented a solution to group RDD[K, V] by key and to compute data according to each group (K, RDD[V]), using partitionBy and Partitioner. Nevertheless, I'm not sure if it is really efficient and I'd like to have your point of view.
Here is a sample case : according to a list of [K: Int, V: Int], compute the Vs mean for each group of K, knowing that it should be distributed and that V values may be very large. That should give :
List[K, V] => (K, mean(V))
The simple Partitioner class:
class MyPartitioner(maxKey: Int) extends Partitioner {
def numPartitions = maxKey
def getPartition(key: Any): Int = key match {
case i: Int if i < maxKey => i
}
}
The partition code :
val l = List((1, 1), (1, 8), (1, 30), (2, 4), (2, 5), (3, 7))
val rdd = sc.parallelize(l)
val p = rdd.partitionBy(new MyPartitioner(4)).cache()
p.foreachPartition(x => {
try {
val r = sc.parallelize(x.toList)
val id = r.first() //get the K partition id
val v = r.map(x => x._2)
println(id._1 + "->" + mean(v))
} catch {
case e: UnsupportedOperationException => 0
}
})
The output is :
1->13, 2->4, 3->7
My questions are :
what does it really happen when calling partitionBy ? (sorry, I didn't find enough specs on it)
Is it really efficient to map by partition, knowing that in my production case it would not be too much keys (as 50 for sample) by very much values (as 1 million for sample)
What is the cost of paralellize(x.toList) ? Is it consistent to do it ? (I need a RDD in input of mean())
How would you do it by yourself ?
Regards

Your code should not work. You cannot pass the SparkContext object to the executors. (It's not Serializable.) Also I don't see why you would need to.
To calculate the mean, you need to calculate the sum and the count and take their ratio. The default partitioner will do fine.
def meanByKey(rdd: RDD[(Int, Int)]): RDD[(Int, Double)] = {
case class SumCount(sum: Double, count: Double)
val sumCounts = rdd.aggregateByKey(SumCount(0.0, 0.0))(
(sc, v) => SumCount(sc.sum + v, sc.count + 1.0),
(sc1, sc2) => SumCount(sc1.sum + sc2.sum, sc1.count + sc2.count))
sumCounts.map(sc => sc.sum / sc.count)
}
This is an efficient single-pass calculation that generalizes well.

Related

Update CoordinateMatrix entry

Is there an efficient way to update a value for a certain index (i,j) for CoordinateMatrix?
Currently I'm using map to iterate all values and update only when I find those certain indexes but I don't think this is the right way to do it
There is not. CoordinateMatrix is backed by and RDD and is immutable. Even if you optimize access by:
Getting its entries:
val mat: CoordinateMatrix = ???
val entries = mat.entries
Converting to RDD of ((row, col), value) and hash partitioning.
val n: Int = ???
val partitioner = new org.apache.spark.HashPartitioner(n)
val pairs = entries.map(e => ((e.i, e.j), e.value)).partitionBy(partitioner)
Mapping only a single partition:
def update(mat: RDD[((Long, Long), Double)], i: Long, j: Long, v: Double) = {
val p = mat.partitioner.map(_.getPartition((i, j)))
p.map(p => mat.mapPartitionsWithIndex{
case (pi, iter) if pi == p => iter.map {
case ((ii, jj), _) if ii == i && jj == j => ((ii, jj), v)
case x => x
}
case (_, iter) => iter
})
}
you'll still make a new RDD for each update.

how to compute average degree of neighbors with GraphX

I want to compute the average degree of neighbors for each node in my graph. Say we have a graph like this:
val users: RDD[(VertexId, String)] =
sc.parallelize(Array((3L, "rxin"),
(7L, "jgonzal"),
(5L, "franklin"),
(2L, "istoica")))
// Create an RDD for edges
val relationships: RDD[Edge[Int]] = sc.parallelize(
Array(Edge(3L, 7L, 12),
Edge(5L, 3L, 1),
Edge(2L, 5L, 3),
Edge(5L, 7L, 5)))
// Build the initial Graph
val graph = Graph(users, relationships)
EDIT
To have an idea of the outcome, take node 5 and its neighbors:
node 3 which has degree = 2
node 7 which has degree = 2
node 2 which has degree = 1
The output for this measure is simply the mean degree for the neighbors of node 5: (2+2+1)/3 = 1.666
Ideally, you want to remove links with node 5 in this computation, but that doesn't really matter to me now...
END EDIT
I am trying to apply aggregateMessages, but I don't know how to retrieve the degree of each node while I am into the aggregateMessages call:
val neideg = g.aggregateMessages[(Long, Double)](
triplet => {
val comparedAttrs = compareAttrs(triplet.dstAttr, triplet.srcAttr) // BUT HERE I SHOULD GIVE ALSO THE DEGREE
triplet.sendToDst(1L, comparedAttrs)
triplet.sendToSrc(1L, comparedAttrs)
},
{ case ((cnt1, v1), (cnt2, v2)) => (cnt1 + cnt2, v1 + v2) })
val aveneideg = neideg.mapValues(kv => kv._2 / kv._1.toDouble).toDF("id", "aveneideg")
then I have a function that does the sum:
def compareAttrs(xs: (Int, String), ys: (Int, String)): Double = {
xs._1.toDouble + ys._1.toDouble
}
how to pass to comparedAttrs also the value of degree for those nodes?
of course, more than happy to see a simpler and smarter solution for this task, compared to the one I am trying to craft...
I'm not clear if that's what you're after, but this is something you could go with:
val degrees = graph.degrees
// now we have a graph where attribute is a degree of a vertex
val graphWithDegrees = graph.outerJoinVertices(degrees) { (_, _, optDegree) =>
optDegree.getOrElse(1)
}
// now each vertex sends its degree to its neighbours
// we aggregate them in a set where each vertex gets all values
// of its neighbours
val neighboursDegreeAndCount = graphWithDegrees.aggregateMessages[List[Long]](
sendMsg = triplet => {
val srcDegree = triplet.srcAttr
val dstDegree = triplet.dstAttr
triplet.sendToDst(List(srcDegree))
triplet.sendToSrc(List(dstDegree))
},
mergeMsg = (x, y) => x ++ y
).mapValues(degrees => degrees.sum / degrees.size.toDouble)
// now if you want it in the original graph you can do
// outerJoinVertices again, and now the attr of vertex
// in the graph is avg of its neighbours
graph.outerJoinVertices(neighboursDegreeAndCount) { (_, _, optAvgDegree) =>
optAvgDegree.getOrElse(1)
}
So for your example the output is: Array((5,1.6666666666666667), (2,3.0), (3,2.5), (7,2.5))

Accessing rows outside of window while aggregating in Spark dataframe

In short, in the example below I want to pin 'b to be the value in the row that the result will appear in.
Given:
a,b
1,2
4,6
3,7 ==> 'special would be: (1-7 + 4-7 + 3-7) == -13 in this row
val baseWin = Window.partitionBy("something_I_forgot").orderBy("whatever")
val sumWin = baseWin.rowsBetween(-2, 0)
frame.withColumn("special",sum( 'a - 'b ).over(win) )
Or another way to think of it is I want to close over the row when I calculate the sum so that I can pass in the value of 'b (in this case 7)
* Update *
Here is what I want to accomplish as an UDF. In short, I used a foldLeft.
def mad(field : Column, numPeriods : Integer) : Column = {
val baseWin = Window.partitionBy("exchange","symbol").orderBy("datetime")
val win = baseWin.rowsBetween(numPeriods + 1, 0)
val subFunc: (Seq[Double],Int) => Double = { (input: Seq[Double], numPeriods : Int) => {
val agg = grizzled.math.stats.mean(input: _*)
val fooBar = (1.0 / -numPeriods)*input.foldLeft(0.0)( (a,b) => a + Math.abs((b-agg)) )
fooBar
} }
val myUdf = udf( subFunc )
myUdf(collect_list(field.cast(DoubleType)).over(win),lit(numPeriods))
}
If I understood correctly what you're trying to do, I think you can refactor your logic a bit to achieve it. The way you have it right now, you're probably getting "-7" instead of -13.
For the "special" column, (1-7 + 4-7 + 3-7), you can calculate it like (sum(a) - count(*) * b):
dfA.withColumn("special",sum('a).over(win) - count("*").over(win) * 'b)

How to find out the machine in the cluster which stores a given element in RDD and send a message to it?

I want to know if in an RDD, for example, RDD = {"0", "1", "2",... "99999"}, can I find out the machine in the cluster which stores a given element (e.g.: 100)?
And then in shuffle, can I aggregate some data and send it to the certain machine? I know that the partition of RDD is transparent for users but could I use some method like key/value to achieve that?
Generally speaking the answer is no or at least not with RDD API. If you can express your logic using graphs then you can try message based API in GraphX or Giraph. If not then using Akka directly instead of Spark could be a better choice.
Still, there are some workarounds but I wouldn't expect high performance. Lets start with some dummy data:
import org.apache.spark.rdd.RDD
val toPairs = (s: Range) => s.map(_.toChar.toString)
val rdd: RDD[(Int, String)] = sc.parallelize(Seq(
(0, toPairs(97 to 100)), // a-d
(1, toPairs(101 to 107)), // e-k
(2, toPairs(108 to 115)) // l-s
)).flatMap{ case (i, vs) => vs.map(v => (i, v)) }
and partition it using custom partitioner:
import org.apache.spark.Partitioner
class IdentityPartitioner(n: Int) extends Partitioner {
def numPartitions: Int = n
def getPartition(key: Any): Int = key.asInstanceOf[Int]
}
val partitioner = new IdentityPartitioner(4)
val parts = rdd.partitionBy(partitioner)
Now we have RDD with 4 partitions including one empty:
parts.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size))).collect
// Array[(Int, Int)] = Array((0,4), (1,7), (2,8), (3,0))
The simplest thing you can do is to leverage partitioning itself. First a dummy function and a helper:
// Dummy map function
def transform(s: String) =
Map("e" -> "x", "k" -> "y", "l" -> "z").withDefault(identity)(s)
// Map String to partition
def address(curr: Int, s: String) = {
val m = Map("x" -> 3, "y" -> 3, "z" -> 3).withDefault(x => curr)
(m(s), s)
}
and "send" data:
val transformed: RDD[(Int, String)] = parts
// Emit pairs (partition, string)
.map{case (i, s) => address(i, transform(s))}
// Repartition
.partitionBy(partitioner)
transformed
.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size)))
.collect
// Array[(Int, Int)] = Array((0,4), (1,5), (2,7), (3,3))
another approach is to collect "messages":
val tmp = parts.mapValues(s => transform(s))
val messages: Map[Int,Iterable[String]] = tmp
.flatMap{case (i, s) => {
val target = address(i, s)
if (target != (i, s)) Seq(target) else Seq()
}}
.groupByKey
.collectAsMap
create broadcast
val messagesBD = sc.broadcast(messages)
and use it to send messages:
val transformed = tmp
.filter{case (i, s) => address(i, s) == (i, s)}
.mapPartitionsWithIndex((i, iter) => {
val combined = iter ++ messagesBD.value.getOrElse(i, Seq())
combined.map((i, _))
}, true)
transformed
.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size)))
.collect
// Array[(Int, Int)] = Array((0,4), (1,5), (2,7), (3,3))
Note the following line:
val combined = iter ++ messagesBD.value.getOrElse(i, Seq())
messagesBD.value is the entire broadcast data, which is actually a Map[Int,Iterable[String]], but then getOrElse method returns only the data that was mapped to i (if available).

Matrix Transpose on RowMatrix in Spark

Suppose I have a RowMatrix.
How can I transpose it. The API documentation does not seem to have a transpose method.
The Matrix has the transpose() method. But it is not distributed. If I have a large matrix greater that the memory how can I transpose it?
I have converted a RowMatrix to DenseMatrix as follows
DenseMatrix Mat = new DenseMatrix(m,n,MatArr);
which requires converting the RowMatrix to JavaRDD and converting JavaRDD to an array.
Is there any other convenient way to do the conversion?
Thanks in advance
If anybody interested, I've implemented the distributed version #javadba had proposed.
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val transposedRowsRDD = m.rows.zipWithIndex.map{case (row, rowIndex) => rowToTransposedTriplet(row, rowIndex)}
.flatMap(x => x) // now we have triplets (newRowIndex, (newColIndex, value))
.groupByKey
.sortByKey().map(_._2) // sort rows and remove row indexes
.map(buildRow) // restore order of elements in each row and remove column indexes
new RowMatrix(transposedRowsRDD)
}
def rowToTransposedTriplet(row: Vector, rowIndex: Long): Array[(Long, (Long, Double))] = {
val indexedRow = row.toArray.zipWithIndex
indexedRow.map{case (value, colIndex) => (colIndex.toLong, (rowIndex, value))}
}
def buildRow(rowWithIndexes: Iterable[(Long, Double)]): Vector = {
val resArr = new Array[Double](rowWithIndexes.size)
rowWithIndexes.foreach{case (index, value) =>
resArr(index.toInt) = value
}
Vectors.dense(resArr)
}
You can use BlockMatrix, which can be created from an IndexedRowMatrix:
BlockMatrix matA = (new IndexedRowMatrix(...).toBlockMatrix().cache();
matA.validate();
BlockMatrix matB = matA.transpose();
Then, can be easily put back as IndexedRowMatrix. This is described in the spark documentation.
You are correct: there is no
RowMatrix.transpose()
method. You will need to do this operation manually.
Here is the non-distributed/local matrix versions:
def transpose(m: Array[Array[Double]]): Array[Array[Double]] = {
(for {
c <- m(0).indices
} yield m.map(_(c)) ).toArray
}
The distributed version would be along the following lines:
origMatRdd.rows.zipWithIndex.map{ case (rvect, i) =>
rvect.zipWithIndex.map{ case (ax, j) => ((j,(i,ax))
}.groupByKey
.sortBy{ case (i, ax) => i }
.foldByKey(new DenseVector(origMatRdd.numRows())) { case (dv, (ix,ax)) =>
dv(ix) = ax
}
Caveat: I have not tested the above: it will have bugs. But the basic approach is valid - and similar to work I had done in the past for a small LinAlg library for spark.
For very large and sparse matrix, (like the one you get from text feature extraction), the best and easiest way is:
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val indexedRM = new IndexedRowMatrix(m.rows.zipWithIndex.map({
case (row, idx) => new IndexedRow(idx, row)}))
val transposed = indexedRM.toCoordinateMatrix().transpose.toIndexedRowMatrix()
new RowMatrix(transposed.rows
.map(idxRow => (idxRow.index, idxRow.vector))
.sortByKey().map(_._2))
}
For not so sparse matrix, you can use BlockMatrix as the bridge as mentioned by aletapool's answer above.
However aletapool's answer misses a very important point: When you start from RowMaxtrix -> IndexedRowMatrix -> BlockMatrix -> transpose -> BlockMatrix -> IndexedRowMatrix -> RowMatrix, in the last step (IndexedRowMatrix -> RowMatrix), you have to do a sort. Because by default, converting from IndexedRowMatrix to RowMatrix, the index is simply dropped and the order will be messed up.
val data = Array(
MllibVectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
MllibVectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
MllibVectors.dense(4.0, 0.0, 0.0, 6.0, 7.0),
MllibVectors.sparse(5, Seq((2, 2.0), (3, 7.0))))
val dataRDD = sc.parallelize(data, 4)
val testMat: RowMatrix = new RowMatrix(dataRDD)
testMat.rows.collect().map(_.toDense).foreach(println)
[0.0,1.0,0.0,7.0,0.0]
[2.0,0.0,3.0,4.0,5.0]
[4.0,0.0,0.0,6.0,7.0]
[0.0,0.0,2.0,7.0,0.0]
transposeRowMatrix(testMat).
rows.collect().map(_.toDense).foreach(println)
[0.0,2.0,4.0,0.0]
[1.0,0.0,0.0,0.0]
[0.0,3.0,0.0,2.0]
[7.0,4.0,6.0,7.0]
[0.0,5.0,7.0,0.0]
Getting the transpose of RowMatrix in Java:
public static RowMatrix transposeRM(JavaSparkContext jsc, RowMatrix mat){
List<Vector> newList=new ArrayList<Vector>();
List<Vector> vs = mat.rows().toJavaRDD().collect();
double [][] tmp=new double[(int)mat.numCols()][(int)mat.numRows()] ;
for(int i=0; i < vs.size(); i++){
double[] rr=vs.get(i).toArray();
for(int j=0; j < mat.numCols(); j++){
tmp[j][i]=rr[j];
}
}
for(int i=0; i < mat.numCols();i++)
newList.add(Vectors.dense(tmp[i]));
JavaRDD<Vector> rows2 = jsc.parallelize(newList);
RowMatrix newmat = new RowMatrix(rows2.rdd());
return (newmat);
}
This is a variant of the previous solution but working for sparse row matrix and keeping the transposed sparse too when needed:
def transpose(X: RowMatrix): RowMatrix = {
val m = X.numRows ().toInt
val n = X.numCols ().toInt
val transposed = X.rows.zipWithIndex.flatMap {
case (sp: SparseVector, i: Long) => sp.indices.zip (sp.values).map {case (j, value) => (i, j, value)}
case (dp: DenseVector, i: Long) => Range (0, n).toArray.zip (dp.values).map {case (j, value) => (i, j, value)}
}.sortBy (t => t._1).groupBy (t => t._2).map {case (i, g) =>
val (indices, values) = g.map {case (i, j, value) => (i.toInt, value)}.unzip
if (indices.size == m) {
(i, Vectors.dense (values.toArray) )
} else {
(i, Vectors.sparse (m, indices.toArray, values.toArray))
}
}.sortBy(t => t._1).map (t => t._2)
new RowMatrix (transposed)
}
Hope this help!

Resources