Suppose I have a RowMatrix.
How can I transpose it. The API documentation does not seem to have a transpose method.
The Matrix has the transpose() method. But it is not distributed. If I have a large matrix greater that the memory how can I transpose it?
I have converted a RowMatrix to DenseMatrix as follows
DenseMatrix Mat = new DenseMatrix(m,n,MatArr);
which requires converting the RowMatrix to JavaRDD and converting JavaRDD to an array.
Is there any other convenient way to do the conversion?
Thanks in advance
If anybody interested, I've implemented the distributed version #javadba had proposed.
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val transposedRowsRDD = m.rows.zipWithIndex.map{case (row, rowIndex) => rowToTransposedTriplet(row, rowIndex)}
.flatMap(x => x) // now we have triplets (newRowIndex, (newColIndex, value))
.groupByKey
.sortByKey().map(_._2) // sort rows and remove row indexes
.map(buildRow) // restore order of elements in each row and remove column indexes
new RowMatrix(transposedRowsRDD)
}
def rowToTransposedTriplet(row: Vector, rowIndex: Long): Array[(Long, (Long, Double))] = {
val indexedRow = row.toArray.zipWithIndex
indexedRow.map{case (value, colIndex) => (colIndex.toLong, (rowIndex, value))}
}
def buildRow(rowWithIndexes: Iterable[(Long, Double)]): Vector = {
val resArr = new Array[Double](rowWithIndexes.size)
rowWithIndexes.foreach{case (index, value) =>
resArr(index.toInt) = value
}
Vectors.dense(resArr)
}
You can use BlockMatrix, which can be created from an IndexedRowMatrix:
BlockMatrix matA = (new IndexedRowMatrix(...).toBlockMatrix().cache();
matA.validate();
BlockMatrix matB = matA.transpose();
Then, can be easily put back as IndexedRowMatrix. This is described in the spark documentation.
You are correct: there is no
RowMatrix.transpose()
method. You will need to do this operation manually.
Here is the non-distributed/local matrix versions:
def transpose(m: Array[Array[Double]]): Array[Array[Double]] = {
(for {
c <- m(0).indices
} yield m.map(_(c)) ).toArray
}
The distributed version would be along the following lines:
origMatRdd.rows.zipWithIndex.map{ case (rvect, i) =>
rvect.zipWithIndex.map{ case (ax, j) => ((j,(i,ax))
}.groupByKey
.sortBy{ case (i, ax) => i }
.foldByKey(new DenseVector(origMatRdd.numRows())) { case (dv, (ix,ax)) =>
dv(ix) = ax
}
Caveat: I have not tested the above: it will have bugs. But the basic approach is valid - and similar to work I had done in the past for a small LinAlg library for spark.
For very large and sparse matrix, (like the one you get from text feature extraction), the best and easiest way is:
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val indexedRM = new IndexedRowMatrix(m.rows.zipWithIndex.map({
case (row, idx) => new IndexedRow(idx, row)}))
val transposed = indexedRM.toCoordinateMatrix().transpose.toIndexedRowMatrix()
new RowMatrix(transposed.rows
.map(idxRow => (idxRow.index, idxRow.vector))
.sortByKey().map(_._2))
}
For not so sparse matrix, you can use BlockMatrix as the bridge as mentioned by aletapool's answer above.
However aletapool's answer misses a very important point: When you start from RowMaxtrix -> IndexedRowMatrix -> BlockMatrix -> transpose -> BlockMatrix -> IndexedRowMatrix -> RowMatrix, in the last step (IndexedRowMatrix -> RowMatrix), you have to do a sort. Because by default, converting from IndexedRowMatrix to RowMatrix, the index is simply dropped and the order will be messed up.
val data = Array(
MllibVectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
MllibVectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
MllibVectors.dense(4.0, 0.0, 0.0, 6.0, 7.0),
MllibVectors.sparse(5, Seq((2, 2.0), (3, 7.0))))
val dataRDD = sc.parallelize(data, 4)
val testMat: RowMatrix = new RowMatrix(dataRDD)
testMat.rows.collect().map(_.toDense).foreach(println)
[0.0,1.0,0.0,7.0,0.0]
[2.0,0.0,3.0,4.0,5.0]
[4.0,0.0,0.0,6.0,7.0]
[0.0,0.0,2.0,7.0,0.0]
transposeRowMatrix(testMat).
rows.collect().map(_.toDense).foreach(println)
[0.0,2.0,4.0,0.0]
[1.0,0.0,0.0,0.0]
[0.0,3.0,0.0,2.0]
[7.0,4.0,6.0,7.0]
[0.0,5.0,7.0,0.0]
Getting the transpose of RowMatrix in Java:
public static RowMatrix transposeRM(JavaSparkContext jsc, RowMatrix mat){
List<Vector> newList=new ArrayList<Vector>();
List<Vector> vs = mat.rows().toJavaRDD().collect();
double [][] tmp=new double[(int)mat.numCols()][(int)mat.numRows()] ;
for(int i=0; i < vs.size(); i++){
double[] rr=vs.get(i).toArray();
for(int j=0; j < mat.numCols(); j++){
tmp[j][i]=rr[j];
}
}
for(int i=0; i < mat.numCols();i++)
newList.add(Vectors.dense(tmp[i]));
JavaRDD<Vector> rows2 = jsc.parallelize(newList);
RowMatrix newmat = new RowMatrix(rows2.rdd());
return (newmat);
}
This is a variant of the previous solution but working for sparse row matrix and keeping the transposed sparse too when needed:
def transpose(X: RowMatrix): RowMatrix = {
val m = X.numRows ().toInt
val n = X.numCols ().toInt
val transposed = X.rows.zipWithIndex.flatMap {
case (sp: SparseVector, i: Long) => sp.indices.zip (sp.values).map {case (j, value) => (i, j, value)}
case (dp: DenseVector, i: Long) => Range (0, n).toArray.zip (dp.values).map {case (j, value) => (i, j, value)}
}.sortBy (t => t._1).groupBy (t => t._2).map {case (i, g) =>
val (indices, values) = g.map {case (i, j, value) => (i.toInt, value)}.unzip
if (indices.size == m) {
(i, Vectors.dense (values.toArray) )
} else {
(i, Vectors.sparse (m, indices.toArray, values.toArray))
}
}.sortBy(t => t._1).map (t => t._2)
new RowMatrix (transposed)
}
Hope this help!
Related
Is there an efficient way to update a value for a certain index (i,j) for CoordinateMatrix?
Currently I'm using map to iterate all values and update only when I find those certain indexes but I don't think this is the right way to do it
There is not. CoordinateMatrix is backed by and RDD and is immutable. Even if you optimize access by:
Getting its entries:
val mat: CoordinateMatrix = ???
val entries = mat.entries
Converting to RDD of ((row, col), value) and hash partitioning.
val n: Int = ???
val partitioner = new org.apache.spark.HashPartitioner(n)
val pairs = entries.map(e => ((e.i, e.j), e.value)).partitionBy(partitioner)
Mapping only a single partition:
def update(mat: RDD[((Long, Long), Double)], i: Long, j: Long, v: Double) = {
val p = mat.partitioner.map(_.getPartition((i, j)))
p.map(p => mat.mapPartitionsWithIndex{
case (pi, iter) if pi == p => iter.map {
case ((ii, jj), _) if ii == i && jj == j => ((ii, jj), v)
case x => x
}
case (_, iter) => iter
})
}
you'll still make a new RDD for each update.
I am loading data from phoenix through this:
val tableDF = sqlContext.phoenixTableAsDataFrame("Hbtable", Array("ID", "distance"), conf = configuration)
and want to carry out the following computation on the column values distance:
val list=Array(10,20,30,40,10,20,0,10,20,30,40,50,60)//list of values from the column distance
val first=list(0)
val last=list(list.length-1)
var m = 0;
for (a <- 0 to list.length-2) {
if (list(a + 1) < list(a) && list(a+1)>=0)
{
m = m + list(a)
}
}
val totalDist=(m+last-first)
You can do something like this. It returns Array[Any]
`val array = df.select("distance").rdd.map(r => r(0)).collect()
If you want to get the data type properly, then you can use. It returns the Array[Int]
val array = df.select("distance").rdd.map(r => r(0).asInstanceOf[Int]).collect()
I have tried pairing the samples but it costs huge amount of memory as 100 samples leads to 9900 samples which is more costly. What could be the more effective way of computing distance matrix in distributed environment in spark
Here is a snippet of pseudo code what i'm trying
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex() //Including the index of each sample
val indexedData = indexed.map{case (k,v) => (v,k)}
val pairedSamples = indexedData.cartesian(indexedData)
val filteredSamples = pairedSamples.filter{ case (x,y) =>
(x._1.toInt > y._1.toInt) //to consider only the upper or lower trainagle
}
filteredSamples.cache
filteredSamples.count
Above code creates the pairs but even if my dataset contains 100 samples, by pairing filteredSamples (above) results in 4950 sample which could be very costly for big data
I recently answered a similar question.
Basically, it will arrive to computing n(n-1)/2 pairs, which would be 4950 computations in your example. However, what makes this approach different is that I use joins instead of cartesian. With your code, the solution would look like this:
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex()
// including the index of each sample
val indexedData = indexed.map { case (k,v) => (v,k) }
// prepare indices
val count = i.count
val indices = sc.parallelize(for(i <- 0L until count; j <- 0L until count; if i > j) yield (i, j))
val joined1 = indices.join(indexedData).map { case (i, (j, v)) => (j, (i,v)) }
val joined2 = joined1.join(indexedData).map { case (j, ((i,v1),v2)) => ((i,j),(v1,v2)) }
// after that, you can then compute the distance using your distFunc
val distRDD = joined2.mapValues{ case (v1, v2) => distFunc(v1, v2) }
Try this method and compare it with the one you already posted. Hopefully, this can speedup your code a bit.
As far as I can see from checking various sources and the Spark mllib clustering site, Spark doesn't currently support the distance or pdist matrices.
In my opinion, 100 samples will always output at least 4950 values; so manually creating a distributed matrix solver using a transformation (like .map) would be the best solution.
This can serve as the java version of jtitusj's answer..
public JavaPairRDD<Tuple2<Long, Long>, Double> getDistanceMatrix(Dataset<Row> ds, String vectorCol) {
JavaRDD<Vector> rdd = ds.toJavaRDD().map(new Function<Row, Vector>() {
private static final long serialVersionUID = 1L;
public Vector call(Row row) throws Exception {
return row.getAs(vectorCol);
}
});
List<Vector> vectors = rdd.collect();
long count = ds.count();
List<Tuple2<Tuple2<Long, Long>, Double>> distanceList = new ArrayList<Tuple2<Tuple2<Long, Long>, Double>>();
for(long i=0; i < count; i++) {
for(long j=0; j < count && i > j; j++) {
Tuple2<Long, Long> indexPair = new Tuple2<Long, Long>(i, j);
double d = DistanceMeasure.getDistance(vectors.get((int)i), vectors.get((int)j));
distanceList.add(new Tuple2<Tuple2<Long, Long>, Double>(indexPair, d));
}
}
return distanceList;
}
I want to know if in an RDD, for example, RDD = {"0", "1", "2",... "99999"}, can I find out the machine in the cluster which stores a given element (e.g.: 100)?
And then in shuffle, can I aggregate some data and send it to the certain machine? I know that the partition of RDD is transparent for users but could I use some method like key/value to achieve that?
Generally speaking the answer is no or at least not with RDD API. If you can express your logic using graphs then you can try message based API in GraphX or Giraph. If not then using Akka directly instead of Spark could be a better choice.
Still, there are some workarounds but I wouldn't expect high performance. Lets start with some dummy data:
import org.apache.spark.rdd.RDD
val toPairs = (s: Range) => s.map(_.toChar.toString)
val rdd: RDD[(Int, String)] = sc.parallelize(Seq(
(0, toPairs(97 to 100)), // a-d
(1, toPairs(101 to 107)), // e-k
(2, toPairs(108 to 115)) // l-s
)).flatMap{ case (i, vs) => vs.map(v => (i, v)) }
and partition it using custom partitioner:
import org.apache.spark.Partitioner
class IdentityPartitioner(n: Int) extends Partitioner {
def numPartitions: Int = n
def getPartition(key: Any): Int = key.asInstanceOf[Int]
}
val partitioner = new IdentityPartitioner(4)
val parts = rdd.partitionBy(partitioner)
Now we have RDD with 4 partitions including one empty:
parts.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size))).collect
// Array[(Int, Int)] = Array((0,4), (1,7), (2,8), (3,0))
The simplest thing you can do is to leverage partitioning itself. First a dummy function and a helper:
// Dummy map function
def transform(s: String) =
Map("e" -> "x", "k" -> "y", "l" -> "z").withDefault(identity)(s)
// Map String to partition
def address(curr: Int, s: String) = {
val m = Map("x" -> 3, "y" -> 3, "z" -> 3).withDefault(x => curr)
(m(s), s)
}
and "send" data:
val transformed: RDD[(Int, String)] = parts
// Emit pairs (partition, string)
.map{case (i, s) => address(i, transform(s))}
// Repartition
.partitionBy(partitioner)
transformed
.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size)))
.collect
// Array[(Int, Int)] = Array((0,4), (1,5), (2,7), (3,3))
another approach is to collect "messages":
val tmp = parts.mapValues(s => transform(s))
val messages: Map[Int,Iterable[String]] = tmp
.flatMap{case (i, s) => {
val target = address(i, s)
if (target != (i, s)) Seq(target) else Seq()
}}
.groupByKey
.collectAsMap
create broadcast
val messagesBD = sc.broadcast(messages)
and use it to send messages:
val transformed = tmp
.filter{case (i, s) => address(i, s) == (i, s)}
.mapPartitionsWithIndex((i, iter) => {
val combined = iter ++ messagesBD.value.getOrElse(i, Seq())
combined.map((i, _))
}, true)
transformed
.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.size)))
.collect
// Array[(Int, Int)] = Array((0,4), (1,5), (2,7), (3,3))
Note the following line:
val combined = iter ++ messagesBD.value.getOrElse(i, Seq())
messagesBD.value is the entire broadcast data, which is actually a Map[Int,Iterable[String]], but then getOrElse method returns only the data that was mapped to i (if available).
I've implemented a solution to group RDD[K, V] by key and to compute data according to each group (K, RDD[V]), using partitionBy and Partitioner. Nevertheless, I'm not sure if it is really efficient and I'd like to have your point of view.
Here is a sample case : according to a list of [K: Int, V: Int], compute the Vs mean for each group of K, knowing that it should be distributed and that V values may be very large. That should give :
List[K, V] => (K, mean(V))
The simple Partitioner class:
class MyPartitioner(maxKey: Int) extends Partitioner {
def numPartitions = maxKey
def getPartition(key: Any): Int = key match {
case i: Int if i < maxKey => i
}
}
The partition code :
val l = List((1, 1), (1, 8), (1, 30), (2, 4), (2, 5), (3, 7))
val rdd = sc.parallelize(l)
val p = rdd.partitionBy(new MyPartitioner(4)).cache()
p.foreachPartition(x => {
try {
val r = sc.parallelize(x.toList)
val id = r.first() //get the K partition id
val v = r.map(x => x._2)
println(id._1 + "->" + mean(v))
} catch {
case e: UnsupportedOperationException => 0
}
})
The output is :
1->13, 2->4, 3->7
My questions are :
what does it really happen when calling partitionBy ? (sorry, I didn't find enough specs on it)
Is it really efficient to map by partition, knowing that in my production case it would not be too much keys (as 50 for sample) by very much values (as 1 million for sample)
What is the cost of paralellize(x.toList) ? Is it consistent to do it ? (I need a RDD in input of mean())
How would you do it by yourself ?
Regards
Your code should not work. You cannot pass the SparkContext object to the executors. (It's not Serializable.) Also I don't see why you would need to.
To calculate the mean, you need to calculate the sum and the count and take their ratio. The default partitioner will do fine.
def meanByKey(rdd: RDD[(Int, Int)]): RDD[(Int, Double)] = {
case class SumCount(sum: Double, count: Double)
val sumCounts = rdd.aggregateByKey(SumCount(0.0, 0.0))(
(sc, v) => SumCount(sc.sum + v, sc.count + 1.0),
(sc1, sc2) => SumCount(sc1.sum + sc2.sum, sc1.count + sc2.count))
sumCounts.map(sc => sc.sum / sc.count)
}
This is an efficient single-pass calculation that generalizes well.