How to compute the distance matrix in spark? - apache-spark

I have tried pairing the samples but it costs huge amount of memory as 100 samples leads to 9900 samples which is more costly. What could be the more effective way of computing distance matrix in distributed environment in spark
Here is a snippet of pseudo code what i'm trying
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex() //Including the index of each sample
val indexedData = indexed.map{case (k,v) => (v,k)}
val pairedSamples = indexedData.cartesian(indexedData)
val filteredSamples = pairedSamples.filter{ case (x,y) =>
(x._1.toInt > y._1.toInt) //to consider only the upper or lower trainagle
}
filteredSamples.cache
filteredSamples.count
Above code creates the pairs but even if my dataset contains 100 samples, by pairing filteredSamples (above) results in 4950 sample which could be very costly for big data

I recently answered a similar question.
Basically, it will arrive to computing n(n-1)/2 pairs, which would be 4950 computations in your example. However, what makes this approach different is that I use joins instead of cartesian. With your code, the solution would look like this:
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex()
// including the index of each sample
val indexedData = indexed.map { case (k,v) => (v,k) }
// prepare indices
val count = i.count
val indices = sc.parallelize(for(i <- 0L until count; j <- 0L until count; if i > j) yield (i, j))
val joined1 = indices.join(indexedData).map { case (i, (j, v)) => (j, (i,v)) }
val joined2 = joined1.join(indexedData).map { case (j, ((i,v1),v2)) => ((i,j),(v1,v2)) }
// after that, you can then compute the distance using your distFunc
val distRDD = joined2.mapValues{ case (v1, v2) => distFunc(v1, v2) }
Try this method and compare it with the one you already posted. Hopefully, this can speedup your code a bit.

As far as I can see from checking various sources and the Spark mllib clustering site, Spark doesn't currently support the distance or pdist matrices.
In my opinion, 100 samples will always output at least 4950 values; so manually creating a distributed matrix solver using a transformation (like .map) would be the best solution.

This can serve as the java version of jtitusj's answer..
public JavaPairRDD<Tuple2<Long, Long>, Double> getDistanceMatrix(Dataset<Row> ds, String vectorCol) {
JavaRDD<Vector> rdd = ds.toJavaRDD().map(new Function<Row, Vector>() {
private static final long serialVersionUID = 1L;
public Vector call(Row row) throws Exception {
return row.getAs(vectorCol);
}
});
List<Vector> vectors = rdd.collect();
long count = ds.count();
List<Tuple2<Tuple2<Long, Long>, Double>> distanceList = new ArrayList<Tuple2<Tuple2<Long, Long>, Double>>();
for(long i=0; i < count; i++) {
for(long j=0; j < count && i > j; j++) {
Tuple2<Long, Long> indexPair = new Tuple2<Long, Long>(i, j);
double d = DistanceMeasure.getDistance(vectors.get((int)i), vectors.get((int)j));
distanceList.add(new Tuple2<Tuple2<Long, Long>, Double>(indexPair, d));
}
}
return distanceList;
}

Related

Optimization (in terms of speed )

is there any other way to optimize this code. Anyone can come up with better way because this is taking lot of time in main code. Thanks alot;)
HashMap<String, Integer> hmap = new HashMap<String, Integer>();
List<String> dup = new ArrayList<String>();
List<String> nondup = new ArrayList<String>();
for (String num : nums) {
String x= num;
String result = x.toLowerCase();
if (hmap.containsKey(result)) {
hmap.put(result, hmap.get(result) + 1);
}
else {
hmap.put(result,1);
}
}
for(String num:nums){
int count= hmap.get(num.toLowerCase());
if (count == 1){
nondup.add(num);
}
else{
dup.add(num);
}
}
output:
[A/tea, C/SEA.java, C/clock, aep, aeP, C/SEA.java]
Dups: [C/SEA.java, aep, aeP, C/SEA.java]
nondups: [A/tea, C/clock]
How much time is "a lot of time"? Is your input bigger than what you've actually shown us?
You could parallelize this with something like Arrays.parallelStream(nums).collect(Collectors.groupingByConcurrent(k -> k, Collectors.counting()), which would get you a Map<String, Long>, but that would only speed up your code if you have a lot of input, which it doesn't look like you have right now.
You could parallelize the next step, if you liked, like so:
Map<String, Long> counts = Arrays.parallelStream(nums)
.collect(Collectors.groupingByConcurrent(k -> k, Collectors.counting());
Map<Boolean, List<String>> hasDup =
counts.entrySet().parallelStream()
.collect(Collectors.partitioningBy(
entry -> entry.getValue() > 1,
Collectors.mapping(Entry::getKey, Collectors.toList())));
List<String> dup = hasDup.get(true);
List<String> nodup = hasDup.get(false);
The algorithms in the other answers can speed up execution using multiple threads.
This can theoretically reduce the processing time with a factor of M, where M is the maximum number of threads that your system can run concurrently. However, as M is a constant number, this does not change the order of complexity, which therefore remains O(N).
At a glance, I do not see a way to solve your problem in less than O(N), I am afraid.

Why am I getting a race condition in multi-threading scala?

I am trying to parallelise a p-norm calculation over an array.
To achieve that I try the following, I understand I can solve this differently but I am interested in understanding where the race condition is occurring,
val toSum = Array(0,1,2,3,4,5,6)
// Calculate the sum over a segment of an array
def sumSegment(a: Array[Int], p:Double, s: Int, t: Int): Int = {
val res = {for (i <- s until t) yield scala.math.pow(a(i), p)}.reduceLeft(_ + _)
res.toInt
}
// Calculate the p-norm over an Array a
def parallelpNorm(a: Array[Int], p: Double): Double = {
var acc = 0L
// The worker who should calculate the sum over a slice of an array
class sumSegmenter(s: Int, t: Int) extends Thread {
override def run() {
// Calculate the sum over the slice
val subsum = sumSegment(a, p, s, t)
// Add the sum of the slice to the accumulator in a synchronized fashion
val x = new AnyRef{}
x.synchronized {
acc = acc + subsum
}
}
}
val split = a.size / 2
val seg_one = new sumSegmenter(0, split)
val seg_two = new sumSegmenter(split, a.size)
seg_one.start
seg_two.start
seg_one.join
seg_two.join
scala.math.pow(acc, 1.0 / p)
}
println(parallelpNorm(toSum, 2))
Expected output is 9.5393920142 but instead some runs give me 9.273618495495704 or even 2.23606797749979.
Any recommendations where the race condition could happen?
The problem has been explained in the previous answer, but a better way to avoid this race condition and improve performance is to use an AtomicInteger
// Calculate the p-norm over an Array a
def parallelpNorm(a: Array[Int], p: Double): Double = {
val acc = new AtomicInteger(0)
// The worker who should calculate the sum over a slice of an array
class sumSegmenter(s: Int, t: Int) extends Thread {
override def run() {
// Calculate the sum over the slice
val subsum = sumSegment(a, p, s, t)
// Add the sum of the slice to the accumulator in a synchronized fashion
acc.getAndAdd(subsum)
}
}
val split = a.length / 2
val seg_one = new sumSegmenter(0, split)
val seg_two = new sumSegmenter(split, a.length)
seg_one.start()
seg_two.start()
seg_one.join()
seg_two.join()
scala.math.pow(acc.get, 1.0 / p)
}
Modern processors can do atomic operations without blocking which can be much faster than explicit synchronisation. In my tests this runs twice as fast as the original code (with correct placement of x)
Move val x = new AnyRef{} outside sumSegmenter (that is, into parallelpNorm) -- the problem is that each thread is using its own mutex rather than sharing one.

Spark,Graphx program does not utilize cpu and memory

I have a function that takes the neighbors of a node ,for the neighbors i use broadcast variable and the id of the node itself and it calculates the closeness centrality for that node.I map each node of the graph with the result of that function.When i open the task manager the cpu is not utilized at all as if it is not working in parallel , the same goes for memory , but the every node executes the function in parallel and also the data is large and it takes time to complete ,its not like it does not need the resources.Every help is truly appreciated , thank you.
For loading the graph i use val graph = GraphLoader.edgeListFile(sc, path).cache
object ClosenessCentrality {
case class Vertex(id: VertexId)
def run(graph: Graph[Int, Float],sc: SparkContext): Unit = {
//Have to reverse edges and make graph undirected because is bipartite
val neighbors = CollectNeighbors.collectWeightedNeighbors(graph).collectAsMap()
val bNeighbors = sc.broadcast(neighbors)
val result = graph.vertices.map(f => shortestPaths(f._1,bNeighbors.value))
//result.coalesce(1)
result.count()
}
def shortestPaths(source: VertexId, neighbors: Map[VertexId, Map[VertexId, Float]]): Double ={
val predecessors = new mutable.HashMap[VertexId, ListBuffer[VertexId]]()
val distances = new mutable.HashMap[VertexId, Double]()
val q = new FibonacciHeap[Vertex]
val nodes = new mutable.HashMap[VertexId, FibonacciHeap.Node[Vertex]]()
distances.put(source, 0)
for (w <- neighbors) {
if (w._1 != source)
distances.put(w._1, Int.MaxValue)
predecessors.put(w._1, ListBuffer[VertexId]())
val node = q.insert(Vertex(w._1), distances(w._1))
nodes.put(w._1, node)
}
while (!q.isEmpty) {
val u = q.minNode
val node = u.data.id
q.removeMin()
//discover paths
//println("Current node is:"+node+" "+neighbors(node).size)
for (w <- neighbors(node).keys) {
//print("Neighbor is"+w)
val alt = distances(node) + neighbors(node)(w)
// if (distances(w) > alt) {
// distances(w) = alt
// q.decreaseKey(nodes(w), alt)
// }
// if (distances(w) == alt)
// predecessors(w).+=(node)
if(alt< distances(w)){
distances(w) = alt
predecessors(w).+=(node)
q.decreaseKey(nodes(w), alt)
}
}//For
}
val sum = distances.values.sum
sum
}
To provide somewhat of an answer to your original question, I suspect that your RDD only has a single partition, thus using a single core to process.
The edgeListFile method has an argument to specify the minimum number of partitions you want.
Also, you can use repartition to get more partitions.
You mentionned coalesce but that only reduces the number of partitions by default, see this question : Spark Coalesce More Partitions

Apache Spark Streaming: Median of windowed PairDStream by key

I want to calculate the median value of a PairDStream for the values of each key.
I already tried the following, which is very unefficient:
JavaPairDStream<String, Iterable<Float>> groupedByKey = pairDstream.groupByKey();
JavaPairDStream<String, Float> medianPerPlug1h = groupedByKey.transformToPair(new Function<JavaPairRDD<String,Iterable<Float>>, JavaPairRDD<String,Float>>() {
public JavaPairRDD<String, Float> call(JavaPairRDD<String, Iterable<Float>> v1) throws Exception {
return v1.mapValues(new Function<Iterable<Float>, Float>() {
public Float call(Iterable<Float> v1) throws Exception {
List<Float> buffer = new ArrayList<Float>();
long count = 0L;
Iterator<Float> iterator = v1.iterator();
while(iterator.hasNext()) {
buffer.add(iterator.next());
count++;
}
float[] values = new float[(int)count];
for(int i = 0; i < buffer.size(); i++) {
values[i] = buffer.get(i);
}
Arrays.sort(values);
float median;
int startIndex;
if(count % 2 == 0) {
startIndex = (int)(count / 2 - 1);
float a = values[startIndex];
float b = values[startIndex + 1];
median = (a + b) / 2.0f;
} else {
startIndex = (int)(count/2);
median = values[startIndex];
}
return median;
}
});
}
});
medianPerPlug1h.print();
Can somebody help me with a more efficient transaction? I have about 1950 different keys, each can get to 3600 (1 data point per second, window of 1 hour) values, where to find the median of.
Thank you!
First thing is that I don't know why are you using Spark for that kind of task. It doesn't seem to be related to big data considering you got just few thousand of values. It may make things more complicated. But let's assume you're planning to scale up to bigger datasets.
I would try to use some more optimized algorithm for finding median than just sorting values. Sorting an array of values runs in O(n * log n) time.
You could think about using some linear-time median algorithm like Median of medians
1) avoid using groupbykey; reducebykey is more efficient than groupbykey.
2) reduceByKeyAndWindow(Function2,windowduration,slideDuration) can serve you better.
example:
JavaPairDStream merged=yourRDD.reduceByKeyAndWindow(new Function2() {
public String call(String arg0, String arg1) throws Exception {
return arg0+","+arg1;
}
}, Durations.seconds(windowDur), Durations.seconds(slideDur));
Assume output from this RDD will be like this :
(key,1,2,3,4,5,6,7)
(key,1,2,3,4,5,6,7).
now for each key , you can parse this , you will have the count of values,
so : 1+2+3+4+5+6+7/count
Note: i used string just to concatenate.
I hope it helps :)

Matrix Transpose on RowMatrix in Spark

Suppose I have a RowMatrix.
How can I transpose it. The API documentation does not seem to have a transpose method.
The Matrix has the transpose() method. But it is not distributed. If I have a large matrix greater that the memory how can I transpose it?
I have converted a RowMatrix to DenseMatrix as follows
DenseMatrix Mat = new DenseMatrix(m,n,MatArr);
which requires converting the RowMatrix to JavaRDD and converting JavaRDD to an array.
Is there any other convenient way to do the conversion?
Thanks in advance
If anybody interested, I've implemented the distributed version #javadba had proposed.
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val transposedRowsRDD = m.rows.zipWithIndex.map{case (row, rowIndex) => rowToTransposedTriplet(row, rowIndex)}
.flatMap(x => x) // now we have triplets (newRowIndex, (newColIndex, value))
.groupByKey
.sortByKey().map(_._2) // sort rows and remove row indexes
.map(buildRow) // restore order of elements in each row and remove column indexes
new RowMatrix(transposedRowsRDD)
}
def rowToTransposedTriplet(row: Vector, rowIndex: Long): Array[(Long, (Long, Double))] = {
val indexedRow = row.toArray.zipWithIndex
indexedRow.map{case (value, colIndex) => (colIndex.toLong, (rowIndex, value))}
}
def buildRow(rowWithIndexes: Iterable[(Long, Double)]): Vector = {
val resArr = new Array[Double](rowWithIndexes.size)
rowWithIndexes.foreach{case (index, value) =>
resArr(index.toInt) = value
}
Vectors.dense(resArr)
}
You can use BlockMatrix, which can be created from an IndexedRowMatrix:
BlockMatrix matA = (new IndexedRowMatrix(...).toBlockMatrix().cache();
matA.validate();
BlockMatrix matB = matA.transpose();
Then, can be easily put back as IndexedRowMatrix. This is described in the spark documentation.
You are correct: there is no
RowMatrix.transpose()
method. You will need to do this operation manually.
Here is the non-distributed/local matrix versions:
def transpose(m: Array[Array[Double]]): Array[Array[Double]] = {
(for {
c <- m(0).indices
} yield m.map(_(c)) ).toArray
}
The distributed version would be along the following lines:
origMatRdd.rows.zipWithIndex.map{ case (rvect, i) =>
rvect.zipWithIndex.map{ case (ax, j) => ((j,(i,ax))
}.groupByKey
.sortBy{ case (i, ax) => i }
.foldByKey(new DenseVector(origMatRdd.numRows())) { case (dv, (ix,ax)) =>
dv(ix) = ax
}
Caveat: I have not tested the above: it will have bugs. But the basic approach is valid - and similar to work I had done in the past for a small LinAlg library for spark.
For very large and sparse matrix, (like the one you get from text feature extraction), the best and easiest way is:
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val indexedRM = new IndexedRowMatrix(m.rows.zipWithIndex.map({
case (row, idx) => new IndexedRow(idx, row)}))
val transposed = indexedRM.toCoordinateMatrix().transpose.toIndexedRowMatrix()
new RowMatrix(transposed.rows
.map(idxRow => (idxRow.index, idxRow.vector))
.sortByKey().map(_._2))
}
For not so sparse matrix, you can use BlockMatrix as the bridge as mentioned by aletapool's answer above.
However aletapool's answer misses a very important point: When you start from RowMaxtrix -> IndexedRowMatrix -> BlockMatrix -> transpose -> BlockMatrix -> IndexedRowMatrix -> RowMatrix, in the last step (IndexedRowMatrix -> RowMatrix), you have to do a sort. Because by default, converting from IndexedRowMatrix to RowMatrix, the index is simply dropped and the order will be messed up.
val data = Array(
MllibVectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
MllibVectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
MllibVectors.dense(4.0, 0.0, 0.0, 6.0, 7.0),
MllibVectors.sparse(5, Seq((2, 2.0), (3, 7.0))))
val dataRDD = sc.parallelize(data, 4)
val testMat: RowMatrix = new RowMatrix(dataRDD)
testMat.rows.collect().map(_.toDense).foreach(println)
[0.0,1.0,0.0,7.0,0.0]
[2.0,0.0,3.0,4.0,5.0]
[4.0,0.0,0.0,6.0,7.0]
[0.0,0.0,2.0,7.0,0.0]
transposeRowMatrix(testMat).
rows.collect().map(_.toDense).foreach(println)
[0.0,2.0,4.0,0.0]
[1.0,0.0,0.0,0.0]
[0.0,3.0,0.0,2.0]
[7.0,4.0,6.0,7.0]
[0.0,5.0,7.0,0.0]
Getting the transpose of RowMatrix in Java:
public static RowMatrix transposeRM(JavaSparkContext jsc, RowMatrix mat){
List<Vector> newList=new ArrayList<Vector>();
List<Vector> vs = mat.rows().toJavaRDD().collect();
double [][] tmp=new double[(int)mat.numCols()][(int)mat.numRows()] ;
for(int i=0; i < vs.size(); i++){
double[] rr=vs.get(i).toArray();
for(int j=0; j < mat.numCols(); j++){
tmp[j][i]=rr[j];
}
}
for(int i=0; i < mat.numCols();i++)
newList.add(Vectors.dense(tmp[i]));
JavaRDD<Vector> rows2 = jsc.parallelize(newList);
RowMatrix newmat = new RowMatrix(rows2.rdd());
return (newmat);
}
This is a variant of the previous solution but working for sparse row matrix and keeping the transposed sparse too when needed:
def transpose(X: RowMatrix): RowMatrix = {
val m = X.numRows ().toInt
val n = X.numCols ().toInt
val transposed = X.rows.zipWithIndex.flatMap {
case (sp: SparseVector, i: Long) => sp.indices.zip (sp.values).map {case (j, value) => (i, j, value)}
case (dp: DenseVector, i: Long) => Range (0, n).toArray.zip (dp.values).map {case (j, value) => (i, j, value)}
}.sortBy (t => t._1).groupBy (t => t._2).map {case (i, g) =>
val (indices, values) = g.map {case (i, j, value) => (i.toInt, value)}.unzip
if (indices.size == m) {
(i, Vectors.dense (values.toArray) )
} else {
(i, Vectors.sparse (m, indices.toArray, values.toArray))
}
}.sortBy(t => t._1).map (t => t._2)
new RowMatrix (transposed)
}
Hope this help!

Resources