Incremently load big RDD file into memory

Incremently load big RDD file into memory - apache-spark

val locations = filelines.map(line => line.split("\t")).map(t => (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect()
val cartesienProduct=locations.cartesian(locations).map(t=> Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2)))
Code executes perfectly fine up till here but when i try to use "cartesienProduct" it got stuck i.e.
val count =cartesienProduct.count()
Any help to efficiently do this will be highly appreciated.

First, the map transformation can be made more readable if written as:
locations.cartesian(locations).map {
case ((a1, (b1, c1)), (a2, (b2, c2)) =>
Edge(a1, a2, distanceAmongPoints(b1,c1,b2,c2)))
}
It seems the objective is to calculate distance between two points for all pairs. cartesian will give the pair twice, effectively computing same distance twice.
To avoid that, one approach could be to broadcast a copy of all points and then compare in parts.
val points: // an array of points.
val pointsRDD = sc.parallelize(points.zipWithIndex)
val bPoints = sc.broadcast(points)
pointsRDD.map { case (point, index) =>
(index + 1 until bPoints.value.size).map { i =>
distanceBetweenPoints(point, bPoints.value.get(i))
}
}
If size of points is N, it will compare point-0 with (point-1 to point-N-1), point-1 with (point-2 to point-N-1) etc.

Related

Distributed DBSCAN on spark

I'm trying to implement the DBSCAN algorithm on Spark, so I'm following the paper A Parallel DBSCAN Algorithm Based on Spark.
They propose an algorithm with 4 main steps:
Data partition
Computing a local DBSCAN
Merging the data partition
Global clustering
So I'm implementing the second step using GraphX, and the pseudocode is something like this:
Select an arbitrary point p in the current partition
Compute the N_{e} and if N_{e} >= minPts mark p as a core point, otherwise as a noise point.
If p is a core point then make a cluster c by p, adding all the points belonging to cluster c to a list for recursive calls.
...
And here my code (I'm aware that it does not work):
def dataPartition() : Graph[Array[String], Int] = {
graph.partitionBy(PartitionStrategy.RandomVertexCut)
}
def computingLocalDBSCAN() : Unit = {
graph = dataPartition()
//val neighbors = graph.mapVertices((id, attr) => localDBSCANMap(id, attr))
}
def localDBSCANMap(id: VertexId, attr:Array[String], cluster:Int):Unit = {
val neighbors = graph.collectNeighbors(EdgeDirection.Out).lookup(id)
if (neighbors.size >= eps) {
attr(0) = PointType.Core.toString
attr(1) = cluster.toString
} else {
attr(0) = PointType.Noise.toString
}
neighbors.foreach(it => {
for (item <- it) {
localDBSCANMap(item._1, item._2, cluster)
}
})
}
I have multiple questions:
How could I change the value of one attribute of the vertex? I understand that the vertices are immutable, but I will like to flag the nodes with Noise, Core, Border or Unclassified.
How could I pick a random node inside a partition? Because my problem with the map method is that I have to modify the values at the same time that I'm traversing the graph.
How could I call a recursive method and modify the attribute values? at the same time?

How to compute the dot product of two distributed RowMatrix in Apache Spark?

Let Q be a distributed Row Matrix in Spark, I want to calculate the cross product of Q with its transpose Q'.
However although a Row Matrix does have a multiply() method, but it can only accept local Matrices as an argument.
Code illustration ( Scala ):
val phi = new RowMatrix(phiRDD) // phiRDD is an instance of RDD[Vector]
val phiTranspose = transposeRowMatrix(phi) // transposeRowMatrix()
// returns the transpose of a RowMatrix
val crossMat = ? // phi * phiTranspose
Note that I want to perform the dot product of 2 Distributed RowMatrix not a distributed one with a local one.
One solution is to use an IndexedRowMatrix as following:
val phi = new IndexedRowMatrix(phiRDD) // phiRDD is an instance of RDD[IndexedRow]
val phiTranspose = transposeMatrix(phi) // transposeMatrix()
// returns the transpose of a Matrix
val crossMat = phi.toBlockMatrix().multiply( phiTranspose.toBlockMatrix()
).toIndexedRowMatrix()
However, I want to use the Row Matrix-Methods such as tallSkinnyQR() and this means that I sholud transform crossMat to a RowMatrix, using .toRowMatrix() method:
val crossRowMat = crossMat.toRowMatrix()
and finally I can apply
crossRowMat.tallSkinnyQR()
but this process includes many transformations between the types of the Distributed Matrices and according to what I understood from MLlib Programming Guide this is expensive:
It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive.
Would someone elaborate, please.

Only distributed matrices which support matrix - matrix multiplication are BlockMatrices. You have to convert your data accordingly - artificial indices are good enough:
new IndexedRowMatrix(
rowMatrix.rows.zipWithIndex.map(x => IndexedRow(x._2, x._1))
).toBlockMatrix match { case m => m.multiply(m.transpose) }

I used the algorithm listed on this page which moves the multiplication problem from dot product to distributed scalar product problem by using vectors outer product:
The outer product between two vectors is the scalar product of the
second vector with all the elements in the first vector, resulting in
a matrix
My own created multiplication function (can be more optimized) for Row Matrices ended up like that.
def multiplyRowMatrices(m1: RowMatrix, m2: RowMatrix)(implicit ctx: SparkSession): RowMatrix = {
// Zip m1 columns with m2 rows
val m1Cm2R = transposeRowMatrix(m1).rows.zip(m2.rows)
// Apply scalar product between each entry in m1 vector with m2 row
val scalar = m1Cm2R.map{
case(column:DenseVector,row:DenseVector) => column.toArray.map{
columnValue => row.toArray.map{
rowValue => columnValue*rowValue
}
}
}
// Add all the resulting matrices point wisely
val sum = scalar.reduce{
case(matrix1,matrix2) => matrix1.zip(matrix2).map{
case(array1,array2)=> array1.zip(array2).map{
case(value1,value2)=> value1+value2
}
}
}
new RowMatrix(ctx.sparkContext.parallelize(sum.map(array=> Vectors.dense(array))))
}
After that I tested both approaches- My own function and using block matrix - using a 300*10 Matrix on a one machine
Using my own function:
val PhiMat = new RowMatrix(phi)
val TphiMat = transposeRowMatrix(PhiMat)
val product = multiplyRowMatrices(PhiMat,TphiMat)
Using matrix transformation:
val MatRow = new RowMatrix(phi)
val MatBlock = new IndexedRowMatrix(MatRow.rows.zipWithIndex.map(x => IndexedRow(x._2, x._1))).toBlockMatrix()
val TMatBlock = MatBlock.transpose
val productMatBlock = MatBlock.multiply(TMatBlock)
val productMatRow = productMatBlock.toIndexedRowMatrix().toRowMatrix()
The first approach spanned 1 job with 5 stages and took 2s to finish in total. While the second approach spanned 4 jobs, three with one stage and one with two stages, and took 0.323s in total. Also the second approach outperformed the first with respect to the Shuffle Read/Write size.
Yet I am still confused by the MLlib Programming Guide statement:
It is very important to choose the right format to store large and
distributed matrices. Converting a distributed matrix to a different
format may require a global shuffle, which is quite expensive.

Spark Accumulator vs Count

I have a use case where I want to count types of elements in an RDD matching some filter.
e.g. RDD.filter(F1) and RDD.filter(!F1)
I have 2 options
Use accumulators: e.g.
LongAccumulator l1 = sparkContext.longAccumulator("Count1")
LongAccumulator l2 = sparkContext.longAccumulator("Count2")
RDD.forEachPartition(f -> {
if(F1) l1.add(1)
else l2.add(1)
});
Use Count
RDD.filter(F1).count(); RDD.filter(!F1).count()
One benefit of the first approach is that we only need to iterate data once (useful since my data set is 10s of TB)
What is the use of count if same affect can be achieved by using Accumulators ?

Major difference is that if your code will fail in transformation, then Accumulators will be updated and count() result not.
Other option is to use pure map-reduce:
val counts = rdd.map(x => (F1(x), 1)).reduceByKey(_ + _).collectAsMap()
Network cost should be also low as only few numbers will be sent. It creates pairs of (is F1(x) true/false, 1) and then sum all ones - it will give you number of items both F1(x) and !F1(x) in counts map

Spark Streaming: How to feedback output into input

Is it possible to implement the above shown scenario?
The system starts with one key-value pair and will discover new pairs. First the number of key-value pairs will increase and then shrink across iterations.
Update: I have to shift to Flink Streaming for Iteration support. Will try with kafka though!

With Apache Flink it is possible to define feedback edges via the iterate API call. The iterate method expects a step function which, given the an input stream, produces a feedback stream and an output stream. The former stream is fed back to the step function and the latter stream is send to down stream operators.
A simple example looks like:
val env = StreamExecutionEnvironment.getExecutionEnvironment
val input = env.fromElements(1).map(x => (x, math.random))
val output = input.iterate {
inputStream =>
val iterationBody = inputStream.flatMap {
randomWalk =>
val (step, position) = randomWalk
val direction = 2 * (math.random - 0.5)
val bifurcate = math.random >= 0.75
Seq(
Some((step + 1, position + direction)),
if (bifurcate) Some((step + 1, position - direction)) else None).flatten
}
val feedback = iterationBody.filter {
randomWalk => math.abs(randomWalk._2) < 1.0
}
val output = iterationBody.filter {
randomWalk => math.abs(randomWalk._2) >= 1.0
}
(feedback, output)
}
output.print()
// execute program
env.execute("Random Walk with Bifurcation")
Here we calculate a random walk where we randomly split our walk to proceed in the opposite direction. A random walk is finished iff its absolute position value is greater or equal to 1.0.

Low performance of groupbykey in spark

After reading the Spark documentation, I find that groupByKey function has a low performance compared with reduceByKey. But what I need is to get the average, maximum and minimum value of a certain key. I don't think this could be done by reduceByKey method. I can just create an customized reduceByKey function to realize those goals?

Let's say you have an RDD[(String, Double)] and you want to calculate avg, min, max over the double values using reduceByKey.
This could by done by duplicating the values as many times as operations you like to apply and then applying the different operations with reduceByKey.
Like this:
val srcData:RDD[(String, Double)] = ???
srcData.cache
val count = srcData.count
val baseData = srcData.map{case (k,v) => (k,(v,1,v,v))}
val aggregates = baseData.reduceByKey { case (v1,v2) =>
(v1._1 + v2._1, v1._2 + v2._2, Math.max(v1._3, v2._3), Math.min(v1._4,v2._4))}
val result = aggregates.collect()
.map{case (id, (sum, count, max, min)) => (id, sum/count, max, min)}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Incremently load big RDD file into memory - apache-spark

Related

Distributed DBSCAN on spark

How to compute the dot product of two distributed RowMatrix in Apache Spark?

Spark Accumulator vs Count

Spark Streaming: How to feedback output into input

Low performance of groupbykey in spark

Categories

Resources