Spark Streaming: How to feedback output into input - apache-spark

Is it possible to implement the above shown scenario?
The system starts with one key-value pair and will discover new pairs. First the number of key-value pairs will increase and then shrink across iterations.
Update: I have to shift to Flink Streaming for Iteration support. Will try with kafka though!

With Apache Flink it is possible to define feedback edges via the iterate API call. The iterate method expects a step function which, given the an input stream, produces a feedback stream and an output stream. The former stream is fed back to the step function and the latter stream is send to down stream operators.
A simple example looks like:
val env = StreamExecutionEnvironment.getExecutionEnvironment
val input = env.fromElements(1).map(x => (x, math.random))
val output = input.iterate {
inputStream =>
val iterationBody = inputStream.flatMap {
randomWalk =>
val (step, position) = randomWalk
val direction = 2 * (math.random - 0.5)
val bifurcate = math.random >= 0.75
Seq(
Some((step + 1, position + direction)),
if (bifurcate) Some((step + 1, position - direction)) else None).flatten
}
val feedback = iterationBody.filter {
randomWalk => math.abs(randomWalk._2) < 1.0
}
val output = iterationBody.filter {
randomWalk => math.abs(randomWalk._2) >= 1.0
}
(feedback, output)
}
output.print()
// execute program
env.execute("Random Walk with Bifurcation")
Here we calculate a random walk where we randomly split our walk to proceed in the opposite direction. A random walk is finished iff its absolute position value is greater or equal to 1.0.

Related

Distributed DBSCAN on spark

I'm trying to implement the DBSCAN algorithm on Spark, so I'm following the paper A Parallel DBSCAN Algorithm Based on Spark.
They propose an algorithm with 4 main steps:
Data partition
Computing a local DBSCAN
Merging the data partition
Global clustering
So I'm implementing the second step using GraphX, and the pseudocode is something like this:
Select an arbitrary point p in the current partition
Compute the N_{e} and if N_{e} >= minPts mark p as a core point, otherwise as a noise point.
If p is a core point then make a cluster c by p, adding all the points belonging to cluster c to a list for recursive calls.
...
And here my code (I'm aware that it does not work):
def dataPartition() : Graph[Array[String], Int] = {
graph.partitionBy(PartitionStrategy.RandomVertexCut)
}
def computingLocalDBSCAN() : Unit = {
graph = dataPartition()
//val neighbors = graph.mapVertices((id, attr) => localDBSCANMap(id, attr))
}
def localDBSCANMap(id: VertexId, attr:Array[String], cluster:Int):Unit = {
val neighbors = graph.collectNeighbors(EdgeDirection.Out).lookup(id)
if (neighbors.size >= eps) {
attr(0) = PointType.Core.toString
attr(1) = cluster.toString
} else {
attr(0) = PointType.Noise.toString
}
neighbors.foreach(it => {
for (item <- it) {
localDBSCANMap(item._1, item._2, cluster)
}
})
}
I have multiple questions:
How could I change the value of one attribute of the vertex? I understand that the vertices are immutable, but I will like to flag the nodes with Noise, Core, Border or Unclassified.
How could I pick a random node inside a partition? Because my problem with the map method is that I have to modify the values at the same time that I'm traversing the graph.
How could I call a recursive method and modify the attribute values? at the same time?

How to compute the dot product of two distributed RowMatrix in Apache Spark?

Let Q be a distributed Row Matrix in Spark, I want to calculate the cross product of Q with its transpose Q'.
However although a Row Matrix does have a multiply() method, but it can only accept local Matrices as an argument.
Code illustration ( Scala ):
val phi = new RowMatrix(phiRDD) // phiRDD is an instance of RDD[Vector]
val phiTranspose = transposeRowMatrix(phi) // transposeRowMatrix()
// returns the transpose of a RowMatrix
val crossMat = ? // phi * phiTranspose
Note that I want to perform the dot product of 2 Distributed RowMatrix not a distributed one with a local one.
One solution is to use an IndexedRowMatrix as following:
val phi = new IndexedRowMatrix(phiRDD) // phiRDD is an instance of RDD[IndexedRow]
val phiTranspose = transposeMatrix(phi) // transposeMatrix()
// returns the transpose of a Matrix
val crossMat = phi.toBlockMatrix().multiply( phiTranspose.toBlockMatrix()
).toIndexedRowMatrix()
However, I want to use the Row Matrix-Methods such as tallSkinnyQR() and this means that I sholud transform crossMat to a RowMatrix, using .toRowMatrix() method:
val crossRowMat = crossMat.toRowMatrix()
and finally I can apply
crossRowMat.tallSkinnyQR()
but this process includes many transformations between the types of the Distributed Matrices and according to what I understood from MLlib Programming Guide this is expensive:
It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive.
Would someone elaborate, please.
Only distributed matrices which support matrix - matrix multiplication are BlockMatrices. You have to convert your data accordingly - artificial indices are good enough:
new IndexedRowMatrix(
rowMatrix.rows.zipWithIndex.map(x => IndexedRow(x._2, x._1))
).toBlockMatrix match { case m => m.multiply(m.transpose) }
I used the algorithm listed on this page which moves the multiplication problem from dot product to distributed scalar product problem by using vectors outer product:
The outer product between two vectors is the scalar product of the
second vector with all the elements in the first vector, resulting in
a matrix
My own created multiplication function (can be more optimized) for Row Matrices ended up like that.
def multiplyRowMatrices(m1: RowMatrix, m2: RowMatrix)(implicit ctx: SparkSession): RowMatrix = {
// Zip m1 columns with m2 rows
val m1Cm2R = transposeRowMatrix(m1).rows.zip(m2.rows)
// Apply scalar product between each entry in m1 vector with m2 row
val scalar = m1Cm2R.map{
case(column:DenseVector,row:DenseVector) => column.toArray.map{
columnValue => row.toArray.map{
rowValue => columnValue*rowValue
}
}
}
// Add all the resulting matrices point wisely
val sum = scalar.reduce{
case(matrix1,matrix2) => matrix1.zip(matrix2).map{
case(array1,array2)=> array1.zip(array2).map{
case(value1,value2)=> value1+value2
}
}
}
new RowMatrix(ctx.sparkContext.parallelize(sum.map(array=> Vectors.dense(array))))
}
After that I tested both approaches- My own function and using block matrix - using a 300*10 Matrix on a one machine
Using my own function:
val PhiMat = new RowMatrix(phi)
val TphiMat = transposeRowMatrix(PhiMat)
val product = multiplyRowMatrices(PhiMat,TphiMat)
Using matrix transformation:
val MatRow = new RowMatrix(phi)
val MatBlock = new IndexedRowMatrix(MatRow.rows.zipWithIndex.map(x => IndexedRow(x._2, x._1))).toBlockMatrix()
val TMatBlock = MatBlock.transpose
val productMatBlock = MatBlock.multiply(TMatBlock)
val productMatRow = productMatBlock.toIndexedRowMatrix().toRowMatrix()
The first approach spanned 1 job with 5 stages and took 2s to finish in total. While the second approach spanned 4 jobs, three with one stage and one with two stages, and took 0.323s in total. Also the second approach outperformed the first with respect to the Shuffle Read/Write size.
Yet I am still confused by the MLlib Programming Guide statement:
It is very important to choose the right format to store large and
distributed matrices. Converting a distributed matrix to a different
format may require a global shuffle, which is quite expensive.

How does SGD works in Spark MLlib with Streaming

I am new to Spark and wonder what will happen in the background in terms of RDDs on the cluster if the trainOn method is invoked on a StreamingLinearAlgorithm e.g.
val regression = new StreamingLinearRegressionWithSGD()
regression.trainOn(someStream)
StreamingLinearRegressionWithSGD contains the actual model (LinearRegressionModel) as well as the learning algorithm (LinearRegressionWithSGD) as a protected member. In method trainOn, the following is happening:
data.foreachRDD { (rdd, time) =>
if (!rdd.isEmpty) {
model = Some(algorithm.run(rdd, model.get.weights))
}
From my understanding, the streaming data is chunked every x seconds and the chunked data is put in an RDD. So every x seconds the learning algorithm is applied to a new RDD. The actual learning magic happens somewhere in GradientDescent, where the RDD data and the previous model weight vector is the input and the updated weight vector is the output.
val bcWeights = data.context.broadcast(weights)
// Sample a subset (fraction miniBatchFraction) of the total data
// compute and sum up the subgradients on this subset (this is one map-reduce)
val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
.treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
seqOp = (c, v) => {
// c: (grad, loss, count), v: (label, features)
val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
(c._1, c._2 + l, c._3 + 1)
},
combOp = (c1, c2) => {
// c: (grad, loss, count)
(c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
})
...
val update = updater.compute(
weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble),
stepSize, i, regParam)
weights = update._1
regVal = update._2
How is the weight vector updated? Does it work in parallel? I thought an RDD is split up into partitions and the distributed. I can see that the input weights are broadcasted in the spark context and treeAggregate is used, to sum up, the gradients but I can still not grasp which actual map-reduce task is taking place and which part is happening in the driver and which in a parallel way. Can anyone explain what is happening in detail?

Is there a way to sample a Spark RDD for exactly a specified number of elements instead of a percentage?

I currently need to randomly sample items in a RDD in Spark for k elements. I noticed that there is the takeSample method. The method signature is as follows.
takeSample(withReplacement: Boolean, num: Int, seed: Long = Utils.random.nextLong): Array[T]
However, this does not return a RDD. There is another sampling method that does return a RDD, sample.
sample(withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T]
I don't want to use the first method takeSample because it does not return a RDD and will pull a significant amount of data back to the driver program (memory issues). I went ahead and used the sample method, but I had to compute the fraction (percentage) as follows.
val rdd = sc.textFile("some/path") //creates the rdd
val N = rdd.count() //total items in the rdd
val fraction = k / N.toDouble
val sampledRdd = rdd.sample(false, fraction, 67L)
The problem with this approach/method is that I may not be able to get a RDD with exactly k items. For example, if we assume N = 10, then
k = 2, fraction = 20%, sampled items = 2
k = 3, fraction = 30%, sampled items = 3
and so on
But with N = 11, then
k = 2, fraction = 18.1818%, sampled items = ?
k = 3, fraction = 27.2727%, sampled items = ?
In the last example, for fraction = 18.1818%, how many items will be in the resulting RDD?
Also, this is what the documentation says about the fraction argument.
expected size of the sample as a fraction of this RDD's size
- without replacement: probability that each element is chosen; fraction must be [0, 1]
- with replacement: expected number of times each element is chosen; fraction must be greater than or equal to 0
Since I have chose without replacement, it seems that my fraction should be computed as follows. Note that each item has equal probability to be selected (which is what I'm trying to express).
val N = rdd.count()
val fraction = 1 / N.toDouble
val sampleRdd = rdd.sample(false, fraction, 67L)
So, is it k / N or 1 / N? It seems as if the documentation is pointing in all different directions with sample size and sampling probability.
And lastly, the documentation notes.
This is NOT guaranteed to provide exactly the fraction of the count of the given RDD.
Which, then brings me back to my original question/concern: if the RDD API doesn't guarantee sampling exactly k items from an RDD, how do we efficiently do so?
As I was writing this post, I discovered there is already another SO post asking nearly the same question. I found the accepted answer unacceptable. Here, I also wanted to clarify the fraction argument.
I wonder if there is a way to do so using Datasets and DataFrames?
This solution is not that beautiful but I hope it would be helpful for thinking.
The trick is using an additional score and get the kth largest score as the threshold.
val k = 100
val rdd = sc.parallelize(0 until 1000)
val rddWithScore = rdd.map((_, Math.random))
rddWithScore.cache()
val threshold = rddWithScore.map(_._2)
.sortBy(t => t)
.zipWithIndex()
.filter(_._2 == k)
.collect()
.head._1
val rddSample = rddWithScore.filter(_._2 < threshold).map(_._1)
rddSample.count()
The output would be
k: Int = 100
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[58] at parallelize at <console>:31
rddWithScore: org.apache.spark.rdd.RDD[(Int, Double)] = MapPartitionsRDD[59] at map at <console>:32
threshold: Double = 0.1180443408900893
rddSample: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[69] at map at <console>:40
res10: Long = 100

Incremently load big RDD file into memory

val locations = filelines.map(line => line.split("\t")).map(t => (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect()
val cartesienProduct=locations.cartesian(locations).map(t=> Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2)))
Code executes perfectly fine up till here but when i try to use "cartesienProduct" it got stuck i.e.
val count =cartesienProduct.count()
Any help to efficiently do this will be highly appreciated.
First, the map transformation can be made more readable if written as:
locations.cartesian(locations).map {
case ((a1, (b1, c1)), (a2, (b2, c2)) =>
Edge(a1, a2, distanceAmongPoints(b1,c1,b2,c2)))
}
It seems the objective is to calculate distance between two points for all pairs. cartesian will give the pair twice, effectively computing same distance twice.
To avoid that, one approach could be to broadcast a copy of all points and then compare in parts.
val points: // an array of points.
val pointsRDD = sc.parallelize(points.zipWithIndex)
val bPoints = sc.broadcast(points)
pointsRDD.map { case (point, index) =>
(index + 1 until bPoints.value.size).map { i =>
distanceBetweenPoints(point, bPoints.value.get(i))
}
}
If size of points is N, it will compare point-0 with (point-1 to point-N-1), point-1 with (point-2 to point-N-1) etc.

Resources