I am new to Spark and wonder what will happen in the background in terms of RDDs on the cluster if the trainOn method is invoked on a StreamingLinearAlgorithm e.g.
val regression = new StreamingLinearRegressionWithSGD()
regression.trainOn(someStream)
StreamingLinearRegressionWithSGD contains the actual model (LinearRegressionModel) as well as the learning algorithm (LinearRegressionWithSGD) as a protected member. In method trainOn, the following is happening:
data.foreachRDD { (rdd, time) =>
if (!rdd.isEmpty) {
model = Some(algorithm.run(rdd, model.get.weights))
}
From my understanding, the streaming data is chunked every x seconds and the chunked data is put in an RDD. So every x seconds the learning algorithm is applied to a new RDD. The actual learning magic happens somewhere in GradientDescent, where the RDD data and the previous model weight vector is the input and the updated weight vector is the output.
val bcWeights = data.context.broadcast(weights)
// Sample a subset (fraction miniBatchFraction) of the total data
// compute and sum up the subgradients on this subset (this is one map-reduce)
val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
.treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
seqOp = (c, v) => {
// c: (grad, loss, count), v: (label, features)
val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
(c._1, c._2 + l, c._3 + 1)
},
combOp = (c1, c2) => {
// c: (grad, loss, count)
(c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
})
...
val update = updater.compute(
weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble),
stepSize, i, regParam)
weights = update._1
regVal = update._2
How is the weight vector updated? Does it work in parallel? I thought an RDD is split up into partitions and the distributed. I can see that the input weights are broadcasted in the spark context and treeAggregate is used, to sum up, the gradients but I can still not grasp which actual map-reduce task is taking place and which part is happening in the driver and which in a parallel way. Can anyone explain what is happening in detail?
Related
I would like to repartition data from RDD[LabeledPoint] to K partitions, and use the K partition to train K ml models respectively in parallel.
The closest operation I know, can produce this, is .mapPartitionsWithIndex but it returns Iterator[LabeledPoint], while the model takes RDD[LabeledPoint] as input.I think I can convert Iterator[LabeledPoint] to RDD[LabeledPoint] but it seems redundant to convert from RDD to Iterator and back again.
Below is what I have so far.
val K = model.size // models:Array[(idx, Model)]
data.repartition(K) //RDD[LabeledPoint]
data.mapPartitionsWithIndex {
case (i, dat) => {
val datByPartition = dat // dat:Iterator[LabeledPoint]
models(i).train(datByPartition) // however input of "train" needs to be RDD[LabeledPoint]
...
}
}
Any suggestions would be greatly appreciated!
Spark MLlib: building classifiers for each data group is similar to the problem I am having, apparently there was no good solution by then. But it was 3 years ago, I am wondering if it is solvable now .
I'm trying to implement the DBSCAN algorithm on Spark, so I'm following the paper A Parallel DBSCAN Algorithm Based on Spark.
They propose an algorithm with 4 main steps:
Data partition
Computing a local DBSCAN
Merging the data partition
Global clustering
So I'm implementing the second step using GraphX, and the pseudocode is something like this:
Select an arbitrary point p in the current partition
Compute the N_{e} and if N_{e} >= minPts mark p as a core point, otherwise as a noise point.
If p is a core point then make a cluster c by p, adding all the points belonging to cluster c to a list for recursive calls.
...
And here my code (I'm aware that it does not work):
def dataPartition() : Graph[Array[String], Int] = {
graph.partitionBy(PartitionStrategy.RandomVertexCut)
}
def computingLocalDBSCAN() : Unit = {
graph = dataPartition()
//val neighbors = graph.mapVertices((id, attr) => localDBSCANMap(id, attr))
}
def localDBSCANMap(id: VertexId, attr:Array[String], cluster:Int):Unit = {
val neighbors = graph.collectNeighbors(EdgeDirection.Out).lookup(id)
if (neighbors.size >= eps) {
attr(0) = PointType.Core.toString
attr(1) = cluster.toString
} else {
attr(0) = PointType.Noise.toString
}
neighbors.foreach(it => {
for (item <- it) {
localDBSCANMap(item._1, item._2, cluster)
}
})
}
I have multiple questions:
How could I change the value of one attribute of the vertex? I understand that the vertices are immutable, but I will like to flag the nodes with Noise, Core, Border or Unclassified.
How could I pick a random node inside a partition? Because my problem with the map method is that I have to modify the values at the same time that I'm traversing the graph.
How could I call a recursive method and modify the attribute values? at the same time?
Let Q be a distributed Row Matrix in Spark, I want to calculate the cross product of Q with its transpose Q'.
However although a Row Matrix does have a multiply() method, but it can only accept local Matrices as an argument.
Code illustration ( Scala ):
val phi = new RowMatrix(phiRDD) // phiRDD is an instance of RDD[Vector]
val phiTranspose = transposeRowMatrix(phi) // transposeRowMatrix()
// returns the transpose of a RowMatrix
val crossMat = ? // phi * phiTranspose
Note that I want to perform the dot product of 2 Distributed RowMatrix not a distributed one with a local one.
One solution is to use an IndexedRowMatrix as following:
val phi = new IndexedRowMatrix(phiRDD) // phiRDD is an instance of RDD[IndexedRow]
val phiTranspose = transposeMatrix(phi) // transposeMatrix()
// returns the transpose of a Matrix
val crossMat = phi.toBlockMatrix().multiply( phiTranspose.toBlockMatrix()
).toIndexedRowMatrix()
However, I want to use the Row Matrix-Methods such as tallSkinnyQR() and this means that I sholud transform crossMat to a RowMatrix, using .toRowMatrix() method:
val crossRowMat = crossMat.toRowMatrix()
and finally I can apply
crossRowMat.tallSkinnyQR()
but this process includes many transformations between the types of the Distributed Matrices and according to what I understood from MLlib Programming Guide this is expensive:
It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive.
Would someone elaborate, please.
Only distributed matrices which support matrix - matrix multiplication are BlockMatrices. You have to convert your data accordingly - artificial indices are good enough:
new IndexedRowMatrix(
rowMatrix.rows.zipWithIndex.map(x => IndexedRow(x._2, x._1))
).toBlockMatrix match { case m => m.multiply(m.transpose) }
I used the algorithm listed on this page which moves the multiplication problem from dot product to distributed scalar product problem by using vectors outer product:
The outer product between two vectors is the scalar product of the
second vector with all the elements in the first vector, resulting in
a matrix
My own created multiplication function (can be more optimized) for Row Matrices ended up like that.
def multiplyRowMatrices(m1: RowMatrix, m2: RowMatrix)(implicit ctx: SparkSession): RowMatrix = {
// Zip m1 columns with m2 rows
val m1Cm2R = transposeRowMatrix(m1).rows.zip(m2.rows)
// Apply scalar product between each entry in m1 vector with m2 row
val scalar = m1Cm2R.map{
case(column:DenseVector,row:DenseVector) => column.toArray.map{
columnValue => row.toArray.map{
rowValue => columnValue*rowValue
}
}
}
// Add all the resulting matrices point wisely
val sum = scalar.reduce{
case(matrix1,matrix2) => matrix1.zip(matrix2).map{
case(array1,array2)=> array1.zip(array2).map{
case(value1,value2)=> value1+value2
}
}
}
new RowMatrix(ctx.sparkContext.parallelize(sum.map(array=> Vectors.dense(array))))
}
After that I tested both approaches- My own function and using block matrix - using a 300*10 Matrix on a one machine
Using my own function:
val PhiMat = new RowMatrix(phi)
val TphiMat = transposeRowMatrix(PhiMat)
val product = multiplyRowMatrices(PhiMat,TphiMat)
Using matrix transformation:
val MatRow = new RowMatrix(phi)
val MatBlock = new IndexedRowMatrix(MatRow.rows.zipWithIndex.map(x => IndexedRow(x._2, x._1))).toBlockMatrix()
val TMatBlock = MatBlock.transpose
val productMatBlock = MatBlock.multiply(TMatBlock)
val productMatRow = productMatBlock.toIndexedRowMatrix().toRowMatrix()
The first approach spanned 1 job with 5 stages and took 2s to finish in total. While the second approach spanned 4 jobs, three with one stage and one with two stages, and took 0.323s in total. Also the second approach outperformed the first with respect to the Shuffle Read/Write size.
Yet I am still confused by the MLlib Programming Guide statement:
It is very important to choose the right format to store large and
distributed matrices. Converting a distributed matrix to a different
format may require a global shuffle, which is quite expensive.
Is it possible to implement the above shown scenario?
The system starts with one key-value pair and will discover new pairs. First the number of key-value pairs will increase and then shrink across iterations.
Update: I have to shift to Flink Streaming for Iteration support. Will try with kafka though!
With Apache Flink it is possible to define feedback edges via the iterate API call. The iterate method expects a step function which, given the an input stream, produces a feedback stream and an output stream. The former stream is fed back to the step function and the latter stream is send to down stream operators.
A simple example looks like:
val env = StreamExecutionEnvironment.getExecutionEnvironment
val input = env.fromElements(1).map(x => (x, math.random))
val output = input.iterate {
inputStream =>
val iterationBody = inputStream.flatMap {
randomWalk =>
val (step, position) = randomWalk
val direction = 2 * (math.random - 0.5)
val bifurcate = math.random >= 0.75
Seq(
Some((step + 1, position + direction)),
if (bifurcate) Some((step + 1, position - direction)) else None).flatten
}
val feedback = iterationBody.filter {
randomWalk => math.abs(randomWalk._2) < 1.0
}
val output = iterationBody.filter {
randomWalk => math.abs(randomWalk._2) >= 1.0
}
(feedback, output)
}
output.print()
// execute program
env.execute("Random Walk with Bifurcation")
Here we calculate a random walk where we randomly split our walk to proceed in the opposite direction. A random walk is finished iff its absolute position value is greater or equal to 1.0.
In the spark mllib documents for Dimensionality Reduction there is a section about PCA that describe how to use PCA in spark. The computePrincipalComponents method requires a parameter that determine the number of top components that we want.
The problem is that I don't know how many components I want. I mean as few as possible. In Some other tools PCA gives us a table that shows if for example we choose those 3 components we'll cover 95 percents of data. So does Spark has this functionality in it's libraries or if it don't have how can I implement it in Spark?
Spark 2.0+:
This should be available out-of-the box. See SPARK-11530 for details.
Spark <= 1.6
Spark doesn't provide this functionality yet but it is not hard to implement using existing Spark code and definition of explained variance. Lets say we want to explain 75 percent of total variance:
val targetVar = 0.75
First lets reuse Spark code to compute SVD:
import breeze.linalg.{DenseMatrix => BDM, DenseVector => BDV, svd => brzSvd}
import breeze.linalg.accumulate
import java.util.Arrays
// Compute covariance matrix
val cov = mat.computeCovariance()
// Compute SVD
val brzSvd.SVD(u: BDM[Double], e: BDV[Double], _) = brzSvd(
new BDM(cov.numRows, cov.numCols, cov.toArray))
Next we can find fraction of explained variance:
val varExplained = accumulate(e).map(x => x / e.toArray.sum).toArray
and number of components we have to get
val (v, k) = varExplained.zipWithIndex.filter{
case (v, _) => v >= targetVar
}.head
Finally we can subset U once again reusing Spark code:
val n = mat.numCols.toInt
Matrices.dense(n, k + 1, Arrays.copyOfRange(u.data, 0, n * (k + 1)))