Spark RDD Inner Join Performance after collecting a sample of data - apache-spark

I have two Spark Jobs where they read two datasets from a Hadoop cluster, load the data to RDDs and perform an inner join. The difference between those two jobs is a sampling operation before the inner join, which I need it for building a histogram of the data sets. The sampling operation is performed for both data sets as rdd.sample(false, 0.01 261).collect() (I'm not using the takeSample() method). The job that integrates the sampling procedure runs much faster in terms of execution time than the other job that simply performs the inner join (the time difference is about ~-25%!).
JavaRDD<String> rddTxtFilesA = jsc.textFile("hdfs://node1:9000/user/datasetA.csv");
JavaRDD<String> rddTxtFilesB = jsc.textFile("hdfs://node1:9000/user/datasetB.csv");
JavaRDD<Tuple3<String, Double, Double>> rddTuplesA = rddTxtFilesA.map((String line) -> {
String[] elements = line.split("\t");
return new Tuple3<>(elements[0], Double.parseDouble(elements[1]), Double.parseDouble(elements[2]));
});
JavaRDD<Tuple3<String, Double, Double>> rddTuplesB = rddTxtFilesB.map((String line) -> {
String[] elements = line.split("\t");
return new Tuple3<>(elements[0], Double.parseDouble(elements[1]), Double.parseDouble(elements[2]));
});
//The different code part in the two Spark Jobs where sampling operation occurs
rddTuplesA.sample(false, 0.01, 261).collect().forEach((tuple) ->
//build histogram
);
rddTuplesB.sample(false, 0.01, 261).collect().forEach((tuple) ->
//build histogram
);
//sampling operation finishes
JavaPairRDD<Integer, Tuple3<String, Double, Double>> pairRDDA = rddTuplesA.flatMapToPair(...);
JavaPairRDD<Integer, Tuple3<String, Double, Double>> pairRDDB = rddTuplesB.flatMapToPair(...);
CustomPartitioner cp = ...
JavaPairRDD<Integer, Tuple2<Tuple3<String, Double, Double>, Tuple3<String, Double, Double>>> joinedRDD = pairRDDA.join(pairRDDB,cp);
long count = joinedRDD.count();
It seems that the collect() operation somehow affects the job and makes it running faster. Any thoughts why is this happening? I'm using the 3.1.3 version of Spark.
Note also that I have replaced the collect() method with the count() method and the same happens. Thus, I suspect that it is the action that somehow affects the job's execution time.

Related

How to operate on RDD by partition?

I would like to repartition data from RDD[LabeledPoint] to K partitions, and use the K partition to train K ml models respectively in parallel.
The closest operation I know, can produce this, is .mapPartitionsWithIndex but it returns Iterator[LabeledPoint], while the model takes RDD[LabeledPoint] as input.I think I can convert Iterator[LabeledPoint] to RDD[LabeledPoint] but it seems redundant to convert from RDD to Iterator and back again.
Below is what I have so far.
val K = model.size // models:Array[(idx, Model)]
data.repartition(K) //RDD[LabeledPoint]
data.mapPartitionsWithIndex {
case (i, dat) => {
val datByPartition = dat // dat:Iterator[LabeledPoint]
models(i).train(datByPartition) // however input of "train" needs to be RDD[LabeledPoint]
...
}
}
Any suggestions would be greatly appreciated!
Spark MLlib: building classifiers for each data group is similar to the problem I am having, apparently there was no good solution by then. But it was 3 years ago, I am wondering if it is solvable now .

Distributed DBSCAN on spark

I'm trying to implement the DBSCAN algorithm on Spark, so I'm following the paper A Parallel DBSCAN Algorithm Based on Spark.
They propose an algorithm with 4 main steps:
Data partition
Computing a local DBSCAN
Merging the data partition
Global clustering
So I'm implementing the second step using GraphX, and the pseudocode is something like this:
Select an arbitrary point p in the current partition
Compute the N_{e} and if N_{e} >= minPts mark p as a core point, otherwise as a noise point.
If p is a core point then make a cluster c by p, adding all the points belonging to cluster c to a list for recursive calls.
...
And here my code (I'm aware that it does not work):
def dataPartition() : Graph[Array[String], Int] = {
graph.partitionBy(PartitionStrategy.RandomVertexCut)
}
def computingLocalDBSCAN() : Unit = {
graph = dataPartition()
//val neighbors = graph.mapVertices((id, attr) => localDBSCANMap(id, attr))
}
def localDBSCANMap(id: VertexId, attr:Array[String], cluster:Int):Unit = {
val neighbors = graph.collectNeighbors(EdgeDirection.Out).lookup(id)
if (neighbors.size >= eps) {
attr(0) = PointType.Core.toString
attr(1) = cluster.toString
} else {
attr(0) = PointType.Noise.toString
}
neighbors.foreach(it => {
for (item <- it) {
localDBSCANMap(item._1, item._2, cluster)
}
})
}
I have multiple questions:
How could I change the value of one attribute of the vertex? I understand that the vertices are immutable, but I will like to flag the nodes with Noise, Core, Border or Unclassified.
How could I pick a random node inside a partition? Because my problem with the map method is that I have to modify the values at the same time that I'm traversing the graph.
How could I call a recursive method and modify the attribute values? at the same time?

How to compute the dot product of two distributed RowMatrix in Apache Spark?

Let Q be a distributed Row Matrix in Spark, I want to calculate the cross product of Q with its transpose Q'.
However although a Row Matrix does have a multiply() method, but it can only accept local Matrices as an argument.
Code illustration ( Scala ):
val phi = new RowMatrix(phiRDD) // phiRDD is an instance of RDD[Vector]
val phiTranspose = transposeRowMatrix(phi) // transposeRowMatrix()
// returns the transpose of a RowMatrix
val crossMat = ? // phi * phiTranspose
Note that I want to perform the dot product of 2 Distributed RowMatrix not a distributed one with a local one.
One solution is to use an IndexedRowMatrix as following:
val phi = new IndexedRowMatrix(phiRDD) // phiRDD is an instance of RDD[IndexedRow]
val phiTranspose = transposeMatrix(phi) // transposeMatrix()
// returns the transpose of a Matrix
val crossMat = phi.toBlockMatrix().multiply( phiTranspose.toBlockMatrix()
).toIndexedRowMatrix()
However, I want to use the Row Matrix-Methods such as tallSkinnyQR() and this means that I sholud transform crossMat to a RowMatrix, using .toRowMatrix() method:
val crossRowMat = crossMat.toRowMatrix()
and finally I can apply
crossRowMat.tallSkinnyQR()
but this process includes many transformations between the types of the Distributed Matrices and according to what I understood from MLlib Programming Guide this is expensive:
It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive.
Would someone elaborate, please.
Only distributed matrices which support matrix - matrix multiplication are BlockMatrices. You have to convert your data accordingly - artificial indices are good enough:
new IndexedRowMatrix(
rowMatrix.rows.zipWithIndex.map(x => IndexedRow(x._2, x._1))
).toBlockMatrix match { case m => m.multiply(m.transpose) }
I used the algorithm listed on this page which moves the multiplication problem from dot product to distributed scalar product problem by using vectors outer product:
The outer product between two vectors is the scalar product of the
second vector with all the elements in the first vector, resulting in
a matrix
My own created multiplication function (can be more optimized) for Row Matrices ended up like that.
def multiplyRowMatrices(m1: RowMatrix, m2: RowMatrix)(implicit ctx: SparkSession): RowMatrix = {
// Zip m1 columns with m2 rows
val m1Cm2R = transposeRowMatrix(m1).rows.zip(m2.rows)
// Apply scalar product between each entry in m1 vector with m2 row
val scalar = m1Cm2R.map{
case(column:DenseVector,row:DenseVector) => column.toArray.map{
columnValue => row.toArray.map{
rowValue => columnValue*rowValue
}
}
}
// Add all the resulting matrices point wisely
val sum = scalar.reduce{
case(matrix1,matrix2) => matrix1.zip(matrix2).map{
case(array1,array2)=> array1.zip(array2).map{
case(value1,value2)=> value1+value2
}
}
}
new RowMatrix(ctx.sparkContext.parallelize(sum.map(array=> Vectors.dense(array))))
}
After that I tested both approaches- My own function and using block matrix - using a 300*10 Matrix on a one machine
Using my own function:
val PhiMat = new RowMatrix(phi)
val TphiMat = transposeRowMatrix(PhiMat)
val product = multiplyRowMatrices(PhiMat,TphiMat)
Using matrix transformation:
val MatRow = new RowMatrix(phi)
val MatBlock = new IndexedRowMatrix(MatRow.rows.zipWithIndex.map(x => IndexedRow(x._2, x._1))).toBlockMatrix()
val TMatBlock = MatBlock.transpose
val productMatBlock = MatBlock.multiply(TMatBlock)
val productMatRow = productMatBlock.toIndexedRowMatrix().toRowMatrix()
The first approach spanned 1 job with 5 stages and took 2s to finish in total. While the second approach spanned 4 jobs, three with one stage and one with two stages, and took 0.323s in total. Also the second approach outperformed the first with respect to the Shuffle Read/Write size.
Yet I am still confused by the MLlib Programming Guide statement:
It is very important to choose the right format to store large and
distributed matrices. Converting a distributed matrix to a different
format may require a global shuffle, which is quite expensive.

How does SGD works in Spark MLlib with Streaming

I am new to Spark and wonder what will happen in the background in terms of RDDs on the cluster if the trainOn method is invoked on a StreamingLinearAlgorithm e.g.
val regression = new StreamingLinearRegressionWithSGD()
regression.trainOn(someStream)
StreamingLinearRegressionWithSGD contains the actual model (LinearRegressionModel) as well as the learning algorithm (LinearRegressionWithSGD) as a protected member. In method trainOn, the following is happening:
data.foreachRDD { (rdd, time) =>
if (!rdd.isEmpty) {
model = Some(algorithm.run(rdd, model.get.weights))
}
From my understanding, the streaming data is chunked every x seconds and the chunked data is put in an RDD. So every x seconds the learning algorithm is applied to a new RDD. The actual learning magic happens somewhere in GradientDescent, where the RDD data and the previous model weight vector is the input and the updated weight vector is the output.
val bcWeights = data.context.broadcast(weights)
// Sample a subset (fraction miniBatchFraction) of the total data
// compute and sum up the subgradients on this subset (this is one map-reduce)
val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
.treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
seqOp = (c, v) => {
// c: (grad, loss, count), v: (label, features)
val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
(c._1, c._2 + l, c._3 + 1)
},
combOp = (c1, c2) => {
// c: (grad, loss, count)
(c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
})
...
val update = updater.compute(
weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble),
stepSize, i, regParam)
weights = update._1
regVal = update._2
How is the weight vector updated? Does it work in parallel? I thought an RDD is split up into partitions and the distributed. I can see that the input weights are broadcasted in the spark context and treeAggregate is used, to sum up, the gradients but I can still not grasp which actual map-reduce task is taking place and which part is happening in the driver and which in a parallel way. Can anyone explain what is happening in detail?

Spark Streaming: How to feedback output into input

Is it possible to implement the above shown scenario?
The system starts with one key-value pair and will discover new pairs. First the number of key-value pairs will increase and then shrink across iterations.
Update: I have to shift to Flink Streaming for Iteration support. Will try with kafka though!
With Apache Flink it is possible to define feedback edges via the iterate API call. The iterate method expects a step function which, given the an input stream, produces a feedback stream and an output stream. The former stream is fed back to the step function and the latter stream is send to down stream operators.
A simple example looks like:
val env = StreamExecutionEnvironment.getExecutionEnvironment
val input = env.fromElements(1).map(x => (x, math.random))
val output = input.iterate {
inputStream =>
val iterationBody = inputStream.flatMap {
randomWalk =>
val (step, position) = randomWalk
val direction = 2 * (math.random - 0.5)
val bifurcate = math.random >= 0.75
Seq(
Some((step + 1, position + direction)),
if (bifurcate) Some((step + 1, position - direction)) else None).flatten
}
val feedback = iterationBody.filter {
randomWalk => math.abs(randomWalk._2) < 1.0
}
val output = iterationBody.filter {
randomWalk => math.abs(randomWalk._2) >= 1.0
}
(feedback, output)
}
output.print()
// execute program
env.execute("Random Walk with Bifurcation")
Here we calculate a random walk where we randomly split our walk to proceed in the opposite direction. A random walk is finished iff its absolute position value is greater or equal to 1.0.

Resources