Spark: How to perform prediction using trained data set (MLLIB: SVMWithSGD) - apache-spark

I am new to Spark. I am able to train the DataSet. But not able use the trained data set to make predictions.
Here is the code to train the data which is 1800x4000 matrix.
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/myfile.txt")
val parsedData = data.map { line =>
val parts = line.split(' ')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}
val firstDataPoint = parsedData.take(1)(0)
// Building the model
val numIterations = 100
val model = SVMWithSGD.train(parsedData, numIterations)
//val model = LinearRegressionWithSGD.train(parsedData,numIterations)
val labelAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
println("Training Error = " + trainErr)
Now I load the data to be used to perform the prediction: Data is vector of 1800 values
val test = sc.textFile("data/mllib/ridge-data/data.txt")
But not sure how to perform prediction using this data. Please help.

First load the labeledPoints from the textFile (keep in mind you had to save the RDD with saveAsTextFile):
JavaRDD<LabeledPoint> test = MLUtils.loadLabeledPoints(init.context, "hdfs://../test/", 30).toJavaRDD();
JavaRDD<Tuple2<Object, Object>> scoreAndLabels = test.map(
new Function<LabeledPoint, Tuple2<Object, Object>>() {
public Tuple2<Object, Object> call(LabeledPoint p) {
Double score = model.predict(p.features());
return new Tuple2<Object, Object>(score, p.label());
}
}
);
Now collect the scores and iterate over them:
List<Tuple2<Object, Object>> scores = scoreAndLabels.collect();
for(Tuple2<Object, Object> score : scores){
System.out.println(score._1 + " \t" + score._2);
}
It is in Java, but maybe you can convert it :)
But the prediction values do not make sense:
-18.841544889249917 0.0
168.32916035523283 1.0
420.67763915879794 1.0
-974.1942589201286 0.0
71.73602841256813 1.0
233.13636224524993 1.0
-1000.5902168199027 0.0
Does anybody know what they mean?

Related

How can I select balanced sampling for binary classification?

There is my code, load data from hive, and do sample balance:
// Load SubSet Data
val dataList = DataLoader.loadSubTrainTestData(hiveContext.sql(sampleDataHql))
// Split Data to Train and Test
val data = dataList.randomSplit(Array(0.7, 0.3), seed = 11L)
// Random balance train data
val sampleCount = data(0).map(rec => (rec.label, 1)).reduceByKey(_ + _)
val positiveSample = data(0).filter(_.label == 1).cache()
val positiveSize = positiveSample.count()
val negativeSample = data(0).filter(_.label == 0).cache()
val negativeSize = negativeSample.count()
// Build train data
val trainData = positiveSample ++
negativeSample.sample(withReplacement = false, 1.0 * positiveSize.toFloat / negativeSize, System.nanoTime())
// Data size
val trainDataSize = positiveSize + negativeSize
val testDataSize = trainDataSize * 3.0 / 7.0
and i calculate the trainDataSize and testDataSize for evaluate the model confidence
Ok I haven't tested this code, but it should go like this :
val data: RDD[LabeledPoint] = ???
val fractions: Map[Double, Double] = Map(0.0 -> 0.5, 1.0 -> 0.5)
val sampledData: RDD[LabeledPoint] = data
.keyBy(_.label)
.sampleByKeyExact(false, fractions) // Optionally with seed
.values
You can convert your LabeledPoint into PairRDDs than apply a sampleByKeyExact using the fractions you wish to use.

Probability of predictions using Spark LogisticRegressionWithLBFGS for multiclass classification

I am using LogisticRegressionWithLBFGS() to train a model with multiple classes.
From the documentation in mllib it is written that clearThreshold() can be used only if the classification is binary. Is there a way to use something similar for multiclass classification in order to output the probabilities of each class in a given input in the model?
There are two ways to accomplish this. One is to create a method that assumes the responsibility of predictPoint in LogisticRegression.scala
object ClassificationUtility {
def predictPoint(dataMatrix: Vector, model: LogisticRegressionModel):
(Double, Array[Double]) = {
require(dataMatrix.size == model.numFeatures)
val dataWithBiasSize: Int = model.weights.size / (model.numClasses - 1)
val weightsArray: Array[Double] = model.weights match {
case dv: DenseVector => dv.values
case _ =>
throw new IllegalArgumentException(s"weights only supports dense vector but got type ${model.weights.getClass}.")
}
var bestClass = 0
var maxMargin = 0.0
val withBias = dataMatrix.size + 1 == dataWithBiasSize
val classProbabilities: Array[Double] = new Array[Double (model.numClasses)
(0 until model.numClasses - 1).foreach { i =>
var margin = 0.0
dataMatrix.foreachActive { (index, value) =>
if (value != 0.0) margin += value * weightsArray((i * dataWithBiasSize) + index)
}
// Intercept is required to be added into margin.
if (withBias) {
margin += weightsArray((i * dataWithBiasSize) + dataMatrix.size)
}
if (margin > maxMargin) {
maxMargin = margin
bestClass = i + 1
}
classProbabilities(i+1) = 1.0 / (1.0 + Math.exp(-margin))
}
return (bestClass.toDouble, classProbabilities)
}
}
Note it is only slightly different from the original method, it just calculates the logistic as a function of the input features. It also defines some vals and vars that are originally private and included outside of this method. Ultimately, it indexes the scores in an Array and returns it along with the best answer. I call my method like so:
// Compute raw scores on the test set.
val predictionAndLabelsAndProbabilities = test
.map { case LabeledPoint(label, features) =>
val (prediction, probabilities) = ClassificationUtility
.predictPoint(features, model)
(prediction, label, probabilities)}
However:
It seems the Spark contributors are discouraging the use of MLlib in favor of ML. The ML logistic regression API currently does not support multi-class classification. I am now using OneVsRest which acts as a wrapper for one vs all classification. You can obtain the raw scores by iterating through the models:
val lr = new LogisticRegression().setFitIntercept(true)
val ovr = new OneVsRest()
ovr.setClassifier(lr)
val ovrModel = ovr.fit(training)
ovrModel.models.zipWithIndex.foreach {
case (model: LogisticRegressionModel, i: Int) =>
model.save(s"model-${model.uid}-$i")
}
val model0 = LogisticRegressionModel.load("model-logreg_457c82141c06-0")
val model1 = LogisticRegressionModel.load("model-logreg_457c82141c06-1")
val model2 = LogisticRegressionModel.load("model-logreg_457c82141c06-2")
Now that you have the individual models, you can obtain the probabilities by calculating the sigmoid of the rawPrediction
def sigmoid(x: Double): Double = {
1.0 / (1.0 + Math.exp(-x))
}
val newPredictionAndLabels0 = model0.transform(newRescaledData)
.select("prediction", "rawPrediction")
.map(row => (row.getDouble(0),
sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels0.foreach(println)
val newPredictionAndLabels1 = model1.transform(newRescaledData)
.select("prediction", "rawPrediction")
.map(row => (row.getDouble(0),
sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels1.foreach(println)
val newPredictionAndLabels2 = model2.transform(newRescaledData)
.select("prediction", "rawPrediction")
.map(row => (row.getDouble(0),
sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels2.foreach(println)

model.predictProbabilities() for LogisticRegression in Spark?

I'm running a multi-class Logistic Regression (withLBFGS) with Spark 1.6.
given x and possible labels {1.0,2.0,3.0}
the final model will only output what is the best prediction, say 2.0.
If I'm interested to know what was the second best prediction, say 3.0, how could I retrieve that information?
In NaiveBayes I would use the model.predictProbabilities() function which for each sample would output a vector with all the probabilities for each possible outcome.
There are two ways to do logistic regression in Spark: spark.ml and spark.mllib.
With DataFrames you can use spark.ml:
import org.apache.spark
import sqlContext.implicits._
def p(label: Double, a: Double, b: Double) =
new spark.mllib.regression.LabeledPoint(
label, new spark.mllib.linalg.DenseVector(Array(a, b)))
val data = sc.parallelize(Seq(p(1.0, 0.0, 0.5), p(0.0, 0.5, 1.0)))
val df = data.toDF
val model = new spark.ml.classification.LogisticRegression().fit(df)
model.transform(df).show
You get the raw predictions and probabilities:
+-----+---------+--------------------+--------------------+----------+
|label| features| rawPrediction| probability|prediction|
+-----+---------+--------------------+--------------------+----------+
| 1.0|[0.0,0.5]|[-19.037302860930...|[5.39764620520461...| 1.0|
| 0.0|[0.5,1.0]|[18.9861466274786...|[0.99999999431904...| 0.0|
+-----+---------+--------------------+--------------------+----------+
With RDDs you can use spark.mllib:
val model = new spark.mllib.classification.LogisticRegressionWithLBFGS().run(data)
This model does not expose the raw predictions and probabilities. You can take a look at predictPoint. It multiplies the vectors and picks the class with the highest prediction. The weights are publicly accessible, so you could copy that algorithm and save the predictions instead of just returning the highest one.
Following the suggestions from #Daniel Darabos:
I tried to use the LogisticRegression function from ml instead of mllib
Unfortunately it doesn't support the multi-class logistic regression but only the binary one.
I took a look at PredictedPoint
and modified it so that it prints all the probabilities for each class. Here it is what it looks like:
def predictPointForMulticlass(featurizedVector:Vector,weightsArray:Vector,intercept:Double,numClasses:Int,numFeatures:Int) : Seq[(String, Double)] = {
val weightsArraySize = weightsArray.size
val dataWithBiasSize = weightsArraySize / (numClasses - 1)
val withBias = false
var bestClass = 0
var maxMargin = 0.0
var margins = new Array[Double](numClasses - 1)
var temp_marginMap = new HashMap[Int, Double]()
var res = new HashMap[Int, Double]()
(0 until numClasses - 1).foreach { i =>
var margin = 0.0
var index = 0
featurizedVector.toArray.foreach(value => {
if (value != 0.0) {
margin += value * weightsArray((i * dataWithBiasSize) + index)
}
index += 1
}
)
// Intercept is required to be added into margin.
if (withBias) {
margin += weightsArray((i * dataWithBiasSize) + featurizedVector.size)
}
val prob = 1.0 / (1.0 + Math.exp(-margin))
margins(i) = margin
temp_marginMap += (i -> margin)
if(margin > maxMargin) {
maxMargin = margin
bestClass = i + 1
}
}
for ((k,v) <- temp_marginMap){
val calc =probCalc(maxMargin,v)
res += (k -> calc)
}
return res
}
where probCalc() is simply defined as:
def probCalc(maxMargin:Double,margin:Double) :Double ={
val res = 1.0 / (1.0 + Math.exp(-(margin - maxMargin)))
res
}
I'm returning a Hashmap[Int, Double] but that can be changed as you wish.
Hopefully this helps!

Spark: How to get probabilities and AUC for Bernoulli Naive Bayes?

I'm running a Bernoulli Naive Bayes using code:
val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code:
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
model.clearThreshold()
// Compute raw scores on the test set.
val labelAndPreds = test.map { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
Unfortunately this code isn't working for NaiveBayes.
Concerning the probabilities for Bernouilli Naive Bayes, here is an example :
// Building dummy data
val data = sc.parallelize(List("0,1 0 0", "1,0 1 0", "1,0 0 1", "0,1 0 1","1,1 1 0"))
// Transforming dummy data into LabeledPoint
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}
// Prepare data for training
val splits = parsedData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
// labels
val labels = model.labels
// Probabilities for all feature vectors
val features = parsedData.map(lp => lp.features)
model.predictProbabilities(features).take(10) foreach println
// For one specific vector, I'm taking the first vector in the parsedData
val testVector = parsedData.first.features
println(s"For vector ${testVector} => probability : ${model.predictProbabilities(testVector)}")
As for the AUC :
// Compute raw scores on the test set.
val labelAndPreds = test.map { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
Concerning the inquiry from the chat :
val results = parsedData.map { lp =>
val probs: Vector = model.predictProbabilities(lp.features)
(for (i <- 0 to (probs.size - 1)) yield ((lp.label, labels(i), probs(i))))
}.flatMap(identity)
results.take(10).foreach(println)
// (0.0,0.0,0.59728640251696)
// (0.0,1.0,0.40271359748304003)
// (1.0,0.0,0.2546873180388961)
// (1.0,1.0,0.745312681961104)
// (1.0,0.0,0.47086939671877026)
// (1.0,1.0,0.5291306032812298)
// (0.0,0.0,0.6496075621805428)
// (0.0,1.0,0.3503924378194571)
// (1.0,0.0,0.4158585282373076)
// (1.0,1.0,0.5841414717626924)
and if you are only interested in the argmax classes :
val results = training.map { lp => val probs: Vector = model.predictProbabilities(lp.features)
val bestClass = probs.argmax
(labels(bestClass), probs(bestClass))
}
results.take(10) foreach println
// (0.0,0.59728640251696)
// (1.0,0.745312681961104)
// (1.0,0.5291306032812298)
// (0.0,0.6496075621805428)
// (1.0,0.5841414717626924)
Note: Works with Spark 1.5+
EDIT: (for Pyspark users)
It seems like some are having troubles getting probabilities using pyspark and mllib. Well that's normal, spark-mllib doesn't present that function for pyspark.
Thus you'll need to use the spark-ml DataFrame-based API :
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import NaiveBayes
df = spark.createDataFrame([
Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])
nb = NaiveBayes(smoothing=1.0, modelType="bernoulli")
model = nb.fit(df)
model.transform(df).show(truncate=False)
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |features |label|rawPrediction |probability |prediction|
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |[0.0,0.0]|0.0 |[-1.4916548767777167,-2.420368128650429] |[0.7168141592920354,0.28318584070796465]|0.0 |
# |[0.0,1.0]|0.0 |[-1.4916548767777167,-3.1135153092103742]|[0.8350515463917526,0.16494845360824742]|0.0 |
# |[1.0,0.0]|1.0 |[-2.5902671654458262,-1.7272209480904837]|[0.29670329670329676,0.7032967032967034]|1.0 |
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
You'll just need to select your prediction column and compute your AUC.
For more information about Naive Bayes in spark-ml, please refer to the official documentation here.

Registering kmeans model as UDF

Hi I am trying to use Spark kmeans model to predict the cluster number. But when I register it and use it in SQL it gives me a
java.lang.reflect.InvocationTargetException
def findCluster(s:String):Int={
model.predict(feautarize(s))
}
I am using the below
%sql select findCluster((text)) from tweets
The same works if i use it directly
findCluster("hello am vishnu")
output 1
It is impossible to reproduce the problem with a code you've provided. Assuming that model is org.apache.spark.mllib.clustering.KMeansModel here is step by step solution
First lets import required libraries and set RNG seed:
import scala.util.Random
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
Random.setSeed(0L)
Generate random train set:
// Generate random training set
val trainData = sc.parallelize((1 to 1000).map { _ =>
val off = if(Random.nextFloat > 0.5) 0.5 else -0.5
Vectors.dense(Random.nextFloat + off, Random.nextFloat + off)
})
Run KMeans
// Train KMeans with 2 clusters
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(trainData, numClusters, numIterations)
Create UDF
// Create broadcast variable with model and prediction function
val model = sc.broadcast(clusters)
def findCluster(v: org.apache.spark.mllib.linalg.Vector):Int={
model.value.predict(v)
}
// Register UDF
sqlContext.udf.register("findCluster", findCluster _)
Prepare test set
// Create test set
case class Coord(v: org.apache.spark.mllib.linalg.Vector)
val testData = sqlContext.createDataFrame(sc.parallelize((1 to 100).map { _ =>
val off = if(Random.nextFloat > 0.5) 0.5 else -0.5
Coord(Vectors.dense(Random.nextFloat + off, Random.nextFloat + off))
}))
// Register test set df
testData.registerTempTable("testData")
// Check if it works
sqlContext.sql("SELECT findCluster(v) FROM testData").take(1)
Result:
res3: Array[org.apache.spark.sql.Row] = Array([1])

Resources