How to get the probabilities of classes in Spark Naive Bayes classifier? - apache-spark

I'm training a NaiveBayesModel in Spark, however when I'm using it to predict a new instance I need to get the probabilities for each class. I looked at the code of predict function in NaiveBayesModel and come up with the following code:
val thetaMatrix = new DenseMatrix (model.labels.length,model.theta(0).length,model.theta.flatten,true)
val piVector = new DenseVector(model.pi)
//val prob = thetaMatrix.multiply(test.features)
val x = test.map {p =>
val prob = thetaMatrix.multiply(p.features)
BLAS.axpy(1.0, piVector, prob)
prob
}
Does this work properly? The line BLAS.axpy(1.0, piVector, prob) keeps giving me an error that the value 'axpy' is not found.

In a recent pull-request this was added to the Spark trunk and will be released in Spark 1.5 (closing SPARK-4362). you can therefore call
def predictProbabilities(testData: RDD[Vector]): RDD[Vector]
or
def predictProbabilities(testData: Vector): Vector

Related

PySpark LinearRegressionWithSGD, model predict dimensions mismatch

I've come across the following error:
AssertionError: dimension mismatch
I've trained a linear regression model using PySpark's LinearRegressionWithSGD.
However when I try to make a prediction on the training set, I get "dimension mismatch" error.
Worth mentioning:
Data was scaled using StandardScaler, but the predicted value was not.
As can be seen in code the features used for training were generated by PCA.
Some code:
pca_transformed = pca_model.transform(data_std)
X = pca_transformed.map(lambda x: (x[0], x[1]))
data = train_votes.zip(pca_transformed)
labeled_data = data.map(lambda x : LabeledPoint(x[0], x[1:]))
linear_regression_model = LinearRegressionWithSGD.train(labeled_data, iterations=10)
The prediction is the source of the error, and these are the variations I tried:
pred = linear_regression_model.predict(pca_transformed.collect())
pred = linear_regression_model.predict([pca_transformed.collect()])
pred = linear_regression_model.predict(X.collect())
pred = linear_regression_model.predict([X.collect()])
The regression weights:
DenseVector([1.8509, 81435.7615])
The vectors used:
pca_transformed.take(1)
[DenseVector([-0.1745, -1.8936])]
X.take(1)
[(-0.17449817243564397, -1.8935926689554488)]
labeled_data.take(1)
[LabeledPoint(22221.0, [-0.174498172436,-1.89359266896])]
This worked:
pred = linear_regression_model.predict(pca_transformed)
pca_transformed is of type RDD.
The function handles RDD's and arrays differently:
def predict(self, x):
"""
Predict the value of the dependent variable given a vector or
an RDD of vectors containing values for the independent variables.
"""
if isinstance(x, RDD):
return x.map(self.predict)
x = _convert_to_vector(x)
return self.weights.dot(x) + self.intercept
When a simple array is used, there might be a dimension mismatch issue (like the error in the question above).
As can be seen, if x is not an RDD, it's being converted to a vector. The thing is the dot product will not work unless you take x[0].
Here is the error reproduced:
j = _convert_to_vector(pca_transformed.take(1))
linear_regression_model.weights.dot(j) + linear_regression_model.intercept
This works just fine:
j = _convert_to_vector(pca_transformed.take(1))
linear_regression_model.weights.dot(j[0]) + linear_regression_model.intercept

Spark ML Naive Bayes predict multiple classes with probabilities

Is there a way to let the model to return a list of prediction labels with the probability score for each label?
For example
given feature (f1,f2,f3),
it returns something like this:
label1:0.50,label2:0.33...
Is it doable in spark?
Yes it is possible.
The output from rawPrediction column is an Array[Double] which contains the probability for each label.
In your example this column would be an Array(0.5,0.33,0.17), you will have to write an UDF that transforms this Array into a String.
It is important to note that if you used a StringIndexer to encode your label column the resulting labels will be different from your original ones. (most frequent label gets index 0)
Had some code that does something similar which can be adapted to your use case.
My code just writes the top X predictions for each feature as a CSV file.
parameter #df for writeToCsv must be a DataFrame after it has been transformed by your Naive Bayes model.
def topXPredictions(v: Vector, labels: Broadcast[Array[String]], topX: Int): Array[String] = {
val labelVal = labels.value
v.toArray
.zip(labelVal)
.sortBy {
case (score, label) => score
}
.reverse
.map {
case (score, label) => label
}
.take(topX)
}
def writeToCsv(df: DataFrame, labelsBroadcast: Broadcast[Array[String]], name: String = "output"): Unit = {
val get_top_predictions = udf((v: Vector, x: Int) => topXPredictions(v, labelsBroadcast, x))
df
.select(
col("id")
,concat_ws(" ", get_top_predictions(col("rawPrediction"), lit(10))).alias("top10Predictions")
)
.orderBy("id")
.coalesce(1)
.write
.mode(SaveMode.Overwrite)
.format("com.databricks.spark.csv")
.option("header", "true")
.save(name)
}

cross validation with pipe line in spark

Cross Validation outside from pipeline.
val naivebayes
val indexer
val pipeLine = new Pipeline().setStages(Array(indexer, naiveBayes))
val paramGrid = new ParamGridBuilder()
.addGrid(naiveBayes.smoothing, Array(1.0, 0.1, 0.3, 0.5))
.build()
val crossValidator = new CrossValidator().setEstimator(pipeLine)
.setEvaluator(new MulticlassClassificationEvaluator)
.setNumFolds(2).setEstimatorParamMaps(paramGrid)
val crossValidatorModel = crossValidator.fit(trainData)
val predictions = crossValidatorModel.transform(testData)
Cross Validation inside pipeline
val naivebayes
val indexer
// param grid for multiple parameter
val paramGrid = new ParamGridBuilder()
.addGrid(naiveBayes.smoothing, Array(0.35, 0.1, 0.2, 0.3, 0.5))
.build()
// validator for naive bayes
val crossValidator = new CrossValidator().setEstimator(naiveBayes)
.setEvaluator(new MulticlassClassificationEvaluator)
.setNumFolds(2).setEstimatorParamMaps(paramGrid)
// pipeline to execute compound transformation
val pipeLine = new Pipeline().setStages(Array(indexer, crossValidator))
// pipeline model
val pipeLineModel = pipeLine.fit(trainData)
// transform data
val predictions = pipeLineModel.transform(testData)
So i want to know which way is better and its pro & cons.
For both functions, i am getting same result and accuracy. Even second approach is little bit faster than first.
As per a training I attended - this should be the best practice :
cv = CrossValidator(estimator=lr,..)
pipelineModel = Pipeline(stages=[idx,assembler,cv])
cv_model= pipelineModel.fit(train)
This way your pipeline would fit only once and not with each recurring run with the param_grid which makes it run faster.
Hope this helps!

Apache Spark in Yarn not using all cores [duplicate]

I am running Spark on my local machine (16G,8 cpu cores). I was trying to train linear regression model on dataset of size 300MB. I checked the cpu statistics and also the programs running, it just executes one thread.
The documentation says they have implemented distributed version of SGD.
http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
from pyspark import SparkContext
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
sc = SparkContext("local", "Linear Reg Simple")
data = sc.textFile("/home/guptap/Dropbox/spark_opt/test.txt")
data.cache()
parsedData = data.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData)
valuesAndPreds = parsedData.map(lambda p: (p.label,model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
model.save(sc, "myModelPath")
sameModel = LinearRegressionModel.load(sc, "myModelPath")
I think what you want to do is explicitly state the number of cores to use with the local context. As you can see from the comments here, "local" (which is what you're doing) instantiates a context on one thread whereas "local[4]" will run with 4 cores. I believe you can also use "local[*]" to run on all cores on your system.

Spark MlLib linear regression (Linear least squares) giving random results

Im new in spark and Machine learning in general.
I have followed with success some of the Mllib tutorials, i can't get this one working:
i found the sample code here :
https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression
(section LinearRegressionWithSGD)
here is the code:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
// Save and load model
model.save(sc, "myModelPath")
val sameModel = LinearRegressionModel.load(sc, "myModelPath")
(that's exactly what's is on the website)
The result is
training Mean Squared Error = 6.2087803138063045
and
valuesAndPreds.collect
gives
Array[(Double, Double)] = Array((-0.4307829,-1.8383286021929077),
(-0.1625189,-1.4955700806407322), (-0.1625189,-1.118820892849544),
(-0.1625189,-1.6134108278724875), (0.3715636,-0.45171266551058276),
(0.7654678,-1.861316066986158), (0.8544153,-0.3588282725617985),
(1.2669476,-0.5036812148225209), (1.2669476,-1.1534698170911792),
(1.2669476,-0.3561392231695041), (1.3480731,-0.7347031705813306),
(1.446919,-0.08564658011814863), (1.4701758,-0.656725375080344),
(1.4929041,-0.14020483324910105), (1.5581446,-1.9438858658143454),
(1.5993876,-0.02181165554398845), (1.6389967,-0.3778677315868635),
(1.6956156,-1.1710092824030043), (1.7137979,0.27583044213064634),
(1.8000583,0.7812664902440078), (1.8484548,0.94605507153074),
(1.8946169,-0.7217282082851512), (1.9242487,-0.24422843221437684),...
My problem here is predictions looks totally random (and wrong), and since its the perfect copy of the website example, with the same input data (training set), i don't know where to look, am i missing something ?
Please give me some advices or clue about where to search, i can read and experiment.
Thanks
As explained by zero323 here, setting the intercept to true will solve the problem. If not set to true, your regression line is forced to go through the origin, which is not appropriate in this case. (Not sure, why this is not included in the sample code)
So, to fix your problem, change the following line in your code (Pyspark):
model = LinearRegressionWithSGD.train(parsedData, numIterations)
to
model = LinearRegressionWithSGD.train(parsedData, numIterations, intercept=True)
Although not mentioned explicitly, this is also why the code from 'selvinsource' in the above question is working. Changing the step size doesn't help much in this example.
Linear Regression is SGD based and requires tweaking the step size, see http://spark.apache.org/docs/latest/mllib-optimization.html for more details.
In your example, if you set the step size to 0.1 you get better results (MSE = 0.5).
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Build the model
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.1)
val model = regression.run(parsedData)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
For another example on a more realistic dataset, see
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/datasets/winequalityred_linearregression.md
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/spark_shell_exporter/linearregression_winequalityred.scala

Resources