ML Pipeline and metrics: Precision, Recall, AUC-ROC, F1Score - apache-spark

I'm using ML Pipeline, something like:
VectorAssembler assembler = new VectorAssembler()
.setInputCols(columns)
.setOutputCol("features");
LogisticRegression lr = new LogisticRegression().setLabelCol(targetColumn);
lr.setMaxIter(10).setRegParam(0.01).setFeaturesCol("features");
Pipeline logisticRegression = new Pipeline();
logisticRegression.setStages(new PipelineStage[] {assembler, lr});
PipelineModel logisticRegressionModel = logisticRegression.fit(learningData);
What I want to is the way to get standard metric like Precision, Recall, AUC-ROC, F1-SCORE, ACCURACY on this model.
I've found BinaryClassificationMetrics - but not sure if it's compatible at all.
RegressionEvaluator seems to return only mse|rmse|r2|mae.
So what is the right way to extract Precision, Recall, etc with ML Pipeline?

Couple of things missing from Ryan's answer above.
I can confirm the following works (Note: my use case was for Multiclass Classification)
val scoredTestSet = model.transform(testSet)
val predictionLabelsRDD = scoredTestSet.select("prediction", "label").rdd.map(r => (r.getDouble(0), r.getDouble(1)))
val multiModelMetrics = new MulticlassMetrics(predictionAndLabelsRDD)

once you have scored data, get the prediction and label and pass that to BinaryClassificationMetrics
something like below (thought it's in scala I hope it helps)
val scoredTestSet = logisticRegressionModel.transform(testSet)
val predictionLabelsRDD = scoredTestSet.select("prediction", "label").map(r => (r.getDouble(0), r.getDouble(1)))
val binMetrics = new BinaryClassificationMetrics(predictionAndLabels)
// binMetrics.areaUnderROC
other examples from https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification
prediction in this case is 1.0 or 0.0
you can also extract the probability and use that instead of the prediction so that binMetrics can show data for multiple thresholds

Related

How to load logistic regression model?

I want to train the logistic regression model using Apache Spark in Java. As first step I would like to train the model just once and save the model parameters (intercept and Coefficient). Subsequently use the saved model parameters to score at a later point in time. I am able to save the model in parquet file, using the following code
LogisticRegressionModel trainedLRModel = logReg.fit(data);
trainedLRModel.write().overwrite().save("mypath");
When I load the model to score, I get the following error.
LogisticRegression lr = new LogisticRegression();
lr.load("//saved_model_path");
Exception in thread "main" java.lang.NoSuchMethodException: org.apache.spark.ml.classification.LogisticRegressionModel.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:325)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:215)
at org.apache.spark.ml.classification.LogisticRegression$.load(LogisticRegression.scala:672)
at org.apache.spark.ml.classification.LogisticRegression.load(LogisticRegression.scala)
Is there a way to train and save model and then evaluate(score) later? I am using Spark ML 2.1.0 in Java.
I face the same problem with pyspark 2.1.1, when i change from LogisticRegression to LogisticRegressionModel , everything works well.
LogisticRegression.load("/model/path") # not works
LogisticRegressionModel.load("/model/path") # works well
TL;DR Use LogisticRegressionModel.load.
load(path: String): LogisticRegressionModel Reads an ML instance from the input path, a shortcut of read.load(path).
As a matter of fact, as of Spark 2.0.0, the recommended approach to use Spark MLlib, incl. LogisticRegression estimator, is using the brand new and shiny Pipeline API.
import org.apache.spark.ml.classification._
val lr = new LogisticRegression()
import org.apache.spark.ml.feature._
val tok = new Tokenizer().setInputCol("body")
val hashTF = new HashingTF().setInputCol(tok.getOutputCol).setOutputCol("features")
import org.apache.spark.ml._
val pipeline = new Pipeline().setStages(Array(tok, hashTF, lr))
// training dataset
val emails = Seq(("hello world", 1)).toDF("body", "label")
val model = pipeline.fit(emails)
model.write.overwrite.save("mypath")
val loadedModel = PipelineModel.load("mypath")

Handling imbalanced class in Spark

I am trying to experiment with credit card fraud detection dataset through spark mllib.
The dataset that I have has many 0's(meaning non-fraud) compared to 1's(meaning fraud).
I wanted to know to solve a class imbalance problem like the above do we have any available algorithm in spark like SMOTE.
I am using logistic regression as the model
You can try weightCol within logistic regression, Something like this:
temp = train.groupby("LabelCol").count()
new_train = train.join(temp, "LabelCol", how = 'leftouter')
num_labels = train_data.select(countDistinct(train_data.score)).first()[0]
train1 = new_train.withColumn("weight",(new_train.count()/(num_labels * new_train["count"])))
# Logistic Regrestion Initiation
lr = LogisticRegression(weightCol = "weight", family = 'multinomial')

How to increase the accuracy of neural network model in spark?

import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row
// Load training data
val data = MLUtils.loadLibSVMFile(sc,"/home/.../neural.txt").toDF()
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
val layers = Array[Int](4, 5, 4, 4)
val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100)
val model = trainer.fit(train)
// compute precision on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
val evaluator = new MulticlassClassificationEvaluator().setMetricName("precision")
println("Precision:" + evaluator.evaluate(predictionAndLabels))
I am using MultilayerPerceptronClassifier to build neural network in Spark. I am getting 62.5% of accuracy. What all parameters I should change to get good accuracy?
As some people has said , the question is too broad and cant be answered without more detail but some advice(independently of the models/altorithms used or the tools and libraries for implementing them) would be:
Use a cross validation set and perform some cross validation with different network architectures.
Plot "Learning curves"
Identify if you are having high bias or high variance
See if you can or need to apply feature scaling and/or normalization.
Do some "Error Analysis"(manually verify which examples failed and evaluate or categorize them to see if you can find a pattern)
Not neccesarily in that order, but that could help you identify if you have underfitting, overfitting, if you need more training data, add or remove features, add regularization, etc. In summary , perform machine learning debugging.
Hope that helps, you can find more deep details about this in Andrew Ngs series of videos, starting with this:
https://www.youtube.com/watch?v=qIfLZAa32H0

How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

Supossed I have a Pipeline like this:
val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features")
val idf = new IDF().setInputCol("features").setOutputCol("idffeatures")
val nb = new org.apache.spark.ml.classification.NaiveBayes()
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb))
val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(nb.smoothing, Array(0.01, 0.1, 1)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator()).setEstimatorParamMaps(paramGrid).setNumFolds(10)
val cvModel = cv.fit(df)
As you can see I defined a CrossValidator using a MultiClassClassificationEvaluator. I have seen a lot of examples getting metrics like Precision/Recall during testing process but these metris are gotten when you use a different set of data for testing purposes (See for example this documentation).
From my understanding, CrossValidator is going to create folds and one fold will be use for testing purposes, then CrossValidator will choose the best model. My question is, is possible to get Precision/Recall metrics during training process?
Well, the only metric which is actually stored is the one you define when you create an instance of an Evaluator. For the BinaryClassificationEvaluator this can take one of the two values:
areaUnderROC
areaUnderPR
with the former one being default, and can be set using setMetricName method.
These values are collected during training process and can accessed using CrossValidatorModel.avgMetrics. Order of values corresponds to the order of EstimatorParamMaps (CrossValidatorModel.getEstimatorParamMaps).

How to calculate log-loss for the trained model?

I am building a ML pipeline for logistic regression.
val lr = new LogisticRegression()
lr.setMaxIter(100).setRegParam(0.001)
val pipeline = new Pipeline().setStages(Array(geoDimEncoder,clientTypeEncoder,
devTypeDimIdEncoder,pubClientIdEncoder,tmpltIdEncoder,
hourEncoder,assembler,lr))
val model = pipeline.fit(trainingDF)
Now, when the model is trained, I want to see the probabilities for the training set and compute certain validation parameters like log-loss. But, I am unable to find this using "model".
The only thing I could find everywhere is
model.transform(testDF).select(....)
How to get the metrics using the trained set for training set validation?
Please check the following ways, should work for you:
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val lrModel = lr.fit(data)
val trainingSummary = lrModel.summary
// Obtain the objective per iteration.
val objectiveHistory = trainingSummary.objectiveHistory
println("objectiveHistory:")
objectiveHistory.foreach(loss => println(loss))

Resources