Handling imbalanced class in Spark - apache-spark

I am trying to experiment with credit card fraud detection dataset through spark mllib.
The dataset that I have has many 0's(meaning non-fraud) compared to 1's(meaning fraud).
I wanted to know to solve a class imbalance problem like the above do we have any available algorithm in spark like SMOTE.
I am using logistic regression as the model

You can try weightCol within logistic regression, Something like this:
temp = train.groupby("LabelCol").count()
new_train = train.join(temp, "LabelCol", how = 'leftouter')
num_labels = train_data.select(countDistinct(train_data.score)).first()[0]
train1 = new_train.withColumn("weight",(new_train.count()/(num_labels * new_train["count"])))
# Logistic Regrestion Initiation
lr = LogisticRegression(weightCol = "weight", family = 'multinomial')

Related

How to load logistic regression model?

I want to train the logistic regression model using Apache Spark in Java. As first step I would like to train the model just once and save the model parameters (intercept and Coefficient). Subsequently use the saved model parameters to score at a later point in time. I am able to save the model in parquet file, using the following code
LogisticRegressionModel trainedLRModel = logReg.fit(data);
trainedLRModel.write().overwrite().save("mypath");
When I load the model to score, I get the following error.
LogisticRegression lr = new LogisticRegression();
lr.load("//saved_model_path");
Exception in thread "main" java.lang.NoSuchMethodException: org.apache.spark.ml.classification.LogisticRegressionModel.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:325)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:215)
at org.apache.spark.ml.classification.LogisticRegression$.load(LogisticRegression.scala:672)
at org.apache.spark.ml.classification.LogisticRegression.load(LogisticRegression.scala)
Is there a way to train and save model and then evaluate(score) later? I am using Spark ML 2.1.0 in Java.
I face the same problem with pyspark 2.1.1, when i change from LogisticRegression to LogisticRegressionModel , everything works well.
LogisticRegression.load("/model/path") # not works
LogisticRegressionModel.load("/model/path") # works well
TL;DR Use LogisticRegressionModel.load.
load(path: String): LogisticRegressionModel Reads an ML instance from the input path, a shortcut of read.load(path).
As a matter of fact, as of Spark 2.0.0, the recommended approach to use Spark MLlib, incl. LogisticRegression estimator, is using the brand new and shiny Pipeline API.
import org.apache.spark.ml.classification._
val lr = new LogisticRegression()
import org.apache.spark.ml.feature._
val tok = new Tokenizer().setInputCol("body")
val hashTF = new HashingTF().setInputCol(tok.getOutputCol).setOutputCol("features")
import org.apache.spark.ml._
val pipeline = new Pipeline().setStages(Array(tok, hashTF, lr))
// training dataset
val emails = Seq(("hello world", 1)).toDF("body", "label")
val model = pipeline.fit(emails)
model.write.overwrite.save("mypath")
val loadedModel = PipelineModel.load("mypath")

ML Pipeline and metrics: Precision, Recall, AUC-ROC, F1Score

I'm using ML Pipeline, something like:
VectorAssembler assembler = new VectorAssembler()
.setInputCols(columns)
.setOutputCol("features");
LogisticRegression lr = new LogisticRegression().setLabelCol(targetColumn);
lr.setMaxIter(10).setRegParam(0.01).setFeaturesCol("features");
Pipeline logisticRegression = new Pipeline();
logisticRegression.setStages(new PipelineStage[] {assembler, lr});
PipelineModel logisticRegressionModel = logisticRegression.fit(learningData);
What I want to is the way to get standard metric like Precision, Recall, AUC-ROC, F1-SCORE, ACCURACY on this model.
I've found BinaryClassificationMetrics - but not sure if it's compatible at all.
RegressionEvaluator seems to return only mse|rmse|r2|mae.
So what is the right way to extract Precision, Recall, etc with ML Pipeline?
Couple of things missing from Ryan's answer above.
I can confirm the following works (Note: my use case was for Multiclass Classification)
val scoredTestSet = model.transform(testSet)
val predictionLabelsRDD = scoredTestSet.select("prediction", "label").rdd.map(r => (r.getDouble(0), r.getDouble(1)))
val multiModelMetrics = new MulticlassMetrics(predictionAndLabelsRDD)
once you have scored data, get the prediction and label and pass that to BinaryClassificationMetrics
something like below (thought it's in scala I hope it helps)
val scoredTestSet = logisticRegressionModel.transform(testSet)
val predictionLabelsRDD = scoredTestSet.select("prediction", "label").map(r => (r.getDouble(0), r.getDouble(1)))
val binMetrics = new BinaryClassificationMetrics(predictionAndLabels)
// binMetrics.areaUnderROC
other examples from https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification
prediction in this case is 1.0 or 0.0
you can also extract the probability and use that instead of the prediction so that binMetrics can show data for multiple thresholds

How to increase the accuracy of neural network model in spark?

import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row
// Load training data
val data = MLUtils.loadLibSVMFile(sc,"/home/.../neural.txt").toDF()
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
val layers = Array[Int](4, 5, 4, 4)
val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100)
val model = trainer.fit(train)
// compute precision on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
val evaluator = new MulticlassClassificationEvaluator().setMetricName("precision")
println("Precision:" + evaluator.evaluate(predictionAndLabels))
I am using MultilayerPerceptronClassifier to build neural network in Spark. I am getting 62.5% of accuracy. What all parameters I should change to get good accuracy?
As some people has said , the question is too broad and cant be answered without more detail but some advice(independently of the models/altorithms used or the tools and libraries for implementing them) would be:
Use a cross validation set and perform some cross validation with different network architectures.
Plot "Learning curves"
Identify if you are having high bias or high variance
See if you can or need to apply feature scaling and/or normalization.
Do some "Error Analysis"(manually verify which examples failed and evaluate or categorize them to see if you can find a pattern)
Not neccesarily in that order, but that could help you identify if you have underfitting, overfitting, if you need more training data, add or remove features, add regularization, etc. In summary , perform machine learning debugging.
Hope that helps, you can find more deep details about this in Andrew Ngs series of videos, starting with this:
https://www.youtube.com/watch?v=qIfLZAa32H0

How to calculate log-loss for the trained model?

I am building a ML pipeline for logistic regression.
val lr = new LogisticRegression()
lr.setMaxIter(100).setRegParam(0.001)
val pipeline = new Pipeline().setStages(Array(geoDimEncoder,clientTypeEncoder,
devTypeDimIdEncoder,pubClientIdEncoder,tmpltIdEncoder,
hourEncoder,assembler,lr))
val model = pipeline.fit(trainingDF)
Now, when the model is trained, I want to see the probabilities for the training set and compute certain validation parameters like log-loss. But, I am unable to find this using "model".
The only thing I could find everywhere is
model.transform(testDF).select(....)
How to get the metrics using the trained set for training set validation?
Please check the following ways, should work for you:
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val lrModel = lr.fit(data)
val trainingSummary = lrModel.summary
// Obtain the objective per iteration.
val objectiveHistory = trainingSummary.objectiveHistory
println("objectiveHistory:")
objectiveHistory.foreach(loss => println(loss))

Apache Spark Naive Bayes based Text Classification

im trying to use Apache Spark for document classification.
For example i have two types of Class (C and J)
Train data is :
C, Chinese Beijing Chinese
C, Chinese Chinese Shanghai
C, Chinese Macao
J, Tokyo Japan Chinese
And test data is :
Chinese Chinese Chinese Tokyo Japan // What is ist J or C ?
How i can train and predict as above datas. I did Naive Bayes text classification with Apache Mahout, however no with Apache Spark.
How can i do this with Apache Spark?
Yes, it doesn't look like there is any simple tool to do that in Spark yet. But you can do it manually by first creating a dictionary of terms. Then compute IDFs for each term and then convert each documents into vectors using the TF-IDF scores.
There is a post on http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ that explains how to do it (with some code as well).
Spark can do this in very simple way. The key step is: 1 use HashingTF to get the item frequency. 2 convert the data to the form of the bayes model needed.
def testBayesClassifier(hiveCnt:SQLContext){
val trainData = hiveCnt.createDataFrame(Seq((0,"aa bb aa cc"),(1,"aa dd ee"))).toDF("category","text")
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(trainData)
val hashTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20)
val featureData = hashTF.transform(wordsData) //key step 1
val trainDataRdd = featureData.select("category","features").map {
case Row(label: Int, features: Vector) => //key step 2
LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
}
//train the model
val model = NaiveBayes.train(trainDataRdd, lambda = 1.0, modelType = "multinomial")
//same for the test data
val testData = hiveCnt.createDataFrame(Seq((-1,"aa bb"),(-1,"cc ee ff"))).toDF("category","text")
val testWordData = tokenizer.transform(testData)
val testFeatureData = hashTF.transform(testWordData)
val testDataRdd = testFeatureData.select("category","features").map {
case Row(label: Int, features: Vector) =>
LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
}
val testpredictionAndLabel = testDataRdd.map(p => (model.predict(p.features), p.label))
}
You can use mlib's naive bayes classifier for this. A sample example is given in the link.
http://spark.apache.org/docs/latest/mllib-naive-bayes.html
There any many classification methods (logistic regression, SVMs, neural networks,LDA, QDA...), you can either implement yours or use MLlib classification methods (actually, there are logistic regression and SVM implemented in MLlib)
What you need to do is transform your features to a vector, and labels to doubles.
For examples, your dataset will look like:
1, (2,1,0,0,0,0)
1, (2,0,1,0,0,0)
0, (1,0,0,1,0,0)
0, (1,0,0,0,1,1)
And tour test vector:
(3,0,0,0,1,1)
Hope this helps

Resources