I have a multi class classification (38 classes) problem and implemented a pipeline in Spark ML in order to solve it. This is how I generated my model.
val nb = new NaiveBayes()
.setLabelCol("id")
.setFeaturesCol("features")
.setThresholds(Seq(1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25).toArray)
val pipeline = new Pipeline()
.setStages(Array(stages, assembler, nb))
val model = pipeline.fit(trainingSet)
I want my model to be able to predict a class only if it's confidence (probability) is greater than 0.8%.
I searched a lot in Spark documentation to understand better what the threshold parameter means, but the only relevant piece of information i've found is this one:
Thresholds in multi-class classification to adjust the probability of
predicting each class. Array must have length equal to the number of
classes, with values > 0 excepting that at most one value may be 0.
The class with largest value p/t is predicted, where p is the original
probability of that class and t is the class's threshold.
This is why my thresholds are 1.25.
The problem is that no matter the value I'm inserting for the thresholds, it seams it doesn't affect my predictions at all.
Do you know if there is a possibility to predict only classes that have the confidence (probability) greater than a specific threshold?
Another way would be to select only the predictions that have the probability greater than that threshold, but I expect this can be achieved using the framework.
Thanks.
Related
I am working on a binary classification problem with an imbalanced dataset where 75% of the data belongs to the negative class(0.0) and the rest (25%) belongs to the positive class(1.0).
I am using a PySpark Dataframe where each row has a label (0.0 or 1.0) associated with it for indicating the class. Due to the imbalance of the classes, I would like to use appropriate class weights.
From the documentation and the example listed here, there's a parameter called weightCol in the line
blor = LogisticRegression(weightCol="weight")
The description of weightCol is mentioned here.
So can I go ahead and create a new column called weight, assign a value of 0.75 whenever the label is 1.0 and 0.25 when the label is 0.0 for every row and initialise the model as mentioned above?
I just want to check if this is the right way to assign weights to an imbalanced dataset in Spark MLlib as the documentation doesn't make it really clear
I'm using Spark 2.0 and the new spark.ml. packages.
Is there a way to adjust the classification threshold so that I reduce the number of False Positives.
If it matters I'm also using the CrossValidator.
I see RandomForestClassifier and DecisionTreeClassifier both output a probability column (Which I could use manually, but GBTClassifier does not.
It sounds like you might be looking for the thresholds parameter:
final val thresholds: DoubleArrayParam
Param for Thresholds in multi-class classification to adjust the probability
of predicting each class. Array must have length equal to the number of
classes, with values >= 0. The class with largest value p/t is predicted,
where p is the original probability of that class and t is the class'
threshold.
You will need to set it by calling setThresholds(value: Array[Double]) on your classifier.
I am using the ALS model from spark.ml to create a recommender system
using implicit feedback for a certain collection of items. I have noticed
that the output predictions of the model are much lower than 1 and they usually range in the interval of [0,0.1]. Thus, using MAE or MSE does not make any
sense in this case.
Therefore I use the areaUnderROC (AUC) to measure the performance. I do that by using the spark's BinaryClassificationEvaluator and I do get something close to 0.8. But, I cannot understand clearly how that is possible, since most of the values range in [0,0.1].
To my understanding after a certain point the evaluator will be considering all of the predictions to belong to class 0. Which essentially would mean that AUC would be equal to the percentage of negative samples?
In general, how would you treat such low values if you need to test your model's performance compared to let's say Logistic Regression?
I train the model as follows:
rank = 25
alpha = 1.0
numIterations = 10
als = ALS(rank=rank, maxIter=numIterations, alpha=alpha, userCol="id", itemCol="itemid", ratingCol="response", implicitPrefs=True, nonnegative=True)
als.setRegParam(0.01)
model = als.fit(train)
What #shuaiyuancn explained about BinaryClassificationEvaluator isn't completely correct. Obviously using that kind of evaluator if you don't have binary ratings and a proper threshold isn't correct.
Thus, you can consider a recommender system as a binary classification when your systems considers binary ratings (click-or-not, like-or-not).
In this case, the recommender defines a logistic model, where we assume that the rating (-1,1) that user u gives item v is generated on a logistic response model :
where scoreuv is the score given by u to v.
For more information about Logistic Models, you can refer to Hastie et al. (2009) - section 4.4
This said, a recommender system can also be considered as multi-class classification problem. And this always depends on your data and the problem in hand but it can also follow some kind of regression model.
Sometimes, I choose to evaluate my recommender system using RegressionMetrics even thought text books recommend using RankingMetrics-like evaluations to compute metrics such as average precision at K or MAP, etc. It always depends on the task and data at hand. There is no general recipe for that.
Nevertheless, I strongly advise you to read the Evaluation Metrics official documentation. It will help you understand better what you are trying to measure regarding what you are trying to achieve.
References
Statistical Methods for Recommender Systems - Deepak K. Agarwal, Bee-Chung Chen.
The Elements of Statistical Learning - Hastie et al.
Spark official documentation - Evaluation Metrics.
EDIT: I ran into this answer today. It's an example implementation of a Binary ALS in python. I strongly advise you to take a look at it.
Using BinaryClassificationEvaluator on a recommender is wrong. Usually a recommender select one or a few items from a collection as prediction. But BinaryClassificationEvaluator only deals with two labels, hence Binary.
The reason you still get a result from BinaryClassificationEvaluator is that there is a prediction column in your result dataframe which is then used to compute the ROC. The number doesn't mean anything in your case, don't take it as a measurement of your model's performance.
I have noticed that the output predictions of the model are much lower than 1 and they usually range in the interval of [0,0.1]. Thus, using MAE or MSE does not make any sense in this case.
Why MSE doesn't make any sense? You're evaluating your model by looking at the difference (error) of predicted rating and the true rating. [0, 0.1] simply means your model predicts the rating to be in that range.
I am using Spark ML to optimise a Naive Bayes multi-class classifier.
I have about 300 categories and I am classifying text documents.
The training set is balanced enough and there is about 300 training examples for each category.
All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is almost equal to 1), while the other categories receive very low probabilities (close to zero).
What are the possible reasons for this?
I would like to add that in SPARK ML there is something called "raw prediction" and when I look at it, I can see negative numbers but they have more or less comparable magnitude, so even the category with the high probability has comparable raw prediction score, but I am finding difficulties in interpreting this scores.
Lets start with a very informal description of Naive Bayes classifier. If C is a set of all classes and d is a document and xi are the features, Naive Bayes returns:
Since P(d) is the same for all classes we can simplify this to
where
Since we assume that features are conditionally independent (that is why it is naive) we can further simplify this (with Laplace correction to avoid zeros) to:
Problem with this expression is that in any non-trivial case it is numerically equal to zero. To avoid we use following property:
and replace initial condition with:
These are the values you get as the raw probabilities. Since each element is negative (logarithm of the value in (0, 1]) a whole expression has negative value as well. As you discovered by yourself these values are further normalized so the maximum value is equal to 1 and divided by the sum of the normalized values
It is important to note that while values you get are not strictly P(c|d) they preserve all important properties. The order and ratios are exactly (ignoring possible numerical issues) the same. If none other class gets prediction close to one it means that, given the evidence, it is a very strong prediction. So it is actually something you want to see.
I try to use Spark MLlib Logistic Regression (LR) and/or Random Forests (RF) classifiers to create model to descriminate between two classes reprsented by sets which cardinality differes quite a lot.
One set has 150 000 000 negative and and another just 50 000 positive instances.
After training both LR and RF classifiers with default parameters I get very similar results for both classifiers with, for example, for the following test set:
Test instances: 26842
Test positives = 433.0
Test negatives = 26409.0
Classifier detects:
truePositives = 0.0
trueNegatives = 26409.0
falsePositives = 433.0
falseNegatives = 0.0
Precision = 0.9838685641904478
Recall = 0.9838685641904478
It looks like classifier can not detect any positive instance at all.
Also, no matter how data was split into train and test sets, classifier provides exactly the same number of false positives equal to a number of positives that test set really has.
LR classifier default threshold is set to 0.5 Setting threshold to 0.8 does not make any difference.
val model = new LogisticRegressionWithLBFGS().run(training)
model.setThreshold(0.8)
Questions:
1) Please advise how to manipulate classifier threshold to make classifier more sensetive to a class with a tiny fraction of positive instances vs a class with huge amount of negative instances?
2) Any other MLlib classifiers to solve this problem?
3) What itercept parameter does to the Logistic Regression algorithm?
val model = new LogisticRegressionWithSGD().setIntercept(true).run(training)
Well, I think what you have here is a very unbalance data set problem:
150 000 000 Class1
50 000 Class2. 3000 times smaller.
So if you train a classifier that assumes all are Class1 you are going to have:
0.999666 accuracy. So the best classifier will always be ALL are Class1. This is what your model is learning here.
There are different ways to assess these cases, in general you can do, downsampling the larger Class, or up-sampling the smaller class, or there are some other things you can do with randomforests for example when you sample do it in a balanced way (stratified), or add weights:
http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
Other methods also exist like SMOTE,etc (also doing samples) for more details you can read here:
https://www3.nd.edu/~dial/papers/SPRINGER05.pdf
The threshold you can change for your logistic regression is going to be the probability one, you can try playing with "probabilityCol" in the parameters of the logistic regression example here:
http://spark.apache.org/docs/latest/ml-guide.html
But a problem now with MLlib is that not all classifiers are returning a probability, I asked them about this and it is in their roadmap.