How to Adjust Classification Threshold with a Spark Decision Tree - apache-spark

I'm using Spark 2.0 and the new spark.ml. packages.
Is there a way to adjust the classification threshold so that I reduce the number of False Positives.
If it matters I'm also using the CrossValidator.
I see RandomForestClassifier and DecisionTreeClassifier both output a probability column (Which I could use manually, but GBTClassifier does not.

It sounds like you might be looking for the thresholds parameter:
final val thresholds: DoubleArrayParam
Param for Thresholds in multi-class classification to adjust the probability
of predicting each class. Array must have length equal to the number of
classes, with values >= 0. The class with largest value p/t is predicted,
where p is the original probability of that class and t is the class'
threshold.
You will need to set it by calling setThresholds(value: Array[Double]) on your classifier.

Related

Random Forest "Feature Importance"

I am currently working on Random Forest Classifier. One of the parameters of Random Forest Classifier is "Criterion" which has 2 options : Gini or Entropy. Low value of Gini is preferred and high value of Entropy is preferred. By default, gini is criterion for Random Forest Classifier.
There is an attribute called feature_importances_ provided by sklearn, where we get the values of the attributes/features provided. By using we can select some features and eliminate some using "threshold and SelectFromModel"
My doubt is that, on what basis these feature_importances_ are calculated? Assume default criterion "Gini" is available. If I assume the feature_importances_ are "Gini Importances" then low value is preferred, but in feature importances, high values are preferred
features_importances_ always output the importance of the features. If the value is bigger, more important is the feature, don't take in consideration gini or entropy criterion, it doesn't matter. Criterion is used to build the model. Feature importance is applied after the model is trained, you only "analyze" and observe which values have been more relevant in your trained model.
Moreover, you will see that all features_importances_ sums to 1, so the importance is seen as a percentage too.
Since RandomForest is formed by several trees, feature importances are averaged over all the trees.

Find threshold for 1-D data

I have a 1-D data. I have a binary classification problem. Points above a certain threshold belong to class 0 and points less than the threshold belong to class 1. I want to find that threshold. I don't want to pass this data to any classifier from sklearn, but want to use functionality like cross-validation and roc curve from sklearn. How can I do this?
Thanks

Spark MLlib predict only if threshold greater than value

I have a multi class classification (38 classes) problem and implemented a pipeline in Spark ML in order to solve it. This is how I generated my model.
val nb = new NaiveBayes()
.setLabelCol("id")
.setFeaturesCol("features")
.setThresholds(Seq(1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25,1.25).toArray)
val pipeline = new Pipeline()
.setStages(Array(stages, assembler, nb))
val model = pipeline.fit(trainingSet)
I want my model to be able to predict a class only if it's confidence (probability) is greater than 0.8%.
I searched a lot in Spark documentation to understand better what the threshold parameter means, but the only relevant piece of information i've found is this one:
Thresholds in multi-class classification to adjust the probability of
predicting each class. Array must have length equal to the number of
classes, with values > 0 excepting that at most one value may be 0.
The class with largest value p/t is predicted, where p is the original
probability of that class and t is the class's threshold.
This is why my thresholds are 1.25.
The problem is that no matter the value I'm inserting for the thresholds, it seams it doesn't affect my predictions at all.
Do you know if there is a possibility to predict only classes that have the confidence (probability) greater than a specific threshold?
Another way would be to select only the predictions that have the probability greater than that threshold, but I expect this can be achieved using the framework.
Thanks.

Spark Naive Bayes Result accuracy (Spark ML 1.6.0) [duplicate]

I am using Spark ML to optimise a Naive Bayes multi-class classifier.
I have about 300 categories and I am classifying text documents.
The training set is balanced enough and there is about 300 training examples for each category.
All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is almost equal to 1), while the other categories receive very low probabilities (close to zero).
What are the possible reasons for this?
I would like to add that in SPARK ML there is something called "raw prediction" and when I look at it, I can see negative numbers but they have more or less comparable magnitude, so even the category with the high probability has comparable raw prediction score, but I am finding difficulties in interpreting this scores.
Lets start with a very informal description of Naive Bayes classifier. If C is a set of all classes and d is a document and xi are the features, Naive Bayes returns:
Since P(d) is the same for all classes we can simplify this to
where
Since we assume that features are conditionally independent (that is why it is naive) we can further simplify this (with Laplace correction to avoid zeros) to:
Problem with this expression is that in any non-trivial case it is numerically equal to zero. To avoid we use following property:
and replace initial condition with:
These are the values you get as the raw probabilities. Since each element is negative (logarithm of the value in (0, 1]) a whole expression has negative value as well. As you discovered by yourself these values are further normalized so the maximum value is equal to 1 and divided by the sum of the normalized values
It is important to note that while values you get are not strictly P(c|d) they preserve all important properties. The order and ratios are exactly (ignoring possible numerical issues) the same. If none other class gets prediction close to one it means that, given the evidence, it is a very strong prediction. So it is actually something you want to see.

Spark, MLlib: Adjusting classifier descrimination threshold

I try to use Spark MLlib Logistic Regression (LR) and/or Random Forests (RF) classifiers to create model to descriminate between two classes reprsented by sets which cardinality differes quite a lot.
One set has 150 000 000 negative and and another just 50 000 positive instances.
After training both LR and RF classifiers with default parameters I get very similar results for both classifiers with, for example, for the following test set:
Test instances: 26842
Test positives = 433.0
Test negatives = 26409.0
Classifier detects:
truePositives = 0.0
trueNegatives = 26409.0
falsePositives = 433.0
falseNegatives = 0.0
Precision = 0.9838685641904478
Recall = 0.9838685641904478
It looks like classifier can not detect any positive instance at all.
Also, no matter how data was split into train and test sets, classifier provides exactly the same number of false positives equal to a number of positives that test set really has.
LR classifier default threshold is set to 0.5 Setting threshold to 0.8 does not make any difference.
val model = new LogisticRegressionWithLBFGS().run(training)
model.setThreshold(0.8)
Questions:
1) Please advise how to manipulate classifier threshold to make classifier more sensetive to a class with a tiny fraction of positive instances vs a class with huge amount of negative instances?
2) Any other MLlib classifiers to solve this problem?
3) What itercept parameter does to the Logistic Regression algorithm?
val model = new LogisticRegressionWithSGD().setIntercept(true).run(training)
Well, I think what you have here is a very unbalance data set problem:
150 000 000 Class1
50 000 Class2. 3000 times smaller.
So if you train a classifier that assumes all are Class1 you are going to have:
0.999666 accuracy. So the best classifier will always be ALL are Class1. This is what your model is learning here.
There are different ways to assess these cases, in general you can do, downsampling the larger Class, or up-sampling the smaller class, or there are some other things you can do with randomforests for example when you sample do it in a balanced way (stratified), or add weights:
http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
Other methods also exist like SMOTE,etc (also doing samples) for more details you can read here:
https://www3.nd.edu/~dial/papers/SPRINGER05.pdf
The threshold you can change for your logistic regression is going to be the probability one, you can try playing with "probabilityCol" in the parameters of the logistic regression example here:
http://spark.apache.org/docs/latest/ml-guide.html
But a problem now with MLlib is that not all classifiers are returning a probability, I asked them about this and it is in their roadmap.

Resources