Get the probability of a sample in sklearn.linear_model.LogisticRegression instead of class label - python-3.x

I am using sklearn.linear_model.LogisticRegression for a text classification project. With the features I have extracted, the samples mostly receive a low probability score. Therefore, when I use the predict() those samples always classified to class 0. But what I want to do is get the actual probabilities for samples and choose the top 25%-30% based on the probability score. How do I get the probability score for a sample? In linear regression, the predict() provides the actual output. But it is not the case for logistic regression. I am not restricted to sklearn. A different package also works.
To make it more clear, what I want from the predict function is to return actual probability value (output of the sigmoid function) instead of the class label like linear regression predict function.

Related

OneVsRest predict_proba() and predict() don't match?

I'm using OneVsRestClassifier on a multiclass problem with svm.SVC as the base estimator.
The argmax from the predict_proba() does not match the predicted class:
Is there some normalization going on in the background? How do I get predict_proba() and predict()
to match?
According to the scikit learn's SVC documentation on multi-class classification, there can be discrepancies between the output of predict and the argmax of predict_proba (emphasis mine):
The decision_function method of SVC and NuSVC gives per-class scores for each sample (or a single score per sample in the binary case). When the constructor option probability is set to True, class membership probability estimates (from the methods predict_proba and predict_log_proba) are enabled. In the binary case, the probabilities are calibrated using Platt scaling: logistic regression on the SVM’s scores, fit by an additional cross-validation on the training data. In the multiclass case, this is extended as per Wu et al. (2004).
Needless to say, the cross-validation involved in Platt scaling is an expensive operation for large datasets. In addition, the probability estimates may be inconsistent with the scores, in the sense that the “argmax” of the scores may not be the argmax of the probabilities. (E.g., in binary classification, a sample may be labeled by predict as belonging to a class that has probability <½ according to predict_proba.) Platt’s method is also known to have theoretical issues. If confidence scores are required, but these do not have to be probabilities, then it is advisable to set probability=False and use decision_function instead of predict_proba.
You cannot get them to match using a SVC. You can try another model if you need the probabilities. If you do not need probabilities, as stated in the documentation, you can use decision_function (see here for more details.)

How do I test my classifier accuracy against random values?

I've set up my first scikit-learn example to play with and I'm trying to gauge accuracy on my predictions. I've got training and test lists set up fine, but I'm getting ~0.95 accuracy even if I give it random values.
This looks to be because I'm checking for 0/1 labels, and 95% of the labels are zero's, so it's guessing on 0's and getting 0.95 accuracy (I think?). Obviously this isn't what I want.
How do I go about deciding if my classifiers are working, and how do I get meaningful accuracy values?
You have a clear class imbalance issue. Your classifier is predicting 0 all the time knowing it will be right 95% of the time. You can inspect this by calling predict(X_test) on your fitted classifier. If all the values are 0 you know this is the case.
To get a better idea on how the model performs you can upsample the data labelled with 1 or down sample the data labelled with 0. You can use this package which builds off scikit-learn and implements a number of resampling methods. Alternatively, you can use scikit learns resampling method. Which will bootstrap new data points for you.

sklearn NB classifier: How to get the actual probabilities of individual samples?

I am making a machine learning program which classifies words in one of the following categories: Hardware, Software, None_of_these. I make use of the Multinomial Naive Bayes classifier from sklearn.
The function predict() gives me the prediction of every word, however, I can't see the actual probability (float ranging for 0 to 1.0) that the word matches with the predicted categorie. I didn't find this on sklearn's site either.
Is there a function which gives me the probability of every sample?
Nevermind, I found the solution.:
predict_proba(X) Returns probability estimates for the test vector X.

sci-kit SVM multi-class classification with unseen label

is there any possibility to configure an svm classifier from sci-kit such that:
1.) the svm classifier is trained with examples from 0,...,n - 1
2.) If none of the single classifiers (one-vs-rest) delivers a positive result (class membership), then the output is a designated label n which means "none of them"
Thanks!
By construction, the OvR multiclass wrapper sklearn.multiclass.OneVsRestClassifier selects the maximum decision_function output or the maximum predict_proba to be decisive of predicted class. This means that there will always be a predicted class.
If you wanted e.g. to predict "None of these" when decision_function / predict_proba all stay under a certain threshold (for all OvR problems), then you would have to write this estimator yourself, but could get inspiration from the code of sklearn.multiclass.OneVsRestClassifier and just modify the decision logic.

Regarding Probability Estimates predicted by LIBSVM

I am attempting 3 class classification by using SVM classifier. How do we interpret the probabililty estimates predicted by LIBSVM. Is it based on perpendicular distance of the instance from the maximal margin hyperplane?.
Kindly through some light on the interpretation of probability estimates predicted by LIBSVM classifier. Parameters C and gamma are first tuned and then probability estimates are outputted by using -b option with both training and testing.
Multiclass SVM is always decomposed into several binary classifiers (typically a set of one vs all classifiers). Any binary SVM classifier's decision function outputs a (signed) distance to the separating hyperplane. In short, an SVM maps the input domain to a one-dimensional real number (the decision value). The predicted label is determined by the sign of the decision value. The most common technique to obtain probabilistic output from SVM models is through so-called Platt scaling (paper of LIBSVM authors).
Is it based on perpendicular distance of the instance from the maximal margin hyperplane?
Yes. Any classifier that outputs such a one-dimensional real value can be post-processed to yield probabilities, by calibrating a logistic function on the decision values of the classifier. This is the exact same approach as in standard logistic regression.
SVM performs binary classification. In order to achieve multiclass classification libsvm performs what it's called one vs all. What you get when you invoke -bis the probability related to this technique that you can found explained here .

Resources