Binary classifier too confident to plot ROC curve with sklearn? - python-3.x

I have a created a binary classifier in Tensorflow that will output a generator object containing predictions. I extract the predictions (e.g [0.98, 0.02]) from the object into a list, later converting this into a numpy array. I have the corresponding array of labels for these predictions. Using these two arrays I believe I should be able to plot a roc curve via:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
fpr, tpr, thr = roc_curve(labels, predictions[:,1])
plt.plot(fpr, tpr)
plt.show()
print(fpr)
print(tpr)
print(thr)
Where predictions[:,1] gives the positive prediction score. However, running this code leads to only a flat line and only three values for each fpr, tpr and thr:
Flat line roc plot and limited function outputs.
The only theory I have as to why this is happening is because my classifier is too sure of it's predictions. Many, if not all, of the positive prediction scores are 1.0, or incredibly close to zero:
[[9.9999976e-01 2.8635742e-07]
[3.3693312e-11 1.0000000e+00]
[1.0000000e+00 9.8642090e-09]
...
[1.0106111e-15 1.0000000e+00]
[1.0000000e+00 1.0030269e-09]
[8.6156778e-15 1.0000000e+00]]
According to a few sources including this stackoverflow thread and this stackoverflow thread, the very polar values of my predictions could be creating an issue for roc_curve().
Is my intuition correct? If so is there anything I can do about it to plot my roc_curve?
I've tried to include all the information I think would be relevant to this issue but if you would like any more information about my program please ask.

ROC is generated by changing the threshold on your predictions and finding the sensitivity and specificity for each threshold. This generally means that as you increase the threshold, your sensitivity decreases but your specificity increases and it draws a picture of the overall quality of your predicted probabilities. In your case, since everything is either 0 or 1 (or very close to it) there are no meaningful thresholds to use. That's why the thr value is basically [ 1, 1, 1 ].
You can try to arbitrarily pull the values closer to 0.5 or alternatively implement your own ROC curve calculation with more tolerance for small differences.
On the other hand you might want to review your network because such result values often mean there is a problem there, maybe the labels leaked into the network somehow and therefore it produces perfect results.

Related

PCA with sklearn discrepancies

I am trying to apply a PCA in a very specific context and ran into a behavior that I can not explain.
As a test I am running the following code with the file data that you can retrieve here: https://www.dropbox.com/s/vdnvxhmvbnssr34/test.npy?dl=0 (numpy array format).
from sklearn.decomposition import PCA
import numpy as np
test = np.load('test.npy')
pca = PCA()
X_proj = pca.fit_transform(test) ### Project in the basis of eigenvectors
proj = pca.inverse_transform(X_proj) ### Reconstruct vector
My issue is the following: Because I do not specify any number of components, I should here be reconstructing with all the computed components. I therefore expect my ouput proj to be the same as my input test. But a quick plot proves this not to be the case:
plt.figure()
plt.plot(test[0]-proj[0])
plt.show()
The plot here will show some large discrepancies between projection and the input matrix.
Does anyone have an idea or explanation to help me understand why proj is different from test in my case?
I checked the your test data and found the following:
mean = test.mean() # 1.9545972004854737e+24
std = test.std() # 9.610595443778275e+26
I interpret the standard deviation to represent, in some sense, the least count or the uncertainty in the values that are reported. By that I mean that if a numerical algorithm reports the answer to be a, then the real answer should be in the interval [a - std, a + std]. This is because numerical algorithms are imprecise by their very nature. They depend on floating point operations which obviously can't represent real numbers in all there glory.
So if I plot:
plt.plot((test[0]-proj[0])/std)
plt.show()
I get the following plot which seems more reasonable.
You may be interested in plotting relative errors as well. Alternately, you can normalize your data to have 0 mean and unit variance and then the PCA results should be more accurate.

What might be the best loss function when target is a gaussian label?

I have a simple CNN with the inputs as
Cropped grayscale patches of size MxN centered on the object of interest. The intensity of each patch is rescaled to [0, 1].
Target Gaussian label of the same size MXN with values ranging
in [5.0155e-173, 1]. This label is kept fixed throughout the training.
The goal is to learn the target label and use the learned model to detect the object in a test image. I am using Adam optimizer with various loss functions such as categorical_crossentropy, mean_squared_error, and mean_absolute_error but training halts soon probably due to the low values returned by all these loss functions (vanishing gradients?). Increasing the batch size from 1 to 16~32 sometimes helps in completing the iteration but gives undesired outcomes at test time.
Is it because the loss function is too sensitive to the lower values in the target and even treats them as outliers hence steering the whole learning process in the wrong direction?
I'll be grateful for your help in fixing the loss function in such a scenario.
I think that the best choice here is to use some probability ditribution pseudo-distance, the first choice that came to my mind is to use Kullback-Leiber Divergence, it is already implemented in pytorch and keras( see [kldivloss](https://pytorch.org/docs/stable/nn.html#kldivloss and keras) Other famous ditances may include Jesnsen-Shanon divergence and Earth-Mover distance (This the same distance thatwas used in WGAN

How do I test my classifier accuracy against random values?

I've set up my first scikit-learn example to play with and I'm trying to gauge accuracy on my predictions. I've got training and test lists set up fine, but I'm getting ~0.95 accuracy even if I give it random values.
This looks to be because I'm checking for 0/1 labels, and 95% of the labels are zero's, so it's guessing on 0's and getting 0.95 accuracy (I think?). Obviously this isn't what I want.
How do I go about deciding if my classifiers are working, and how do I get meaningful accuracy values?
You have a clear class imbalance issue. Your classifier is predicting 0 all the time knowing it will be right 95% of the time. You can inspect this by calling predict(X_test) on your fitted classifier. If all the values are 0 you know this is the case.
To get a better idea on how the model performs you can upsample the data labelled with 1 or down sample the data labelled with 0. You can use this package which builds off scikit-learn and implements a number of resampling methods. Alternatively, you can use scikit learns resampling method. Which will bootstrap new data points for you.

Different Linear Regression Coefficients with statsmodels and sklearn

I was planning to use sklearn linear_model to plot a graph of linear regression result, and statsmodels.api to get a detail summary of the learning result. However, the two packages produce very different results on the same input.
For example, the constant term from sklearn is 7.8e-14, but the constant term from statsmodels is 48.6. (I added a column of 1's in x for constant term when using both methods) My code for both methods are succint:
# Use statsmodels linear regression to get a result (summary) for the model.
def reg_statsmodels(y, x):
results = sm.OLS(y, x).fit()
return results
# Use sklearn linear regression to compute the coefficients for the prediction.
def reg_sklearn(y, x):
lr = linear_model.LinearRegression()
lr.fit(x, y)
return lr.coef_
The input is too complicated to post here. Is it possible that a singular input x caused this problem?
By making a 3-d plot using PCA, it seems that the sklearn result is not a good approximation. What are some explanations? I still want to make a visualization, so it will be very helpful to fix the issues in the sklearn linear regression implementation.
You say that
I added a column of 1's in x for constant term when using both methods
But the documentation of LinearRegression says that
LinearRegression(fit_intercept=True, [...])
it fits an intercept by default. This could explain why you have the differences in the constant term.
Now for the other coefficients, differences can occur when two of the variables are highly correlated. Let's consider the most extreme case where two of your columns are identical. Then reducing the coefficient in front of any of the two can be compensated by increasing the other. This is the first thing I'd check.

sklearn: AUC score for LinearSVC and OneSVM

One option of the SVM classifier (SVC) is probability which is false by default. The documentation does not say what it does. Looking at libsvm source code, it seems to do some sort of cross-validation.
This option does not exist for LinearSVC nor OneSVM.
I need to calculate AUC scores for several SVM models, including these last two. Should I calculate the AUC score using decision_function(X) as the thresholds?
Answering my own question.
Firstly, it is a common "myth" that you need probabilities to draw the ROC curve. No, you need some kind of threshold in your model that you can change. The ROC curve is then drawn by changing this threshold. The point of the ROC curve being, of course, to see how well your model is reproducing the hypothesis by seeing how well it is ordering the observations.
In the case of SVM, there are two ways I see people drawing ROC curves for them:
using distance to the decision bondary, as I mentioned in my own question
using the bias term as your threshold in the SVM: http://researchgate.net/post/How_can_I_plot_determine_ROC_AUC_for_SVM. In fact, if you use SVC(probabilities=True) then probabilities will be calculated for you in this manner, by using CV, which you can then use to draw the ROC curve. But as mentioned in the link I provide, it is much faster if you draw the ROC curve directly by varying the bias.
I think #2 is the same as #1 if we are using a linear kernel, as in my own case, because varying the bias is varying the distance in this particular case.
In order to calculate AUC, using sklearn, you need a predict_proba method on your classifier; this is what the probability parameter on SVC does (you are correct that it's calculated using cross-validation). From the docs:
probability : boolean, optional (default=False)
Whether to enable probability estimates. This must be enabled prior to calling fit, and will slow down that method.
You can't use the decision function directly to compute AUC, since it's not a probability. I suppose you could scale the decision function to take values in the range [0,1], and compute AUC, however I'm not sure what statistical properties this will have; you certainly won't be able to use it to compare with ROC calculated using probabilities.

Resources