How does the predict function of StatsModels interact with roc_auc_score of scikit-learn? - python-3.x

I am trying to understand the predict function in Python statsmodels for a Logit model. Its documentation is here.
When I build a Logit Model and use predict, it returns values from 0 to 1 as opposed to 0 or 1. Now I read this saying these are probabilities and we need a threshold. Python statsmodel.api logistic regression (Logit)
Now, I want to produce AUC numbers and I use roc_auc_score from sklearn (docs).
Here is when I start getting confused.
When I put in the raw predicted values (probabilities) from my Logit model into the roc_auc_score as the second argument y_score, I get a reasonable AUC value of around 80%. How does the roc_auc_score function know which of my probabilities equate to 1 and which equate to 0? Nowhere was I given an opportunity to set the threshold.
When I manually convert my probabilities into 0 or 1 using a threshold of 0.5, I get an AUC of around 50%. Why would this happen?
Here's some code:
m1_result = m1.fit(disp = False)
roc_auc_score(y, m1_result.predict(X1))
AUC: 0.80
roc_auc_score(y, [1 if X >=0.5 else 0 for X in m1_result.predict(X1)])
AUC: 0.50
Why is this the case?

Your 2nd way of calculating the AUC is wrong; by definition, AUC needs probabilities, and not hard class predictions 0/1 generated after thresholding, as you do here. So, your AUC is 0.80.
You don't set a threshold yourself in AUC calculation; roughly speaking, as I have explained elsewhere, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.
It would be overkill to explain again here the rationale and details of AUC calculation; instead, these other SE threads (and the links therein) will help you get the idea:
In Classification, what is the difference between the test accuracy and the AUC score?
Advantages of AUC vs standard accuracy
Getting a low ROC AUC score but a high accuracy
Comparing AUC, log loss and accuracy scores between models

predict yields the estimated probability of event according to your fitted model. That is, each element corresponds to the predicted probability that your model calculated for each observation.
The process behind building a ROC curve consists of selecting each predicted probability as a threshold, measuring its false positive and true positive rates and plotting these results as a line graph. The area below this curve is the AUC.
To visualize this, imagine you had the following data:
observation
observed_result
predicted_prob
1
0
0.1
2
0
0.5
3
1
0.9
The function roc_auc_score will do the following:
Use 0.1 as the threshold such that all observations with predicted_prob ≤ 0.1 are classified as 0 and those with predicted_prob > 0.1 will be classified as 1
Use 0.5 as the threshold such that all observations with predicted_prob ≤ 0.5 are classified as 0 and those with predicted_prob > 0.5 will be classified as 1
Use 0.9 as the threshold such that all observations with predicted_prob ≤ 0.9 are classified as 0 and those with predicted_prob > 0.9 will be classified as 1
Each of the three different thresholds (0.1, 0.5 and 0.9) will result in its own false positive and true positive rates. The false positive rates are plotted along the x-axis, while the true positive rates are plotted in the y-axis.
As you can guess, you need to test many thresholds to plot a smooth curve. If you use 0.5 as a threshold and pass this to roc_auc_curve, you are testing out the false positive and true positive rates of a single threshold. This is incorrect and is also the reason roc_auc_curve is returning a lower AUC than before.
Instead of doing this, you may want to test the performance of a single threshold (i.e. 0.5) by calculating its corresponding accuracy, true positive rate or false positive rate.
For instance, imagine we set a threshold of 0.5 in the data above.
observation
observed_result
predicted_prob
predicted_class
1
0
0.1
0
2
0
0.5
0
3
1
0.9
1
This is a silly example, but by using 0.5 as the cutoff value, we made a perfect prediction because the observed_result matches predicted_class in all cases.

Related

Is there a reason why scikit-learn's classification report doesn't show you the number of predictions?

I use sklearn.metrics.classification_report often at work. One feature that I had to implement myself was to show the number of predictions for each class as well rather than just the support.
For example (I omitted some details for brevity):
<Original>
precision recall f1 support
class 0 0.5 0.5 0.5 100
class 1 0.5 0.5 0.5 200
class 2 0.5 0.5 0.5 300
<Mine>
precision recall f1 support preds
class 0 0.5 0.5 0.5 100 100
class 1 0.5 0.5 0.5 200 300
class 2 0.5 0.5 0.5 300 200
When performing error analysis I find it useful to compare the true label distribution to the predicted label distribution. However, since scikit-learn's function doesn't implement this I made a simple change in it so that it does.
I'm curious why this isn't a feature to start with? Is this based on some reason that the number of predictions is insignificant compared to support?

True Positive value difference in confusion matrix

To assess accuracy for LULCC, I have used the confusion matrix from pandas_ml. However, the statistic report has made me confused. The actual vs Predicted matrix indicates 20 (points) for LSAgri class but TP value is 57 for LSAgri. Shouldn't these two values need to be identical? class statistic vs CM

LogisticRegression predict 1 if probability to 1 is bigger than 0.7

in sklearn LogisticRegression
model = LogisticRegression().fit(X_train,y_train)
predictions = model.predict(X_test)
it give 1 if the probability to 1 is bigger than 0.5.
I want to change it that it give 1 if the probability to 1 is bigger than 0.7
You need to use model.predict_proba instead of mode.predict, as it will give you probabilities. On that you can use required threshold.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

How to interpret the Confusion Matrix in Python for 2 classes

I am implementing a machine learning model in Python which predicts success or failure. I have created a dummy variable which is 1 when there is success and 0 when there is a failure. I understand the concept of confusion matrix but I have found some online where the TPs and TNs are on opposite sides of the matrix. I would like to know how to interpret the results for my variables. Is the top-left corner of the matrix predicting True Positive? If so would that translate to the amount of successes being predicted correctly or the amount of failures being predicted correctly?
Does the matrix match the diagram below and if so how?
Ideally, please describe each corner of the confusion matrix in the context where I have success as 1 and failure as 0.
Refer to the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
Since you haven't specified the third parameter for labels in confusion_matrix, the labels in y_test_res will be used in sorted order, i.e. in this case 0 then 1. The row labels represent actual y, and column labels represent predicted y.
So the top-left corner is showing the number of failure observations, i.e. the actual y was 0 and was predicted 0, i.e. true negatives. The bottom-right corner is showing true positives, i.e. the actual y was 1 and was predicted 0.
The top-right corner would be actual y = 0 and predicted y = 1, i.e. false positive.
Using the confusion matrix plot would prettify things a little.
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(forest, X_test, y_test)
print(plt.show())
In the case of binary classification where classes are 0 and 1 and according to the doc :
1st row is for class 0
2nd row is for class 1
1st column is for predicted class 0
2nd column is for predicted class 1
Coefficient (0, 0) is the True Negative count (TN).
Coefficient (0, 1) is the False Positive count (FP).
Coefficient (1, 0) is the False Negative count (FN).
Coefficient (1, 1) is the True Positive count (TP).

Mean Square Error (MSE) Root Mean Square Error (RMSE)

I'm on a project for the study that I used the mean_square_error from skelarn to get my MSE and RMSE.
I can´t understand what the information means.
I'm using a dataset about House Sales and I want to predict the price of the house using Linear Regression. When I put my predict price and real price, I got the results:
MSE: 1114197668.6920328 RMSE: 33379.59958855158
What this information really means actually? That my predict will have a mean difference in price about 33379.60?
MSE = mean_squared_error(predict,testSalePrice)
RMSE = np.sqrt(MSE)
Mean Squared Error:
In statistics, the mean squared error (MSE) or mean squared deviation
(MSD) of an estimator (of a procedure for estimating an unobserved
quantity) measures the average of the squares of the errors.
So for example let's assume you have three datapoints:
Price Predicted
1900 2000
2000 2000
2100 2000
Then the MSE is: 1/3 * ((-100)*(-100)+ (0)*(0) + (100)*(100)) = 1/3 * (20000) = 6000
The perfect one would be 0, but this you will probably not reach. You have to interpret it in comparison with your actual value range.
The RMSE in this case would be: SQRT(6000) = 77,..
This is more intepretable, that means on average you are 77 away from your prediction, which makes sense if you see the three results

Resources