in sklearn LogisticRegression
model = LogisticRegression().fit(X_train,y_train)
predictions = model.predict(X_test)
it give 1 if the probability to 1 is bigger than 0.5.
I want to change it that it give 1 if the probability to 1 is bigger than 0.7
You need to use model.predict_proba instead of mode.predict, as it will give you probabilities. On that you can use required threshold.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Related
I am trying to understand the predict function in Python statsmodels for a Logit model. Its documentation is here.
When I build a Logit Model and use predict, it returns values from 0 to 1 as opposed to 0 or 1. Now I read this saying these are probabilities and we need a threshold. Python statsmodel.api logistic regression (Logit)
Now, I want to produce AUC numbers and I use roc_auc_score from sklearn (docs).
Here is when I start getting confused.
When I put in the raw predicted values (probabilities) from my Logit model into the roc_auc_score as the second argument y_score, I get a reasonable AUC value of around 80%. How does the roc_auc_score function know which of my probabilities equate to 1 and which equate to 0? Nowhere was I given an opportunity to set the threshold.
When I manually convert my probabilities into 0 or 1 using a threshold of 0.5, I get an AUC of around 50%. Why would this happen?
Here's some code:
m1_result = m1.fit(disp = False)
roc_auc_score(y, m1_result.predict(X1))
AUC: 0.80
roc_auc_score(y, [1 if X >=0.5 else 0 for X in m1_result.predict(X1)])
AUC: 0.50
Why is this the case?
Your 2nd way of calculating the AUC is wrong; by definition, AUC needs probabilities, and not hard class predictions 0/1 generated after thresholding, as you do here. So, your AUC is 0.80.
You don't set a threshold yourself in AUC calculation; roughly speaking, as I have explained elsewhere, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.
It would be overkill to explain again here the rationale and details of AUC calculation; instead, these other SE threads (and the links therein) will help you get the idea:
In Classification, what is the difference between the test accuracy and the AUC score?
Advantages of AUC vs standard accuracy
Getting a low ROC AUC score but a high accuracy
Comparing AUC, log loss and accuracy scores between models
predict yields the estimated probability of event according to your fitted model. That is, each element corresponds to the predicted probability that your model calculated for each observation.
The process behind building a ROC curve consists of selecting each predicted probability as a threshold, measuring its false positive and true positive rates and plotting these results as a line graph. The area below this curve is the AUC.
To visualize this, imagine you had the following data:
observation
observed_result
predicted_prob
1
0
0.1
2
0
0.5
3
1
0.9
The function roc_auc_score will do the following:
Use 0.1 as the threshold such that all observations with predicted_prob ≤ 0.1 are classified as 0 and those with predicted_prob > 0.1 will be classified as 1
Use 0.5 as the threshold such that all observations with predicted_prob ≤ 0.5 are classified as 0 and those with predicted_prob > 0.5 will be classified as 1
Use 0.9 as the threshold such that all observations with predicted_prob ≤ 0.9 are classified as 0 and those with predicted_prob > 0.9 will be classified as 1
Each of the three different thresholds (0.1, 0.5 and 0.9) will result in its own false positive and true positive rates. The false positive rates are plotted along the x-axis, while the true positive rates are plotted in the y-axis.
As you can guess, you need to test many thresholds to plot a smooth curve. If you use 0.5 as a threshold and pass this to roc_auc_curve, you are testing out the false positive and true positive rates of a single threshold. This is incorrect and is also the reason roc_auc_curve is returning a lower AUC than before.
Instead of doing this, you may want to test the performance of a single threshold (i.e. 0.5) by calculating its corresponding accuracy, true positive rate or false positive rate.
For instance, imagine we set a threshold of 0.5 in the data above.
observation
observed_result
predicted_prob
predicted_class
1
0
0.1
0
2
0
0.5
0
3
1
0.9
1
This is a silly example, but by using 0.5 as the cutoff value, we made a perfect prediction because the observed_result matches predicted_class in all cases.
I am implementing a machine learning model in Python which predicts success or failure. I have created a dummy variable which is 1 when there is success and 0 when there is a failure. I understand the concept of confusion matrix but I have found some online where the TPs and TNs are on opposite sides of the matrix. I would like to know how to interpret the results for my variables. Is the top-left corner of the matrix predicting True Positive? If so would that translate to the amount of successes being predicted correctly or the amount of failures being predicted correctly?
Does the matrix match the diagram below and if so how?
Ideally, please describe each corner of the confusion matrix in the context where I have success as 1 and failure as 0.
Refer to the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
Since you haven't specified the third parameter for labels in confusion_matrix, the labels in y_test_res will be used in sorted order, i.e. in this case 0 then 1. The row labels represent actual y, and column labels represent predicted y.
So the top-left corner is showing the number of failure observations, i.e. the actual y was 0 and was predicted 0, i.e. true negatives. The bottom-right corner is showing true positives, i.e. the actual y was 1 and was predicted 0.
The top-right corner would be actual y = 0 and predicted y = 1, i.e. false positive.
Using the confusion matrix plot would prettify things a little.
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(forest, X_test, y_test)
print(plt.show())
In the case of binary classification where classes are 0 and 1 and according to the doc :
1st row is for class 0
2nd row is for class 1
1st column is for predicted class 0
2nd column is for predicted class 1
Coefficient (0, 0) is the True Negative count (TN).
Coefficient (0, 1) is the False Positive count (FP).
Coefficient (1, 0) is the False Negative count (FN).
Coefficient (1, 1) is the True Positive count (TP).
I'm on a project for the study that I used the mean_square_error from skelarn to get my MSE and RMSE.
I can´t understand what the information means.
I'm using a dataset about House Sales and I want to predict the price of the house using Linear Regression. When I put my predict price and real price, I got the results:
MSE: 1114197668.6920328 RMSE: 33379.59958855158
What this information really means actually? That my predict will have a mean difference in price about 33379.60?
MSE = mean_squared_error(predict,testSalePrice)
RMSE = np.sqrt(MSE)
Mean Squared Error:
In statistics, the mean squared error (MSE) or mean squared deviation
(MSD) of an estimator (of a procedure for estimating an unobserved
quantity) measures the average of the squares of the errors.
So for example let's assume you have three datapoints:
Price Predicted
1900 2000
2000 2000
2100 2000
Then the MSE is: 1/3 * ((-100)*(-100)+ (0)*(0) + (100)*(100)) = 1/3 * (20000) = 6000
The perfect one would be 0, but this you will probably not reach. You have to interpret it in comparison with your actual value range.
The RMSE in this case would be: SQRT(6000) = 77,..
This is more intepretable, that means on average you are 77 away from your prediction, which makes sense if you see the three results
I have a set of independent variables X and set of values of dependent variable Y. The task at hand is binomial classification, i.e. predict whether debtor will default on his debt (1) or not (0).
After filtering out statistically insignificant variables and variables that bring about multicollinearity I am left with following summary of logistic regression model:
Accuracy ~0.87
Confusion matrix [[1038 254]
[72 1182]]
Parameters Coefficients
intercept -4.210
A 5.119
B 0.873
C -1.414
D 3.757
Now, I convert these coefficients into new continuous variable "default_probability" via log odds_ratio, i.e.
import math
e = math.e
power = (-4.210*1) + (A*5.119) + (B*0.873) + (C*-1.414) + (D*3.757)
default_probability = (e**power)/(1+(e**power))
When I divide my original dataset into quartiles according to this new continuos variable "default_probability", then:
1st quartile contains 65% of defaulted debts (577 out of 884 incidents)
2nd quartile contains 23% of defaulted debts (206 out of 884 incidents)
3rd quartile contains 9% of defaulted debts (77 out of 884 incidents)
4th quartile contains 3% of defaulted debts (24 out of 884 incidents)
At the same time:
overall quantity of debtors in 1st quartile - 1145
overall quantity of debtors in 1st quartile - 516
overall quantity of debtors in 1st quartile - 255
overall quantity of debtors in 1st quartile - 3043
I wanted to use "default probability" to surgically remove the most problematic credits by imposing the business-rule "no credit to the 1st quartile", but now I wonder whether it is "surgical" at all (by this rule I will lose (1145 - 577 = 568 "good" clients) and overall is it mathematically/logically correct to derive new continuous variables for the dataset out of the coefficients of logistic regression by the line of reasoning described above?
You have forgotten the intercept when you compute power. But supposing this is only a typo like you said in the comments, then your approach is valid. However, you might want to use scikit-learn's predict_proba function, which will save you the trouble. Example:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
import numpy as np
data = load_breast_cancer()
X = data.data
y = data.target
lr = LogisticRegression()
lr.fit(X,y)
Suppose I then want to compute the probability of belonging to class 1 for a given observation (say observation i), I can do what you have done, essentially using the regression coefficients and the intercept like you have done:
i = 0
1/(1+np.exp(-X[i].dot(lr.coef_[0])-lr.intercept_[0]))
Or just do :
lr.predict_proba(X)[i][1]
which is faster
I have 2 series of 45 values in the interval [0,1]. The first series is a human-generated standard, the second one is computer-generated (full series here http://www.copypastecode.com/74844/).
The first series is sorted decreasingly.
0.909090909 0.216196598
0.909090909 0.111282099
0.9 0.021432587
0.9 0.033901106
...
0.1 0.003099256
0 0.001084533
0 0.008882249
0 0.006501463
Now what I want to assess is the degree to which the order is preserved in the second series, given that the first series is monotonic.
The Pearson correlation is 0.454763067, but I think that the relationship is not linear so this value is difficult to interpret.
A natural approach would be to use the Spearman rank correlation, which in this case is 0.670556181.
I noticed that with random values, while Pearson is very close to 0, the Spearman rank correlation goes up to 0.5, so a value of 0.67 seems very low.
What would you use to assess the order similarity between these 2 series?
I want to assess is the degree to which the order is preserved
Since it's the order (rank) that you care about, Spearman rank correlation is the more meaningful metric here.
I noticed that with random values [...] the Spearman rank correlation goes up to 0.5
How do you generate those random values? I've just conducted a simple experiment with some random numbers generated using numpy, and I am not seeing that:
In [1]: import numpy as np
In [2]: import scipy.stats
In [3]: x = np.random.randn(1000)
In [4]: y = np.random.randn(1000)
In [5]: print scipy.stats.spearmanr(x, y)
(-0.013847401847401847, 0.66184551507218536)
The first number (-0.01) is the rank correlation coefficient; the second number (0.66) is the associated p-value.