Is there a reason why scikit-learn's classification report doesn't show you the number of predictions? - scikit-learn

I use sklearn.metrics.classification_report often at work. One feature that I had to implement myself was to show the number of predictions for each class as well rather than just the support.
For example (I omitted some details for brevity):
<Original>
precision recall f1 support
class 0 0.5 0.5 0.5 100
class 1 0.5 0.5 0.5 200
class 2 0.5 0.5 0.5 300
<Mine>
precision recall f1 support preds
class 0 0.5 0.5 0.5 100 100
class 1 0.5 0.5 0.5 200 300
class 2 0.5 0.5 0.5 300 200
When performing error analysis I find it useful to compare the true label distribution to the predicted label distribution. However, since scikit-learn's function doesn't implement this I made a simple change in it so that it does.
I'm curious why this isn't a feature to start with? Is this based on some reason that the number of predictions is insignificant compared to support?

Related

LogisticRegression predict 1 if probability to 1 is bigger than 0.7

in sklearn LogisticRegression
model = LogisticRegression().fit(X_train,y_train)
predictions = model.predict(X_test)
it give 1 if the probability to 1 is bigger than 0.5.
I want to change it that it give 1 if the probability to 1 is bigger than 0.7
You need to use model.predict_proba instead of mode.predict, as it will give you probabilities. On that you can use required threshold.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Multi label regression summed to 1

I have data of renovation jobs - each job requires at least one of 3 skills - carpenter, painter and ceramics. For each row my labels are share of time each skill is required for each job (summed to 1).
Sample:
Job Description (free text) Location Estimated Cost Main material Carpenter Painter Ceramics
Paint Smiths' House and Parquet Chicago 4000 Parquet 0.1 0.15 0.75
Total renovation and pool New York 15700 Metal 0.6 0.2 0.2
Pink decorations New York 12000 Wallpaper 0.7 0.05 0.25
I want to train the model to predict the shares of the skills.
I was thinking about MultiOutputRegressor of scikit-learn, but my main issue is to oblige the predictions to be >=0 and summed to 1.
Is there an off the shelf solution?

How does the predict function of StatsModels interact with roc_auc_score of scikit-learn?

I am trying to understand the predict function in Python statsmodels for a Logit model. Its documentation is here.
When I build a Logit Model and use predict, it returns values from 0 to 1 as opposed to 0 or 1. Now I read this saying these are probabilities and we need a threshold. Python statsmodel.api logistic regression (Logit)
Now, I want to produce AUC numbers and I use roc_auc_score from sklearn (docs).
Here is when I start getting confused.
When I put in the raw predicted values (probabilities) from my Logit model into the roc_auc_score as the second argument y_score, I get a reasonable AUC value of around 80%. How does the roc_auc_score function know which of my probabilities equate to 1 and which equate to 0? Nowhere was I given an opportunity to set the threshold.
When I manually convert my probabilities into 0 or 1 using a threshold of 0.5, I get an AUC of around 50%. Why would this happen?
Here's some code:
m1_result = m1.fit(disp = False)
roc_auc_score(y, m1_result.predict(X1))
AUC: 0.80
roc_auc_score(y, [1 if X >=0.5 else 0 for X in m1_result.predict(X1)])
AUC: 0.50
Why is this the case?
Your 2nd way of calculating the AUC is wrong; by definition, AUC needs probabilities, and not hard class predictions 0/1 generated after thresholding, as you do here. So, your AUC is 0.80.
You don't set a threshold yourself in AUC calculation; roughly speaking, as I have explained elsewhere, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.
It would be overkill to explain again here the rationale and details of AUC calculation; instead, these other SE threads (and the links therein) will help you get the idea:
In Classification, what is the difference between the test accuracy and the AUC score?
Advantages of AUC vs standard accuracy
Getting a low ROC AUC score but a high accuracy
Comparing AUC, log loss and accuracy scores between models
predict yields the estimated probability of event according to your fitted model. That is, each element corresponds to the predicted probability that your model calculated for each observation.
The process behind building a ROC curve consists of selecting each predicted probability as a threshold, measuring its false positive and true positive rates and plotting these results as a line graph. The area below this curve is the AUC.
To visualize this, imagine you had the following data:
observation
observed_result
predicted_prob
1
0
0.1
2
0
0.5
3
1
0.9
The function roc_auc_score will do the following:
Use 0.1 as the threshold such that all observations with predicted_prob ≤ 0.1 are classified as 0 and those with predicted_prob > 0.1 will be classified as 1
Use 0.5 as the threshold such that all observations with predicted_prob ≤ 0.5 are classified as 0 and those with predicted_prob > 0.5 will be classified as 1
Use 0.9 as the threshold such that all observations with predicted_prob ≤ 0.9 are classified as 0 and those with predicted_prob > 0.9 will be classified as 1
Each of the three different thresholds (0.1, 0.5 and 0.9) will result in its own false positive and true positive rates. The false positive rates are plotted along the x-axis, while the true positive rates are plotted in the y-axis.
As you can guess, you need to test many thresholds to plot a smooth curve. If you use 0.5 as a threshold and pass this to roc_auc_curve, you are testing out the false positive and true positive rates of a single threshold. This is incorrect and is also the reason roc_auc_curve is returning a lower AUC than before.
Instead of doing this, you may want to test the performance of a single threshold (i.e. 0.5) by calculating its corresponding accuracy, true positive rate or false positive rate.
For instance, imagine we set a threshold of 0.5 in the data above.
observation
observed_result
predicted_prob
predicted_class
1
0
0.1
0
2
0
0.5
0
3
1
0.9
1
This is a silly example, but by using 0.5 as the cutoff value, we made a perfect prediction because the observed_result matches predicted_class in all cases.

Odds ratio to Probability of Success

We ran a logistic regression model with Passing the certification exam (0 or 1) as an outcome. We found that one of the strongest predictors is the student's program GPA, the highest the program GPA, the highest the odds of passing the certification exam.
Standardized GPA, p-value < .0001, B estimate = 1.7154, odds ratio = 5.559
I interpret this as, with every 0.33 unit (one standard deviation) increase in GPA, the odds of succeeding in the certification exam increased by 5.559 times.
However, clients want to understand this in terms of probability. I calculated probability by:
(5.559 - 1) x 100 = 455.9 percent
I'm having trouble explaining this percentage to our client. I thought probability of success is only supposed to range from 0 to 1. So confused! Help please!
Your math is correct, just need to work on the interpretation.
I suppose the client wants to know "What is the probability of passing the exam if we increase the GPA by 1 unit?"
Using your output, we know that the odds ratio (OR) is 5.559. As you said, this means that the odds in favor of passing the exam increases by 5.559 times for every unit increase in GPA. So what's the increase in probability?
odds(Y=1|X_GPA + 1) = 5.559 = p(Y=1|X_GPA + 1) / (1 - p(Y=1|X_GPA + 1))
Solving for p(Y=1|X_GPA + 1), we get:
p(Y=1|X_GPA + 1) = odds(Y=1|X_GPA + 1) / (1 + odds(Y=1|X_GPA + 1) ) = 5.559 / 6.559 = 0.847.
Note that another way to do this is to make use of the formula for logit:
logit(p) = B_0 + B_1*X_1 +...+ B_GPA*X_GPA therefore
p = 1 / ( 1 + e^-(B_0 + B_1*X_1 +...+ B_GPA*X_GPA) )
Since we know B_GPA = 1.7154, we can calculate that p = 1 / ( 1 + e^-1.7154 ) = 0.847
The change in probability(risk ratio i.e. p2/p1) of the target relies on the baseline probability (p1) and as such isn't a single value for a given odds ratio.
It can be calculated using the following formula:
RR = OR / (1 – p + (p x OR))
where p is the baseline value for p.
Eg.
Odds Ratio 0.1 0.2 0.3 0.4 0.5 0.6
RR(p=0.1) 0.11 0.22 0.32 0.43 0.53 0.63
RR(p=0.2) 0.12 0.24 0.35 0.45 0.56 0.65
RR(p=0.3) 0.14 0.26 0.38 0.49 0.59 0.68
This link elaborates on the formula.
https://www.r-bloggers.com/how-to-convert-odds-ratios-to-relative-risks/

How can I find the formula for this table of values?

How can I find the formula for this table of values?
Things I know:
This table horizontally is 45*1.25+(x*0.25) where x is column number starting at 0.
This table vertically is 45*1.25+(y*0.125) where y is row number starting at 0.
These rules only work for the first row and column I believe which is why I'm having an issue figuring out whats going on.
56.25 67.5 78.75 90
61.88 78.75 95.63 112.5
67.5 90 112.5 135
So throwing a regression tool at it, I find a model of
56.2513 + 11.2497*x + 5.625*y + 5.625*x*y
with parameter standard deviations at
0.0017078 0.00091287 0.0013229 0.00070711
A measure of the residual errors is 0.0018257, which is down near the rounding error in your data. I would point out that it is quite close to that given by Amadan.
I can get a slightly better model as
56.2505 + 11.2497*x + 5.63*y + 5.625*x*y - 0.0025*y^2
again, the parameter standard errors are
0.0014434 0.00074536 0.0024833 0.00057735 0.001118
with a residual error of 0.0013944. The improvement is minimal, and you can see the coefficient of y^2 is barely more than twice the standard deviation. I'd be very willing to believe this parameter does not belong in the model, but was just generated by rounding noise.
Perhaps more telling is to look at the residuals. The model posed by Amadan yields residuals of:
56.25 + 5.63*Y + 11.26*X + 5.63*X.*Y - Z
ans =
0 0.01 0.02 0.03
0 0.02 0.03 0.05
0.01 0.03 0.05 0.07
Instead, consider the model generated by the regression tool.
56.2513 + 11.2497*X + 5.625*Y + 5.625*X.*Y - Z
ans =
0.0013 0.001 0.0007 0.0004
-0.0037 0.001 -0.0043 0.0004
0.0013 0.001 0.0007 0.0004
The residuals here are better, but I can do slightly better yet, merely by looking at the coefficients and perturbing them in a logical manner. What does this tell me? That Amadan's model is not the model that originally generated the data, although it was close.
My better model is this one:
56.25 + 11.25*X + 5.625*Y + 5.625*X.*Y
ans =
56.25 67.5 78.75 90
61.875 78.75 95.625 112.5
67.5 90 112.5 135
See that it is exact, except for two cells which have now been "unrounded". It yields residuals of:
56.25 + 11.25*X + 5.625*Y + 5.625*X.*Y - Z
ans =
0 0 0 0
-0.005 0 -0.005 0
0 0 0 0
Regression analysis will not always yield the result you need. Sometimes pencil and paper are as good or even better. But it can give you some understanding if you look at the data. My conclusion is that the original model was
f(x,y) = 56.25 + 11.25*x + 5.625*y + 5.625*x*y
The coefficients are well behaved and simple, and they predict the data perfectly except for two cells, which were surely rounded.
f(x,y) = 56.25 + 5.63 * ((x + 1) * y + 2 * x)
And, not programming.
I think you need a least squares fit for your data, given an assumed polynomial. This approach will "work" even if you give it more data points. Least squares will calculate the polynomial coefficients that minimize the mean square error between the polynomial and the points.

Resources