Mean Square Error (MSE) Root Mean Square Error (RMSE) - scikit-learn

I'm on a project for the study that I used the mean_square_error from skelarn to get my MSE and RMSE.
I can´t understand what the information means.
I'm using a dataset about House Sales and I want to predict the price of the house using Linear Regression. When I put my predict price and real price, I got the results:
MSE: 1114197668.6920328 RMSE: 33379.59958855158
What this information really means actually? That my predict will have a mean difference in price about 33379.60?
MSE = mean_squared_error(predict,testSalePrice)
RMSE = np.sqrt(MSE)

Mean Squared Error:
In statistics, the mean squared error (MSE) or mean squared deviation
(MSD) of an estimator (of a procedure for estimating an unobserved
quantity) measures the average of the squares of the errors.
So for example let's assume you have three datapoints:
Price Predicted
1900 2000
2000 2000
2100 2000
Then the MSE is: 1/3 * ((-100)*(-100)+ (0)*(0) + (100)*(100)) = 1/3 * (20000) = 6000
The perfect one would be 0, but this you will probably not reach. You have to interpret it in comparison with your actual value range.
The RMSE in this case would be: SQRT(6000) = 77,..
This is more intepretable, that means on average you are 77 away from your prediction, which makes sense if you see the three results

Related

True Positive value difference in confusion matrix

To assess accuracy for LULCC, I have used the confusion matrix from pandas_ml. However, the statistic report has made me confused. The actual vs Predicted matrix indicates 20 (points) for LSAgri class but TP value is 57 for LSAgri. Shouldn't these two values need to be identical? class statistic vs CM

How does the predict function of StatsModels interact with roc_auc_score of scikit-learn?

I am trying to understand the predict function in Python statsmodels for a Logit model. Its documentation is here.
When I build a Logit Model and use predict, it returns values from 0 to 1 as opposed to 0 or 1. Now I read this saying these are probabilities and we need a threshold. Python statsmodel.api logistic regression (Logit)
Now, I want to produce AUC numbers and I use roc_auc_score from sklearn (docs).
Here is when I start getting confused.
When I put in the raw predicted values (probabilities) from my Logit model into the roc_auc_score as the second argument y_score, I get a reasonable AUC value of around 80%. How does the roc_auc_score function know which of my probabilities equate to 1 and which equate to 0? Nowhere was I given an opportunity to set the threshold.
When I manually convert my probabilities into 0 or 1 using a threshold of 0.5, I get an AUC of around 50%. Why would this happen?
Here's some code:
m1_result = m1.fit(disp = False)
roc_auc_score(y, m1_result.predict(X1))
AUC: 0.80
roc_auc_score(y, [1 if X >=0.5 else 0 for X in m1_result.predict(X1)])
AUC: 0.50
Why is this the case?
Your 2nd way of calculating the AUC is wrong; by definition, AUC needs probabilities, and not hard class predictions 0/1 generated after thresholding, as you do here. So, your AUC is 0.80.
You don't set a threshold yourself in AUC calculation; roughly speaking, as I have explained elsewhere, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.
It would be overkill to explain again here the rationale and details of AUC calculation; instead, these other SE threads (and the links therein) will help you get the idea:
In Classification, what is the difference between the test accuracy and the AUC score?
Advantages of AUC vs standard accuracy
Getting a low ROC AUC score but a high accuracy
Comparing AUC, log loss and accuracy scores between models
predict yields the estimated probability of event according to your fitted model. That is, each element corresponds to the predicted probability that your model calculated for each observation.
The process behind building a ROC curve consists of selecting each predicted probability as a threshold, measuring its false positive and true positive rates and plotting these results as a line graph. The area below this curve is the AUC.
To visualize this, imagine you had the following data:
observation
observed_result
predicted_prob
1
0
0.1
2
0
0.5
3
1
0.9
The function roc_auc_score will do the following:
Use 0.1 as the threshold such that all observations with predicted_prob ≤ 0.1 are classified as 0 and those with predicted_prob > 0.1 will be classified as 1
Use 0.5 as the threshold such that all observations with predicted_prob ≤ 0.5 are classified as 0 and those with predicted_prob > 0.5 will be classified as 1
Use 0.9 as the threshold such that all observations with predicted_prob ≤ 0.9 are classified as 0 and those with predicted_prob > 0.9 will be classified as 1
Each of the three different thresholds (0.1, 0.5 and 0.9) will result in its own false positive and true positive rates. The false positive rates are plotted along the x-axis, while the true positive rates are plotted in the y-axis.
As you can guess, you need to test many thresholds to plot a smooth curve. If you use 0.5 as a threshold and pass this to roc_auc_curve, you are testing out the false positive and true positive rates of a single threshold. This is incorrect and is also the reason roc_auc_curve is returning a lower AUC than before.
Instead of doing this, you may want to test the performance of a single threshold (i.e. 0.5) by calculating its corresponding accuracy, true positive rate or false positive rate.
For instance, imagine we set a threshold of 0.5 in the data above.
observation
observed_result
predicted_prob
predicted_class
1
0
0.1
0
2
0
0.5
0
3
1
0.9
1
This is a silly example, but by using 0.5 as the cutoff value, we made a perfect prediction because the observed_result matches predicted_class in all cases.

Deriving new continuous variable out of logistic regression coefficients

I have a set of independent variables X and set of values of dependent variable Y. The task at hand is binomial classification, i.e. predict whether debtor will default on his debt (1) or not (0).
After filtering out statistically insignificant variables and variables that bring about multicollinearity I am left with following summary of logistic regression model:
Accuracy ~0.87
Confusion matrix [[1038 254]
[72 1182]]
Parameters Coefficients
intercept -4.210
A 5.119
B 0.873
C -1.414
D 3.757
Now, I convert these coefficients into new continuous variable "default_probability" via log odds_ratio, i.e.
import math
e = math.e
power = (-4.210*1) + (A*5.119) + (B*0.873) + (C*-1.414) + (D*3.757)
default_probability = (e**power)/(1+(e**power))
When I divide my original dataset into quartiles according to this new continuos variable "default_probability", then:
1st quartile contains 65% of defaulted debts (577 out of 884 incidents)
2nd quartile contains 23% of defaulted debts (206 out of 884 incidents)
3rd quartile contains 9% of defaulted debts (77 out of 884 incidents)
4th quartile contains 3% of defaulted debts (24 out of 884 incidents)
At the same time:
overall quantity of debtors in 1st quartile - 1145
overall quantity of debtors in 1st quartile - 516
overall quantity of debtors in 1st quartile - 255
overall quantity of debtors in 1st quartile - 3043
I wanted to use "default probability" to surgically remove the most problematic credits by imposing the business-rule "no credit to the 1st quartile", but now I wonder whether it is "surgical" at all (by this rule I will lose (1145 - 577 = 568 "good" clients) and overall is it mathematically/logically correct to derive new continuous variables for the dataset out of the coefficients of logistic regression by the line of reasoning described above?
You have forgotten the intercept when you compute power. But supposing this is only a typo like you said in the comments, then your approach is valid. However, you might want to use scikit-learn's predict_proba function, which will save you the trouble. Example:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
import numpy as np
data = load_breast_cancer()
X = data.data
y = data.target
lr = LogisticRegression()
lr.fit(X,y)
Suppose I then want to compute the probability of belonging to class 1 for a given observation (say observation i), I can do what you have done, essentially using the regression coefficients and the intercept like you have done:
i = 0
1/(1+np.exp(-X[i].dot(lr.coef_[0])-lr.intercept_[0]))
Or just do :
lr.predict_proba(X)[i][1]
which is faster

why divide sample standard deviation by sqrt(sample size) when calculating z-score

I have been following Khan Academy videos to gain understanding of hypothesis testing, and I must confess that all my understanding thus far is based on that source.
Now, the following videos talk about z-score/hypothesis testing:
Hypothesis Testing
Z-statistic vs T-statistic
Now, coming to my doubts, which is all about the denominator in the z-score:
For the z-score formula which is: z = (x – μ) / σ,
we use this directly when the standard deviation of the population(σ), is known.
But when its unknown, and we use a sampling distribution,
then we have z = (x – μ) / (σ / √n); and we estimate σ with σs ; where σs is the standard deviation of the sample, and n is the sample size.
Then z score = (x – μ) / (σs / √n). Why are dividing by √n, when σs is already known?
Even in the video, Hypothesis Testing - Sal divides the sample's standard deviation by √n. Why are we doing this, when σs is directly given?
Please help me understand.
I tried applying this on the following question, and faced the problems below:
Question : Yardley designed new perfumes. Yardley company claimed that an average new
perfume bottle lasts 300 days. Another company randomly selects 35 new perfume bottles from
Yardley for testing. The sampled bottles last an average of 190 days, with a
standard deviation of 50 days. If the Yardley's claim were true,
what is the probability that 35 randomly selected bottles would have an average
life of no more than 190 days ?
So, the above question, when I do the following:
z = (190-300)/(50/√35), we get z = -13.05, which is not a possible score, since
z score should be between +-3.
And when I do, z = (190-110)/50, or rather z = (x – μ) / σ, I seem to be getting an acceptable answer over here.
Please help me figure out what I am missing.
I think the origin of the 1/\sqrt{n} is simply whether you're calculating the standard deviation of the lifetime of a single bottle, or the standard deviation of the (sample) mean of a set of bottles.
The question indicates that 50 days is the standard deviation of the lifetimes of the set of 35 bottles. That implies that the estimated mean age (190 days) will have a margin of error of about 50/\sqrt{35} days. Assuming that this similar margin of error applied to the claimed 300-day lifetime, one can calculate the probability that a set of 35 bottles would be measured to be 190 days or less, using the complementary error function.
Your z=-13.05 looks about right, implying that it is extremely unlikely that claimed 300-day lifetime is consistent with that seen in the 35-bottle experiment.

understanding of result of logistic regression

let us suppose we have following data with binary response output(coupon)
annual spending is given in 1000th unit, my goal is to estimate whether if customer spend more then 2000 and has Simmons card, will also have coupon, first of all i have sorted data according to response data, i got following picture
at next stage i have calculated logit for each data, for those initially i choose following coefficient
B0 0.1
B1 0.1
B2 0.1
and i have calculated L according to the following formula
at next stage i have calculated e^L (which in excel can be done easily by exp function )
=EXP(D2)
after that i have calculated probability
=E2/(1+E2)
and finally using formula
i have calculated log likelihood function
then i have calculated sum and using solver i have calculated coefficient that minimize this sum( please pay attention that values are given in negative value) , but i have got all coefficient zero
i am wrong ? or does it means that i can'predict buying of coupon on the base of Annual spending and owning of Simmons card? thanks in advance
You can predict the buying of a coupon on the base of Annual spending (and knowing Simmons card doesn't help).
Admittedly I didn't solve it in Excel, but I suspect the problem might be that your optimization didn't converge (i.e., failed to reach the correct coefficients through the solving process) -- the correct coefficients are B0 = 5.63, B1 = -2.95, and B2 = 0. I found an online reference for the Excel logistic regression procedure at http://blog.excelmasterseries.com/2014/06/logistic-regression-performed-in-excel.html.
I ran the logistic regression myself and found that Annual spending is significant (at the 0.05 level) whereas Simmons card is not. Re-running the model with Simmons card removed yields the following equations:
L = 5.63 - 2.95 * Annual spending
P(1) = exp(L)/(1 + exp(L))
If P(1) > 0.5 => coupon = 1
Although the entropy Rsquare is low at 0.39 (and the number of data points is very low), the model is statistically significant.

Resources