How to interpret economic significance from the regression (OLS) coefficient? - statistics

I have the results from the following regression (pooled OLS with industry and year fixed effects):
y =0.02 - 0.31*X1 + 0.23*X2 + residual
Please, correct me if I am wrong with interpreting the economic significance of X1 on y (where the mean value in the sample of X1 is o.55 and standard deviation is 0.1):
One standard deviation increase in X1 (holding all other factors constant) decreases "y" by 0.031 (0.31*0.1) from its mean of 0.22 to 0.189, or 14%.

Related

Mean Square Error (MSE) Root Mean Square Error (RMSE)

I'm on a project for the study that I used the mean_square_error from skelarn to get my MSE and RMSE.
I can´t understand what the information means.
I'm using a dataset about House Sales and I want to predict the price of the house using Linear Regression. When I put my predict price and real price, I got the results:
MSE: 1114197668.6920328 RMSE: 33379.59958855158
What this information really means actually? That my predict will have a mean difference in price about 33379.60?
MSE = mean_squared_error(predict,testSalePrice)
RMSE = np.sqrt(MSE)
Mean Squared Error:
In statistics, the mean squared error (MSE) or mean squared deviation
(MSD) of an estimator (of a procedure for estimating an unobserved
quantity) measures the average of the squares of the errors.
So for example let's assume you have three datapoints:
Price Predicted
1900 2000
2000 2000
2100 2000
Then the MSE is: 1/3 * ((-100)*(-100)+ (0)*(0) + (100)*(100)) = 1/3 * (20000) = 6000
The perfect one would be 0, but this you will probably not reach. You have to interpret it in comparison with your actual value range.
The RMSE in this case would be: SQRT(6000) = 77,..
This is more intepretable, that means on average you are 77 away from your prediction, which makes sense if you see the three results

Odds ratio to Probability of Success

We ran a logistic regression model with Passing the certification exam (0 or 1) as an outcome. We found that one of the strongest predictors is the student's program GPA, the highest the program GPA, the highest the odds of passing the certification exam.
Standardized GPA, p-value < .0001, B estimate = 1.7154, odds ratio = 5.559
I interpret this as, with every 0.33 unit (one standard deviation) increase in GPA, the odds of succeeding in the certification exam increased by 5.559 times.
However, clients want to understand this in terms of probability. I calculated probability by:
(5.559 - 1) x 100 = 455.9 percent
I'm having trouble explaining this percentage to our client. I thought probability of success is only supposed to range from 0 to 1. So confused! Help please!
Your math is correct, just need to work on the interpretation.
I suppose the client wants to know "What is the probability of passing the exam if we increase the GPA by 1 unit?"
Using your output, we know that the odds ratio (OR) is 5.559. As you said, this means that the odds in favor of passing the exam increases by 5.559 times for every unit increase in GPA. So what's the increase in probability?
odds(Y=1|X_GPA + 1) = 5.559 = p(Y=1|X_GPA + 1) / (1 - p(Y=1|X_GPA + 1))
Solving for p(Y=1|X_GPA + 1), we get:
p(Y=1|X_GPA + 1) = odds(Y=1|X_GPA + 1) / (1 + odds(Y=1|X_GPA + 1) ) = 5.559 / 6.559 = 0.847.
Note that another way to do this is to make use of the formula for logit:
logit(p) = B_0 + B_1*X_1 +...+ B_GPA*X_GPA therefore
p = 1 / ( 1 + e^-(B_0 + B_1*X_1 +...+ B_GPA*X_GPA) )
Since we know B_GPA = 1.7154, we can calculate that p = 1 / ( 1 + e^-1.7154 ) = 0.847
The change in probability(risk ratio i.e. p2/p1) of the target relies on the baseline probability (p1) and as such isn't a single value for a given odds ratio.
It can be calculated using the following formula:
RR = OR / (1 – p + (p x OR))
where p is the baseline value for p.
Eg.
Odds Ratio 0.1 0.2 0.3 0.4 0.5 0.6
RR(p=0.1) 0.11 0.22 0.32 0.43 0.53 0.63
RR(p=0.2) 0.12 0.24 0.35 0.45 0.56 0.65
RR(p=0.3) 0.14 0.26 0.38 0.49 0.59 0.68
This link elaborates on the formula.
https://www.r-bloggers.com/how-to-convert-odds-ratios-to-relative-risks/

calculate R2 from sum of squares of residuals and number of sample is known only

I was trying to solve a mathematical problem of multiple linear regression. There is a model given as
Y= ß0 + ß1X2 + ß2X3 + ε
And the sum of squares of residual i.e. SSRes=4.312. The number of sample i.e. n=108.
I need to find the value of coefficient of determination, R2. Which is the ratio of SSReg/SST. I know that SSRes=SST-SSReg. But how to calculate R2, if I don't know any of SST or SSReg.
SST=Total Sum of Squares, SSReg=Sum of Squares of Regression.
Please suggest any possible approach to find R2 from these given data only.
if you know these 108 data, then SST = sum((y - mean(y))^2), R2 = (SST - SSRes) / SST.

Create empirical cumulative distribution function (CDF) and then use the CDF to find probabilities

I have a set of observed data and created an empirical cumulative distribution using Excel. I want to use this CDF to find probabilities like P(x < X) or P (X1 < x < X2 ).
The way I created the CDF is to arrange the data in ascending order and then create a column next to it with the probabilities:
I have 4,121 records and the sample here is for four records. Once I have this calculation done, the curve is plotted using xy scatter plot for Data in the x-axis and Probability in the y-axis. This is how I created the CDF.
How can I find probability below 2.5, P(x<=2.5), or P( 970 < x < 980 )?
I hope there is an easy way because I will have hundreds of probabilities to find.

How to do a linear regression in case of incomplete information about output variable

I need to do a linear regression
y <- x1 + x2+ x3 + x4
y is not known
but instead of y we have f(y) which depends on y
for example, y is a probability from 0 to 1 of a binomial distribution over 0, 1
and instead of y we have (the number of 0, the number of 1) out of (the number of 0 + the number of 1) experiments
How should I perform linear regression to find correct y
How should I take into account the amount of information provided that for some x1 x2 x3 we have n experiments which give high confidence value of y, but for other x1 x2 x3 we have low confidence value of y due to small number of measurements
Sounds like you need something like BUGS (Bayes inference Using Gibbs Sampling) for the unknown variable y.
It sounds like you might be asking for logistic regression.

Resources