Odds ratio to Probability of Success - statistics

We ran a logistic regression model with Passing the certification exam (0 or 1) as an outcome. We found that one of the strongest predictors is the student's program GPA, the highest the program GPA, the highest the odds of passing the certification exam.
Standardized GPA, p-value < .0001, B estimate = 1.7154, odds ratio = 5.559
I interpret this as, with every 0.33 unit (one standard deviation) increase in GPA, the odds of succeeding in the certification exam increased by 5.559 times.
However, clients want to understand this in terms of probability. I calculated probability by:
(5.559 - 1) x 100 = 455.9 percent
I'm having trouble explaining this percentage to our client. I thought probability of success is only supposed to range from 0 to 1. So confused! Help please!

Your math is correct, just need to work on the interpretation.
I suppose the client wants to know "What is the probability of passing the exam if we increase the GPA by 1 unit?"
Using your output, we know that the odds ratio (OR) is 5.559. As you said, this means that the odds in favor of passing the exam increases by 5.559 times for every unit increase in GPA. So what's the increase in probability?
odds(Y=1|X_GPA + 1) = 5.559 = p(Y=1|X_GPA + 1) / (1 - p(Y=1|X_GPA + 1))
Solving for p(Y=1|X_GPA + 1), we get:
p(Y=1|X_GPA + 1) = odds(Y=1|X_GPA + 1) / (1 + odds(Y=1|X_GPA + 1) ) = 5.559 / 6.559 = 0.847.
Note that another way to do this is to make use of the formula for logit:
logit(p) = B_0 + B_1*X_1 +...+ B_GPA*X_GPA therefore
p = 1 / ( 1 + e^-(B_0 + B_1*X_1 +...+ B_GPA*X_GPA) )
Since we know B_GPA = 1.7154, we can calculate that p = 1 / ( 1 + e^-1.7154 ) = 0.847

The change in probability(risk ratio i.e. p2/p1) of the target relies on the baseline probability (p1) and as such isn't a single value for a given odds ratio.
It can be calculated using the following formula:
RR = OR / (1 – p + (p x OR))
where p is the baseline value for p.
Eg.
Odds Ratio 0.1 0.2 0.3 0.4 0.5 0.6
RR(p=0.1) 0.11 0.22 0.32 0.43 0.53 0.63
RR(p=0.2) 0.12 0.24 0.35 0.45 0.56 0.65
RR(p=0.3) 0.14 0.26 0.38 0.49 0.59 0.68
This link elaborates on the formula.
https://www.r-bloggers.com/how-to-convert-odds-ratios-to-relative-risks/

Related

How to interpret economic significance from the regression (OLS) coefficient?

I have the results from the following regression (pooled OLS with industry and year fixed effects):
y =0.02 - 0.31*X1 + 0.23*X2 + residual
Please, correct me if I am wrong with interpreting the economic significance of X1 on y (where the mean value in the sample of X1 is o.55 and standard deviation is 0.1):
One standard deviation increase in X1 (holding all other factors constant) decreases "y" by 0.031 (0.31*0.1) from its mean of 0.22 to 0.189, or 14%.

How can I split my normalization in two according to column values?

HI I have a column data in pandas with a hugely skewed distribution:
I split the data in two according to a cutoff value of 1000 and this is the distribution of the two groups.
Now, I want to normalize with values between 0-1. I want to perform a 'differential' normalization, in a way that the left panel values are normalized between 0-0.5 and the right panel is normalized to 0.5 to 1, everything in the same column. How can I do it?
It's not pretty, but works.
df = pd.DataFrame({'dataExample': [0,1,2,1001,1002,1003]})
less1000 = df.loc[df['dataExample'] <= 1000]
df.loc[df['dataExample'] <= 1000, 'datanorm'] = less1000['dataExample'] / (less1000['dataExample'].max() * 2)
high1000 = df.loc[df['dataExample'] > 1000]
df.loc[df['dataExample'] > 1000, 'datanorm'] = ((high1000['dataExample'] - high1000['dataExample'].min()) / ((high1000['dataExample'].max() - high1000['dataExample'].min()) * 2) + 0.5)
output:
dataExample datanorm
0 0 0.00
1 1 0.25
2 2 0.50
3 1001 0.50
4 1002 0.75
5 1003 1.00
Let's assume your dataframe is called df, the column holding the data is called data and the column holding the counts is called counts. Then you could do something like this:
df['data_norm'] = df['data'].loc[df['counts']<=1000] / 1000 / 2
df['data_norm'] = df['data'].loc[df['counts']>1000] / df['counts'].max() + 0.5
... assuming I understood you correctly. But I think I neither understand your problem properly nor your approach to solve it.

Express a percentage as a range value

I have a range:
1.75 [which I need to be 100%] to 4 [which needs to be 0%] (inclusive).
I need to be able to put in the percent, and get back the value.
How would I find the value for a percentage, say 50%, using a formula in Excel?
What I have tried so far: If I 'pretend' to reverse the percentage so that 1.75 is 0% and 4 is 100%, it seems a lot easier: I can use = (x - 1.75) / (4 - 1.75) * 100 to return the percentage of x, which is to say (x - min) / (max - min) * 100 = percentage of a range.
But I can't get this to work when the max is actually lower than the min. And...I'm not looking for the percent, I'm looking for the value when I enter the percent. :-/
The percentage of the value in the range is
=(max - value) / (max - min)
The value at some percentage is
=(min - max) * percentage + max
Edit: Perhaps a more intuitive way to attack "the value at some percentage" (notice I changed the terms here):
= (max - min) * (1- percentage) + min
IOW,
= (total distance) * (complement of fractional distance) + baseline
The complement is needed because you have reversed the sense of upper and lower bounds.
Like so, I used =4-(2.25*A1),=4-(2.25*A2) and =4-(2.25*A3)
0 4
0.5 2.875
1 1.75

How can I find the formula for this table of values?

How can I find the formula for this table of values?
Things I know:
This table horizontally is 45*1.25+(x*0.25) where x is column number starting at 0.
This table vertically is 45*1.25+(y*0.125) where y is row number starting at 0.
These rules only work for the first row and column I believe which is why I'm having an issue figuring out whats going on.
56.25 67.5 78.75 90
61.88 78.75 95.63 112.5
67.5 90 112.5 135
So throwing a regression tool at it, I find a model of
56.2513 + 11.2497*x + 5.625*y + 5.625*x*y
with parameter standard deviations at
0.0017078 0.00091287 0.0013229 0.00070711
A measure of the residual errors is 0.0018257, which is down near the rounding error in your data. I would point out that it is quite close to that given by Amadan.
I can get a slightly better model as
56.2505 + 11.2497*x + 5.63*y + 5.625*x*y - 0.0025*y^2
again, the parameter standard errors are
0.0014434 0.00074536 0.0024833 0.00057735 0.001118
with a residual error of 0.0013944. The improvement is minimal, and you can see the coefficient of y^2 is barely more than twice the standard deviation. I'd be very willing to believe this parameter does not belong in the model, but was just generated by rounding noise.
Perhaps more telling is to look at the residuals. The model posed by Amadan yields residuals of:
56.25 + 5.63*Y + 11.26*X + 5.63*X.*Y - Z
ans =
0 0.01 0.02 0.03
0 0.02 0.03 0.05
0.01 0.03 0.05 0.07
Instead, consider the model generated by the regression tool.
56.2513 + 11.2497*X + 5.625*Y + 5.625*X.*Y - Z
ans =
0.0013 0.001 0.0007 0.0004
-0.0037 0.001 -0.0043 0.0004
0.0013 0.001 0.0007 0.0004
The residuals here are better, but I can do slightly better yet, merely by looking at the coefficients and perturbing them in a logical manner. What does this tell me? That Amadan's model is not the model that originally generated the data, although it was close.
My better model is this one:
56.25 + 11.25*X + 5.625*Y + 5.625*X.*Y
ans =
56.25 67.5 78.75 90
61.875 78.75 95.625 112.5
67.5 90 112.5 135
See that it is exact, except for two cells which have now been "unrounded". It yields residuals of:
56.25 + 11.25*X + 5.625*Y + 5.625*X.*Y - Z
ans =
0 0 0 0
-0.005 0 -0.005 0
0 0 0 0
Regression analysis will not always yield the result you need. Sometimes pencil and paper are as good or even better. But it can give you some understanding if you look at the data. My conclusion is that the original model was
f(x,y) = 56.25 + 11.25*x + 5.625*y + 5.625*x*y
The coefficients are well behaved and simple, and they predict the data perfectly except for two cells, which were surely rounded.
f(x,y) = 56.25 + 5.63 * ((x + 1) * y + 2 * x)
And, not programming.
I think you need a least squares fit for your data, given an assumed polynomial. This approach will "work" even if you give it more data points. Least squares will calculate the polynomial coefficients that minimize the mean square error between the polynomial and the points.

how to show that NDCG score is significant

Suppose the NDCG score for my retrieval system is .8. How do I interpret this score. How do i tell the reader that this score is significant?
To understand this lets check an example of Normalized Discounted Cumulative Gain (nDCG)
For nDCG we need DCG and Ideal DCG (IDCG)
Lets understand what is Cumulative Gain (CG) first,
Example: Suppose we have [Doc_1, Doc_2, Doc_3, Doc_4, Doc_5]
Doc_1 is 100% relevant
Doc_2 is 70% relevant
Doc_3 is 95% relevant
Doc_4 is 20% relevant
Doc_5 is 100% relevant
So our Cumulative Gain (CG) is
CG = 100 + 70 + 95 + 20 + 100 ###(Index of the doc doesn't matter)
= 385
and
Discounted cumulative gain (DCG) is
DCG = SUM( relivencyAt(index) / log2(index + 1) ) ###where index 1 -> 5
Doc_1 is 100 / log2(2) = 100.00
Doc_2 is 70 / log2(3) = 044.17
Doc_3 is 95 / log2(4) = 047.50
Doc_4 is 20 / log2(5) = 008.61
Doc_5 is 100 / log2(6) = 038.69
DCG = 100 + 44.17 + 47.5 + 8.61 + 38.69
DCG = 238.97
and Ideal DCG is
IDCG = Doc_1 , Doc_5, Doc_3, Doc_2, Doc_4
Doc_1 is 100 / log2(2) = 100.00
Doc_5 is 100 / log2(3) = 063.09
Doc_3 is 95 / log2(4) = 047.50
Doc_2 is 75 / log2(5) = 032.30
Doc_4 is 20 / log2(6) = 007.74
IDCG = 100 + 63.09 + 47.5 + 32.30 + 7.74
IDCG = 250.63
nDCG(5) = DCG / IDCG
= 238.97 / 250.63
= 0.95
Conclusion:
In the given example nDCG was 0.95, 0.95 is not prediction accuracy, 0.95 is the ranking of the document effective. So, the gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks.
Wiki reference
The NDCG is a ranking metric. In the information retrieval field you should predict a sorted list of documents and them compare it with a list of relevant documents. Imagine that you predicted a sorted list of 1000 documents and there are 100 relevant documents, the NDCG equals 1 is reached when the 100 relevant docs have the 100 highest ranks in the list.
So .8 NDCG is 80% of the best ranking.
This is an intuitive explanation the real math includes some logarithms, but it is not so far from this.
If you have relatively big sample, you can use bootstrap resampling to compute the confidence intervals, which will show you whether your NDCG score is significantly better than zero.
Additionally, you can use pairwise bootstrap resampling in order to significantly compare your NDCG score with another system's NDCG score

Resources