understanding of result of logistic regression - excel

let us suppose we have following data with binary response output(coupon)
annual spending is given in 1000th unit, my goal is to estimate whether if customer spend more then 2000 and has Simmons card, will also have coupon, first of all i have sorted data according to response data, i got following picture
at next stage i have calculated logit for each data, for those initially i choose following coefficient
B0 0.1
B1 0.1
B2 0.1
and i have calculated L according to the following formula
at next stage i have calculated e^L (which in excel can be done easily by exp function )
=EXP(D2)
after that i have calculated probability
=E2/(1+E2)
and finally using formula
i have calculated log likelihood function
then i have calculated sum and using solver i have calculated coefficient that minimize this sum( please pay attention that values are given in negative value) , but i have got all coefficient zero
i am wrong ? or does it means that i can'predict buying of coupon on the base of Annual spending and owning of Simmons card? thanks in advance

You can predict the buying of a coupon on the base of Annual spending (and knowing Simmons card doesn't help).
Admittedly I didn't solve it in Excel, but I suspect the problem might be that your optimization didn't converge (i.e., failed to reach the correct coefficients through the solving process) -- the correct coefficients are B0 = 5.63, B1 = -2.95, and B2 = 0. I found an online reference for the Excel logistic regression procedure at http://blog.excelmasterseries.com/2014/06/logistic-regression-performed-in-excel.html.
I ran the logistic regression myself and found that Annual spending is significant (at the 0.05 level) whereas Simmons card is not. Re-running the model with Simmons card removed yields the following equations:
L = 5.63 - 2.95 * Annual spending
P(1) = exp(L)/(1 + exp(L))
If P(1) > 0.5 => coupon = 1
Although the entropy Rsquare is low at 0.39 (and the number of data points is very low), the model is statistically significant.

Related

Weighted mean, sd and median - Size Weighted (Negative Numbers)

I need to calculate the weighted median, average, sd of PE funds' returns. I weighted the sample according to the amount of committed capital of a fund, but I should consider negative products to analyze underperforming funds. However, I'm not sure if I can use neg/zero values to derivate these statistic measures.
Wμ = Σ(w,x)/Σw --> the formula i consider for wgt. average
w = Fund's size
x = net IRR
(w,x) = Neg & Pos values.
How can I calculate those measures, including negative/zero values? I'm doing it in Excel
My standpoint is the Kaplan and Schoar's approach (Private Equity Performance: Returns, Persistence, and Capital Flows)
Any help on this matter is really appreciated!

Mean Square Error (MSE) Root Mean Square Error (RMSE)

I'm on a project for the study that I used the mean_square_error from skelarn to get my MSE and RMSE.
I can´t understand what the information means.
I'm using a dataset about House Sales and I want to predict the price of the house using Linear Regression. When I put my predict price and real price, I got the results:
MSE: 1114197668.6920328 RMSE: 33379.59958855158
What this information really means actually? That my predict will have a mean difference in price about 33379.60?
MSE = mean_squared_error(predict,testSalePrice)
RMSE = np.sqrt(MSE)
Mean Squared Error:
In statistics, the mean squared error (MSE) or mean squared deviation
(MSD) of an estimator (of a procedure for estimating an unobserved
quantity) measures the average of the squares of the errors.
So for example let's assume you have three datapoints:
Price Predicted
1900 2000
2000 2000
2100 2000
Then the MSE is: 1/3 * ((-100)*(-100)+ (0)*(0) + (100)*(100)) = 1/3 * (20000) = 6000
The perfect one would be 0, but this you will probably not reach. You have to interpret it in comparison with your actual value range.
The RMSE in this case would be: SQRT(6000) = 77,..
This is more intepretable, that means on average you are 77 away from your prediction, which makes sense if you see the three results

Computing sum of progressively-increasing values in Excel

I am trying to solve an iterative problem in Excel. I want to be able to calculate the sum of rent for x years. The rent is increasing at a rate of 10 percent every year. I quickly came up with this python code on a REPL for clarity:
year = 6
rent = 192000
total_rent = rent
for x in range(1 , year):
rent= rent + .1*rent
total_rent = total_rent + rent
print(total_rent) # 1481397.12 is what it prints
This is a trivial problem in programming but I am not sure the best way to achieve this in excel.
In excel I am doing it this something like this:
But all the intermediate rent amount(s) are not really needed. I guess there should be a for loop here as well too, but is there a mathematical representation of this problem which I can use to create the expected result?
If you have a financial problem, you might try the financial functions of excel.
=-FV(0.1, 6, 192000)
or
=FV(0.1, 6, -192000)
the detail: FV on Office Support
Description
FV, one of the financial functions, calculates the future value of an investment based on a constant interest rate. You can use FV with either periodic, constant payments, or a single lump sum payment.
Syntax
FV(rate, nper, pmt, [pv], [type])
For a more complete description of the arguments in FV and for more information on annuity functions, see PV.
The FV function syntax has the following arguments:
Rate Required
The interest rate per period.
Nper Required
The total number of payment periods in an annuity.
Pmt Required
The payment made each period; it cannot change over the life of the annuity. Typically, pmt contains principal and interest but no other fees or taxes. If pmt is omitted, you must include the pv argument.
Pv Optional
The present value, or the lump-sum amount that a series of future payments is worth right now. If pv is omitted, it is assumed to be 0 (zero), and you must include the pmt argument.
Type Optional
The number 0 or 1 and indicates when payments are due. If type is omitted, it is assumed to be 0.
Your problem is a geometric series where the initial term is a = 192000 and the common ratio is r = 1.1. (The ratio is not just the 10% added, it includes the 100% that is added to.) To refresh your Algebra II memory, a geometric series is
total = a + a*r + a*r**2 + ... + a*r**(n-1)
The closed-form formula for the sum of the geometric series is
total = a * (r**n - 1) / (r - 1)
(using Python syntax), or, using something closer to Excel syntax,
total = a * (r^n - 1) / (r - 1)
where n is the number of years. Just substitute your values for a, r, and n.
As the question is about excel it is possible by
Or by using the FV function.
FV returns the future value of an investment based on regular payments and a constant interest rate.
Attributes of the FV function;:
Rate: The interest rate per period.
Nper: The total number of payment periods in an annuity.
Pmt: The payment made each period; it cannot change over the life of the annuity. Typically, pmt contains principal and interest but no other fees or taxes. If pmt is omitted, you must include the pv argument.
Pv: The present value, or the lump-sum amount that a series of future payments is worth right now. If pv is omitted, it is assumed to be 0 (zero), and you must include the pmt argument.
Type: The number 0 or 1 and indicates when payments are due. If type is omitted, it is assumed to be 0.
Yet another way is computing it as a geometric series with the non-financial function SERIESSUM:
=SERIESSUM(1.1,0,1,192000*{1,1,1,1,1,1})
The rate multiplier is 1.1, starting from 1.1^0 == 1 and increasing by 1 each year. The result is 1*a + 1.1*b + 1.1^2*c.... The array 192000*{1,1,...} provides the coefficients a, b, c, ... : one array value for the initial total_rent = rent, and one for each subsequent year 1..5 (from range(1,year)).

why divide sample standard deviation by sqrt(sample size) when calculating z-score

I have been following Khan Academy videos to gain understanding of hypothesis testing, and I must confess that all my understanding thus far is based on that source.
Now, the following videos talk about z-score/hypothesis testing:
Hypothesis Testing
Z-statistic vs T-statistic
Now, coming to my doubts, which is all about the denominator in the z-score:
For the z-score formula which is: z = (x – μ) / σ,
we use this directly when the standard deviation of the population(σ), is known.
But when its unknown, and we use a sampling distribution,
then we have z = (x – μ) / (σ / √n); and we estimate σ with σs ; where σs is the standard deviation of the sample, and n is the sample size.
Then z score = (x – μ) / (σs / √n). Why are dividing by √n, when σs is already known?
Even in the video, Hypothesis Testing - Sal divides the sample's standard deviation by √n. Why are we doing this, when σs is directly given?
Please help me understand.
I tried applying this on the following question, and faced the problems below:
Question : Yardley designed new perfumes. Yardley company claimed that an average new
perfume bottle lasts 300 days. Another company randomly selects 35 new perfume bottles from
Yardley for testing. The sampled bottles last an average of 190 days, with a
standard deviation of 50 days. If the Yardley's claim were true,
what is the probability that 35 randomly selected bottles would have an average
life of no more than 190 days ?
So, the above question, when I do the following:
z = (190-300)/(50/√35), we get z = -13.05, which is not a possible score, since
z score should be between +-3.
And when I do, z = (190-110)/50, or rather z = (x – μ) / σ, I seem to be getting an acceptable answer over here.
Please help me figure out what I am missing.
I think the origin of the 1/\sqrt{n} is simply whether you're calculating the standard deviation of the lifetime of a single bottle, or the standard deviation of the (sample) mean of a set of bottles.
The question indicates that 50 days is the standard deviation of the lifetimes of the set of 35 bottles. That implies that the estimated mean age (190 days) will have a margin of error of about 50/\sqrt{35} days. Assuming that this similar margin of error applied to the claimed 300-day lifetime, one can calculate the probability that a set of 35 bottles would be measured to be 190 days or less, using the complementary error function.
Your z=-13.05 looks about right, implying that it is extremely unlikely that claimed 300-day lifetime is consistent with that seen in the 35-bottle experiment.

Why do I get wrong prediction when using this polynomial forecasting formula

I would like to do a forecasting on a growth per period.
I have a formula of polynomial regression
y = -5E-05x2 + 0.0348x + 0.7148.
I translated it to:
=EXP(-5)-0.5*(B4)^2+0.0348*B4+0.7148
where B4 is the period running number (I have 365 days, so B4 is the first period ,C4is the next period etc)
I have strange results (my prediction decreases over time instead of getting increased) so I guess I didn't interpret Excel's formula well.
How can I resolve this problem?
An image of the chart and Excel's formula:
-5E-05 isn't exp(-5)-5
it is -5 * 10^(-5)
for clarification: -7E-05 means: -7 * 10^(-5) = -.00007

Resources