How to choose between geometric and negative binomial distributions - statistics

A sample question for an actuarial science exam sample exam goes like this:
"Calculate the probability that there will be at least four months in which no accidents occur before the fourth month in which at least one accident occurs.
A company takes out an insurance policy to cover accidents that occur at its manufacturing plant. The probability that one or more accidents will occur during any given month is 3/5.
The number of accidents that occur in any given month is independent of the number of accidents that occur in all other months."
I interpreted this as what is the probability (P) of no accidents during any of at least 3 months before one or more accidents occur in the following month.
I assumed a geometric distribution and calculated two different ways, got the same answer both times:
Given: "event": "one or more accidents in a month"
p(event) = 3/5; q(non event) = 1-p = 2/5
One event occurs after 3 or more months of no events: P = q^3psum(k=0->inf)(q^k) = q^3p(1/(1-q)) = q^3 = (2/5)^3 = 0.064
P = 1 - Prob(one or more accidents occur in one or more of the first three months). Same answer: 0.064.
But 0.064 is not among the answer choices.
The exam offers its solution as using the negative binomial distribution as follows:
"Solution: D
If a month with one or more accidents is regarded as success and k = the number of failures before the fourth success, then k follows a negative binomial distribution and the requested probability is:
Alternatively the solution is
which can be derived directly or by regarding the problem as a negative binomial distribution with
success taken as a month with no accidents
k = the number of failures before the fourth success, and calculating"
So my question is: How to infer that the correct probability distribution to consider is the negative binomial ?? In my reading of the question, it is the first "success" not the fourth "success" that occurs after three failures hence geometric distribution (or, equivalently, (1,p) NB distribution).
What am I missing?
Thanks in advance.

I think they asked to calculate the probability of the event before an Rth success occurs. So, the whole point of negative binomial distribution is to find the probabilities of events before Rth success in "N-R" trials. whereas it is quite different with geometric distribution where you find the probability of the first success.
I hope my explanation was understandable, also I just stumbled upon this.

Related

Statistically compare an experimental data and a theory value

I appreciate if someone could answer my question. I have two values. The first one is experimental data deduced from only one measurement. The uncertainty for this values is determined. The second value is theory result. My question is how to statistically compare these two values? I tried to use t-test, but failed because the number of freedom is df = 1-1 =0 (only one experiment was conducted to measure the first value).

binomial distribution z-score value too large

i try to solve this question
by
n =500 ,p=0.9/100 and q=1-0.9/100
but im geting z-score and mean very large .
Paycheck Errors The payroll department of a hospital has found that in one year, 0.9% of its paychecks are calcu- lated incorrectly. The hospital has 500 employees.
(a) What is the probability that in one month’s records no paycheck errors are made?
(b) What is the probability that in one month’s records at least one paycheck error is made?
Z transformation is a poor approximation to the binomial distribution for npq < 10. For your problem npq == 4.4595, so the Z approximation is a no-go.
You'd do better to calculate it exactly as a binomial using software, or approximate it as a Poisson with rate λ=np. Once you solve part (a), part (b) is just the complement.
I went ahead and calculated part (a) both ways. The Poisson approximation differs from the exact calculation by only 0.00022.
You should use binomial distribution formula rather than sampling distribution formula.

Trying to either pull or recreate trendline data using LINEST

I am trying to recreate the formula from a trendline on a graph. basically my company is trying to predict the corn yields for next year. all of the actual programmers are out for the week so they passed it on to me(web developer:D). Ive attempted the LINEST formula multiple times with no luck.
basically in column B I have the years(1-15, trying to project 16) and Column C i have the actual trend data. i am probably doing this wrong however
EX =LINEST(C16:C30,B16:B30,FALSE,FALSE)
Any help would be appreciated. just tell me if you need the actual file or more information. Thanks in advance!
The fourth argument, concerning the return of additional regression statistics, is optional and is taken as FALSE if omitted, so seems not required for your purposes. The third argument, concerning the intercept with the Y-axis (the value of y when x is 0), is also optional but taken as TRUE if omitted. In your case TRUE seems appropriate so the third parameter seems not required for your purposes.
With your data spanning 15 years, if ending with the current year, it is conveniently 2001-2015 bdi and has no information about the value of y (production) in year 2000 (ie when x is 0) but this is unlikely to have been 0, as would be taken to be the case if the third argument is FALSE.
In a simplified example, take production of 50 in 2001, increasing by an (unrealistically!) constant 5 each year. By 2015 this has reached 120, so for 2016 at the same rate of increase production of 125 should be expected. Your formula returns 9.35 so would predict production of 129.35, though we know to expect 125, as given by:
=LINEST(C16:C30,B16:B30)
when added to the latest available (120).
The former is too high a predicted increase because it assumes growth was from 0 to 120 in sixteen years, rather than what I have taken to be from 50 to 120 in fifteen.
As has been mentioned by #Byron Wall, Excel has the TREND function that may be used for linear extrapolation to obtain the next (16th) value like so:
=TREND(C16:C30,B16:B30,16)
This directly returns 125 for the, simplified, sample data.
HOWEVER, all the above assumes growth is linear. Taking say Brazilian corn production (Million tons) over the period (offset one year) this has been roughly (based on USDA.gov):
The red line is the Linear trend and green a fourth order Polynomial. They happen both to end up at the same place for one year ahead (the hollow bar) but predict different results from the latest six years:
It may be worth charting the data you have, and adding different trend lines, before deciding whether linear extrapolation seems the most promising for forecasting purposes. ‘Wavy’ (cyclical) progress is evident in many datasets.

How do you calculate the probability of getting 1 new customer out of X using Excel?

How do you calculate the probability of getting 1 new customer out of X. I am expecting to come up with 2,3,4...10. I have tried using the probability function, but it doesn't seem to like what I am using for parameters.
=BINOM.DIST(1,500,1/500,TRUE)
I don't know in what sense this "doesn't work". Excel is returning exactly what you asked for: the cumulative probability of 1 success out of 500 trials where the probability of success is 1/500 in each independent trial.
Instead of looking at the cumulative distribution, though, consider each possible number of successes separately:
If the probability of getting a customer is 1/500, then if you ask 500 people, there is a 36.8% chance that you will get zero customers. There is also a 36.8% chance that you'll get one new customer, for a cumulative probability of 73.6%. There is a 99.6% chance that you will get 4 or fewer new customers from 500 tries.
To be clear, the formula for the "Prob" column is
=BINOM.DIST(cell_with_success_number,500,1/500,0)
The correct syntax is:
BINOM.DIST(number_s,trials,probability_s,cumulative)
Number_s The number of successes in trials.
Trials The number of independent trials.
Probability_s The probability of success on each trial.
Cumulative A logical value that determines the form of the function. If cumulative is TRUE, then BINOM.DIST returns the
cumulative distribution function, which is the probability that there
are at most number_s successes; if FALSE, it returns the probability
mass function, which is the probability that there are number_s
successes.
http://office.microsoft.com/en-us/excel-help/binom-dist-function-HP010335671.aspx
It returns a probability (0< X<1), so a value between 0 and 1. Note that "Probability of success in each trial" is asking what the chances are of getting a new customer EACH time you call (one trial). By entering 1/500, it appears you are assuming a probability of 0.002 for each trial. The resulting answer of 0.736 indicates you have a 73.6% chance of getting 1 new client for each 500 calls, or "trials".
Do you have a number you can enter based on your past experience calling leads?

Compute statistical significance with Excel

I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm

Resources