Probability of selecting an element from a set - statistics

The expected probability of randomly selecting an element from a set of n elements is P=1.0/n .
Suppose I check P using an unbiased method sufficiently many times. What is the distribution type of P? It is clear that P is not normally distributed, since cannot be negative. Thus, may I correctly assume that P is gamma distributed? And if yes, what are the parameters of this distribution?
Histogram of probabilities of selecting an element from 100-element set for 1000 times is shown here.
Is there any way to convert this to a standard distribution
Now supposed that the observed probability of selecting the given element was P* (P* != P). How can I estimate whether the bias is statistically significant?
EDIT: This is not a homework. I'm doing a hobby project and I need this piece of statistics for it. I've done my last homework ~10 years ago:-)

With repetitions, your distribution will be binomial. So let X be the number of times you select some fixed object, with M total selections
P{ X = x } = ( M choose x ) * (1/N)^x * (N-1/N)^(M-x)
You may find this difficult to compute for large N. It turns out that for sufficiently large N, this actually converges to a normal distribution with probability 1 (Central Limit theorem).
In case P{X=x} will be given by a normal distribution. The mean will be M/N and the variance will be M * (1/N) * ( N-1) / N.

This is a clear binomial distribution with p=1/(number of elements) and n=(number of trials).
To test whether the observed result differs significantly from the expected result, you can do the binomial test.
The dice examples on the two Wikipedia pages should give you some good guidance on how to formulate your problem. In your 100-element, 1000 trial example, that would be like rolling a 100-sided die 1000 times.

As others have noted, you want the Binomial distribution. Your question seems to imply an interest in a continuous approximation to it, though. It can actually be approximated by the normal distribution, and also by the Poisson distribution.

Is your distribution a discrete uniform distribution?

Related

Why is Standard Deviation the square of difference of an obsevation from the mean?

I am learning statistics, and have some basic yet core questions on SD:
s = sample size
n = total number of observations
xi = ith observation
μ = arithmetic mean of all observations
σ = the usual definition of SD, i.e. ((1/(n-1))*sum([(xi-μ)**2 for xi in s])**(1/2) in Python lingo
f = frequency of an observation value
I do understand that (1/n)*sum([xi-μ for xi in s]) would be useless (= 0), but would not (1/n)*sum([abs(xi-μ) for xi in s]) have been a measure of variation?
Why stop at power of 1 or 2? Would ((1/(n-1))*sum([abs((xi-μ)**3) for xi in s])**(1/3) or ((1/(n-1))*sum([(xi-μ)**4 for xi in s])**(1/4) and so on have made any sense?
My notion of squaring is that it 'amplifies' the measure of variation from the arithmetic mean while the simple absolute difference is somewhat a linear scale notionally. Would it not amplify it even more if I cubed it (and made absolute value of course) or quad it?
I do agree computationally cubes and quads would have been more expensive. But with the same argument, the absolute values would have been less expensive... So why squares?
Why is the Normal Distribution like it is, i.e. f = (1/(σ*math.sqrt(2*pi)))*e**((-1/2)*((xi-μ)/σ))?
What impact would it have on the normal distribution formula above if I calculated SD as described in (1) and (2) above?
Is it only a matter of our 'getting used to the squares', it could well have been linear, cubed or quad, and we would have trained our minds likewise?
(I may not have been 100% accurate in my number of opening and closing brackets above, but you will get the idea.)
So, if you are looking for an index of dispersion, you actually don't have to use the standard deviation. You can indeed report mean absolute deviation, the summary statistic you suggested. You merely need to be aware of how each summary statistic behaves, for example the SD assigns more weight to outlying variables. You should also consider how each one can be interpreted. For example, with a normal distribution, we know how much of the distribution lies between ±2SD from the mean. For some discussion of mean absolute deviation (and other measures of average absolute deviation, such as the median average deviation) and their uses see here.
Beyond its use as a measure of spread though, SD is related to variance and this is related to some of the other reasons it's popular, because the variance has some nice mathematical properties. A mathematician or statistician would be able to provide a more informed answer here, but squared difference is a smooth function and is differentiable everywhere, allowing one to analytically identify a minimum, which helps when fitting functions to data using least squares estimation. For more detail and for a comparison with least absolute deviations see here. Another major area where variance shines is that it can be easily decomposed and summed, which is useful for example in ANOVA and regression models generally. See here for a discussion.
As to your questions about raising to higher powers, they actually do have uses in statistics! In general, the mean (which is related to average absolute mean), the variance (related to standard deviation), skewness (related to the third power) and kurtosis (related to the fourth power) are all related to the moments of a distribution. Taking differences raised to those powers and standardizing them provides useful information about the shape of a distribution. The video I linked provides some easy intuition.
For some other answers and a larger discussion of why SD is so popular, See here.
Regarding the relationship of sigma and the normal distribution, sigma is simply a parameter that stretches the standard normal distribution, just like the mean changes its location. This is simply a result of the way the standard normal distribution (a normal distribution with mean=0 and SD=variance=1) is mathematically defined, and note that all normal distributions can be derived from the standard normal distribution. This answer illustrates this. Now, you can parameterize a normal distribution in other ways as well, but I believe you do need to provide sigma, whether using the SD or precisions. I don't think you can even parametrize a normal distribution using just the mean and the mean absolute difference. Now, a deeper question is why normal distributions are so incredibly useful in representing widely different phenomena and crop up everywhere. I think this is related to the Central Limit Theorem, but I do not understand the proofs of the theorem well enough to comment further.

Generating Normally distributed Random Numbers without decimal in excel

I am trying to get random numbers that are normally distributed with a mean of 20 and standard deviation of 2 for a sample size of 225 in Excel but I am getting numbers with decimals ( like 17.5642 , 16.337).
if I round it off, normal distribution cant be achieved. Please help me to get round figures that are normally distributed too....I used the Excel FORMULA "* =NORMINV(RAND(),20,2) *" for generating those numbers. Please suggest to get round figures.
As #circular-ruin has observed, what you are asking for strictly speaking doesn't make sense.
But -- perhaps you can run the Central Limit Theorem backwards. CLT is often used to approximate discrete distributions by normal distributions. You can use it to approximate a normal distribution by a discrete distribution.
If X is binomial with parameters p and n, then it is a standard result that the mean of X is np and the variance of X is np(1-p). Elementary algebra yields that such an X has mean 20 and variance 4 (hence standard deviation 2) if and only if n = 25 and p = 0.8. Thus -- if you simulate a bin(25,0.8) random variable you will get integer values which will be approximately N(20,4). This seems a little more principled then simulating N(20,4) directly and then just rounding. It still isn't normal -- but you really need to drop that requirement if you want your values to be integers.
To simulate a bin(25,0.8) random variable in Excel, just use the formula
=BINOM.INV(25,0.8,RAND())
with just 225 observations the results would probably pass a Chi-squared goodness of fit test for N(20,4) (though the right tail would be under-represented).

How does Laplace Smoothing Effect Prior / Evidence

I have a probability / stats question related to implementing Naive Bayes Classifiers, in particular about implementing Laplace Smoothing to avoid the Zero count issue and overfitting.
From what Ive read, the basic NBC formula using MLE looks like this:
p(C│F_1 ...F_n )=(p(C)p(F_1 |C)...p(F_n |C))/(p(F_1)...p(F_n))
However if one of the p(F_i |C) is zero, the whole probability becomes 0. One solution is Lapace smooth
p(F_i│C)~(x_i+k)/(N+kd)
Where x_i is the number of times F_i appeared in class C, N is the number of times class C occurred and d is the number of distinct values F_i has been known to take on.
My question is this:
What if anything needs to be done to p(C) in the numerator, and p(F_i) in the denominator?
Let f = (f_1 ... f_n). Laplace smoothing affects the class likelihood term, p(f|C). It does not affect the class prior p(C). It does have an effect on the marginal probability of f, in the sense that:
p(f) = \sum_c p(C) * p(f|C)
Whereby the term p(f|C) is different if you're Laplace smoothing than if you're not. But since the denominator is constant for all C, you shouldn't be bothering to evaluate this anyway.
P.S. This isn't really a programming question!

How do I efficiently estimate a probability based on a small amount of evidence?

I've been trying to find an answer to this for months (to be used in a machine learning application), it doesn't seem like it should be a terribly hard problem, but I'm a software engineer, and math was never one of my strengths.
Here is the scenario:
I have a (possibly) unevenly weighted coin and I want to figure out the probability of it coming up heads. I know that coins from the same box that this one came from have an average probability of p, and I also know the standard deviation of these probabilities (call it s).
(If other summary properties of the probabilities of other coins aside from their mean and stddev would be useful, I can probably get them too.)
I toss the coin n times, and it comes up heads h times.
The naive approach is that the probability is just h/n - but if n is small this is unlikely to be accurate.
Is there a computationally efficient way (ie. doesn't involve very very large or very very small numbers) to take p and s into consideration to come up with a more accurate probability estimate, even when n is small?
I'd appreciate it if any answers could use pseudocode rather than mathematical notation since I find most mathematical notation to be impenetrable ;-)
Other answers:
There are some other answers on SO that are similar, but the answers provided are unsatisfactory. For example this is not computationally efficient because it quickly involves numbers way smaller than can be represented even in double-precision floats. And this one turned out to be incorrect.
Unfortunately you can't do machine learning without knowing some basic math---it's like asking somebody for help in programming but not wanting to know about "variables" , "subroutines" and all that if-then stuff.
The better way to do this is called a Bayesian integration, but there is a simpler approximation called "maximum a postieri" (MAP). It's pretty much like the usual thinking except you can put in the prior distribution.
Fancy words, but you may ask, well where did the h/(h+t) formula come from? Of course it's obvious, but it turns out that it is answer that you get when you have "no prior". And the method below is the next level of sophistication up when you add a prior. Going to Bayesian integration would be the next one but that's harder and perhaps unnecessary.
As I understand it the problem is two fold: first you draw a coin from the bag of coins. This coin has a "headsiness" called theta, so that it gives a head theta fraction of the flips. But the theta for this coin comes from the master distribution which I guess I assume is Gaussian with mean P and standard deviation S.
What you do next is to write down the total unnormalized probability (called likelihood) of seeing the whole shebang, all the data: (h heads, t tails)
L = (theta)^h * (1-theta)^t * Gaussian(theta; P, S).
Gaussian(theta; P, S) = exp( -(theta-P)^2/(2*S^2) ) / sqrt(2*Pi*S^2)
This is the meaning of "first draw 1 value of theta from the Gaussian" and then draw h heads and t tails from a coin using that theta.
The MAP principle says, if you don't know theta, find the value which maximizes L given the data that you do know. You do that with calculus. The trick to make it easy is that you take logarithms first. Define LL = log(L). Wherever L is maximized, then LL will be too.
so
LL = hlog(theta) + tlog(1-theta) + -(theta-P)^2 / (2*S^2)) - 1/2 * log(2*pi*S^2)
By calculus to look for extrema you find the value of theta such that dLL/dtheta = 0.
Since the last term with the log has no theta in it you can ignore it.
dLL/dtheta = 0 = (h/theta) + (P-theta)/S^2 - (t/(1-theta)) = 0.
If you can solve this equation for theta you will get an answer, the MAP estimate for theta given the number of heads h and the number of tails t.
If you want a fast approximation, try doing one step of Newton's method, where you start with your proposed theta at the obvious (called maximum likelihood) estimate of theta = h/(h+t).
And where does that 'obvious' estimate come from? If you do the stuff above but don't put in the Gaussian prior: h/theta - t/(1-theta) = 0 you'll come up with theta = h/(h+t).
If your prior probabilities are really small, as is often the case, instead of near 0.5, then a Gaussian prior on theta is probably inappropriate, as it predicts some weight with negative probabilities, clearly wrong. More appropriate is a Gaussian prior on log theta ('lognormal distribution'). Plug it in the same way and work through the calculus.
You can use p as a prior on your estimated probability. This is basically the same as doing pseudocount smoothing. I.e., use
(h + c * p) / (n + c)
as your estimate. When h and n are large, then this just becomes h / n. When h and n are small, this is just c * p / c = p. The choice of c is up to you. You can base it on s but in the end you have to decide how small is too small.
You don't have nearly enough info in this question.
How many coins are in the box? If it's two, then in some scenarios (for example one coin is always heads, the other always tails) knowing p and s would be useful. If it's more than a few, and especially if only some of the coins are only slightly weighted then it is not useful.
What is a small n? 2? 5? 10? 100? What is the probability of a weighted coin coming up heads/tail? 100/0, 60/40, 50.00001/49.99999? How is the weighting distributed? Is every coin one of 2 possible weightings? Do they follow a bell curve? etc.
It boils down to this: the differences between a weighted/unweighted coin, the distribution of weighted coins, and the number coins in your box will all decide what n has to be for you to solve this with a high confidence.
The name for what you're trying to do is a Bernoulli trial. Knowing the name should be helpful in finding better resources.
Response to comment:
If you have differences in p that small, you are going to have to do a lot of trials and there's no getting around it.
Assuming a uniform distribution of bias, p will still be 0.5 and all standard deviation will tell you is that at least some of the coins have a minor bias.
How many tosses, again, will be determined under these circumstances by the weighting of the coins. Even with 500 tosses, you won't get a strong confidence (about 2/3) detecting a .51/.49 split.
In general, what you are looking for is Maximum Likelihood Estimation. Wolfram Demonstration Project has an illustration of estimating the probability of a coin landing head, given a sample of tosses.
Well I'm no math man, but I think the simple Bayesian approach is intuitive and broadly applicable enough to put a little though into it. Others above have already suggested this, but perhaps if your like me you would prefer more verbosity.
In this lingo, you have a set of mutually-exclusive hypotheses, H, and some data D, and you want to find the (posterior) probabilities that each hypothesis Hi is correct given the data. Presumably you would choose the hypothesis that had the largest posterior probability (the MAP as noted above), if you had to choose one. As Matt notes above, what distinguishes the Bayesian approach from only maximum likelihood (finding the H that maximizes Pr(D|H)) is that you also have some PRIOR info regarding which hypotheses are most likely, and you want to incorporate these priors.
So you have from basic probability Pr(H|D) = Pr(D|H)*Pr(H)/Pr(D). You can estimate these Pr(H|D) numerically by creating a series of discrete probabilities Hi for each hypothesis you wish to test, eg [0.0,0.05, 0.1 ... 0.95, 1.0], and then determining your prior Pr(H) for each Hi -- above it is assumed you have a normal distribution of priors, and if that is acceptable you could use the mean and stdev to get each Pr(Hi) -- or use another distribution if you prefer. With coin tosses the Pr(D|H) is of course determined by the binomial using the observed number of successes with n trials and the particular Hi being tested. The denominator Pr(D) may seem daunting but we assume that we have covered all the bases with our hypotheses, so that Pr(D) is the summation of Pr(D|Hi)Pr(H) over all H.
Very simple if you think about it a bit, and maybe not so if you think about it a bit more.

Non-Uniform Random Number Generator Implementation?

I need a random number generator that picks numbers over a specified range with a programmable mean.
For example, I need to pick numbers between 2 and 14 and I need the average of the random numbers to be 5.
I use random number generators a lot. Usually I just need a uniform distribution.
I don't even know what to call this type of distribution.
Thank you for any assistance or insight you can provide.
You might be able to use a binomial distribution, if you're happy with the shape of that distribution. Set n=12 and p=0.25. This will give you a value between 0 and 12 with a mean of 3. Just add 2 to each result to get the range and mean you are looking for.
Edit: As for implementation, you can probably find a library for your chosen language that supports non-uniform distributions (I've written one myself for Java).
A binomial distribution can be approximated fairly easily using a uniform RNG. Simply perform n trials and record the number of successes. So if you have n=10 and p=0.5, it's just like flipping a coin 10 times in a row and counting the number of heads. For p=0.25 just generate uniformly-distributed values between 0 and 3 and only count zeros as successes.
If you want a more efficient implementation, there is a clever algorithm hidden away in the exercises of volume 2 of Knuth's The Art of Computer Programming.
You haven't said what distribution you are after. Regarding your specific example, a function which produced a uniform distribution between 2 and 8 would satisfy your requirements, strictly as you have written them :)
If you want a non-uniform distribution of the random number, then you might have to implement some sort of mapping, e.g:
// returns a number between 0..5 with a custom distribution
int MyCustomDistribution()
{
int r = rand(100); // random number between 0..100
if (r < 10) return 1;
if (r < 30) return 2;
if (r < 42) return 3;
...
}
Based on the Wikipedia sub-article about non-uniform generators, it would seem you want to apply the output of a uniform pseudorandom number generator to an area distribution that meets the desired mean.
You can create a non-uniform PRNG from a uniform one. This makes sense, as you can imagine taking a uniform PRNG that returns 0,1,2 and create a new, non-uniform PRNG by returning 0 for values 0,1 and 1 for the value 2.
There is more to it if you want specific characteristics on the distribution of your new, non-uniform PRNG. This is covered on the Wikipedia page on PRNGs, and the Ziggurat algorithm is specifically mentioned.
With those clues you should be able to search up some code.
My first idea would be:
generate numbers in the range 0..1
scale to the range -9..9 ( x-0.5; x*18)
shift range by 5 -> -4 .. 14 (add 5)
truncate the range to 2..14 (discard numbers < 2)
that should give you numbers in the range you want.
You need a distributed / weighted random number generator. Here's a reference to get you started.
Assign all numbers equal probabilities,
while currentAverage not equal to intendedAverage (whithin possible margin)
pickedNumber = pick one of the possible numbers (at random, uniform probability, if you pick intendedAverage pick again)
if (pickedNumber is greater than intendedAverage and currentAverage<intendedAverage) or (pickedNumber is less than intendedAverage and currentAverage>intendedAverage)
increase pickedNumber's probability by delta at the expense of all others, conserving sum=100%
else
decrease pickedNumber's probability by delta to the benefit of all others, conserving sum=100%
end if
delta=0.98*delta (the rate of decrease of delta should probably be experimented with)
end while

Resources