Non-Uniform Random Number Generator Implementation? - statistics

I need a random number generator that picks numbers over a specified range with a programmable mean.
For example, I need to pick numbers between 2 and 14 and I need the average of the random numbers to be 5.
I use random number generators a lot. Usually I just need a uniform distribution.
I don't even know what to call this type of distribution.
Thank you for any assistance or insight you can provide.

You might be able to use a binomial distribution, if you're happy with the shape of that distribution. Set n=12 and p=0.25. This will give you a value between 0 and 12 with a mean of 3. Just add 2 to each result to get the range and mean you are looking for.
Edit: As for implementation, you can probably find a library for your chosen language that supports non-uniform distributions (I've written one myself for Java).
A binomial distribution can be approximated fairly easily using a uniform RNG. Simply perform n trials and record the number of successes. So if you have n=10 and p=0.5, it's just like flipping a coin 10 times in a row and counting the number of heads. For p=0.25 just generate uniformly-distributed values between 0 and 3 and only count zeros as successes.
If you want a more efficient implementation, there is a clever algorithm hidden away in the exercises of volume 2 of Knuth's The Art of Computer Programming.

You haven't said what distribution you are after. Regarding your specific example, a function which produced a uniform distribution between 2 and 8 would satisfy your requirements, strictly as you have written them :)

If you want a non-uniform distribution of the random number, then you might have to implement some sort of mapping, e.g:
// returns a number between 0..5 with a custom distribution
int MyCustomDistribution()
{
int r = rand(100); // random number between 0..100
if (r < 10) return 1;
if (r < 30) return 2;
if (r < 42) return 3;
...
}

Based on the Wikipedia sub-article about non-uniform generators, it would seem you want to apply the output of a uniform pseudorandom number generator to an area distribution that meets the desired mean.

You can create a non-uniform PRNG from a uniform one. This makes sense, as you can imagine taking a uniform PRNG that returns 0,1,2 and create a new, non-uniform PRNG by returning 0 for values 0,1 and 1 for the value 2.
There is more to it if you want specific characteristics on the distribution of your new, non-uniform PRNG. This is covered on the Wikipedia page on PRNGs, and the Ziggurat algorithm is specifically mentioned.
With those clues you should be able to search up some code.

My first idea would be:
generate numbers in the range 0..1
scale to the range -9..9 ( x-0.5; x*18)
shift range by 5 -> -4 .. 14 (add 5)
truncate the range to 2..14 (discard numbers < 2)
that should give you numbers in the range you want.

You need a distributed / weighted random number generator. Here's a reference to get you started.

Assign all numbers equal probabilities,
while currentAverage not equal to intendedAverage (whithin possible margin)
pickedNumber = pick one of the possible numbers (at random, uniform probability, if you pick intendedAverage pick again)
if (pickedNumber is greater than intendedAverage and currentAverage<intendedAverage) or (pickedNumber is less than intendedAverage and currentAverage>intendedAverage)
increase pickedNumber's probability by delta at the expense of all others, conserving sum=100%
else
decrease pickedNumber's probability by delta to the benefit of all others, conserving sum=100%
end if
delta=0.98*delta (the rate of decrease of delta should probably be experimented with)
end while

Related

How to generate a random stochastic matrix or ndarray?

I was looking for a crate that would allow me to easily and randomly generate probability vectors, stochastic matrices or, in general, ndarrays that are stochastic. For people not familiar with these concepts, a probability vector v is defined as follows
0 <= v[i] <= 1, for all i
sum(v[i]) = 1
Similarly, a stochastic matrix is a matrix where each column (or row) satisfies the conditions above.
More generally, a ndarray A would be stochastic if
0 <= A[i, j, k, ..., h] <= 1, for all indices
sum(A[i, j, k, ..., :]) = 1, for all indices
Here, ... just means other indices between k and the last index h. : is a notation to indicate all elements of that dimension.
Is there a crate that does this easily (i.e. you just need to call a function or something like that)? If not, how would you do it? I suppose one could just generate a random ndarray and then change the array by dividing the last dimension by the sum of the elements in that dimension, so, for a 1d array (a probability vector), we could do something like this
use ndarray::Array1;
use ndarray_rand::RandomExt;
use ndarray_rand::rand_distr::Uniform;
fn main() {
let mut a = Array1::random(10, Uniform::new(0.0, 1.0));
a = &a / a.sum();
println!("The sum is {:?}", a.sum());
}
But how would you do it for higher dimensional arrays? We could use a for loop an iterate over all indices, but that doesn't look like it would be efficient. I suppose there must be a way to do this operation in a vectorized form. Is there a function (in the standard library, in the ndarray crate or some other crate) that does this for us? Could we use ndarray-rand to do this without having to divide by the sum?
Requirements
Efficiency is not strictly necessary, but it would be nice.
I am more looking for a simple solution (no more complicated than what I wrote above).
Numerical stability would also be great (e.g. generating random integers and then dividing by the sum may be a better idea than generating random floats and then do the same thing).
I would like to use ndarrays and the related crate, but it's ok if you share also other solutions (which may be useful to others that don't use ndarrays)
I would argue that sampling with whatever distribution you have on hands (U(0,1), Exponential, abs Normal, ...) and then dividing by sum is the wrong way to go.
Start with distribution which has property values being in the [0...1] range and sum of values being equal to 1.
Fortunately, there is such distribution - Dirichlet distribution.
And, apparently, there is a Rust lib to do Dirichlet sampling. Cannot say anything about lib quality.
https://docs.rs/rand_distr/latest/rand_distr/struct.Dirichlet.html
UPDATE
Wrt sampling and then normalizing, problem is, noone knows what would be distribution of the RVs
U(0,1)/(U(0,1) + U(0,1) + ... + U(0,1))
Mean value? Median? Variance? Anything to say at all?
You could even construct it like
[U(0,1);Exp(2);|N(0,1)|;U(0,88);Exp(4.5);...] and as soon as you divide it by sum, values in the vector would be between 0 and 1 and summed to 1. Even less to say about properties of such RVs.
I assume you want to generate random vector/matrices for some purpose, like Monte Carlo etc. Dealing with known distribution with well-defined properties, mean values, variance looks like right way to go.
If I understand correctly, the Dirichlet distribution allows you to generate a probability vector, where the probabilities depend on the initial parameters that you pass, but you would still need to pass these parameters (manually)
Yes, concentration parameters. By default all ones, which makes RVs uniformly distributed in the simplex.
So, are you suggesting the Dirichlet distribution because it was designed on purpose to generate probability vectors?
I'm suggesting Dirichlet because by default it will produce uniformly in-the-simplex distributed RVs, summed to 1 and with well-known statistical properties, starting with PDF, CDF, mean, median, variance, ...
UPDATE II
For Dirichlet
PDF=Prod(xiai-1)/B(a)
So for the case where all ai=1
PDF = 1/B(a)
so given the constrains defining simplex Sum(xi)=1 this is as uniform as it gets.

Spark Hashing TF power of two feature dimension recommendation reasoning

According to https://spark.apache.org/docs/2.3.0/ml-features.html#tf-idf:
"HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3."
...
"Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the vector indices."
I tried to understand why using a power of two as the feature dimension will map words evenly and tried find some helpful documentation on the internet to understand it, but both attempts were not successful.
Does somebody know or have useful sources on why using the power two maps words evenly to vector indices?
The output of a hash function is b-bit, i.e., there are 2^b possible values to which a feature can be hashed. Additionally, we assume that the 2^b possible values appear uniformly at random.
If d is the feature dimension, an index for a feature f is determined as hash(f) MOD d. Again, hash(f) takes on 2^b possible values. It is easy to see that d has to be a power of two (i.e., a divisor of 2^b) itself in order for uniformity to be maintained.
For a counter-example, consider a 2-bit hash function and a 3-dimensional feature space. As per our assumptions, the hash function outputs 0, 1, 2, or 3 with probability 1/4 each. However, taking mod 3 results in 0 with probability 1/2, or 1 or 2 with probability 1/4 each. Therefore, uniformity is not maintained. On the other hand; if the feature space were 2-dimensional, it is easy to see that the result would be 0 or 1 with probability 1/2 each.

Generating Normally distributed Random Numbers without decimal in excel

I am trying to get random numbers that are normally distributed with a mean of 20 and standard deviation of 2 for a sample size of 225 in Excel but I am getting numbers with decimals ( like 17.5642 , 16.337).
if I round it off, normal distribution cant be achieved. Please help me to get round figures that are normally distributed too....I used the Excel FORMULA "* =NORMINV(RAND(),20,2) *" for generating those numbers. Please suggest to get round figures.
As #circular-ruin has observed, what you are asking for strictly speaking doesn't make sense.
But -- perhaps you can run the Central Limit Theorem backwards. CLT is often used to approximate discrete distributions by normal distributions. You can use it to approximate a normal distribution by a discrete distribution.
If X is binomial with parameters p and n, then it is a standard result that the mean of X is np and the variance of X is np(1-p). Elementary algebra yields that such an X has mean 20 and variance 4 (hence standard deviation 2) if and only if n = 25 and p = 0.8. Thus -- if you simulate a bin(25,0.8) random variable you will get integer values which will be approximately N(20,4). This seems a little more principled then simulating N(20,4) directly and then just rounding. It still isn't normal -- but you really need to drop that requirement if you want your values to be integers.
To simulate a bin(25,0.8) random variable in Excel, just use the formula
=BINOM.INV(25,0.8,RAND())
with just 225 observations the results would probably pass a Chi-squared goodness of fit test for N(20,4) (though the right tail would be under-represented).

approximate histogram for streaming string values (card catalog algorithm?)

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.
I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.
Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.
The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:
number_of_strings / bin_count = number_of_strings_in_each_bin
Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.
This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.
Thanks.
If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.
This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.
Of course, if you can make some assumptions about the data, here are some that might help:
For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.
Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.
Then, to find the bin endpoints, you'd find the element at size/n/m in your array.
The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.
We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.
For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.
This approach has the advantage of working for any Comparable data type.

Probability of selecting an element from a set

The expected probability of randomly selecting an element from a set of n elements is P=1.0/n .
Suppose I check P using an unbiased method sufficiently many times. What is the distribution type of P? It is clear that P is not normally distributed, since cannot be negative. Thus, may I correctly assume that P is gamma distributed? And if yes, what are the parameters of this distribution?
Histogram of probabilities of selecting an element from 100-element set for 1000 times is shown here.
Is there any way to convert this to a standard distribution
Now supposed that the observed probability of selecting the given element was P* (P* != P). How can I estimate whether the bias is statistically significant?
EDIT: This is not a homework. I'm doing a hobby project and I need this piece of statistics for it. I've done my last homework ~10 years ago:-)
With repetitions, your distribution will be binomial. So let X be the number of times you select some fixed object, with M total selections
P{ X = x } = ( M choose x ) * (1/N)^x * (N-1/N)^(M-x)
You may find this difficult to compute for large N. It turns out that for sufficiently large N, this actually converges to a normal distribution with probability 1 (Central Limit theorem).
In case P{X=x} will be given by a normal distribution. The mean will be M/N and the variance will be M * (1/N) * ( N-1) / N.
This is a clear binomial distribution with p=1/(number of elements) and n=(number of trials).
To test whether the observed result differs significantly from the expected result, you can do the binomial test.
The dice examples on the two Wikipedia pages should give you some good guidance on how to formulate your problem. In your 100-element, 1000 trial example, that would be like rolling a 100-sided die 1000 times.
As others have noted, you want the Binomial distribution. Your question seems to imply an interest in a continuous approximation to it, though. It can actually be approximated by the normal distribution, and also by the Poisson distribution.
Is your distribution a discrete uniform distribution?

Resources