I was looking for a crate that would allow me to easily and randomly generate probability vectors, stochastic matrices or, in general, ndarrays that are stochastic. For people not familiar with these concepts, a probability vector v is defined as follows
0 <= v[i] <= 1, for all i
sum(v[i]) = 1
Similarly, a stochastic matrix is a matrix where each column (or row) satisfies the conditions above.
More generally, a ndarray A would be stochastic if
0 <= A[i, j, k, ..., h] <= 1, for all indices
sum(A[i, j, k, ..., :]) = 1, for all indices
Here, ... just means other indices between k and the last index h. : is a notation to indicate all elements of that dimension.
Is there a crate that does this easily (i.e. you just need to call a function or something like that)? If not, how would you do it? I suppose one could just generate a random ndarray and then change the array by dividing the last dimension by the sum of the elements in that dimension, so, for a 1d array (a probability vector), we could do something like this
use ndarray::Array1;
use ndarray_rand::RandomExt;
use ndarray_rand::rand_distr::Uniform;
fn main() {
let mut a = Array1::random(10, Uniform::new(0.0, 1.0));
a = &a / a.sum();
println!("The sum is {:?}", a.sum());
}
But how would you do it for higher dimensional arrays? We could use a for loop an iterate over all indices, but that doesn't look like it would be efficient. I suppose there must be a way to do this operation in a vectorized form. Is there a function (in the standard library, in the ndarray crate or some other crate) that does this for us? Could we use ndarray-rand to do this without having to divide by the sum?
Requirements
Efficiency is not strictly necessary, but it would be nice.
I am more looking for a simple solution (no more complicated than what I wrote above).
Numerical stability would also be great (e.g. generating random integers and then dividing by the sum may be a better idea than generating random floats and then do the same thing).
I would like to use ndarrays and the related crate, but it's ok if you share also other solutions (which may be useful to others that don't use ndarrays)
I would argue that sampling with whatever distribution you have on hands (U(0,1), Exponential, abs Normal, ...) and then dividing by sum is the wrong way to go.
Start with distribution which has property values being in the [0...1] range and sum of values being equal to 1.
Fortunately, there is such distribution - Dirichlet distribution.
And, apparently, there is a Rust lib to do Dirichlet sampling. Cannot say anything about lib quality.
https://docs.rs/rand_distr/latest/rand_distr/struct.Dirichlet.html
UPDATE
Wrt sampling and then normalizing, problem is, noone knows what would be distribution of the RVs
U(0,1)/(U(0,1) + U(0,1) + ... + U(0,1))
Mean value? Median? Variance? Anything to say at all?
You could even construct it like
[U(0,1);Exp(2);|N(0,1)|;U(0,88);Exp(4.5);...] and as soon as you divide it by sum, values in the vector would be between 0 and 1 and summed to 1. Even less to say about properties of such RVs.
I assume you want to generate random vector/matrices for some purpose, like Monte Carlo etc. Dealing with known distribution with well-defined properties, mean values, variance looks like right way to go.
If I understand correctly, the Dirichlet distribution allows you to generate a probability vector, where the probabilities depend on the initial parameters that you pass, but you would still need to pass these parameters (manually)
Yes, concentration parameters. By default all ones, which makes RVs uniformly distributed in the simplex.
So, are you suggesting the Dirichlet distribution because it was designed on purpose to generate probability vectors?
I'm suggesting Dirichlet because by default it will produce uniformly in-the-simplex distributed RVs, summed to 1 and with well-known statistical properties, starting with PDF, CDF, mean, median, variance, ...
UPDATE II
For Dirichlet
PDF=Prod(xiai-1)/B(a)
So for the case where all ai=1
PDF = 1/B(a)
so given the constrains defining simplex Sum(xi)=1 this is as uniform as it gets.
Related
I'm trying to manipulate individual weights of different neural nets to see how their performance degrades. As part of these experiments, I'm required to sample randomly from their weight tensors, which I've come to understand as sampling with replacement (in the statistical sense). However, since it's high-dimensional, I've been stumped by how to do this in a fair manner. Here are the approaches and research I've put into considering this problem:
This was previously implemented by selecting a random layer and then selecting a random weight in that layer (ignore the implementation of picking a random weight). Since layers are different sizes, we discovered that weights were being sampled unevenly.
I considered what would happen if we sampled according to the numpy.shape of the tensor; however, I realize now that this encounters the same problem as above.
Consider what happens to a rank 2 tensor like this:
[[*, *, *],
[*, *, *, *]]
Selecting a row randomly and then a value from that row results in an unfair selection. This method could work if you're able to assert that this scenario never occurs, but it's far from a general solution.
Note that this possible duplicate actually implements it in this fashion.
I found people suggesting flattening the tensor and use numpy.random.choice to select randomly from a 1D array. That's a simple solution, except I have no idea how to invert the flattened tensor back into its original shape. Further, flattening millions of weights would be a somewhat slow implementation.
I found someone discussing tf.random.multinomial here, but I don't understand enough of it to know whether it's applicable or not.
I ran into this paper about resevoir sampling, but again, it went over my head.
I found another paper which specifically discusses tensors and sampling techniques, but it went even further over my head.
A teammate found this other paper which talks about random sampling from a tensor, but it's only for rank 3 tensors.
Any help understanding how to do this? I'm working in Python with Keras, but I'll take an algorithm in any form that it exists. Thank you in advance.
Before I forget to document the solution we arrived at, I'll talk about the two different paths I see for implementing this:
Use a total ordering on scalar elements of the tensor. This is effectively enumerating your elements, i.e. flattening them. However, you can do this while maintaining the original shape. Consider this pseudocode (in Python-like syntax):
def sample_tensor(tensor, chosen_index: int) -> Tuple[int]:
"""Maps a chosen random number to its index in the given tensor.
Args:
tensor: A ragged-array n-tensor.
chosen_index: An integer in [0, num_scalar_elements_in_tensor).
Returns:
The index that accesses this element in the tensor.
NOTE: Entirely untested, expect it to be fundamentally flawed.
"""
remaining = chosen_index
for (i, sub_list) in enumerate(tensor):
if type(sub_list) is an iterable:
if |sub_list| > remaining:
remaining -= |sub_list|
else:
return i joined with sample_tensor(sub_list, remaining)
else:
if len(sub_list) <= remaining:
return tuple(remaining)
First of all, I'm aware this isn't a sound algorithm. The idea is to count down until you reach your element, with bookkeeping for indices.
We need to make crucial assumptions here. 1) All lists will eventually contain only scalars. 2) By direct consequence, if a list contains lists, assume that it also doesn't contain scalars at the same level. (Stop and convince yourself for (2).)
We also need to make a critical note here too: We are unable to measure the number of scalars in any given list, unless the list is homogeneously consisting of scalars. In order to avoid measuring this magnitude at every point, my algorithm above should be refactored to descend first, and subtract later.
This algorithm has some consequences:
It's the fastest in its entire style of approaching the problem. If you want to write a function f: [0, total_elems) -> Tuple[int], you must know the number of preceding scalar elements along the total ordering of the tensor. This is effectively bound at Theta(l) where l is the number of lists in the tensor (since we can call len on a list of scalars).
It's slow. It's too slow compared to sampling nicer tensors that have a defined shape to them.
It begs the question: can we do better? See the next solution.
Use a probability distribution in conjunction with numpy.random.choice. The idea here is that if we know ahead of time what the distribution of scalars is already like, we can sample fairly at each level of descending the tensor. The hard problem here is building this distribution.
I won't write pseudocode for this, but lay out some objectives:
This can be called only once to build the data structure.
The algorithm needs to combine iterative and recursive techniques to a) build distributions for sibling lists and b) build distributions for descendants, respectively.
The algorithm will need to map indices to a probability distribution respective to sibling lists (note the assumptions discussed above). This does require knowing the number of elements in an arbitrary sub-tensor.
At lower levels where lists contain only scalars, we can simplify by just storing the number of elements in said list (as opposed to storing probabilities of selecting scalars randomly from a 1D array).
You will likely need 2-3 functions: one that utilizes the probability distribution to return an index, a function that builds the distribution object, and possibly a function that just counts elements to help build the distribution.
This is also faster at O(n) where n is the rank of the tensor. I'm convinced this is the fastest possible algorithm, but I lack the time to try to prove it.
You might choose to store the distribution as an ordered dictionary that maps a probability to either another dictionary or the number of elements in a 1D array. I think this might be the most sensible structure.
Note that (2) is truly the same as (1), but we pre-compute knowledge about the densities of the tensor.
I hope this helps.
According to https://spark.apache.org/docs/2.3.0/ml-features.html#tf-idf:
"HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3."
...
"Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the vector indices."
I tried to understand why using a power of two as the feature dimension will map words evenly and tried find some helpful documentation on the internet to understand it, but both attempts were not successful.
Does somebody know or have useful sources on why using the power two maps words evenly to vector indices?
The output of a hash function is b-bit, i.e., there are 2^b possible values to which a feature can be hashed. Additionally, we assume that the 2^b possible values appear uniformly at random.
If d is the feature dimension, an index for a feature f is determined as hash(f) MOD d. Again, hash(f) takes on 2^b possible values. It is easy to see that d has to be a power of two (i.e., a divisor of 2^b) itself in order for uniformity to be maintained.
For a counter-example, consider a 2-bit hash function and a 3-dimensional feature space. As per our assumptions, the hash function outputs 0, 1, 2, or 3 with probability 1/4 each. However, taking mod 3 results in 0 with probability 1/2, or 1 or 2 with probability 1/4 each. Therefore, uniformity is not maintained. On the other hand; if the feature space were 2-dimensional, it is easy to see that the result would be 0 or 1 with probability 1/2 each.
That is, I want to check if the linear system derived from a radiosity problem is convergent.
I also want to know is there any book/paper giving a proof on the convergence of the radiosity problem?
Thanks.
I assume you're solving B = (I - rho*F) B (based on the wikipedia article)
Gauss-Seidel and Jacobi iteration methods are both guaranteed to converge if the matrix is diagonally dominant (Gauss-Seidel is also guaranteed to converge if the matrix is symmetric and positive definite).
The rows of the F matrix (view factors) sum to 1, so if rho (reflectivity) is < 1, which physically it should be, the matrix will be diagonally dominant.
How do I compute the generalized mean for extreme values of p (very close to 0, or very large) with reasonable computational error?
As per your link, the limit for p going to 0 is the geometric mean, for which bounds are derived.
The limit for p going to infinity is the maximum.
I have been struggling with the same problem. Here is how I handled this:
Let gmean_p(x1,...,xn) be the generalized mean where p is real but not 0, and x1, ..xn nonnegative. For M>0, we have gmean_p(x1,...,xn) = M*gmean_p(x1/M,...,xn/M) of which the latter form can be exploited to reduce the computational error. For large p, I use M=max(x1,...,xn) and for p close to 0, I use M=mean(x1,..xn). In case M=0, just add a small positive constant to it. This did the job for me.
I suspect if you're interested in very large or small values of p, it may be best to do some form of algebraic manipulation of the generalized-mean formula before putting in numerical values.
For example, in the small-p limit, one can show that the generalized mean tends to the n'th root of the product x_1*x_2*...x_n. The higher order terms in p involve sums and products of log(x_i), which should also be relatively numerically stable to compute. In fact, I believe the first-order expansion in p has a simple relationship to the variance of log(x_i):
If one applies this formula to a set of 100 random numbers drawn uniformly from the range [0.2, 2], one gets a trend like this:
which here shows the asymptotic formula becoming pretty accurate for p less than about 0.3, and the simple formula only failing when p is less than about 1e-10.
The case of large p, is dominated by that x_i which has the largest magnitude (lets call that index i_max). One can rearrange the generalized mean formula to take the following form, which has less pathological behaviour for large p:
If this is applied (using standard numpy routines including numpy.log1p) to another 100 uniformly distributed samples over [0.2, 2.0], one finds that the rearranged formula agrees essentially exactly with the simple formula, but remains valid for much larger values of p for which the simple formula overflows when computing powers of x_i.
(Note that the left-hand plot has the blue curve for the simple formula shifted up by 0.1 so that one can see where it ends due to overflows. For p less than about 1000, the two curves would otherwise be indistinguishable.)
I think the answer here should be to use a recursive solution. In the same way that mean(1,2,3,4)=mean(mean(1,2),mean(3,4)), you can do this kind of recursion for generalized means. What this buys you is that you won't need to do as many sums of really large numbers and you decrease the likelihood of creating an overflow. Also, the other danger when working with floating point numbers is when adding numbers of very different magnitudes (or subtracting numbers of very similar magnitudes). So to avoid these kinds of rounding errors it might help to sort your data before you try and calculate the generalized mean.
Here's a hunch:
First convert all your numbers into a representation in base p. Now to raise to a power of 1/p or p, you just have to shift them --- so you can very easily do all powers without losing precision.
Work out your mean in base p, then convert the result back to base two.
If that doesn't work, an even less practical hunch:
Try working out the discrete Fourier transform, and relating that to the discrete Fourier transform of the input vector.
I need a random number generator that picks numbers over a specified range with a programmable mean.
For example, I need to pick numbers between 2 and 14 and I need the average of the random numbers to be 5.
I use random number generators a lot. Usually I just need a uniform distribution.
I don't even know what to call this type of distribution.
Thank you for any assistance or insight you can provide.
You might be able to use a binomial distribution, if you're happy with the shape of that distribution. Set n=12 and p=0.25. This will give you a value between 0 and 12 with a mean of 3. Just add 2 to each result to get the range and mean you are looking for.
Edit: As for implementation, you can probably find a library for your chosen language that supports non-uniform distributions (I've written one myself for Java).
A binomial distribution can be approximated fairly easily using a uniform RNG. Simply perform n trials and record the number of successes. So if you have n=10 and p=0.5, it's just like flipping a coin 10 times in a row and counting the number of heads. For p=0.25 just generate uniformly-distributed values between 0 and 3 and only count zeros as successes.
If you want a more efficient implementation, there is a clever algorithm hidden away in the exercises of volume 2 of Knuth's The Art of Computer Programming.
You haven't said what distribution you are after. Regarding your specific example, a function which produced a uniform distribution between 2 and 8 would satisfy your requirements, strictly as you have written them :)
If you want a non-uniform distribution of the random number, then you might have to implement some sort of mapping, e.g:
// returns a number between 0..5 with a custom distribution
int MyCustomDistribution()
{
int r = rand(100); // random number between 0..100
if (r < 10) return 1;
if (r < 30) return 2;
if (r < 42) return 3;
...
}
Based on the Wikipedia sub-article about non-uniform generators, it would seem you want to apply the output of a uniform pseudorandom number generator to an area distribution that meets the desired mean.
You can create a non-uniform PRNG from a uniform one. This makes sense, as you can imagine taking a uniform PRNG that returns 0,1,2 and create a new, non-uniform PRNG by returning 0 for values 0,1 and 1 for the value 2.
There is more to it if you want specific characteristics on the distribution of your new, non-uniform PRNG. This is covered on the Wikipedia page on PRNGs, and the Ziggurat algorithm is specifically mentioned.
With those clues you should be able to search up some code.
My first idea would be:
generate numbers in the range 0..1
scale to the range -9..9 ( x-0.5; x*18)
shift range by 5 -> -4 .. 14 (add 5)
truncate the range to 2..14 (discard numbers < 2)
that should give you numbers in the range you want.
You need a distributed / weighted random number generator. Here's a reference to get you started.
Assign all numbers equal probabilities,
while currentAverage not equal to intendedAverage (whithin possible margin)
pickedNumber = pick one of the possible numbers (at random, uniform probability, if you pick intendedAverage pick again)
if (pickedNumber is greater than intendedAverage and currentAverage<intendedAverage) or (pickedNumber is less than intendedAverage and currentAverage>intendedAverage)
increase pickedNumber's probability by delta at the expense of all others, conserving sum=100%
else
decrease pickedNumber's probability by delta to the benefit of all others, conserving sum=100%
end if
delta=0.98*delta (the rate of decrease of delta should probably be experimented with)
end while