I'm trying to manipulate individual weights of different neural nets to see how their performance degrades. As part of these experiments, I'm required to sample randomly from their weight tensors, which I've come to understand as sampling with replacement (in the statistical sense). However, since it's high-dimensional, I've been stumped by how to do this in a fair manner. Here are the approaches and research I've put into considering this problem:
This was previously implemented by selecting a random layer and then selecting a random weight in that layer (ignore the implementation of picking a random weight). Since layers are different sizes, we discovered that weights were being sampled unevenly.
I considered what would happen if we sampled according to the numpy.shape of the tensor; however, I realize now that this encounters the same problem as above.
Consider what happens to a rank 2 tensor like this:
[[*, *, *],
[*, *, *, *]]
Selecting a row randomly and then a value from that row results in an unfair selection. This method could work if you're able to assert that this scenario never occurs, but it's far from a general solution.
Note that this possible duplicate actually implements it in this fashion.
I found people suggesting flattening the tensor and use numpy.random.choice to select randomly from a 1D array. That's a simple solution, except I have no idea how to invert the flattened tensor back into its original shape. Further, flattening millions of weights would be a somewhat slow implementation.
I found someone discussing tf.random.multinomial here, but I don't understand enough of it to know whether it's applicable or not.
I ran into this paper about resevoir sampling, but again, it went over my head.
I found another paper which specifically discusses tensors and sampling techniques, but it went even further over my head.
A teammate found this other paper which talks about random sampling from a tensor, but it's only for rank 3 tensors.
Any help understanding how to do this? I'm working in Python with Keras, but I'll take an algorithm in any form that it exists. Thank you in advance.

Before I forget to document the solution we arrived at, I'll talk about the two different paths I see for implementing this:
Use a total ordering on scalar elements of the tensor. This is effectively enumerating your elements, i.e. flattening them. However, you can do this while maintaining the original shape. Consider this pseudocode (in Python-like syntax):
def sample_tensor(tensor, chosen_index: int) -> Tuple[int]:
"""Maps a chosen random number to its index in the given tensor.
tensor: A ragged-array n-tensor.
chosen_index: An integer in [0, num_scalar_elements_in_tensor).
The index that accesses this element in the tensor.
NOTE: Entirely untested, expect it to be fundamentally flawed.
remaining = chosen_index
for (i, sub_list) in enumerate(tensor):
if type(sub_list) is an iterable:
if |sub_list| > remaining:
remaining -= |sub_list|
return i joined with sample_tensor(sub_list, remaining)
if len(sub_list) <= remaining:
return tuple(remaining)
First of all, I'm aware this isn't a sound algorithm. The idea is to count down until you reach your element, with bookkeeping for indices.
We need to make crucial assumptions here. 1) All lists will eventually contain only scalars. 2) By direct consequence, if a list contains lists, assume that it also doesn't contain scalars at the same level. (Stop and convince yourself for (2).)
We also need to make a critical note here too: We are unable to measure the number of scalars in any given list, unless the list is homogeneously consisting of scalars. In order to avoid measuring this magnitude at every point, my algorithm above should be refactored to descend first, and subtract later.
This algorithm has some consequences:
It's the fastest in its entire style of approaching the problem. If you want to write a function f: [0, total_elems) -> Tuple[int], you must know the number of preceding scalar elements along the total ordering of the tensor. This is effectively bound at Theta(l) where l is the number of lists in the tensor (since we can call len on a list of scalars).
It's slow. It's too slow compared to sampling nicer tensors that have a defined shape to them.
It begs the question: can we do better? See the next solution.
Use a probability distribution in conjunction with numpy.random.choice. The idea here is that if we know ahead of time what the distribution of scalars is already like, we can sample fairly at each level of descending the tensor. The hard problem here is building this distribution.
I won't write pseudocode for this, but lay out some objectives:
This can be called only once to build the data structure.
The algorithm needs to combine iterative and recursive techniques to a) build distributions for sibling lists and b) build distributions for descendants, respectively.
The algorithm will need to map indices to a probability distribution respective to sibling lists (note the assumptions discussed above). This does require knowing the number of elements in an arbitrary sub-tensor.
At lower levels where lists contain only scalars, we can simplify by just storing the number of elements in said list (as opposed to storing probabilities of selecting scalars randomly from a 1D array).
You will likely need 2-3 functions: one that utilizes the probability distribution to return an index, a function that builds the distribution object, and possibly a function that just counts elements to help build the distribution.
This is also faster at O(n) where n is the rank of the tensor. I'm convinced this is the fastest possible algorithm, but I lack the time to try to prove it.
You might choose to store the distribution as an ordered dictionary that maps a probability to either another dictionary or the number of elements in a 1D array. I think this might be the most sensible structure.
Note that (2) is truly the same as (1), but we pre-compute knowledge about the densities of the tensor.
I hope this helps.


How to generate a random stochastic matrix or ndarray?

I was looking for a crate that would allow me to easily and randomly generate probability vectors, stochastic matrices or, in general, ndarrays that are stochastic. For people not familiar with these concepts, a probability vector v is defined as follows
0 <= v[i] <= 1, for all i
sum(v[i]) = 1
Similarly, a stochastic matrix is a matrix where each column (or row) satisfies the conditions above.
More generally, a ndarray A would be stochastic if
0 <= A[i, j, k, ..., h] <= 1, for all indices
sum(A[i, j, k, ..., :]) = 1, for all indices
Here, ... just means other indices between k and the last index h. : is a notation to indicate all elements of that dimension.
Is there a crate that does this easily (i.e. you just need to call a function or something like that)? If not, how would you do it? I suppose one could just generate a random ndarray and then change the array by dividing the last dimension by the sum of the elements in that dimension, so, for a 1d array (a probability vector), we could do something like this
use ndarray::Array1;
use ndarray_rand::RandomExt;
use ndarray_rand::rand_distr::Uniform;
fn main() {
let mut a = Array1::random(10, Uniform::new(0.0, 1.0));
a = &a / a.sum();
println!("The sum is {:?}", a.sum());
But how would you do it for higher dimensional arrays? We could use a for loop an iterate over all indices, but that doesn't look like it would be efficient. I suppose there must be a way to do this operation in a vectorized form. Is there a function (in the standard library, in the ndarray crate or some other crate) that does this for us? Could we use ndarray-rand to do this without having to divide by the sum?
Efficiency is not strictly necessary, but it would be nice.
I am more looking for a simple solution (no more complicated than what I wrote above).
Numerical stability would also be great (e.g. generating random integers and then dividing by the sum may be a better idea than generating random floats and then do the same thing).
I would like to use ndarrays and the related crate, but it's ok if you share also other solutions (which may be useful to others that don't use ndarrays)
I would argue that sampling with whatever distribution you have on hands (U(0,1), Exponential, abs Normal, ...) and then dividing by sum is the wrong way to go.
Start with distribution which has property values being in the [0...1] range and sum of values being equal to 1.
Fortunately, there is such distribution - Dirichlet distribution.
And, apparently, there is a Rust lib to do Dirichlet sampling. Cannot say anything about lib quality.
Wrt sampling and then normalizing, problem is, noone knows what would be distribution of the RVs
U(0,1)/(U(0,1) + U(0,1) + ... + U(0,1))
Mean value? Median? Variance? Anything to say at all?
You could even construct it like
[U(0,1);Exp(2);|N(0,1)|;U(0,88);Exp(4.5);...] and as soon as you divide it by sum, values in the vector would be between 0 and 1 and summed to 1. Even less to say about properties of such RVs.
I assume you want to generate random vector/matrices for some purpose, like Monte Carlo etc. Dealing with known distribution with well-defined properties, mean values, variance looks like right way to go.
If I understand correctly, the Dirichlet distribution allows you to generate a probability vector, where the probabilities depend on the initial parameters that you pass, but you would still need to pass these parameters (manually)
Yes, concentration parameters. By default all ones, which makes RVs uniformly distributed in the simplex.
So, are you suggesting the Dirichlet distribution because it was designed on purpose to generate probability vectors?
I'm suggesting Dirichlet because by default it will produce uniformly in-the-simplex distributed RVs, summed to 1 and with well-known statistical properties, starting with PDF, CDF, mean, median, variance, ...
For Dirichlet
So for the case where all ai=1
PDF = 1/B(a)
so given the constrains defining simplex Sum(xi)=1 this is as uniform as it gets.

Does adding a list of Word2Vec embeddings give a meaningful represenation?

I'm using a pre-trained word2vec model (word2vec-google-news-300) to get the embeddings for a given list of words. Please note that this is NOT a list of words that we get after tokenizing a sentence, it is just a list of words that describe a given image.
Now I'd like to get a single vector representation for the entire list. Does adding all the individual word embeddings make sense? Or should I consider averaging?
Also, I would like the vector to be of a constant size so concatenating the embeddings is not an option.
It would be really helpful if someone can explain the intuition behind considering either one of the above approaches.
Averaging is most typical, when someone is looking for a super-simple way to turn a bag-of-words into a single fixed-length vector.
You could try a simple sum, as well.
But note that the key difference between the sum and average is that the average divides by the number of input vectors. Thus they both result in a vector that's pointing in the exact same 'direction', just of different magnitude. And, the most-often-used way of comparing such vectors, cosine-similarity, is oblivious to magnitudes. So for a lot of cosine-similarity-based ways of later comparing the vectors, sum-vs-average will give identical results.
On the other hand, if you're comparing the vectors in other ways, like via euclidean-distances, or feeding them into other classifiers, sum-vs-average could make a difference.
Similarly, some might try unit-length-normalizing all vectors before use in any comparisons. After such a pre-use normalization, then:
euclidean-distance (smallest to largest) & cosine-similarity (largest-to-smallest) will generate identical lists of nearest-neighbors
average-vs-sum will result in different ending directions - as the unit-normalization will have upped some vectors' magnitudes, and lowered others, changing their relative contributions to the average.
What should you do? There's no universally right answer - depending on your dataset & goals, & the ways your downstream steps use the vectors, different choices might offer slight advantages in whatever final quality/desirability evaluation you perform. So it's common to try a few different permutations, along with varying other parameters.
The GoogleNews vectors were trained on news articles back around 2013; their word senses thus may not be optimal for an image-labeling task. If you have enough of your own data, or can collect it, training your own word-vectors might result in better results. (Both the use of domain-specific data, & the ability to tune training parameters based on your own evaluations, could offer benefits - especially when your domain is unique, or the tokens aren't typical natural-language sentences.)
There are other ways to create a single summary vector for a run-of-tokens, not just arithmatical-combo-of-word-vectors. One that's a small variation on the word2vec algorithm often goes by the name Doc2Vec (or 'Paragraph Vector') - it may also be worth exploring.
There are also ways to compare bags-of-tokens, leveraging word-vectors, that don't collapse the bag-of-tokens to a single fixed-length vector 1st - and while they're more expensive to calculate, sometimes offer better pairwise similarity/distance results than simple cosine-similarity. One such alternate comparison is called "Word Mover's Distance" - at some point,, you may want to try that as well.

How to compute the iteration matrix for nth NLBGS iteration

I was wondering if there was a direct way of computing the iteration matrix for nth Linear Block Gauss Seidel iteration within OpenMDAO?
thank you
If I understand you correctly, you are referring to the matrix-form of the Gauss Seidel algorithm where you take Ax=b, and break A up into the Diagonal (D), Lower (L) and Upper (U) parts, then use those parts to compute the next iterate.
Specifically you compute [D-L]^-1. This, I believe is what you are referring to as the "iteration matrix" (I am not familiar with this terminology, but based on the algorithm I'm comfortable making an educated guess).
This formulation of the algorithm is useful to think about and a simple way to implement it, but OpenMDAO takes a different approach. The LBGS algorithm implemented in OpenMDAO is set up to work in a matrix-free manner. That means it only interacts with the linear operator methods solve_linear and apply_linear and never explicitly assembles the A matrix at all. Hence there isn't an opportunity to split A up into D, L, U.
Depending on the way you constructed the model, the A matrix you would need might or might not be there at all because OpenMDAO is capable of working in a completely matrix free context. However, if all of your components use the compute_partials or linearize methods to provide partial derivatives then the data you would need for the A matrix does exist in memory.
You'll have to dig for it a bit, and ironically the best place to see how to do that is in the direct solver which does actually require the matrix be formed to compute a factorization.
Also, in that code you'll see a function can iteratively call the linear operator to construct a dense matrix even if the underlying components don't provide their partials directly. Please note that this approach for assembling the matrix is extremely slow and is not recommended for normal operations.

Quasi-Monte-Carlo vs. variable dimensionality?

I've been looking through the Matlab documention on using quasi-random sampling of N-dimensional unit cubes. This represents a problem with N stochastic parameters. Based on the fact that it is a unit cube, I presume that I need to use the inverse CDF of each parameter to map from the [0,1] domain to the value range of each parameter.
I would like to try this on a problem for which I now use Monte Carlo. Unfortunately, the problem I'm analyzing does not have a fixed number of dimensions. For each instantiation of the problem, I generate a variable number of widgets (say) using a Poisson distribution. Only after that do I randomly generate the parameters for each widget. That whole process yields one instance of the problem to be analyzed, so the number of parameters varies from one instance to the next.
Is this kind of problem still amenable to Quasi-Monte-Carlo?
What I used once was to get highest possible dimension of the problem d, generate Sobol sequence in d and use whatever number of points necessary for a particular sampling. I would say it helped somewhat...
From talking to a much smarter colleague, we need to consider the various combinations of widget counts for each widget type. For example, if we have 2 of widget type#1, 4 of widget type #2, 1 of widget type #3, etc., that constitutes one combination. QMC can be applied to that one combination. We are assuming that number of widget#i is independent of the number of widget#j for i<>j, so the probability of each combination is just the product of p(2 widgets of type#1), p(4 widgets of type#2), p(1 widget of type#3), etc. The individual probabilities are easy to get from their Poisson distributions (or their flat distributions, or whatever distribution is being used). If there are N widget types, this is just a joint PMF in N-space. This probability is then used to weight the QMC result for that particular combination. Note that even when the exactly combination is nailed down, QMC is still needed because there each widget is associated with 3 stochastic parameters.

SVD and singular / non-singular matrices

I need to use the SVD form of a matrix to extract concepts from a series of documents. My matrix is of the form A = [d1, d2, d3 ... dN] where di is a binary vector of M components. Then the svd decomposition gives me svd(A) = U x S x V' with S containing the singular values.
I use SVDLIBC to do the processing in nodejs (using a small module I wrote to use it). It seemed to work all well, but I noticed something quite weird in the running time behavior depending on the state of my matrix (where N, M are growing, but already above 1000 for each).
First, I didn't consider extracting the same document vectors, but now after some tests, it looks like adding a document twice sometimes speeds the processing extraordinarily.
Do I have to make sure that each of the columns of A are pairwise-independent? Are they required to be all linearly independent? (I thought nope, since SVD just seems to be performing its job well even with some columns being exactly the same, it will simply show in the resulting decomposition which columns / rows are useless by having 0 components in U or V)
Now that it sometimes takes way too much time to compute the SVD of my big matrix, I was trying to reduce its size by removing the same columns, but I found out that actually adding dummy same vectors can make it way faster. Is that normal? What's happening?
Logically, I'd say that I want my matrix to contain as much information as possible, and thus
[A] Remove all same columns, and in the best case, maybe
[B] Remove linearly dependent columns.
Doing [A] seems pretty simple and not computationally too expensive, I could hash my vectors at construction to check what are the possibly same vectors, and then spend time to check these, but are there good computation techniques for [A] and [B]?
(I'd appreciate for [A] to not have to check equality of a new vector with the whole past vectors the brute-force way, and as for [B], I don't know any good way to check it / do it).
Added related question: about my second question, why would SVD's running time behavior change so massively by just adding one similar column? Is that a normal possible behavior, or does it mean I should look for a bug in SVDLIBC?
It is difficult to say where the problem is without samples of fast and slow input matrices. But, since one of the primary uses of the SVD is to provide a rotation that eliminates covariance, redundant (or the same) columns should not cause problems.
To answer your question about if the slow behavior being a bug in the library you're using, I'd suggest trying to retrieve the SVD of the same matrix using another tool. For example, in Octave, retrieve an SVD of your matrix to compare runtimes:
[U, S, V] = svd(A)
