Desired distribution of weights in word embedding vectors - nlp

I am training my own embedding vectors as I'm focused on an academic dataset (WOS); whether the vectors are generated via word2vec or fasttext doesn't particularly matter. Say my vectors are 150 dimensions each. I'm wondering what the desired distribution of weights within a vector ought to be, if you averaged across an entire corpus's vectors?
I did a few experiments while looking at the distributions of a sample of my vectors and came to these conclusions (uncertain as to how absolutely they hold):
If one trains their model with too few epochs then the vectors don't change significantly from their initiated values (easy to see if you start you vectors as weight 0 in every category). Thus if my weight distribution is centered around some point (typically 0) then I've under-trained my corpus.
If one trains their model with too few documents/over-trains then the vectors show significant correlation with each other (I typically visualize a random set of vectors and you can see stripes where all the vectors have weights that are either positive or negative).
What I imagine is a single "good" vector has various weights across the entire range of -1 to 1. For any single vector it may have significantly more dimensions near -1 or 1. However, the weight distribution of an entire corpus would balance out vectors that randomly have more values towards one end of the spectrum or another, so that the weight distribution of the entire corpus is approximately evenly distributed across the entire corpus. Is this intuition correct?

I'm unfamiliar with any research or folk wisdom about the desirable "weights of the vectors" (by which I assume you mean the individual dimensions).
In general, since the individual dimensions aren't strongly interpretable, I'm not sure you could say much about how any one dimension's values should be distributed. And remember, our intuitions from low-dimensional spaces (2d, 3d, 4d) often don't hold up in high-dimensional spaces.
I've seen two interesting, possibly relevant observations in research:
some have observed that the raw trained vectors for words with singular meanings tend to have a larger magnitude, and those with many meanings have smaller magnitudes. A plausible explanation for this would be that word-vectors for polysemous word-tokenss are being pulled in different directions for the multiple contrasting meanings, and thus wind up "somewhere in the middle" (closer to the origin, and thus of lower magnitude). Note, though, that most word-vector-to-word-vector comparisons ignore the magnitudes, by using cosine-similarity to only compare angles (or largely equivalently, by normalizing all vectors to unit length before comparisons).
A paper "All-but-the-Top: Simple and Effective Postprocessing for Word Representations" by Mu, Bhat, & Viswanath https://arxiv.org/abs/1702.01417v2 has noted that the average of all word-vectors that were trained together tends to biased in a certain direction from the origin, but that removing that bias (and other commonalities in the vectors) can result in improved vectors for many tasks. In my own personal experiments, I've observed that the magnitude of that bias-from-origin seems correlated with the number of negative samples chosen - and that choosing the extreme (and uncommon) value of just 1 negative sample makes such a bias negligible (but might not be best for overall quality or efficiency/speed of training).
So there may be useful heuristics about vector quality from looking at the relative distributions of vectors, but I'm not sure any would be sensitive to individual dimensions (except insofar as those happen to be the projections of vectors onto a certain axis).

Related

Does adding a list of Word2Vec embeddings give a meaningful represenation?

I'm using a pre-trained word2vec model (word2vec-google-news-300) to get the embeddings for a given list of words. Please note that this is NOT a list of words that we get after tokenizing a sentence, it is just a list of words that describe a given image.
Now I'd like to get a single vector representation for the entire list. Does adding all the individual word embeddings make sense? Or should I consider averaging?
Also, I would like the vector to be of a constant size so concatenating the embeddings is not an option.
It would be really helpful if someone can explain the intuition behind considering either one of the above approaches.
Averaging is most typical, when someone is looking for a super-simple way to turn a bag-of-words into a single fixed-length vector.
You could try a simple sum, as well.
But note that the key difference between the sum and average is that the average divides by the number of input vectors. Thus they both result in a vector that's pointing in the exact same 'direction', just of different magnitude. And, the most-often-used way of comparing such vectors, cosine-similarity, is oblivious to magnitudes. So for a lot of cosine-similarity-based ways of later comparing the vectors, sum-vs-average will give identical results.
On the other hand, if you're comparing the vectors in other ways, like via euclidean-distances, or feeding them into other classifiers, sum-vs-average could make a difference.
Similarly, some might try unit-length-normalizing all vectors before use in any comparisons. After such a pre-use normalization, then:
euclidean-distance (smallest to largest) & cosine-similarity (largest-to-smallest) will generate identical lists of nearest-neighbors
average-vs-sum will result in different ending directions - as the unit-normalization will have upped some vectors' magnitudes, and lowered others, changing their relative contributions to the average.
What should you do? There's no universally right answer - depending on your dataset & goals, & the ways your downstream steps use the vectors, different choices might offer slight advantages in whatever final quality/desirability evaluation you perform. So it's common to try a few different permutations, along with varying other parameters.
Separately:
The GoogleNews vectors were trained on news articles back around 2013; their word senses thus may not be optimal for an image-labeling task. If you have enough of your own data, or can collect it, training your own word-vectors might result in better results. (Both the use of domain-specific data, & the ability to tune training parameters based on your own evaluations, could offer benefits - especially when your domain is unique, or the tokens aren't typical natural-language sentences.)
There are other ways to create a single summary vector for a run-of-tokens, not just arithmatical-combo-of-word-vectors. One that's a small variation on the word2vec algorithm often goes by the name Doc2Vec (or 'Paragraph Vector') - it may also be worth exploring.
There are also ways to compare bags-of-tokens, leveraging word-vectors, that don't collapse the bag-of-tokens to a single fixed-length vector 1st - and while they're more expensive to calculate, sometimes offer better pairwise similarity/distance results than simple cosine-similarity. One such alternate comparison is called "Word Mover's Distance" - at some point,, you may want to try that as well.

What do negative vectors mean on word2vec?

I am doing a research on travel reviews and used word2vec to analyze the reviews. However, when I showed my output to my adviser, he said that I have a lot of words with negative vector values and that only words with positive values are considered logical.
What could these negative values mean? Is there a way to ensure that all vector values I will get in my analysis would be positive?
While some other word-modeling algorithms do in fact model words into spaces where dimensions are 0 or positive, and the individual positive dimensions might be clearly meaningful to humans, that is not the case with the original, canonical 'word2vec' algorithm.
The positive/negativeness of any word2vec word-vector – in a particular dimension, or in net magnitude – has no strong meaning. Meaningful words will be spread out in every direction from the origin point. Directions or neighborhoods in this space that loosely correlate to recognizable categories may appear anywhere, and skew with respect to any of the dimensional axes.
(Here's a related algorithm that does use non-negative constraints – https://www.cs.cmu.edu/~bmurphy/NNSE/. But most references to 'word2vec' mean the classic approach where dimensions usefully range over all reals.)

Are the features of Word2Vec independent each other?

I am new to NLP and studying Word2Vec. So I am not fully understanding the concept of Word2Vec.
Are the features of Word2Vec independent each other?
For example, suppose there is a 100-dimensional word2vec. Then the 100 features are independent each other? In other words, if the "sequence" of the features are shuffled, then the meaning of word2vec is changed?
Word2vec is a 'dense' embedding: the individual dimensions generally aren't independently interpretable. It's just the 'neighborhoods' and 'directions' (not limited to the 100 orthogonal axis dimensions) that have useful meanings.
So, they're not 'independent' of each other in a statistical sense. But, you can discard any of the dimensions – for example, the last 50 dimensions of all your 100-dimensional vectors – and you still have usable word-vectors. So in that sense they're still independently useful.
If you shuffled the order-of-dimensions, the same way for every vector in your set, you've then essentially just rotated/reflected all the vectors similarly. They'll all have different coordinates, but their relative distances will be the same, and if "going toward word B from word A" used to vaguely indicate some human-understandable aspect like "largeness", then even after performing your order-of-dimensions shuffle, "going towards word B from word A" will mean the same thing, because the vectors "thataway" (in the transformed coordinates) will be the same as before.
The first thing to understand here is that how word2Vec is formalized. Shifting away from traditional representations of words, the word2vec model tries to encode the meaning of the world into different features. For eg lets say every word in the english dictionary can be manifested in a set of say '4' features. The features could be , lets say "f1":"gender", "f2":"color","f3":"smell","f4":"economy".
So now when a word2vec vector is written , what it signifies is how much manifestation of a particular feature it has. Lets take an example to understand this. Consider a Man(V1) who is dark,not so smelly and is not very rich and is neither poor. Then the first feature ie gender is represented as 1 (since we are taking 1 as male and -1 as female). The second feature color is -1 here as it is exactly opposite to white (which we are taking as 1). Smell and economy are similary given 0.3 and 0.4 values.
Now consider another man(V2) who also has the same anatomy and social status like the first man. Then his word2vec vector would also be similar.
V1=>[1,-1,0.3,0.4]
V2=>[1,-1,0.4,0.3]
This kind of representation helps us represent words into features that are independent or orthogonal to each other.The orthogonality helps in finding similarity or dissimilarity based on some mathematical operation lets say cosine dot product.
The sequence of the number in a word2vec is important since every number represents the weight of a particular feature: gender, color,smell,economy. So shuffling the positions would result in a completely different vector

How are vectors calculated in doc2vec and what does the size parameter depict?

If I pass a Sentence containing 5 words to the Doc2Vec model and if the size is 100, there are 100 vectors. I'm not getting what are those vectors. If I increase the size to 200, there are 200 vectors for just a simple sentence. Please tell me how are those vectors calculated.
When using a size=100, there are not "100 vectors" per text example – there is one vector, which includes 100 scalar dimensions (each a floating-point value, like 0.513 or -1.301).
Note that the values represent points in 100-dimensional space, and the individual dimensions/axes don't have easily-interpretable meanings. Rather, it is only the relative distances and relative directions between individual vectors that have useful meaning for text-based applications, such as assisting in information-retrieval or automatic classification.
The method for computing the vectors is described in the paper 'Distributed Representation of Sentences and Documents' by Le & Mikolov. But, it is closely associated to the 'word2vec' algorithm, so understanding that 1st may help, such as via its first and second papers. If that style of paper isn't your style, queries like [word2vec tutorial] or [how does word2vec work] or [doc2vec intro] should find more casual beginning descriptions.

Under what conditions can two classes have different average values, yet be indistinguishable to a SVM?

I am asking because I have observed sometimes in neuroimaging that a brain region might have different average activation between two experimental conditions, but sometimes an SVM classifier somehow can't distinguish the patterns of activation between the two conditions.
My intuition is that this might happen in cases where the within-class variance is far greater than the between-class variance. For example, suppose we have two classes, A and B, and that for simplicity our data consists just of integers (rather than vectors). Let the data falling under class A be 0,0,0,0,0,10,10,10,10,10. Let the data falling under class B be 1,1,1,1,1,11,11,11,11,11. Here, A and B are clearly different on average, yet there's no decision boundary that would allow A and B to be distinguished. I believe this logic would hold even if our data consisted of vectors, rather than integers.
Is this a special case of some broader range of cases where an SVM would fail to distinguish two classes that are different on average? Is it possible to delineate the precise conditions under which an SVM classifier would fail to distinguish two classes that differ on average?
EDIT: Assume a linear SVM.
As described in the comments - there are no such conditions because SVM will separate data just fine (I am not talking about any generalisation here, just separating training data). For the rest of the answer I am assuming there are no two identical points with different labels.
Non-linear case
For a kernel case, using something like RBF kernel, SVM will always perfectly separate any training set, given that C is big enough.
Linear case
If data is linearly separable then again - with big enough C it will separate data just fine. If data is not linearly separable, cranking up C as much as possible will lead to smaller and smaller training error (of course it will not get 0 since data is not linearly separable).
In particular for the data you provided kernelized SVM will get 100%, and any linear model will get 50%, but it has nothing to do with means being different or variances relations - it is simply a dataset where any linear separator has at most 50% accuracy, literally every decision point, thus it has nothing to do with SVM. In particular it will separate them "in the middle", meaning that the decision point will be somewhere around "5".

Resources