What N ((1,0)T , I) mean related to Gaussian Distribution - statistics

Hi everyone I am reading a book "Element of Statistical Learning) and came across the below paragraph which i dont I understand. (explains how the training data was generated)
We generated 10 means mk from a bivariate Gaussian distribution N((0,1)T,I) and labeled this class as blue. Similraly, 10 more were drawn from from N((0,1)T,I) and labeled class Orange. Then for each class we generated 100 observations as follows: for each observation, we picked an mk at random with probability 1/10, and then generated a N(mk, I/5), thus leading to a mixture of Gaussian cluster for each class.
I would appreciate if you could explain the above paragraph and especially N((0,1)T,I)
by the way- (0,1) to the power of T for Transpose.
Is this notation mathmatically common or related to a specific computer language.

In the paragraph N stands for the Normal distribution; more specifically, in this case it stands for the Multivariate normal distribution. It is not specific to any programming languages. It comes from statistics and probability theory, but due to numerous appealing properties and important applications of this probability distribution it is also widely used in programming, so you should be able to perform the described procedure in any language.
The part (0,1)^T is a vector of means. That is, we have in mind a random vector of length two, where the first element on average is 0, and the second one on average is 1.
"I" stands for the 2x2 identity matrix whose role is the variance-covariance matrix. That is, the variance of both random vector components is 1 (i.e., the diagonal terms), while off-diagonal points are 0 and correspond to the covariance between the two random variables.

Related

Desired distribution of weights in word embedding vectors

I am training my own embedding vectors as I'm focused on an academic dataset (WOS); whether the vectors are generated via word2vec or fasttext doesn't particularly matter. Say my vectors are 150 dimensions each. I'm wondering what the desired distribution of weights within a vector ought to be, if you averaged across an entire corpus's vectors?
I did a few experiments while looking at the distributions of a sample of my vectors and came to these conclusions (uncertain as to how absolutely they hold):
If one trains their model with too few epochs then the vectors don't change significantly from their initiated values (easy to see if you start you vectors as weight 0 in every category). Thus if my weight distribution is centered around some point (typically 0) then I've under-trained my corpus.
If one trains their model with too few documents/over-trains then the vectors show significant correlation with each other (I typically visualize a random set of vectors and you can see stripes where all the vectors have weights that are either positive or negative).
What I imagine is a single "good" vector has various weights across the entire range of -1 to 1. For any single vector it may have significantly more dimensions near -1 or 1. However, the weight distribution of an entire corpus would balance out vectors that randomly have more values towards one end of the spectrum or another, so that the weight distribution of the entire corpus is approximately evenly distributed across the entire corpus. Is this intuition correct?
I'm unfamiliar with any research or folk wisdom about the desirable "weights of the vectors" (by which I assume you mean the individual dimensions).
In general, since the individual dimensions aren't strongly interpretable, I'm not sure you could say much about how any one dimension's values should be distributed. And remember, our intuitions from low-dimensional spaces (2d, 3d, 4d) often don't hold up in high-dimensional spaces.
I've seen two interesting, possibly relevant observations in research:
some have observed that the raw trained vectors for words with singular meanings tend to have a larger magnitude, and those with many meanings have smaller magnitudes. A plausible explanation for this would be that word-vectors for polysemous word-tokenss are being pulled in different directions for the multiple contrasting meanings, and thus wind up "somewhere in the middle" (closer to the origin, and thus of lower magnitude). Note, though, that most word-vector-to-word-vector comparisons ignore the magnitudes, by using cosine-similarity to only compare angles (or largely equivalently, by normalizing all vectors to unit length before comparisons).
A paper "All-but-the-Top: Simple and Effective Postprocessing for Word Representations" by Mu, Bhat, & Viswanath https://arxiv.org/abs/1702.01417v2 has noted that the average of all word-vectors that were trained together tends to biased in a certain direction from the origin, but that removing that bias (and other commonalities in the vectors) can result in improved vectors for many tasks. In my own personal experiments, I've observed that the magnitude of that bias-from-origin seems correlated with the number of negative samples chosen - and that choosing the extreme (and uncommon) value of just 1 negative sample makes such a bias negligible (but might not be best for overall quality or efficiency/speed of training).
So there may be useful heuristics about vector quality from looking at the relative distributions of vectors, but I'm not sure any would be sensitive to individual dimensions (except insofar as those happen to be the projections of vectors onto a certain axis).

Are the features of Word2Vec independent each other?

I am new to NLP and studying Word2Vec. So I am not fully understanding the concept of Word2Vec.
Are the features of Word2Vec independent each other?
For example, suppose there is a 100-dimensional word2vec. Then the 100 features are independent each other? In other words, if the "sequence" of the features are shuffled, then the meaning of word2vec is changed?
Word2vec is a 'dense' embedding: the individual dimensions generally aren't independently interpretable. It's just the 'neighborhoods' and 'directions' (not limited to the 100 orthogonal axis dimensions) that have useful meanings.
So, they're not 'independent' of each other in a statistical sense. But, you can discard any of the dimensions – for example, the last 50 dimensions of all your 100-dimensional vectors – and you still have usable word-vectors. So in that sense they're still independently useful.
If you shuffled the order-of-dimensions, the same way for every vector in your set, you've then essentially just rotated/reflected all the vectors similarly. They'll all have different coordinates, but their relative distances will be the same, and if "going toward word B from word A" used to vaguely indicate some human-understandable aspect like "largeness", then even after performing your order-of-dimensions shuffle, "going towards word B from word A" will mean the same thing, because the vectors "thataway" (in the transformed coordinates) will be the same as before.
The first thing to understand here is that how word2Vec is formalized. Shifting away from traditional representations of words, the word2vec model tries to encode the meaning of the world into different features. For eg lets say every word in the english dictionary can be manifested in a set of say '4' features. The features could be , lets say "f1":"gender", "f2":"color","f3":"smell","f4":"economy".
So now when a word2vec vector is written , what it signifies is how much manifestation of a particular feature it has. Lets take an example to understand this. Consider a Man(V1) who is dark,not so smelly and is not very rich and is neither poor. Then the first feature ie gender is represented as 1 (since we are taking 1 as male and -1 as female). The second feature color is -1 here as it is exactly opposite to white (which we are taking as 1). Smell and economy are similary given 0.3 and 0.4 values.
Now consider another man(V2) who also has the same anatomy and social status like the first man. Then his word2vec vector would also be similar.
V1=>[1,-1,0.3,0.4]
V2=>[1,-1,0.4,0.3]
This kind of representation helps us represent words into features that are independent or orthogonal to each other.The orthogonality helps in finding similarity or dissimilarity based on some mathematical operation lets say cosine dot product.
The sequence of the number in a word2vec is important since every number represents the weight of a particular feature: gender, color,smell,economy. So shuffling the positions would result in a completely different vector

Under what conditions can two classes have different average values, yet be indistinguishable to a SVM?

I am asking because I have observed sometimes in neuroimaging that a brain region might have different average activation between two experimental conditions, but sometimes an SVM classifier somehow can't distinguish the patterns of activation between the two conditions.
My intuition is that this might happen in cases where the within-class variance is far greater than the between-class variance. For example, suppose we have two classes, A and B, and that for simplicity our data consists just of integers (rather than vectors). Let the data falling under class A be 0,0,0,0,0,10,10,10,10,10. Let the data falling under class B be 1,1,1,1,1,11,11,11,11,11. Here, A and B are clearly different on average, yet there's no decision boundary that would allow A and B to be distinguished. I believe this logic would hold even if our data consisted of vectors, rather than integers.
Is this a special case of some broader range of cases where an SVM would fail to distinguish two classes that are different on average? Is it possible to delineate the precise conditions under which an SVM classifier would fail to distinguish two classes that differ on average?
EDIT: Assume a linear SVM.
As described in the comments - there are no such conditions because SVM will separate data just fine (I am not talking about any generalisation here, just separating training data). For the rest of the answer I am assuming there are no two identical points with different labels.
Non-linear case
For a kernel case, using something like RBF kernel, SVM will always perfectly separate any training set, given that C is big enough.
Linear case
If data is linearly separable then again - with big enough C it will separate data just fine. If data is not linearly separable, cranking up C as much as possible will lead to smaller and smaller training error (of course it will not get 0 since data is not linearly separable).
In particular for the data you provided kernelized SVM will get 100%, and any linear model will get 50%, but it has nothing to do with means being different or variances relations - it is simply a dataset where any linear separator has at most 50% accuracy, literally every decision point, thus it has nothing to do with SVM. In particular it will separate them "in the middle", meaning that the decision point will be somewhere around "5".

Representing classification confidence

I am working on a simple AI program that classifies shapes using unsupervised learning method. Essentially I use the number of sides and angles between the sides and generate aggregates percentages to an ideal value of a shape. This helps me create some fuzzingness in the result.
The problem is how do I represent the degree of error or confidence in the classification? For example: a small rectangle that looks very much like a square would yield night membership values from the two categories but can I represent the degree of error?
Thanks
Your confidence is based on used model. For example, if you are simply applying some rules based on the number of angles (or sides), you have some multi dimensional representation of objects:
feature 0, feature 1, ..., feature m
Nice, statistical approach
You can define some kind of confidence intervals, baesd on your empirical results, eg. you can fit multi-dimensional gaussian distribution to your empirical observations of "rectangle objects", and once you get a new object you simply check the probability of such value in your gaussian distribution, and have your confidence (which would be quite well justified with assumption, that your "observation" errors have normal distribution).
Distance based, simple approach
Less statistical approach would be to directly take your model's decision factor and compress it to the [0,1] interaval. For example, if you simply measure distance from some perfect shape to your new object in some metric (which yields results in [0,inf)) you could map it using some sigmoid-like function, eg.
conf( object, perfect_shape ) = 1 - tanh( distance( object, perfect_shape ) )
Hyperbolic tangent will "squash" values to the [0,1] interval, and the only remaining thing to do would be to select some scaling factor (as it grows quite quickly)
Such approach would be less valid in the mathematical terms, but would be similar to the approach taken in neural networks.
Relative approach
And more probabilistic approach could be also defined using your distance metric. If you have distances to each of your "perfect shapes" you can calculate the probability of an object being classified as some class with assumption, that classification is being performed at random, with probiability proportional to the inverse of the distance to the perfect shape.
dist(object, perfect_shape1) = d_1
dist(object, perfect_shape2) = d_2
dist(object, perfect_shape3) = d_3
...
inv( d_i )
conf(object, class_i) = -------------------
sum_j inv( d_j )
where
inv( d_i ) = max( d_j ) - d_i
Conclusions
First two ideas can be also incorporated into the third one to make use of knowledge of all the classes. In your particular example, the third approach should result in confidence of around 0.5 for both rectangle and circle, while in the first example it would be something closer to 0.01 (depending on how many so small objects would you have in the "training" set), which shows the difference - first two approaches show your confidence in classifing as a particular shape itself, while the third one shows relative confidence (so it can be low iff it is high for some other class, while the first two can simply answer "no classification is confident")
Building slightly on what lejlot has put forward; my preference would be to use the Mahalanobis distance with some squashing function. The Mahalanobis distance M(V, p) allows you to measure the distance between a distribution V and a point p.
In your case, I would use "perfect" examples of each class to generate the distribution V and p is the classification you want the confidence of. You can then use something along the lines of the following to be your confidence interval.
1-tanh( M(V, p) )

'Probability' of a K-nearest neighbor like classification

I've a small set of data points (around 10) in a 2D space, and each of them have a category label. I wish to classify a new data point based on the existing data point labels and also associate a 'probability' for belonging to any particular label class.
Is it appropriate to label the new point based on the label to its nearest neighbor( like a K-nearest neighbor, K=1)? For getting the probability I wish to permute all the labels and calculate all the minimum distance of the unknown point and the rest and finding the fraction of cases where the minimum distance is lesser or equal to the distance that was used to label it.
Thanks
The Nearest Neighbour method is already using the Bayes theorem to estimate the probability using the points in a ball containing your chosen K points. There is no need to transform, as the number of points in the ball of K points belonging to each label divided by the total number of points in that ball already is an approximation of the posterior probability of that label. In other words:
P(label|z) = P(z|label)P(label) / P(z) = K(label)/K
This is obtained using the Bayes rule of probability on an estimated probability estimated using a subset of the data. In particular, using:
VP(x) = K/N (this gives you the probability of a point in a ball of volume V)
P(x) = K/NV (from above)
P(x=label) = K(label)/N(label)V (where K(label) and N(label) are the number of points in the ball of that given class and the number of points in the total samples of that class)
and
P(label) = N(label)/N.
Therefore, just pick a K, calculate the distances, count the points and by checking their labels and recounting you will have your probability.
Roweis uses a probabilistic framework with KNN in his publication Neighbourhood Component Analysis. The idea is to use a "soft" nearest neighbour classification, where the probability that a point i uses another point j as its neighbour is defined by
,
where d_ij is the euclidean distance between point i and j.
The are no probabilities for such K-nearest classification method because it is discriminative classification as well as SVM. There are should be used postporcess for learning probabilities on unseen data with generative model like logistic regression.
1. learn K nearest classifier
2. Train logistic regression on distance and average distance to K nearest for validation data.
Check for details LibSVM article.
Sort the distances to the 10 centres; they could be
1 5 6 ... — one near, others far
1 1 1 5 6 ... — 3 near, others far
... lots of possibilities.
You could combine the 10 distances to a single number, e.g. 1 - (nearest / average) ** p,
but that's throwing away information.
(Different powers p makes the hills around the centres steeper or flatter.)
If your centres are really Gaussian hills though, take a look at
Multivariate kernel density estimation.
Added:
There are zillions of functions that go smoothly between 0 and 1,
but that doesn't make them probabilities of something.
"Probability" means either that chance, likelihood, is involved,
as in probability of rain;
or that you're trying to impress somebody.
Added again: scholar.google.com "(single|1) nearest neighbor classifier" gets > 300 hits;
"k nearest neighbor classifier" gets almost 3000.
It seems to me (non-expert) that, out of 10 different ways of mapping k-NN distances to labels,
each one might be better than the 9 others — for some data, with some error measure.
Anyway, you could try asking stats.stackexchange.com ,
The answer is : it depends.
Imagine your labels are the surname of a person, and the X,Y coordinates represent some essential characteristics of the person's DNA sequence. Clearly a more close DNA description enhance the probability of having the same surnames.
Now suppose the X,Y is the lat/long of the work office for that person. Working closer isn't related to label (surname) sharing.
So, it depends on the semantic of your tags and axes.
HTH!

Resources