If data has Number of features greater than number of samples, will it always be linearly separable? - svm

According to a corollary using VC dimension, n+1 points can always be shattered in . Hence, if the feature space dimension is far greater than the number of samples, there will always be a hyperplane that can separate two classes.
So, if I have 50 samples and 100 features, they will always be linearly separable. Is this conclusion right?

Related

How to handle discontinuous input distributions in neural network

I am using Keras to setup neural networks.
As input data, I use vectors in which each coordinate can be either 0 (feature not present or not measured) or a value that can range for instance between 5000 and 10000.
So my input value distribution is a kind of gaussian centered let us say around 7500 plus a very thin peak at 0.
I cannot remove the vectors with 0 in some of their coordinates because almost all of them will have some 0s at some locations.
So my question is : "how to best normalize the input vectors ?". I see two possibilities :
just substract the mean and divide by standard deviation. The problem then is that the mean is biased by the high number of meaningless 0s, and the std is overestimated, which erases the fine changes in the meaningful measurement.
compute the mean and standard deviation on the non-zeros coordinates, which is more meaningful. But then all the 0 values that correspond to non measured data will come out with high (negative) values which gives some importance to meaningless data...
Does someone have an advice on how to proceed ?
Thanks !
Instead, represent your features as 2 dimensions:
First one is normalised value of the feature if it is non zero (where normalisation is computed over non zero elements), otherwise it is 0
Second is 1 iff the feature was 0, otherwise it is 1. This makes sure that 0 from the previous feature that could either come from raw 0, or from normalised 0 can be discriminated
You can think of this as encoding extra feature saying "the other feature is missing". This way scale of each feature is normalised, and all informatino preserved

Desired distribution of weights in word embedding vectors

I am training my own embedding vectors as I'm focused on an academic dataset (WOS); whether the vectors are generated via word2vec or fasttext doesn't particularly matter. Say my vectors are 150 dimensions each. I'm wondering what the desired distribution of weights within a vector ought to be, if you averaged across an entire corpus's vectors?
I did a few experiments while looking at the distributions of a sample of my vectors and came to these conclusions (uncertain as to how absolutely they hold):
If one trains their model with too few epochs then the vectors don't change significantly from their initiated values (easy to see if you start you vectors as weight 0 in every category). Thus if my weight distribution is centered around some point (typically 0) then I've under-trained my corpus.
If one trains their model with too few documents/over-trains then the vectors show significant correlation with each other (I typically visualize a random set of vectors and you can see stripes where all the vectors have weights that are either positive or negative).
What I imagine is a single "good" vector has various weights across the entire range of -1 to 1. For any single vector it may have significantly more dimensions near -1 or 1. However, the weight distribution of an entire corpus would balance out vectors that randomly have more values towards one end of the spectrum or another, so that the weight distribution of the entire corpus is approximately evenly distributed across the entire corpus. Is this intuition correct?
I'm unfamiliar with any research or folk wisdom about the desirable "weights of the vectors" (by which I assume you mean the individual dimensions).
In general, since the individual dimensions aren't strongly interpretable, I'm not sure you could say much about how any one dimension's values should be distributed. And remember, our intuitions from low-dimensional spaces (2d, 3d, 4d) often don't hold up in high-dimensional spaces.
I've seen two interesting, possibly relevant observations in research:
some have observed that the raw trained vectors for words with singular meanings tend to have a larger magnitude, and those with many meanings have smaller magnitudes. A plausible explanation for this would be that word-vectors for polysemous word-tokenss are being pulled in different directions for the multiple contrasting meanings, and thus wind up "somewhere in the middle" (closer to the origin, and thus of lower magnitude). Note, though, that most word-vector-to-word-vector comparisons ignore the magnitudes, by using cosine-similarity to only compare angles (or largely equivalently, by normalizing all vectors to unit length before comparisons).
A paper "All-but-the-Top: Simple and Effective Postprocessing for Word Representations" by Mu, Bhat, & Viswanath https://arxiv.org/abs/1702.01417v2 has noted that the average of all word-vectors that were trained together tends to biased in a certain direction from the origin, but that removing that bias (and other commonalities in the vectors) can result in improved vectors for many tasks. In my own personal experiments, I've observed that the magnitude of that bias-from-origin seems correlated with the number of negative samples chosen - and that choosing the extreme (and uncommon) value of just 1 negative sample makes such a bias negligible (but might not be best for overall quality or efficiency/speed of training).
So there may be useful heuristics about vector quality from looking at the relative distributions of vectors, but I'm not sure any would be sensitive to individual dimensions (except insofar as those happen to be the projections of vectors onto a certain axis).

Using scipy.stats.entropy on gmm.predict_proba() values

Background so I don't throw out an XY problem -- I'm trying to check the goodness of fit of a GMM because I want statistical back-up for why I'm choosing the number of clusters I've chosen to group these samples. I'm checking AIC, BIC, entropy, and root mean squared error. This question is about entropy.
I've used kmeans to cluster a bunch of samples, and I want an entropy greater than 0.9 (stats and psychology are not my expertise and this problem is both). I have 59 samples; each sample has 3 features in it. I look for the best covariance type via
for cv_type in cv_types:
for n_components in n_components_range:
# Fit a Gaussian mixture with EM
gmm = mixture.GaussianMixture(n_components=n_components,
covariance_type=cv_type)
gmm.fit(data3)
where the n_components_range is just [2] (later I'll check 2 through 5).
Then I take the GMM with the lowest AIC or BIC, saved as best_eitherAB, (not shown) of the four. I want to see if the label assignments of the predictions are stable across time (I want to run for 1000 iterations), so I know I then need to calculate the entropy, which needs class assignment probabilities. So I predict the probabilities of the class assignment via gmm's method,
probabilities = best_eitherAB.predict_proba(data3)
all_probabilities.append(probabilities)
After all the iterations, I have an array of 1000 arrays, each contains 59 rows (sample size) by 2 columns (for the 2 classes). Each inner row of two sums to 1 to make the probability.
Now, I'm not entirely sure what to do regarding the entropy. I can just feed the whole thing into scipy.stats.entropy,
entr = scipy.stats.entropy(all_probabilities)
and it spits out numbers - as many samples as I have, I get a 2 item numpy matrix for each. I could feed just one of the 1000 tests in and just get 1 small matrix of two items; or I could feed in just a single column and get a single values back. But I don't know what this is, and the numbers are between 1 and 3.
So my questions are -- am I totally misunderstanding how I can use scipy.stats.entropy to calculate the stability of my classes? If I'm not, what's the best way to find a single number entropy that tells me how good my model selection is?

How to reduce an unknown size data into a fixed size data? Please read details

Example:
Given n number of images marked 1 to n where n is unknown, I can calculate a property of every image which is a scalar quantity. Now I have to represent this property of all images in a fixed size vector (say 5 or 10).
One naive approach can be this vector- [avg max min std_deviation]
And I also want to include the effect of relative positions of those images.
What your are looking for is called feature extraction.
There are many techniques for the same, for images:
For your purpose try:
PCA
Auto-encoders
Convolutional Auto-encoders, 1 & 2
You could also look into conventional (old) methods like SIFT, HOG, Edge Detection, but they all will need an extra step for making them to a smaller-fixed size.

Standardization for PCA/FA

I did PCA/FA analysis with and without standardization and end up with different results. For standardization, I just divided each input variable by its corresponding standard deviation. However, I have not subtracted the mean (as in case of Z-scores). My question is how important it is to subtract the mean in case of PCA/FA?
I found on another blog that dividing by std dev is another way of standardizing the data-set. Is this superior to z-scores in any sense? Thanks.
By definition, principal components try to capture highest variation in the data; The important point is that, variation in here is defined as the 2nd norm; not variance and not standard deviation;
For example the first principal component is the linear combination of data in the direction given by:
This matters a lot because
unlike variance, 2nd norm is sensitive to location; in other words, if you add a constant to a vector, the variance will not change but the 2nd norm will change;
unlike standard deviation, 2nd norm is sensitive to scale; i.e. if a vector is multiplied by a constant factor, 2nd norm will scale by that factor;
There are at least two problems if an analysis is impacted by location and scale of explanatory factors:
In reality, observations represent different phenomena, so they have different and incomparable scale and average; for example the variations and average income values are not comparable with variations and average age of a sample population;
You do not want the model results conceptually change if for example incomes are quoted in cents as opposed to dollars, or measurements are done in inches and feet as opposed to meters;
But, plain PCA is sensitive to scale and location; for example, this is a PCA analysis on two dimensional standard normal variables with correlation .4;
The red lines represents the direction of loading vectors; Obviously the first principal component is capturing the highest variation in the joint data, and correctly gives equal shares to each vector;
But things will change dramatically if we move the population 2 units to the right; (equivalent of increasing the average of the first vector by 2 units):
Technically we have the same data as before, but now the first principal component is basically capturing the fact that the first vector has non-zero mean;
Similarly, if the first vector is scaled by a factor of 2:
As can be seen, the first vector has got 4 times more weight than the second vector, simply driven by the fact that it has higher variance.
This shows the importance of normalizing scale and removing mean value from the data before doing PCA;
That said, still one can come up with certain situations that the relative location and scale of the explanatory factors have useful information in the analysis and they should not be wiped out of the data.

Resources