Standardization for PCA/FA

Standardization for PCA/FA - statistics

I did PCA/FA analysis with and without standardization and end up with different results. For standardization, I just divided each input variable by its corresponding standard deviation. However, I have not subtracted the mean (as in case of Z-scores). My question is how important it is to subtract the mean in case of PCA/FA?
I found on another blog that dividing by std dev is another way of standardizing the data-set. Is this superior to z-scores in any sense? Thanks.

By definition, principal components try to capture highest variation in the data; The important point is that, variation in here is defined as the 2nd norm; not variance and not standard deviation;
For example the first principal component is the linear combination of data in the direction given by:
This matters a lot because
unlike variance, 2nd norm is sensitive to location; in other words, if you add a constant to a vector, the variance will not change but the 2nd norm will change;
unlike standard deviation, 2nd norm is sensitive to scale; i.e. if a vector is multiplied by a constant factor, 2nd norm will scale by that factor;
There are at least two problems if an analysis is impacted by location and scale of explanatory factors:
In reality, observations represent different phenomena, so they have different and incomparable scale and average; for example the variations and average income values are not comparable with variations and average age of a sample population;
You do not want the model results conceptually change if for example incomes are quoted in cents as opposed to dollars, or measurements are done in inches and feet as opposed to meters;
But, plain PCA is sensitive to scale and location; for example, this is a PCA analysis on two dimensional standard normal variables with correlation .4;
The red lines represents the direction of loading vectors; Obviously the first principal component is capturing the highest variation in the joint data, and correctly gives equal shares to each vector;
But things will change dramatically if we move the population 2 units to the right; (equivalent of increasing the average of the first vector by 2 units):
Technically we have the same data as before, but now the first principal component is basically capturing the fact that the first vector has non-zero mean;
Similarly, if the first vector is scaled by a factor of 2:
As can be seen, the first vector has got 4 times more weight than the second vector, simply driven by the fact that it has higher variance.
This shows the importance of normalizing scale and removing mean value from the data before doing PCA;
That said, still one can come up with certain situations that the relative location and scale of the explanatory factors have useful information in the analysis and they should not be wiped out of the data.

Related

Decision trees: information gain - bias against attributes - how and why is it say so?

I am confused to get the context on biases in the following line (marked in bold):
Information gain ratio biases the decision tree against considering attributes with a large number of distinct values which might lead to overfitting.

Did you mean Information gain, as information gain is bias towards variables with large distinct values and information gain ratio is tries to solve this by taking into account the number of branches that would result before making the split, It corrects information gain by taking the intrinsic information of a split into account.
Answer for why information gain is biased towards variables with large distinct values
Please note that information gain (IG) is biased toward variables with large number of distinct values not variables that have observations with large values. Before describing the reason of this condition, lets review the definition of IG.
Information gain is the amount of information that's gained by knowing the value of the attribute, which is the entropy of the distribution before the split minus the entropy of the distribution after it. The largest information gain is equivalent to the smallest entropy.
In other words, a variable with the highest number of distinct values probability can divide data to smaller chunks. Also, we know that lower number of observations in each chunk reduces probability of variation occurrence.
Using ID variable in splitting data is a common example for this issue. Since each individual sample has their own distinct value, selecting ID features leads to many clusters with one sample and entropy of zero. Therefore, a decision tree that works with IG, selects the ID as the first separator attribute. Indeed, entropy will approach to zero by selecting the ID feature. However, we are not interested to such a feature. We are more interested to features that highly explain the variation of dependent variable.
Please refer to this discussion where this point was initially written.

How can r-squared be negative when the correlation between prediction and truth is positive?

Trying to understand how the r-squared (and also explained variance) metrics can be negative (thus indicating non-existant forecasting power) when at the same time the correlation factor between prediction and truth (as well as slope in a linear-regression (regressing truth on prediction)) are positive

R Squared can be negative in a rare scenario.
R squared = 1 – (SSR/SST)
Here, SST stands for Sum of Squared Total which is nothing but how much does the predicted points get varies from the mean of the target variable. Mean is nothing but a regression line here.
SST = Sum (Square (Each data point- Mean of the target variable))
For example,
If we want to build a regression model to predict height of a student with weight as the independent variable then a possible prediction without much effort is to calculate the mean height of all current students and consider it as the prediction.
In the above diagram, red line is the regression line which is nothing but the mean of all heights. This mean calculated without much effort and can be considered as one of the worst method of prediction with poor accuracy. In the diagram itself we can see that the prediction is nowhere near to the original data points.
Now come to SSR,
SSR stands for Sum of Squared Residuals. This residual is calculated from the model which we build from our mathematical approach (Linear regression, Bayesian regression, Polynomial regression or any other approach). If we use a sophisticated approach rather than using a naive approach like mean then our accuracy will obviously increase.
SSR = Sum (Square (Each data point - Each corresponding data point in the regression line))
In the above diagram, let's consider that the blue line indicates a sophisticated model with large mathematical analysis. We can see that it has obviously higher accuracy than the red line.
Now come to the formula,
R Squared = 1- (SSR/SST)
Here,
SST will be large number because it a very poor model (red line).
SSR will be a small number because it is the best model we developed
after much mathematical analysis (blue line).
So, SSR/SST will be a very small number (It will become very small
whenever SSR decreases).
So, 1- (SSR/SST) will be large number.
So we can infer that whenever R Squared goes higher, it means the
model is too good.
This is a generic case but this cannot be applied in many cases where multiple independent variables are present. In the example, we had only one independent variable and one target variable but in real case, we will have 100's of independent variables for a single dependent variable. The actual problem is that, out of 100's of independent variables-
Some variables will have very high correlation with target variable.
Some variables will have very small correlation with target variable.
Also some independent variables will have no correlation at all.
So, RSquared is calculated on an assumption that the average line of the target which is perpendicular line of y axis is the worst fit a model can have at a maximum riskiest case. SST is the squared difference between this average line and original data points. Similarly, SSR is the squared difference between the predicted data points (by the model plane) and original data points.
SSR/SST gives a ratio how SSR is worst with respect to SST. If your model can somewhat build a plane which is a comparatively good than the worst, then in 99% cases SSR<SST. It eventually makes R squared as positive if you substitute it in the equation.
But what if SSR>SST ? This means that your regression plane is worse than the mean line (SST). In this case, R squared will be obviously negative. But it happens only at 1% of cases or smaller.
Answer was originally written in quora by me -
https://qr.ae/pNsLU8
https://qr.ae/pNsLUr

Under what conditions can two classes have different average values, yet be indistinguishable to a SVM?

I am asking because I have observed sometimes in neuroimaging that a brain region might have different average activation between two experimental conditions, but sometimes an SVM classifier somehow can't distinguish the patterns of activation between the two conditions.
My intuition is that this might happen in cases where the within-class variance is far greater than the between-class variance. For example, suppose we have two classes, A and B, and that for simplicity our data consists just of integers (rather than vectors). Let the data falling under class A be 0,0,0,0,0,10,10,10,10,10. Let the data falling under class B be 1,1,1,1,1,11,11,11,11,11. Here, A and B are clearly different on average, yet there's no decision boundary that would allow A and B to be distinguished. I believe this logic would hold even if our data consisted of vectors, rather than integers.
Is this a special case of some broader range of cases where an SVM would fail to distinguish two classes that are different on average? Is it possible to delineate the precise conditions under which an SVM classifier would fail to distinguish two classes that differ on average?
EDIT: Assume a linear SVM.

As described in the comments - there are no such conditions because SVM will separate data just fine (I am not talking about any generalisation here, just separating training data). For the rest of the answer I am assuming there are no two identical points with different labels.
Non-linear case
For a kernel case, using something like RBF kernel, SVM will always perfectly separate any training set, given that C is big enough.
Linear case
If data is linearly separable then again - with big enough C it will separate data just fine. If data is not linearly separable, cranking up C as much as possible will lead to smaller and smaller training error (of course it will not get 0 since data is not linearly separable).
In particular for the data you provided kernelized SVM will get 100%, and any linear model will get 50%, but it has nothing to do with means being different or variances relations - it is simply a dataset where any linear separator has at most 50% accuracy, literally every decision point, thus it has nothing to do with SVM. In particular it will separate them "in the middle", meaning that the decision point will be somewhere around "5".

Scale before PCA

I'm using PCA from sckit-learn and I'm getting some results which I'm trying to interpret, so I ran into question - should I subtract the mean (or perform standardization) before using PCA, or is this somehow embedded into sklearn implementation?
Moreover, which of the two should I perform, if so, and why is this step needed?

I will try to explain it with an example. Suppose you have a dataset that includes a lot features about housing and your goal is to classify if a purchase is good or bad (a binary classification). The dataset includes some categorical variables (e.g. location of the house, condition, access to public transportation, etc.) and some float or integer numbers (e.g. market price, number of bedrooms etc). The first thing that you may do is to encode the categorical variables. For instance, if you have 100 locations in your dataset, the common way is to encode them from 0 to 99. You may even end up encoding these variables in one-hot encoding fashion (i.e. a column of 1 and 0 for each location) depending on the classifier that you are planning to use. Now if you use the price in million dollars, the price feature would have a much higher variance and thus higher standard deviation. Remember that we use square value of the difference from mean to calculate the variance. A bigger scale would create bigger values and square of a big value grow faster. But it does not mean that the price carry significantly more information compared to for instance location. In this example, however, PCA would give a very high weight to the price feature and perhaps the weights of categorical features would almost drop to 0. If you normalize your features, it provides a fair comparison between the explained variance in the dataset. So, it is good practice to normalize the mean and scale the features before using PCA.

Before PCA, you should,
Mean normalize (ALWAYS)
Scale the features (if required)
Note: Please remember that step 1 and 2 are not the same technically.

This is a really non-technical answer but my method is to try both and then see which one accounts for more variation on PC1 and PC2. However, if the attributes are on different scales (e.g. cm vs. feet vs. inch) then you should definitely scale to unit variance. In every case, you should center the data.
Here's the iris dataset w/ center and w/ center + scaling. In this case, centering lead to higher explained variance so I would go with that one. Got this from sklearn.datasets import load_iris data. Then again, PC1 has most of the weight on center so patterns I find in PC2 I wouldn't think are significant. On the other hand, on center | scaled the weight is split up between PC1 and PC2 so both axis should be considered.

Representing classification confidence

I am working on a simple AI program that classifies shapes using unsupervised learning method. Essentially I use the number of sides and angles between the sides and generate aggregates percentages to an ideal value of a shape. This helps me create some fuzzingness in the result.
The problem is how do I represent the degree of error or confidence in the classification? For example: a small rectangle that looks very much like a square would yield night membership values from the two categories but can I represent the degree of error?
Thanks

Your confidence is based on used model. For example, if you are simply applying some rules based on the number of angles (or sides), you have some multi dimensional representation of objects:
feature 0, feature 1, ..., feature m
Nice, statistical approach
You can define some kind of confidence intervals, baesd on your empirical results, eg. you can fit multi-dimensional gaussian distribution to your empirical observations of "rectangle objects", and once you get a new object you simply check the probability of such value in your gaussian distribution, and have your confidence (which would be quite well justified with assumption, that your "observation" errors have normal distribution).
Distance based, simple approach
Less statistical approach would be to directly take your model's decision factor and compress it to the [0,1] interaval. For example, if you simply measure distance from some perfect shape to your new object in some metric (which yields results in [0,inf)) you could map it using some sigmoid-like function, eg.
conf( object, perfect_shape ) = 1 - tanh( distance( object, perfect_shape ) )
Hyperbolic tangent will "squash" values to the [0,1] interval, and the only remaining thing to do would be to select some scaling factor (as it grows quite quickly)
Such approach would be less valid in the mathematical terms, but would be similar to the approach taken in neural networks.
Relative approach
And more probabilistic approach could be also defined using your distance metric. If you have distances to each of your "perfect shapes" you can calculate the probability of an object being classified as some class with assumption, that classification is being performed at random, with probiability proportional to the inverse of the distance to the perfect shape.
dist(object, perfect_shape1) = d_1
dist(object, perfect_shape2) = d_2
dist(object, perfect_shape3) = d_3
...
inv( d_i )
conf(object, class_i) = -------------------
sum_j inv( d_j )
where
inv( d_i ) = max( d_j ) - d_i
Conclusions
First two ideas can be also incorporated into the third one to make use of knowledge of all the classes. In your particular example, the third approach should result in confidence of around 0.5 for both rectangle and circle, while in the first example it would be something closer to 0.01 (depending on how many so small objects would you have in the "training" set), which shows the difference - first two approaches show your confidence in classifing as a particular shape itself, while the third one shows relative confidence (so it can be low iff it is high for some other class, while the first two can simply answer "no classification is confident")

Building slightly on what lejlot has put forward; my preference would be to use the Mahalanobis distance with some squashing function. The Mahalanobis distance M(V, p) allows you to measure the distance between a distribution V and a point p.
In your case, I would use "perfect" examples of each class to generate the distribution V and p is the classification you want the confidence of. You can then use something along the lines of the following to be your confidence interval.
1-tanh( M(V, p) )

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string