Why Principle Components of Covariance matrix capture maximum variance of the variables? - statistics

I am trying to understand PCA, I went through several tutorials. So far I understand that, the eigenvectors of a matrix implies the directions in which vectors are rotated and scaled when multiplied by that matrix, in proportion of the eigenvalues. Hence the eigenvector associated with the maximum Eigen value defines direction of maximum rotation. I understand that along the principle component, the variations are maximum and reconstruction errors are minimum. What I do not understand is:
why finding the Eigen vectors of the covariance matrix corresponds to the axis such that the original variables are better defined with this axis?
In addition to tutorials, I reviewed other answers here including this and this. But still I do not understand it.

Your premise is incorrect. PCA (and eigenvectors of a covariance matrix) certainly don't represent the original data "better".
Briefly, the goal of PCA is to find some lower dimensional representation of your data (X, which is in n dimensions) such that as much of the variation is retained as possible. The upshot is that this lower dimensional representation is an orthogonal subspace and it's the best k dimensional representation (where k < n) of your data. We must find that subspace.
Another way to think about this: given a data matrix X find a matrix Y such that Y is a k-dimensional projection of X. To find the best projection, we can just minimize the difference between X and Y, which in matrix-speak means minimizing ||X - Y||^2.
Since Y is just a projection of X onto lower dimensions, we can say Y = X*v where v*v^T is a lower rank projection. Google rank if this doesn't make sense. We know Xv is a lower dimension than X, but we don't know what direction it points.
To do that, we find the v such that ||X - X*v*v^t||^2 is minimized. This is equivalent to maximizing ||X*v||^2 = ||v^T*X^T*X*v|| and X^T*X is the sample covariance matrix of your data. This is mathematically why we care about the covariance of the data. Also, it turns out that the v that does this the best, is an eigenvector. There is one eigenvector for each dimension in the lower dimensional projection/approximation. These eigenvectors are also orthogonal.
Remember, if they are orthogonal, then the covariance between any two of them is 0. Now think of a matrix with non-zero diagonals and zero's in the off-diagonals. This is a covariance matrix of orthogonal columns, i.e. each column is an eigenvector.
Hopefully that helps bridge the connection between covariance matrix and how it helps to yield the best lower dimensional subspace.
Again, eigenvectors don't better define our original variables. The axis determined by applying PCA to a dataset are linear combinations of our original variables that tend to exhibit maximum variance and produce the closest possible approximation to our original data (as measured by l2 norm).

Related

How does probability come in to play in a kNN algorithm?

kNN seems relatively simple to understand: you have your data points and you draw them in your feature space (in a feature space of dimension 2, its the same as drawing points on a xy plane graph). When you want to classify a new data, you put the new data onto the same feature space, find the nearest k neighbors, and see what their labels are, ultimately taking the label(s) with highest votes.
So where does probability come in to play here? All I am doing to calculating distance between two points and taking the label(s) of the closest neighbor.
For a new test sample you look on the K nearest neighbors and look on their labels.
You count how many of those K samples you have in each of the classes, and divide the counts by the number of classes.
For example - lets say that you have 2 classes in your classifier and you use K=3 nearest neighbors, and the labels of those 3 nearest samples are (0,1,1) - the probability of class 0 is 1/3 and the probability for class 1 is 2/3.

Unsupervised Outlier detection

I have 6 points in each row and have around 20k such rows. Each of these row points are actually points on a curve, the nature of curve of each of the rows is same (say a sigmoidal curve or straight line, etc). These 6 points may have different x-values in each row.I also know a point (a,b) for each row which that curve should pass through. How should I go about in finding the rows which may be anomalous or show an unexpected behaviour than other rows? I was thinking of curve fitting but then I only have 6 points for each curve, all I know is that majority of the rows have same nature of curve, so I can perhaps make a general curve for all the rows and have a distance threshold for outlier detection.
What happens if you just treat the 6 points as a 12 dimensional vector and run any of the usual outlier detection methods such as LOF and LoOP?
It's trivial to see the relationship between Euclidean distance on the 12 dimensional vector, and the 6 Euclidean distances of the 6 points each. So this will compare the similarities of these curves.
You can of course also define a complex distance function for LOF.

Possible to calculate the squared perimeter of a 2d triangle without square-root calls?

For a more efficient perimeter calculation *
Is it possible to avoid 3x length calculations (therefor square-root calls)
when calculating a 2D triangles squared perimeter?
To be clear, by squared perimeter I mean the actual perimeter multiplied by it's self. The square-root of this value would be the actual perimeter.
Just as the squared length is often used for comparisons.
Efficient with floating point on todays typical CPU's
There is no simple way to get (a+b+c)^2 value for triangle.
But seems that using perimeter in linked topic is not the best way.
Note also that using Heron formula is not wise. Cross-product is much simpler and faster.
Moreover, you need some criteria of triangle elongation. Why not use comparison of values like this:
E = Max(AB^2, AC^2, BC^2) / CrossProduct(AB,AC)

Principle Component Analysis why are we finding the max of the vectors?

I'm attempting to comprehend the PCA but I'm stuck on a specific part. After it being referenced to the Harvard data science course I looked it up here: https://en.wikipedia.org/wiki/Principal_component_analysis
Under details and then just below first component they say "the first loading vector w(1) thus has to satisfy" and I understand why the below line is true?
The arg max where ||w|| = 1 implies were finding the maximum values of the summation when w is a unit vector. But I don't understand why we want that, or how the values are expected to change, if we have a given matrix X. Unless were trying to optimize which weights to dot with each row?
Or do we just do this to get it into Raleigh quotient form, so we can then use the eigenvalues to find the largest eigenvector associated with the matrix? (which is also the largest vector)
and why do we want the largest vector in the first place? In our transformed axis are we just showing the largest variance in each dimension? Wouldn't we want to transform all points and attempt to see some correlation?
In a sense, the eigenvector with the largest eigenvalue is pointing in the direction of the maximum variance. The one with the second largest eigenvalue is pointing in the direction of maximum variance that is left after accounting for the first one. The eigenvector with the second largest eigenvalue will be orthogonal to the one with the largest eigenvalue. Take another look at the Wikipedia article you referenced, and look for the diagram in the upper right. The longer line is for the eigenvector with the largest eigenvalue and it points along the maximum variance in the data. The shorter line is the eigenvector that has the second largest eigenvalue, and it points along the maximum remaining variance, orthogonal to the first line.

Multivariate Sample Variance and Covariance

When calculating sample variances and covariances from the data matrix, When do we divide by n and when do we divide by n-1.
When your sample is the population, divide by n. Otherwise, divide by n-1 because you are losing one degree of freedom in estimating mean from the same sample.
For large samples this should make hardly any difference.

Resources