(Incremental)PCA's Eigenvectors are not transposed but should be? - scikit-learn

When we posted a homework assignment about PCA we told the course participants to pick any way of calculating the eigenvectors they found. They found multiple ways: eig, eigh (our favorite was svd). In a later task we told them to use the PCAs from scikit-learn - and were surprised that the results differed a lot more than we expected.
I toyed around a bit and we posted an explanation to the participants that either solution was correct and probably just suffered from numerical instabilities in the algorithms. However, recently I picked that file up again during a discussion with a co-worker and we quickly figured out that there's an interesting subtle change to make to get all results to be almost equivalent: Transpose the eigenvectors obtained from the SVD (and thus from the PCAs).
A bit of code to show this:
def pca_eig(data):
"""Uses numpy.linalg.eig to calculate the PCA."""
data = data.T # data
val, vec = np.linalg.eig(data)
return val, vec
versus
def pca_svd(data):
"""Uses numpy.linalg.svd to calculate the PCA."""
u, s, v = np.linalg.svd(data)
return s ** 2, v
Does not yield the same result. Changing the return of pca_svd to s ** 2, v.T, however, works! It makes perfect sense following the definition by wikipedia: The SVD of X follows X=UΣWT where
the right singular vectors W of X are equivalent to the eigenvectors of XTX
So to get the eigenvectors we need to transposed the output v of np.linalg.eig(...).
Unless there is something else going on? Anyway, the PCA and IncrementalPCA both show wrong results (or eig is wrong? I mean, transposing that yields the same equality), and looking at the code for PCA reveals that they are doing it as I did it initially:
U, S, V = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, V = svd_flip(U, V)
components_ = V
I created a little gist demonstrating the differences (nbviewer), the first with PCA and IncPCA as they are (also no transposition of the SVD), the second with transposed eigenvectors:
Comparison without transposition of SVD/PCAs (normalized data)
Comparison with transposition of SVD/PCAs (normalized data)
As one can clearly see, in the upper image the results are not really great, while the lower image only differs in some signs, thus mirroring the results here and there.
Is this really wrong and a bug in scikit-learn? More likely I am using the math wrong – but what is right? Can you please help me?

If you look at the documentation, it's pretty clear from the shape that the eigenvectors are in the rows, not the columns.
The point of the sklearn PCA is that you can use the transform method to do the correct transformation.

Related

Implementing alternative Fibonacci sequence

So I'm struggling with Question 3. I think the representation of L would be a function that goes something like this:
import numpy as np
def L(a, b):
#L is 2x2 Matrix, that is
return(np.dot([[0,1],[1,1]],[a,b]))
def fibPow(n):
if(n==1):
return(L(0,1))
if(n%2==0):
return np.dot(fibPow(n/2), fibPow(n/2))
else:
return np.dot(L(0,1),np.dot(fibPow(n//2), fibPow(n//2)))
Given b I'm pretty sure I'm wrong. What should I be doing? Any help would be appreciated. I don't think I'm supposed to use the golden ratio property of the Fibonacci series. What should my a and b be?
EDIT: I've updated my code. For some reason it doesn't work. L will give me the right answer, but my exponentiation seems to be wrong. Can someone tell me what I'm doing wrong
With an edited code, you are almost there. Just don't cram everything into one function. That leads to subtle mistakes, which I think you may enjoy to find.
Now, L is not function. As I said before, it is a matrix. And the core of the problem is to compute its nth power. Consider
L = [[0,1], [1,1]]
def nth_power(matrix, n):
if n == 1:
return matrix
if (n % 2) == 0:
temp = nth_power(matrix, n/2)
return np.dot(temp, temp)
else:
temp = nth_power(matrix, n // 2)
return np.dot(matrix, np.dot(temp, temp))
def fibPow(n):
Ln = nth_power(L, n)
return np.dot(L, [0,1])[1]
The nth_power is almost identical to your approach, with some trivial optimization. You may optimize it further by eliminating recursion.
First thing first, there is no L(n, a, b). There is just L(a, b), a well defined linear operator which transforms a vector a, b into a vector b, a+b.
Now a huge hint: a linear operator is a matrix (in this case, 2x2, and very simple). Can you spell it out?
Now, applying this matrix n times in a row to an initial vector (in this case, 0, 1), by matrix magic is equivalent to applying nth power of L once to the initial vector. This is what Question 2 is about.
Once you determine how this matrix looks like, fibPow reduces to computing its nth power, and multiplying the result by 0, 1. To get O(log n) complexity, check out exponentiation by squaring.

How to create a multi-diagonal square matrix in Theano?

Is there a better way to create a multi-diagonal square matrix in theano than the following,
A = theano.tensor.nlinalg.AllocDiag(offset=0)(x)
A += theano.tensor.nlinalg.AllocDiag(offset=1)(x[:-1])
A += theano.tensor.nlinalg.AllocDiag(offset=-1)(x[1:])
where x is the vector i want on the diagonals? Each time i call AllocDiag()() a new Apply node is created which is causing memory issues and inefficiencies.
I'm hoping there is a way similar to scipy where a list of vectors can be passed into the function with a corresponding list of offsets, see https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.diags.html.
Any assistance is much appreciated.
One way which doesn't require AllocDiag()() is to use theano.tensor.set_subtensor() with A[range(n),range(n)] to obtain the diagonal indexes where A is an n*nmatrix . Something like the following:
A = tt.set_subtensor(A0[range(n),range(n)], x)
A = tt.set_subtensor(A[range(n-1),range(1,n)], x[:-1])
A = tt.set_subtensor(A[range(1,n),range(n-1), x[1:])
where A0is the initial matrix, for example, a matrix of zeros.

Scipy.integrate gives odd results; are there best practices?

I am still struggling with scipy.integrate.quad.
Sparing all the details, I have an integral to evaluate. The function is of the form of the integral of a product of functions in x, like so:
Z(k) = f(x) g(k/x) / abs(x)
I know for certain the range of integration is between tow positive numbers. Oddly, when I pick a wide range that I know must contain all values of x that are positive - like integrating from 1 to 10,000,000 - it intgrates fast and gives an answer which looks right. But when I fingure out the exact limits - which I know sice f(x) is zero over a lot of the real line - and use those, I get another answer that is different. They aren't very different, though I know the second is more accurate.
After much fiddling I got it to work OK, but then needed to add in an exopnentiation - I was at least getting a 'smooth' answer for the computed function of z. I had this working in an OK way before I added in the exponentiation (which is needed), but now the function that gets generated (z) becomes more and more oscillatory and peculiar.
Any idea what is happening here? I know this code comes from an old Fortran library, so there must be some known issues, but I can't find references.
Here is the core code:
def normal(x, mu, sigma) :
return (1.0/((2.0*3.14159*sigma**2)**0.5)*exp(-(x-
mu)**2/(2*sigma**2)))
def integrand(x, z, mu, sigma, f) :
return np.exp(normal(z/x, mu, sigma)) * getP(x, f._x, f._y) / abs(x)
for _z in range (int(z_min), int(z_max) + 1, 1000):
z.append(_z)
pResult = quad(integrand, lb, ub,
args=(float(_z), MU-SIGMA**2/2, SIGMA, X),
points = [100000.0],
epsabs = 1, epsrel = .01) # drop error estimate of tuple
p.append(pResult[0]) # drop error estimate of tuple
By the way, getP() returns a linearly interpolated, piecewise continuous,but non-smooth function to give the integrator values that smoothly fit between the discrete 'buckets' of the histogram.
As with many numerical methods, it can be very sensitive to asymptotes, zeros, etc. The only choice is to keep giving it 'hints' if it will accept them.

Expectation Maximization algorithm(Gaussian Mixture Model) : ValueError: the input matrix must be positive semidefinite

I am trying to implement Expectation Maximization algorithm(Gaussian Mixture Model) on a data set data=[[x,y],...]. I am using mv_norm.pdf(data, mean,cov) function to calculate cluster responsibilities. But after calculating new values of covariance (cov matrix) after 6-7 iterations, cov matrix is becoming singular i.e determinant of cov is 0 (very small value) and hence it is giving errors
ValueError: the input matrix must be positive semidefinite
and
raise np.linalg.LinAlgError('singular matrix')
Can someone suggest any solution for this?
#E-step: Compute cluster responsibilities, given cluster parameters
def calculate_cluster_responsibility(data,centroids,cov_m):
pdfmain=[[] for i in range(0,len(data))]
for i in range(0,len(data)):
sum1=0
pdfeach=[[] for m in range(0,len(centroids))]
pdfeach[0]=1/3.*mv_norm.pdf(data[i], mean=centroids[0],cov=[[cov_m[0][0][0],cov_m[0][0][1]],[cov_m[0][1][0],cov_m[0][1][1]]])
pdfeach[1]=1/3.*mv_norm.pdf(data[i], mean=centroids[1],cov=[[cov_m[1][0][0],cov_m[1][0][1]],[cov_m[1][1][0],cov_m[0][1][1]]])
pdfeach[2]=1/3.*mv_norm.pdf(data[i], mean=centroids[2],cov=[[cov_m[2][0][0],cov_m[2][0][1]],[cov_m[2][1][0],cov_m[2][1][1]]])
sum1+=pdfeach[0]+pdfeach[1]+pdfeach[2]
pdfeach[:] = [x / sum1 for x in pdfeach]
pdfmain[i]=pdfeach
global old_pdfmain
if old_pdfmain==pdfmain:
return
old_pdfmain=copy.deepcopy(pdfmain)
softcounts=[sum(i) for i in zip(*pdfmain)]
calculate_cluster_weights(data,centroids,pdfmain,soft counts)
Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.
Can someone suggest any solution for this?
The problem is your data lies in some manifold of dimension strictly smaller than the input data. In other words for example your data lies on a circle, while you have 3 dimensional data. As a consequence when your method tries to estimate 3 dimensional ellipsoid (covariance matrix) that fits your data - it fails since the optimal one is a 2 dimensional ellipse (third dimension is 0).
How to fix it? You will need some regularization of your covariance estimator. There are many possible solutions, all in M step, not E step, the problem is with computing covariance:
Simple solution, instead of doing something like cov = np.cov(X) add some regularizing term, like cov = np.cov(X) + eps * np.identity(X.shape[1]) with small eps
Use nicer estimator like LedoitWolf estimator from scikit-learn.
Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.
This makes no sense, covariance matrix values has nothing to do with amount of clusters. You can initialize it with anything more or less resonable.

Percentage diff b/t two strings of different lengths

I have a problem where I am trying to prevent repeats of a string. So far the best solution is to compare the strings for a percentage and check if it is above a certain fixed point.
I've looked up Levenshtein distance but so far I believe it does not accomplish my goal since it compares strings of the same length. Both of my strings are more than likely to be significantly different lengths (stack trace). I'm looking for content or word comparison rather than char to char comparison. A percentage answer is the most important part of this.
I assume someone has an algorithm or would be willing to point me in the right direction? Thank you for reading and even more so for helping!
An indirect example... think of them as being stacktraces in py.test form.
I have filepaths and am comparing them
/test/opt/somedir/blah/something
def do_something(self, x):
return x
SomeError: do_something in 'filepath' threw some exception or something
vs
/test/opt/somedir/blah2/somethingelse
def do_another_thing(self, y):
return y
SomeError: do_another_thing in 'different filepath' threw some exception
But also when you have the same filepath, but different errors. The traces are hundreds of lines long, so showing a full example isn't reasonable. This example is as close as I can get without the actual trace.
One way of going at this would be through applications of the Jaro-Winkler String Similarity metric. Happily, this has a PyPI package.
Let's start off with three string, your two examples, and the begining of your question:
s1 = u'''
/test/opt/somedir/blah/something
def do_something(self, x):
return x
SomeError: do_something in 'filepath' threw some exception or something'''
s2 = u'''
/test/opt/somedir/blah2/somethingelse
def do_another_thing(self, y):
return y
SomeError: do_another_thing in 'different filepath' threw some exception'''
q = u'''
I have a problem where I am trying to prevent repeats of a string. So far the best solution is to compare the strings for a percentage and check if it is above a certain fixed point.'''
Then the similarities are:
>> jaro.jaro_metric(s1, s2)
0.8059572665529058
>> jaro.jaro_metric(s1, q)
0.6562121541167517
However, since you know something of the problem domain (it is a sequence of lines of stacktraces), you could do better by calculating line differences, perhaps:
import itertools
>> [jaro.jaro_metric(l1, l2) for l1, l2 in itertools.izip(s1.split('\n'), s2.split('\n'))]
[1.0,
0.9353471118177001,
0.8402824228911184,
0.9444444444444443,
0.8043725314852076]
So, you need to experiment with this, but you could try, given two stacktraces, calculating a "distance" which is a matrix - the i-j entry would be the similarity between the i-th string of the first to the j-th of the second. (This is a bit computationally expensive.) See if there's a threshold for a percentage or number of entries obtaining very high scores.

Resources