Singular matrix when implementing GMM from scratch - python-3.x

In the EM algorithm for Gaussian Mixture Models. We go through the above steps. 9.23 composes the E step and 9.24,9.25,9.26 composes the M step.
However, Implementing 9.25 always gave $\Sigma_K$ that were singular. No matter mean or sigma I chose, the algorithm would create a singular $\Sigma$ after the first iteration.
My implementation of M step:
for k in range(K):
mu[:,k] = np.sum(X*gamma[:,k][:,None],axis = 0)/N[k]
for k in range(K):
A = X - mu[:,k]
SigmaK = A.T#(A *gamma[:,k][:,None])
SigmaK /=self.N[k]
sigma[k] = SigmaK
pi = N*(1/N.sum())
Is there something particularly wrong with this implementation that would cause covariance to always be singular?

Related

How to define log-count ratio for multiclass text dataset (fastai)?

I am trying to follow Rachel Thomas path of sentiment classification with Naive Bayes. In the video she uses a binary dataset (pos. and neg. movie reviews). When it comes to apply Naive Bayes, this is what she does:
Defintion: log-count ratio r for each word f:
r = log (ratio of feature f in positive documents) / (ratio of feature f in negative documents)
where ratio of feature $f$ in positive documents is the number of times a positive document has a feature divided by the number of positive documents.
p1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))
pr1 = (p1+1) / ((y.items==positive).sum() + 1)
pr0 = (p0+1) / ((y.items==negative).sum() + 1)
r = np.log(pr1/pr0)
--> it is very simple to apply the log-count-ratio to a dataset with 2 labels!
Problem:
My dataset is not binary! Lets assume I have 5 labels: label_1,...,label_5
How do I get the log-count ratio r for multilabel dataset?
My approach:
p4 = np.squeeze(np.asarray(x[y.items==label_5].sum(0)))
p3 = np.squeeze(np.asarray(x[y.items==label_4].sum(0)))
p2 = np.squeeze(np.asarray(x[y.items==label_3].sum(0)))
p1 = np.squeeze(np.asarray(x[y.items==label_2].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==label_1].sum(0)))
log-count-ratio:
pr1 = (p1+1) / ((y.items==label_2).sum() + 1)
pr1_not = (p1+1) / ((y.items!=label_2).sum() + 1)
r_1 = np.log(pr1/pr1_not)
log-count-ratio:
pr2 = (p2+1) / ((y.items==label_3).sum() + 1)
pr2_not = (p2+1) / ((y.items!=label_3).sum() + 1)
r_2 = np.log(pr2/pr2_not)
...
Is this correct? Does it mean I get multiple ratios?
Yes this is correct. The “negative class” is basically all the classes but the one you are considering. So yes, you will get multiple ratios (as many as number of classes you have).
From https://marvinlsj.github.io/2018/11/23/NBSVM%20for%20sentiment%20and%20topic%20classification/
, the log-count-ratio is derived from Posterior prob ratio which is good for comparing 2 classes to get insight into which is the most probable. I guess you're trying to do one-vs-one method for multi-class problem. This will end up with 5x4/2=10 pairs of ratios for classification. If you'd like to do classification only, we normally compute Posterior prob for each class and select the best. So in your case, you just select the best from sum(log(p1)), sum(log(p2)), ..., sum(log(p5)).

Solving vector second order differential equation while indexing into an array

I'm attempting to solve the differential equation:
m(t) = M(x)x'' + C(x, x') + B x'
where x and x' are vectors with 2 entries representing the angles and angular velocity in a dynamical system. M(x) is a 2x2 matrix that is a function of the components of theta, C is a 2x1 vector that is a function of theta and theta' and B is a 2x2 matrix of constants. m(t) is a 2*1001 array containing the torques applied to each of the two joints at the 1001 time steps and I would like to calculate the evolution of the angles as a function of those 1001 time steps.
I've transformed it to standard form such that :
x'' = M(x)^-1 (m(t) - C(x, x') - B x')
Then substituting y_1 = x and y_2 = x' gives the first order linear system of equations:
y_2 = y_1'
y_2' = M(y_1)^-1 (m(t) - C(y_1, y_2) - B y_2)
(I've used theta and phi in my code for x and y)
def joint_angles(theta_array, t, torques, B):
phi_1 = np.array([theta_array[0], theta_array[1]])
phi_2 = np.array([theta_array[2], theta_array[3]])
def M_func(phi):
M = np.array([[a_1+2.*a_2*np.cos(phi[1]), a_3+a_2*np.cos(phi[1])],[a_3+a_2*np.cos(phi[1]), a_3]])
return np.linalg.inv(M)
def C_func(phi, phi_dot):
return a_2 * np.sin(phi[1]) * np.array([-phi_dot[1] * (2. * phi_dot[0] + phi_dot[1]), phi_dot[0]**2])
dphi_2dt = M_func(phi_1) # (torques[:, t] - C_func(phi_1, phi_2) - B # phi_2)
return dphi_2dt, phi_2
t = np.linspace(0,1,1001)
initial = theta_init[0], theta_init[1], dtheta_init[0], dtheta_init[1]
x = odeint(joint_angles, initial, t, args = (torque_array, B))
I get the error that I cannot index into torques using the t array, which makes perfect sense, however I am not sure how to have it use the current value of the torques at each time step.
I also tried putting odeint command in a for loop and only evaluating it at one time step at a time, using the solution of the function as the initial conditions for the next loop, however the function simply returned the initial conditions, meaning every loop was identical. This leads me to suspect I've made a mistake in my implementation of the standard form but I can't work out what it is. It would be preferable however to not have to call the odeint solver in a for loop every time, and rather do it all as one.
If helpful, my initial conditions and constant values are:
theta_init = np.array([10*np.pi/180, 143.54*np.pi/180])
dtheta_init = np.array([0, 0])
L_1 = 0.3
L_2 = 0.33
I_1 = 0.025
I_2 = 0.045
M_1 = 1.4
M_2 = 1.0
D_2 = 0.16
a_1 = I_1+I_2+M_2*(L_1**2)
a_2 = M_2*L_1*D_2
a_3 = I_2
Thanks for helping!
The solver uses an internal stepping that is problem adapted. The given time list is a list of points where the internal solution gets interpolated for output samples. The internal and external time lists are in no way related, the internal list only depends on the given tolerances.
There is no actual natural relation between array indices and sample times.
The translation of a given time into an index and construction of a sample value from the surrounding table entries is called interpolation (by a piecewise polynomial function).
Torque as a physical phenomenon is at least continuous, a piecewise linear interpolation is the easiest way to transform the given function value table into an actual continuous function. Of course one also needs the time array.
So use numpy.interp1d or the more advanced routines of scipy.interpolate to define the torque function that can be evaluated at arbitrary times as demanded by the solver and its integration method.

Difficulty in understanding the function and the function parameters

I am trying to build a program on Minutiae based fingerprint verification and for pre-processing I need to orient the ridges and I found a related function on the internet but could not understand it properly.
I tried reading many research papers on minutiae extraction.
This is the function I found on the internet
def ridge_orient(im, gradientsigma, blocksigma, orientsmoothsigma):
rows,cols = im.shape;
#Calculate image gradients.
sze = np.fix(6*gradientsigma);
if np.remainder(sze,2) == 0:
sze = sze+1;
gauss = cv2.getGaussianKernel(np.int(sze),gradientsigma);
f = gauss * gauss.T;
fy,fx = np.gradient(f); #Gradient of Gaussian
#Gx = ndimage.convolve(np.double(im),fx);
#Gy = ndimage.convolve(np.double(im),fy);
Gx = signal.convolve2d(im,fx,mode='same');
Gy = signal.convolve2d(im,fy,mode='same');
Gxx = np.power(Gx,2);
Gyy = np.power(Gy,2);
Gxy = Gx*Gy;
#Now smooth the covariance data to perform a weighted summation of the data.
sze = np.fix(6*blocksigma);
gauss = cv2.getGaussianKernel(np.int(sze),blocksigma);
f = gauss * gauss.T;
Gxx = ndimage.convolve(Gxx,f);
Gyy = ndimage.convolve(Gyy,f);
Gxy = 2*ndimage.convolve(Gxy,f);
# Analytic solution of principal direction
denom = np.sqrt(np.power(Gxy,2) + np.power((Gxx - Gyy),2)) + np.finfo(float).eps;
sin2theta = Gxy/denom; # Sine and cosine of doubled angles
cos2theta = (Gxx-Gyy)/denom;
if orientsmoothsigma:
sze = np.fix(6*orientsmoothsigma);
if np.remainder(sze,2) == 0:
sze = sze+1;
gauss = cv2.getGaussianKernel(np.int(sze),orientsmoothsigma);
f = gauss * gauss.T;
cos2theta = ndimage.convolve(cos2theta,f); # Smoothed sine and cosine of
sin2theta = ndimage.convolve(sin2theta,f); # doubled angles
orientim = np.pi/2 + np.arctan2(sin2theta,cos2theta)/2;
return(orientim);
This code computes the structure tensor, a method to compute the local orientation in a robust way. It obtains, per pixel, a symmetric 2x2 matrix (tensor) (variables Gxx, Gyy and Gxy) whose eigenvectors indicate local orientation and whose eigenvalues indicate strength of local variation. Because it returns a matrix rather than a simple gradient vector, you can distinguish uniform regions from structures like a cross, neither of which has a strong gradient, but the cross has a strong local variation.
Some criticism about the code
fy,fx = np.gradient(f);
is a really bad way of obtaining a Gaussian derivative kernel. Here you just lose all the benefits of a Gaussian gradient, and instead compute a finite difference approximation to the gradient.
Next, the code doesn’t use the separability of the Gaussian to reduce computational complexity, which is also a shame.
In this blog post I go over the theory of Gaussian filtering and Gaussian derivatives, and in this other one I detail some practical aspects (using MATLAB, but this should readily extend to Python).

Defining new kernels to use in approximate_kernel.py

I am trying to test a new kernel method in Kernel Ridge Regression and want to do this by implementing the Fastfood transformation (https://arxiv.org/abs/1408.3060). I can write a function which computes this transform but it isn't playing nicely with the kernel ridge regression function in sklearn. As a result I have gone to the source code for sklearn kernel ridge regression (https://insight.io/github.com/scikit-learn/scikit-learn/blob/master/sklearn/kernel_ridge.py) and approximate_kernel.py (https://insight.io/github.com/scikit-learn/scikit-learn/blob/master/sklearn/kernel_approximation.py) in order to try and define this new kernel as a class definition in approximate_kernel.py. The problem is that I have no idea how to convert my construction to something which will work in the approximate_kernel KernelRidge programs. Would anybody be able to advise how best to do this please?
My construction for the fastfood transform is:
def fastfood_product(d):
'''
Constructs the fastfood matrix composition V = const*S*H*G*Pi*B where
S is a scaling matrix
H is Hadamard transform
G is a diagonal random Gaussian
Pi is a permutation matrix
B is a diagonal Rademacher matrix.
Inputs: n - dimensionality of the feature vectors for the kernel.
must be a power of two and be divisible by d. If not then can
pad the matrix with zeros but for simplicity assume this condition
is always met.
Output: V'''
S = np.zeros(shape=(d,d))
G = np.zeros_like(S)
B = np.zeros_like(S)
H = hadamard(d)
Pi = np.eye(d)
np.random.shuffle(Pi) # Permutation matrix
# Construct the simple matrices
np.fill_diagonal(B, 2*np.random.randint(low=0,high=2,size=(d,1)).flatten() - 1)
np.fill_diagonal(G, np.random.randn(G.shape[0],1)) # May want to change standard normal to arbitrary which will affect the scaling for V
np.fill_diagonal(S, np.linalg.norm(G,'fro')**(-0.5))
#print('Shapes of B {}, S {}, G {}, H{}, Pi {}'.format(B.shape, S.shape, G.shape, H.shape, Pi.shape))
V = d**(-0.5)*S.dot(H).dot(G).dot(Pi).dot(H).dot(B)
return V
def fastfood_feature_map(X, n):
'''Given a matrix X of data compute the fastfood transformation and feature mapping.
Input: X data of dimension d by m, n = the number of nonlinear basis functions to choose (power of 2)
Outputs: Phi - matrix of random features for fastfood kernel approximation.
Usage: Phi must be transposed for computation in the kernel ridge regression.
i.e solve ||Phi.T * w - b || + regulariser
Comments: This only uses a standard normal distribution but this could
be altered with different hyperparameters.'''
d,m = A.shape
V = fastfood_product(d)
Phi = n**(-0.5)*np.exp(1j*np.dot(V, X))
return Phi
I think the imports numpy as np and from linalg import hadamard will be necessary for the above.

ALS model - how to generate full_u * v^t * v?

I'm trying to figure out how an ALS model can predict values for new users in between it being updated by a batch process. In my search, I came across this stackoverflow answer. I've copied the answer below for the reader's convenience:
You can get predictions for new users using the trained model (without updating it):
To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor matrix (matrix made of the latent representations of all products, a bunch of vectors of size f) and gives you a score for each product. For new users, the problem is that you don't have access to their latent representation (you only have the full representation of size M (number of different products), but what you can do is use a similarity function to compute a similar latent representation for this new user by multiplying it by the transpose of the product matrix.
i.e. if you user latent matrix is u and your product latent matrix is v, for user i in the model, you get scores by doing: u_i * v for a new user, you don't have a latent representation, so take the full representation full_u and do: full_u * v^t * v This will approximate the latent factors for the new users and should give reasonable recommendations (if the model already gives reasonable recommendations for existing users)
To answer the question of training, this allows you to compute predictions for new users without having to do the heavy computation of the model which you can now do only once in a while. So you have you batch processing at night and can still make prediction for new user during the day.
Note: MLLIB gives you access to the matrix u and v
The quoted text above is an excellent answer, however, I'm struggling to understand how to programmatically implement this solution. For example, the matrix u and v can be obtained with:
# pyspark example
# ommitted for brevity ... loading movielens 1M ratings
model = ALS.train(ratings, rank, numIterations, lambdaParam)
matrix_u = model.userFeatures()
print(matrix_u.take(2)) # take a look at the dataset
This returns:
[
(2, array('d', [0.26341307163238525, 0.1650490164756775, 0.118405282497406, -0.5976635217666626, -0.3913084864616394, -0.1379186064004898, -0.3866392970085144, -0.1768060326576233, -0.38342711329460144, 0.48550787568092346, -0.18867433071136475, -0.02757863700389862, 0.1410026103258133, 0.11498363316059113, 0.03958914801478386, 0.034536730498075485, 0.08427099883556366, 0.46969038248062134, -0.8230801224708557, -0.15124185383319855, 0.2566414773464203, 0.04326820373535156, 0.19077207148075104, 0.025207923725247383, -0.02030213735997677, 0.1696728765964508, 0.5714617967605591, -0.03885050490498543, -0.09797532111406326, 0.29186877608299255, -0.12768596410751343, -0.1582849770784378, 0.01933656632900238, -0.09131495654582977, 0.26577943563461304, -0.4543033838272095, -0.11789630353450775, 0.05775507912039757, 0.2891307771205902, -0.2147761881351471, -0.011787488125264645, 0.49508437514305115, 0.5610293745994568, 0.228189617395401, 0.624510645866394, -0.009683617390692234, -0.050237834453582764, -0.07940001785755157, 0.4686132073402405, -0.02288617007434368])),
(4, array('d', [-0.001666820957325399, -0.12487432360649109, 0.1252429485321045, -0.794727087020874, -0.3804478347301483, -0.04577340930700302, -0.42346617579460144, -0.27448347210884094, -0.25846347212791443, 0.5107921957969666, 0.04229479655623436, -0.10212298482656479, -0.13407345116138458, -0.2059325873851776, 0.12777331471443176, -0.318756639957428, 0.129398375749588, 0.4351944327354431, -0.9031049013137817, -0.29211774468421936, -0.02933369390666485, 0.023618215695023537, 0.10542935132980347, -0.22032295167446136, -0.1861676126718521, 0.13154461979866028, 0.6130356192588806, -0.10089754313230515, 0.13624103367328644, 0.22037173807621002, -0.2966669499874115, -0.34058427810668945, 0.37738317251205444, -0.3755438029766083, -0.2408779263496399, -0.35355791449546814, 0.05752146989107132, -0.15478627383708954, 0.3418906629085541, -0.6939512491226196, 0.4279302656650543, 0.4875738322734833, 0.5659542083740234, 0.1479463279247284, 0.5280753970146179, -0.24357643723487854, 0.14329688251018524, -0.2137598991394043, 0.011986476369202137, -0.015219110995531082]))
]
I can also do similar to get the v matrix:
matrix_v = model.productFeatures()
print(matrix_v.take(2)) # take a look at the dataset
This results in:
[
(2, array('d', [0.019985994324088097, 0.0673416256904602, -0.05697149783372879, -0.5434763431549072, -0.40705952048301697, -0.18632276356220245, -0.30776089429855347, -0.13178342580795288, -0.27466219663619995, 0.4183739423751831, -0.24422742426395416, -0.24130797386169434, 0.24116989970207214, 0.06833088397979736, -0.01750543899834156, 0.03404173627495766, 0.04333991929888725, 0.3577033281326294, -0.7044714689254761, 0.1438472419977188, 0.06652364134788513, -0.029888223856687546, -0.16717877984046936, 0.1027146726846695, -0.12836599349975586, 0.10197233408689499, 0.5053384900093079, 0.019304445013403893, -0.21254844963550568, 0.2705852687358856, -0.04169371724128723, -0.24098040163516998, -0.0683765709400177, -0.09532768279314041, 0.1006036177277565, -0.08682398498058319, -0.13584329187870026, -0.001340558985248208, 0.20587041974067688, -0.14007550477981567, -0.1831497997045517, 0.5021498203277588, 0.3049483597278595, 0.11236990243196487, 0.15783801674842834, -0.044139936566352844, -0.14372406899929047, 0.058535050600767136, 0.3777201473712921, -0.045475270599126816])),
(4, array('d', [0.10334215313196182, 0.1881643384695053, 0.09297363460063934, -0.457258403301239, -0.5272660255432129, -0.0989445373415947, -0.2053477019071579, -0.1644461452960968, -0.3771175146102905, 0.21405018866062164, -0.18553146719932556, 0.011830524541437626, 0.29562288522720337, 0.07959598302841187, -0.035378433763980865, -0.11786794662475586, -0.11603366583585739, 0.3776192367076874, -0.5124108791351318, 0.03971947357058525, -0.03365595266222954, 0.023278912529349327, 0.17436474561691284, -0.06317273527383804, 0.05118614062666893, 0.4375131130218506, 0.3281322419643402, 0.036590900272130966, -0.3759073317050934, 0.22429685294628143, -0.0728025734424591, -0.10945595055818558, 0.0728464275598526, 0.014129920862615108, -0.10701996833086014, -0.2496117204427719, -0.09409723430871964, -0.11898282915353775, 0.18940524756908417, -0.3211393356323242, -0.035668935626745224, 0.41765937209129333, 0.2636736035346985, -0.01290816068649292, 0.2824321389198303, 0.021533429622650146, -0.08053319901227951, 0.11117415875196457, 0.22975310683250427, 0.06993964314460754]))
]
However, I'm not sure how to progress from this to full_u * v^t * v
This new user is not the the matrix U, so you don't have its latent representation in 'k' factors, you only know its full representation, i.e., all its ratings. full_u here means all of the new user ratings in a dense format (not the sparse format ratings are) e.g.:
[0 2 0 0 0 1 0] if user u has rated item 2 with a 2 and item 6 with a 1.
then you can get v pretty much like you did and turning it to a matrix in numpy for instance:
pf = model.productFeatures()
Vt = np.matrix(np.asarray(pf.values().collect()))
then is is just a matter of multiplying:
full_u*Vt*Vt.T
Vt and V are transposed compared to the other answer but that's just a matter of convention.
Note that the Vt*Vt.T product is fixed, so if you are going to use this for multiple new users it will be computationally more efficient to pre-compute it. Actually for more than one user it would be better to put all their ratings in bigU (in the same format as my one new user example) and just do the matrix product:
bigU*Vt*Vt.T to get all the ratings for all the new users. Might still be worth checking that the product is done in the most efficient way in terms of number of operations.
Just a word of warning. People talk about the user and product matrices like they are left and right singular vectors. But as far as I understand, the method used to find U and V is an optimization of a straight squared error cost function, which makes none of the orthogonality guarantees of SVD.
In other words, think algebraically about what the above answer claims. We have a full ratings matrix R, an n by p matrix of ratings for n users over p products. We decompose it with k latent factors to approximate R = UV, where the rows of U, an n by k matrix, are the latent user representations, and the columns of V, a k by p matrix, are the latent product representations. In order to find latent user representations for a matrix R of entirely new users without refitting the model, we need:
R = U V
R V^{-1} = U V V^{-1}
R V^{-1} = U I_{k}
R V^{-1} = U
where I_{k} is the k dimensional identity matrix and V^{-1} is the p by k right inverse of V. The tip above assumes that V^{T} = V^{-1}. This is not guaranteed. And in general there is no guarantee that assuming this is true will give you anything but nonsense answers.
Let me know if I'm missing something in the optimization method behind MLLib's CF implementation. Is there a trick in the ALS model that guarantees orthogonality that I'm missing?

Resources