First let me clarify that here "sparse PCA" means PCA with L1 penalty and sparse loadings, not PCA on sparse matrix.
I've read the paper on sparse PCA by Zou and Hastie, I've read the documentation on sklearn.decomposition.SparsePCA, and I know how to use PCA, but I can't seem to get the right result from SparsePCA.
Namely, when L1 penalty is 0, the result from SparsePCA is supposed to agree with PCA, but the loadings differ quite a lot. To make sure that I didn't mess up any hyperparameters, I used the same hyperparameters (convergence tolerance, maximum iterations, ridge penalty, lasso penalty...) in R with 'spca' from 'elasticnet', and R gave me the correct result. I'd rather not have to go through the source code of SparsePCA if anyone has experience using this function and could let me know if I made any mistakes.
Below is how I generated my dataset. It's a bit convoluted because I wanted a specific Markov Decision Process to test some reinforcement learning algorithms. Just treat it as some non-sparse dataset.
import numpy as np
from sklearn.decomposition import PCA, SparsePCA
import numpy.random as nr
def transform(data, TranType=None):
if TranType == 'quad':
data = np.minimum(np.square(data), 3)
if TranType == 'cubic':
data = np.maximum(np.minimum(np.power(data, 3), 3), -3)
if TranType == 'exp':
data = np.minimum(np.exp(data), 3)
if TranType == 'abslog':
data = np.minimum(np.log(abs(data)), 3)
return data
def NewStateGen(OldS, A, TranType, m=0, sd=0.5, nsd=0.1, dim=64):
# dim needs to be a multiple of 4, and preferably a multiple of 16.
assert (dim == len(OldS) and dim % 4 == 0)
TrueDim = dim / 4
NewS = np.zeros(dim)
# Generate new state according to action
if A == 0:
NewS[range(0, dim, 4)] = transform(OldS[0:TrueDim], TranType) + \
nr.normal(scale=nsd, size=TrueDim)
NewS[range(1, dim, 4)] = transform(OldS[0:TrueDim], TranType) + \
nr.normal(scale=nsd, size=TrueDim)
NewS[range(2, dim, 4)] = nr.normal(m, sd, size=TrueDim)
NewS[range(3, dim, 4)] = nr.normal(m, sd, size=TrueDim)
R = 2 * np.sum(transform(OldS[0:int(np.ceil(dim / 32.0))], TranType)) - \
np.sum(transform(OldS[int(np.ceil(dim / 32.0)):(dim / 16)], TranType)) + \
if A == 1:
NewS[range(0, dim, 4)] = nr.normal(m, sd, size=TrueDim)
NewS[range(1, dim, 4)] = nr.normal(m, sd, size=TrueDim)
NewS[range(2, dim, 4)] = transform(OldS[0:TrueDim], TranType) + \
nr.normal(scale=nsd, size=TrueDim)
NewS[range(3, dim, 4)] = transform(OldS[0:TrueDim], TranType) + \
nr.normal(scale=nsd, size=TrueDim)
R = 2 * np.sum(transform(OldS[int(np.floor(dim / 32.0)):(dim / 16)], TranType)) - \
np.sum(transform(OldS[0:int(np.floor(dim / 32.0))], TranType)) + \
return NewS, R
def MDPGen(dim=64, rep=1, n=30, T=100, m=0, sd=0.5, nsd=0.1, TranType=None):
X_all = np.zeros(shape=(rep*n*T, dim))
Y_all = np.zeros(shape=(rep*n*T, dim+1))
A_all = np.zeros(rep*n*T)
R_all = np.zeros(rep*n*T)
for j in xrange(rep*n):
# Data for a single subject
X = np.zeros(shape=(T+1, dim))
A = np.zeros(T)
R = np.zeros(T)
NewS = np.zeros(dim)
X[0] = nr.normal(m, sd, size=dim)
for i in xrange(T):
OldS = X[i]
# Pick a random action
A[i] = nr.randint(2)
# Generate new state according to action
X[i+1], R[i] = NewStateGen(OldS, A[i], TranType, m, sd, nsd, dim)
Y = np.concatenate((X[1:(T+1)], R.reshape(T, 1)), axis=1)
X = X[0:T]
X_all[(j*T):((j+1)*T)] = X
Y_all[(j*T):((j+1)*T)] = Y
A_all[(j*T):((j+1)*T)] = A
R_all[(j*T):((j+1)*T)] = R
return {'X': X_all, 'Y': Y_all, 'A': A_all, 'R': R_all, 'rep': rep, 'n': n, 'T': T}
MDP = MDPGen(dim=64, rep=1, n=30, T=90, sd=0.5, nsd=0.1, TranType=None)
X = MDP.get('X').astype(np.float32)
Now I run PCA and SparsePCA. When the lasso penalty, 'alpha', is 0, SparsePCA is supposed to give the same result as PCA, which is not the case. The other hyperparameters are set with the default values from elasticnet in R. If I use the default from SparsePCA the result will still be incorrect.
PCA_model = PCA(n_components=64)
Z = PCA_model.transform(X)
SPCA_model = SparsePCA(n_components=64, alpha=0, ridge_alpha=1e-6, max_iter=200, tol=1e-3)
SZ = SPCA_model.transform(X)
# Check the first 2 loadings from PCA and SPCA. They are supposed to agree.
print PCA_model.components_[0:2]
print SPCA_model.components_[0:2]
# Check the first 2 observations of transformed data. They are supposed to agree.
print Z[0:2]
print SZ[0:2]
When the lasso penalty is greater than 0, the result from SparsePCA is still quite different from what R gives me, and the latter is correct based on manual inspection and what I learned from the original paper. So, is SparsePCA broken, or did I miss anything?
As often: there are many different formulations & implementations.
sklearn is using a different implementation with different characteristics.
Let's have a look how they differ:
sklearn: (reference within user-guide)
Elasticnet: (Zou et. al. paper)
So it seems sklearn is at least doing something different in regards to the l2-norm based component (it's missing).
This is by design as this is the basic form within the area of dictionary-learning: (algorithm-paper linked by sklearn used for implementation).
It is quite possible, that this alternative formulation is not guaranteeing (or does not care at all) to emulate classic PCA when the sparsity-parameter is zero (which is not really surprising as these problems differ a lot in regards to optimization-theory and sparsePCA has to reside to some heuristic-based algorithm as the problem itself is NP-hard, ref). This idea is strengthened by the describing of the equivalence theorem here:
The answers aren't different. First, I thought it may be the solvers, but checking for different solvers, I get almost identical loadings. See this:
MDP = MDPGen(dim=16, rep=1, n=30, T=90, sd=0.5, nsd=0.1, TranType=None)
X = MDP.get('X').astype(np.float32)
PCA_model = PCA(n_components=10,svd_solver='auto',tol=1e-6)
SPCA_model = SparsePCA(n_components=10, alpha=0, ridge_alpha=0)
PC1 = PCA_model.components_[0]/np.linalg.norm(PCA_model.components_[0])
SPC1 = SPCA_model.components_[0].T/np.linalg.norm(SPCA_model.components_[0])
import pylab
I am attempting to translate a MATLAB function to Python from Timothy Sauer,
Numerical Analysis Second Edition, page 546, Program 12.8. The original function
receives a square matrix and returns a matrix with the same eigenvalues but in
Upper Hessenberg form. The original function creates Householder reflectors to produce zeros in the
offdiagonals of the matrix and performs similarity transformations on the original matrix to
get it to upper hessenberg form.
My Python translation succeeds only in obtaining the eigenvalues for 3x3 matrices
but not for 4x4 matrices. Would anyone know the cause of the error? I pasted my code with success and failing cases below. Thank you.
import numpy as np
import math
norm = lambda v:math.sqrt(np.sum(v**2))
def upper_hessenberg(A):
Translated from Timothy Sauer, Numerical Analysis Second Edition, page 546, Program 12.8
Input: Square Matrix, A
Output: B, a Similar Matrix with Same Eigenvalues as A except in Upper Hessenberg form
V, a matrix containing the reflectors used to produce zeros in the off diagonals
rows, columns = A.shape
B = A[:,:].astype(np.float) #will store the similar matrix
V = np.zeros(shape=(rows,columns),dtype=float) #will store the reflectors
for column in range(columns-2): #start from the 1st column end at the third to last column
row = column
x = B[row+1: ,column] #decapitate the column
reflection_of_x = np.zeros(len(x)) #first entry is the norm, followed by 0s
if abs(norm(x)) <= np.finfo(float).eps: #if there are already 0s inthe offdiagonals skip this column
reflection_of_x[0] = norm(x)
v = reflection_of_x - x # v, (the difference vector) represents the line connecting the original column to the reflection of the column (see Timothy Sauer Num Analysis 2nd Edition Figure 4.11 Householder reflector)
v = v/norm(v) #normalize to length of 1 (unit vector)
V[:len(v), column] = v #save the reflector in an upper triangular matrix called V
#verify with x-2*(x # v * v) should equal a vector with all zeros except the leading entry
column_projections = np.outer(v , v # B[row+1:, column:]) #project each col onto difference vector
B[row+1:, column:] = B[row+1:, column:] - (2 * column_projections)
row_projections = np.outer(v, B[row:, column + 1:] # v).T #project each row onto difference vector
B[row:, column + 1:] = B[row:, column + 1:] - (2 * row_projections)
return V, B
# Algorithm succeeds only with 3x3 matrices
eigvectors = np.array([
eigvalues = np.array([
M = eigvectors # eigvalues # np.linalg.inv(eigvectors)
print("The expected eigvals :", np.linalg.eigvals(M))
V,B = upper_hessenberg(M)
print("For 3x3 matrices, The function successfully produces these eigvals",np.linalg.eigvals(B))
#But with 4x4 matrices it fails
eigvectors = np.array([
eigvalues = np.array([
M = eigvectors # eigvalues # np.linalg.inv(eigvectors)
print("The expected eigvals :", np.linalg.eigvals(M))
V,B = upper_hessenberg(M)
print("For 4x4 matrices, The function fails to obtain correct eigvals",np.linalg.eigvals(B))
Your error is that you try to be too efficient. While the last rows are indeed increasingly reduced with leading zeros, this is not the case for the last columns. So in row_projections you need to remove the limiter row:, change to B[:, column + 1:].
You are using the unstable variant of the "improved" Householder reflector. The older version would use the larger of x_refl - x and x_refl + x by setting reflection_of_x[0] = -np.sign(x[0])*norm(x) (or remove all minus signs there).
The stable variant of the improved reflector would use the binomial trick in the normalization of x_refl - x if this difference becomes too small.
x_refl - x = [ norm(x) - x[0], - x[1:] ]
= [ norm(x[1:])^2/(norm(x) + x[0]), - x[1:] ]
(x_refl - x)/norm(x_refl - x)
[ norm(x[1:]), - (norm(x)+x[0])*(x[1:]/norm(x[1:])) ]
= -----------------------------------------------------
While the parts may have wildly different scales, no catastrophic cancellation happens for x[0]>0.
See the discussion about the same algorithm from Golub/van Loan 4th ed. in for further details and opinions and the code from that book.
I have this similarity matrix plot of some documents. I want to sort the values of the matrix, which is a numpynd array, to group colors, while maintaining their relative position (diagonal yellow line), and labels as well.
path = "C:\\Users\\user\\Desktop\\texts\\dataset"
text_files = os.listdir(path)
#print (text_files)
tfidf_vectorizer = TfidfVectorizer()
documents = [open(f, encoding="utf-8").read() for f in text_files if f.endswith('.txt')]
sparse_matrix = tfidf_vectorizer.fit_transform(documents)
labels = []
for f in text_files:
if f.endswith('.txt'):
pairwise_similarity = sparse_matrix * sparse_matrix.T
pairwise_similarity_array = pairwise_similarity.toarray()
fig, ax = plt.subplots(figsize=(20,20))
cax = ax.matshow(pairwise_similarity_array, interpolation='spline16')
plt.title('News articles similarity matrix')
plt.xticks(range(23), labels, rotation=90);
plt.yticks(range(23), labels);
fig.colorbar(cax, ticks=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
Here is one possibility.
The idea is to use the information in the similarity matrix and put elements next to each other if they are similar. If two items are similar they should also be similar with respect to other elements ie have similar colors.
I start with the element which has the most in common with all other elements (this choice is a bit arbitrary) [a] and as next element I choose from the remaining elements the one which is closest to the current [b].
import numpy as np
import matplotlib.pyplot as plt
def create_dummy_sim_mat(n):
sm = np.random.random((n, n))
sm = (sm + sm.T) / 2
sm[range(n), range(n)] = 1
return sm
def argsort_sim_mat(sm):
idx = [np.argmax(np.sum(sm, axis=1))] # a
for i in range(1, len(sm)):
sm_i = sm[idx[-1]].copy()
sm_i[idx] = -1
idx.append(np.argmax(sm_i)) # b
return np.array(idx)
n = 10
sim_mat = create_dummy_sim_mat(n=n)
idx = argsort_sim_mat(sim_mat)
sim_mat2 = sim_mat[idx, :][:, idx] # apply reordering for rows and columns
# Plot results
fig, ax = plt.subplots(1, 2)
def ticks(_ax, ti, la):
ticks(_ax=ax[0], ti=range(n), la=range(n))
ticks(_ax=ax[1], ti=range(n), la=idx)
After meTchaikovsky's answer I also tested my idea on a clustered similarity matrix (see first image) this method works but is not perfect (see second image).
Because I use the similarity between two elements as approximation to their similarity to all other elements, it is quite clear why this does not work perfectly.
So instead of using the initial similarity to sort the elements one could calculate a second order similarity matrix which measures how similar the similarities are (sorry).
This measure describes better what you are interested in. If two rows / columns have similar colors they should be close to each other. The algorithm to sort the matrix is the same as before
def add_cluster(sm, c=3):
idx_cluster = np.array_split(np.random.permutation(np.arange(len(sm))), c)
for ic in idx_cluster:
cluster_noise = np.random.uniform(0.9, 1.0, (len(ic),)*2)
sm[ic[np.newaxis, :], ic[:, np.newaxis]] = cluster_noise
def get_sim_mat2(sm):
return 1 / (np.linalg.norm(sm[:, np.newaxis] - sm[np.newaxis], axis=-1) + 1/n)
sim_mat = create_dummy_sim_mat(n=100)
add_cluster(sim_mat, c=4)
sim_mat2 = get_sim_mat2(sim_mat)
idx = argsort_sim_mat(sim_mat)
idx2 = argsort_sim_mat(sim_mat2)
sim_mat_sorted = sim_mat[idx, :][:, idx]
sim_mat_sorted2 = sim_mat[idx2, :][:, idx2]
# Plot results
fig, ax = plt.subplots(1, 3)
The results with this second method are quite good (see third image)
but I guess there exist cases where this approach also fails, so I would be happy about feedback.
I tried to explain it and did also link the ideas to the code with [a] and [b], but obviously I did not do a good job, so here is a second more verbose explanation.
You have n elements and a n x n similarity matrix sm where each cell (i, j) describes how similar element i is to element j. The goal is to order the rows / columns in such a way that one can see existing patterns in the similarity matrix. My idea to achieve this is really simple.
You start with an empty list and add elements one by one. The criterion for the next element is the similarity to the current element. If element i was added in the last step, I chose the element argmax(sm[i, :]) as next, ignoring the elements already added to the list. I ignore the elements by setting the values of those elements to -1.
You can use the function ticks to reorder the labels:
labels = np.array(labels) # make labels an numpy array, to index it with a list
ticks(_ax=ax[0], ti=range(n), la=labels[idx])
#scleronomic's solution is very elegant, but it also has one shortage, which is we cannot set the number of clusters in the sorted correlation matrix. Assume we are working with a set of variables, in which some of them are weakly correlated
import string
import numpy as np
import pandas as pd
n_variables = 20
n_clusters = 10
n_samples = 100
names = list(string.ascii_lowercase)[:n_variables]
belongs_to_cluster = np.random.randint(0,n_clusters,n_variables)
latent = np.random.randn(n_clusters,n_samples)
variables = np.random.rand(n_variables,n_samples)
for ind in range(n_clusters):
mask = belongs_to_cluster == ind
# weakening the correlation
if ind % 2 == 0:variables[mask] += latent[ind]*0.1
variables[mask] += latent[ind]
df = pd.DataFrame({key:val for key,val in zip(names,variables)})
corr_mat = np.array(df.corr())
As you can see, there are 10 clusters of variables by construction, however, variables within clusters that has an even index are weakly correlated. If we only want to see roughly 5 clusters in the sorted correlation matrix, maybe we need to find another way.
Based on this post, which is the accepted answer to the question "Clustering a correlation matrix", to sort a correlation matrix into blocks, what we need to find are blocks, where correlations within blocks are high and correlations between blocks are low. However, the solution provided by this accepted answer works best when we know how many blocks are there in the first place, and more importantly, the sizes of the underlying blocks are the same, or at least similar. Therefore, I improved the solution with a new function sort_corr_mat
def sort_corr_mat(corr_mat,clusters_guess):
def _swap_rows(corr_mat, var1, var2):
rs = corr_mat.copy()
rs[var2, :],rs[var1, :]= corr_mat[var1, :],corr_mat[var2, :]
cs = rs.copy()
cs[:, var2],cs[:, var1] = rs[:, var1],rs[:, var2]
return cs
# analysis
max_iter = 500
best_score,current_score,best_count = -1e8,-1e8,0
num_minimua_to_visit = 20
best_corr = corr_mat
best_ordering = np.arange(n_variables)
for i in range(max_iter):
for row1 in range(n_variables):
for row2 in range(n_variables):
if row1 == row2: continue
option_ordering = best_ordering.copy()
option_ordering[row1],option_ordering[row2] = best_ordering[row2],best_ordering[row1]
option_corr = _swap_rows(best_corr,row1,row2)
option_score = score(option_corr,n_variables,clusters_guess)
if option_score > best_score:
best_corr = option_corr
best_ordering = option_ordering
best_score = option_score
if best_score > current_score:
best_count += 1
current_corr = best_corr
current_ordering = best_ordering
current_score = best_score
if best_count >= num_minimua_to_visit:
return best_corr#,best_ordering
return best_corr#,best_ordering
With this function and the corr_mat constructed in the first place, I compared the result obtained with my function (on the right) with that obtained with #scleronomic's solution (in the middle)
sim_mat_sorted = corr_mat[argsort_sim_mat(corr_mat), :][:, argsort_sim_mat(corr_mat)]
corr_mat_sorted = sort_corr_mat(corr_mat,clusters_guess=5)
# Plot results
fig, ax = plt.subplots(1,3,figsize=(18,6))
Clearly, #scleronomic's solution works much better and faster, but my solution offers more control to the pattern of the output.
I'm trying to build an NMF model for topic extraction. For re-training of the model, I've to pass a parameter to the nmf function, for which I need to pass the x co-ordinate from a given point that the algorithm returns, here is the code for reference:
no_features = 1000
no_topics = 9
print ('Old number of topics: ', no_topics)
tfidf_vectorizer = TfidfVectorizer(max_df = 0.95, min_df = 2, max_features = no_features, stop_words = 'english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
no_topics = tfidf.shape
print('New number of topics :', no_topics)
# nmf = NMF(n_components = no_topics, random_state = 1, alpha = .1, l1_ratio = .5, init = 'nndsvd').fit(tfidf)
On the third last line, the tfidf.shape returns a point (3,1000) to the variable 'no_topics', however I want that variable to be set to only the x co-ordinate, i.e (3).
How can I extract just the x co-ordinate from the point?
you can select the first values with no_topics[0]
print('New number of topics : {}'.format(no_topics[0]))
You can do a slicing on your numpy array tfidf with
topics = tfidf[0,:]
I have a model which is defined as:
m(x,z) = C1*x^2*sin(z)+C2*x^3*cos(z)
I have multiple data sets for different z (z=1, z=2, z=3), in which they give me m(x,z) as a function of x.
The parameters C1 and C2 have to be the same for all z values.
So I have to fit my model to the three data sets simultaneously otherwise I will have different values of C1 and C2 for different values of z.
It this possible to do with scipy.optimize.
I can do it for just one value of z, but can't figure out how to do it for all z's.
For one z I just write this:
def my_function(x,C1,C1):
return C1*x**2*np.sin(z)+ C2*x**3*np.cos(z)
data = 'some/path/for/data/z=1'
x= data[:,0]
y= data[:,1]
from lmfit import Model
gmodel = Model(my_function)
result =, x=x, C1=1.1)
How can I do it for multiple set of datas (i.e different z values?)
So what you want to do is fit a multi-dimensional fit (2-D in your case) to your data; that way for the entire data set you get a single set of C parameters that bests describes your data. I think the best way to do this is using scipy.optimize.curve_fit().
So your code would look something like this:
import scipy.optimize as optimize
import numpy as np
def my_function(xz, *par):
""" Here xz is a 2D array, so in the form [x, z] using your variables, and *par is an array of arguments (C1, C2) in your case """
x = xz[:,0]
z = xz[:,1]
return par[0] * x**2 * np.sin(z) + par[1] * x**3 * np.cos(z)
# generate fake data. You will presumable have this already
x = np.linspace(0, 10, 100)
z = np.linspace(0, 3, 100)
xx, zz = np.meshgrid(x, z)
xz = np.array([xx.flatten(), zz.flatten()]).T
fakeDataCoefficients = [4, 6.5]
fakeData = my_function(xz, *fakeDataCoefficients) + np.random.uniform(-0.5, 0.5, xx.size)
# Fit the fake data and return the set of coefficients that jointly fit the x and z
# points (and will hopefully be the same as the fakeDataCoefficients
popt, _ = optimize.curve_fit(my_function, xz, fakeData, p0=fakeDataCoefficients)
# Print the results
When I do this fit I get precisely the fakeDataCoefficients I used to generate the function, so the fit works well.
So the conclusion is that you don't do 3 fits independently, setting the value of z each time, but instead you do a 2D fit which takes the values of x and z simultaneously to find the best coefficients.
Your code is incomplete and has a few syntax errors.
But I think that you want to build a model that concatenates the models for the different data sets, and then fit the concatenated data to that model. Within the context of lmfit (disclosure: author and maintainer), I often find it easier to use minimize() and an objective function for multiple data set fits rather than the Model class. Perhaps start with something like this:
import lmfit
import numpy as np
# define the model function for each dataset
def my_function(x, c1, c2, z=1):
return C1*x**2*np.sin(z)+ C2*x**3*np.cos(z)
# Then write an objective function like this
def f2min(params, x, data2d, zlist):
ndata, npts = data2d.shape
residual = 0.0*data2d[:]
for i in range(ndata):
c1 = params['c1_%d' % (i+1)].value
c2 = params['c2_%d' % (i+1)].value
residual[i,:] = data[i,:] - my_function(x, c1, c2, z=zlist[i])
return residual.flatten()
# now build that `data2d`, `zlist` and build the `Parameters`
data2d = []
zlist = []
x = None
for fname in dataset_names:
d = np.loadtxt(fname) # or however you read / generate data
if x is None: x = d[:, 0]
data2d.append(d[:, 1])
zlist.append(z_for_dataset(fname)) # or however ...
data2d = np.array(data2d) # turn list into nd array
ndata, npts = data2d.shape
params = lmfit.Parameters()
for i in range(ndata):
params.add('c1_%d' % (i+1), value=1.0) # give a better starting value!
params.add('c2_%d' % (i+1), value=1.0) # give a better starting value!
# now you're ready to do the fit and print out the results:
result = lmfit.minimize(f2min, params, args=(x, data2d, zlist))
That code really a sketch and is all untested, but hopefully will give you a good starting foundation.
How do I generate random matrices and get them multiplied in an efficient way.
This is what I've done:
`mat1 = []
for i in range(0, order):
num1 = random.sample(range(1,10), order)
print("Result of Matrix Multiplication.")
for p in range(len(mat1)):
for q in range(len(mat2[0])):
for r in range(len(mat2)):
res_matrix[p][q] += mat1[p][r] * mat2[r][q]
for res in res_matrix:
You can use list comprehension to generate res_matrix using
res_matrix = [[0 for i in range(order)] for j in range(order)]
Also, have you heard of numpy? It does this kind of computations (and many more) in an easy and very fast way. This is what your code would become with numpy
import numpy as np
print("Generate 1st Matrix")
mat1 = np.random.randint(1, 10, size=(order, order))
print("Generate 2nd Matrix")
mat2 = np.random.randint(1, 10, size=(order, order))
res_matrix =
print("Result of Matrix Multiplication.")