Efficient sparse matrix column change - python-3.x

I'm implementing an efficient PageRank algorithm so I'm using sparse matrices. I'm close, but there's one problem. I have a matrix where I want the sum of each column to be one. This is easy to implement, but the problem occurs when I get a matrix with a zero column.
In this case, I want to set each element in the column to be 1/(n-1) where n is the dimension of the matrix. I divide by n-1 and not n because I wish to keep the diagonals zero, always.
How can I implement this efficiently? My naive solution is to just determine the sum of each column and then find the column indices that are zero and replace the entire column with an 1/(n-1) value like so:
# naive approach (too slow!)
# M is my nxn sparse matrix where each column sums to one
col_sums = M.sum(axis=0)
for i in range(n):
if col_sums[0,i] == 0:
# set entire column to 1/(n-1)
M[:, i] = 1/(n-1)
# make sure diagonal is zeroed
M[i,i] = 0
My M matrix is very very very large and this method simply doesn't scale. How can I do this efficiently?

You can't add new nonzero values without reallocating and copying the underlying data structure. If you expect these zero columns to be very common (> 25% of the data) you should handle them in some other way, or you're better off with a dense array.
Otherwise try this:
import scipy.sparse
M = scipy.sparse.rand(1000, 1000, density=0.001, format='csr')
nz_col_weights = scipy.sparse.csr_matrix(M.shape, dtype=M.dtype)
nz_col_weights[:, M.getnnz(axis=0) == 0] = 1 / (M.shape[0] - 1)
nz_col_weights.setdiag(0)
M += nz_col_weights
This has only two allocation operations

Related

Remove linearly dependent columns in (n x m) matrix of real numbers

I am working on a model for forecasting. My independent variables are contained in an (n x m) matrix, where n represents the number of observations and m represents the number of features. Each column contains one of the following data types: binary, integer, real. I want to remove the columns that are linearly dependent. Pulling resources from the internet, I have come up with the following function:
def remove_linearly_dependent_columns(df):
# Create a matrix from the DataFrame
matrix = df.to_numpy()
# Get the rank of the matrix
rank = np.linalg.matrix_rank(matrix)
# If rank is equal to the number of columns, then all columns are linearly independent
if rank == matrix.shape[1]:
return df
# Otherwise, use randomized SVD to find linearly dependent columns
_, s, vh = randomized_svd(matrix, n_components=rank, random_state=0)
# Get a threshold for small singular values
threshold = np.finfo(np.float64).eps * max(matrix.shape) * s[0]
# Get the number of linearly independent columns
num_independent = np.sum(s > threshold)
# Select only the linearly independent columns
independent_columns = vh[:num_independent].T
# Create a new DataFrame with only the linearly independent columns
independent_df = pd.DataFrame(data=matrix # independent_columns, columns=df.columns[:num_independent], index=df.index)
return independent_df
However, this function applies a transformation to the data, which is not what I need. I would need to obtain the original dataset, just without the columns that are linearly dependent. How could I solve this issue?

Apply this function to a 2D numpy Matrix vector operations only

guys, I have this function
def averageRating(a,b):
avg = (float(a)+float(b))/2
return round(avg/25)*25
Currently, I am looping over my np array which is just a 2D array that has numerical values. What I want to be able to do is have "a" be the 1st array and "b" be the 2nd array and get the average per row and what I want for my return is just an array with the values. I have used mean but could not find a way to edit it and have the round() part or multiple (avg*25)/25.
My goal is to get rid of looping and replace it with a vectorized operations because of how slow looping is.
Sorry for the question new to python and numpy.
def averageRating(a,b):
avg = (np.average(a,axis=1) + np.average(b,axis=1))/2
return np.round(avg,0)
This should do what you are looking for if I understand the question correctly. Specifying axis = 1 in np.average will give the average of the rows (axis = 0 would be the average of the columns). And the 0 in np.round will round to 0 decimal places, changing it will change the number of decimal places you round to. Hope that helps!
def averageRating(a, b):
averages = []
for i in range( len(a) ):
averages.append( (a[i] + b[i]) / 2 )
return averages
Giving your arrays are of equal length this should be a simple resolution.
This doesn't eliminate the use of for loops, however, it will be computationally cheaper than the current approach.

Two regular loops with using given values for a parameter in MATLAB

I have an S1 (21x21) matrix and a W (21x21) matrix given. I define a matrix results with each element as a matrix as results = {W};
and then, I have two regular for loops such that it runs all the values in index1 and then goes to the second index; but each time it should take a specific value of k for example.
There are also two given vectors cos and ens each having dimension 21x1. Here is the code:
rowsP=21;
M=0;
beta=0.9;
p=0.5;
q=0.5;
k= [1:rowsP-1];
for j=1:rowsP-k
for i=1:rowsP-k
R(i,j) = ( S1(i,end-k) - cos(j+k) ) *ens(j)-0.001*M +
beta*(p*results{k}(i,end-j)+q*results{k}(i+1,end-j));
results{k+1}=fliplr(R);
end
end
I am getting the error
Matrix Dimensions must agree.
So I am trying to calculate a matrix results each time using two for loops given results{1}=W (a given matrix) given k=1.
Then flipping the matrix left to right, I get results{2} which then helps to calculate R again but for k=2. And this is then repeated until k=21.
As you see, I keep dropping the last column of each successive R, the matrix results should be appended each time giving a row of 21 elements each cell having 21x21 matrix (the given matrix W) and then a matrix of 20x20 and then 19x19 and so on... until a matrix of 1x1. I am unable to solve the problem as Matlab only does 1 iteration and then does not compute the correct answer. I keep getting two cells in results with a 21x21 matrix (the one given) and the next 20x20 matrix.
I tried with another for loop for k, but in that case, for a given k, starting from k=1, it runs the whole code for j and then i, but it does not solve my problem.

element-wise multiplication in blockMatrix in Spark

I am looking if there is a good way to implement the following problem in Spark BlockMatrix:
Suppose I have a matrix A of n * m, n is huge, and m is not that large, represented by a BlockMatrix. I have another matrix B with only one column, the size is n * 1, also a BlockMatrix. Now I want to multiply these two matrix of A times B, in the way very close to element-wise multiplication. The first row of matrix A will multiply by the first element in matrix B, and so on....
So the result will still be a n * m Matrix. The operation is like B is acting as a weight that put to A.
I think the key is how to mapping the corresponding blocks together.
I know I can make B as a diagonal matrix of n * n, then B * A will be the result. Multiplication is supported already in Spark, but this will not be a good idea since it costs much more in communication.

masking a double over a string

This is a question in MatLab...
I have two matrices, one being a (5 x 1 double) :
1
2
3
1
3
And the second matrix being a (5 x 3 string), with spaces where no character appears :
a
bc
def
g
hij
I am trying to get an output such that a (5 x 1 string) is created and outputs the nth value from each line of matrix two, where n is the value in matrix one. I am unsure how to do this using a mask which would be able to handle much larger matrces. My target matrix would have the following :
a
c
f
g
j
Thank you very much for the help!!!
There are so many ways you can accomplish this task. I'll give you two.
Method #1 - Generate linear indices and access elements
Use sub2ind to generate a set of linear indices that correspond to the row and column locations you want to access in your matrix. You'll note that the column locations are the ones changing, but the row locations are always increasing by 1 as you want to access each row. As such, given your string matrix A, and your columns you want to access stored in ind, just do this:
A = ['a '; 'bc '; 'def'; 'g ';'hij'];
ind = [1 2 3 1 3];
out = A(sub2ind(size(A), (1:numel(ind)).', ind(:)))
out =
a
c
f
g
j
Method #2 - Create a sparse matrix, convert to logical and access
Alternatively, you can create a sparse matrix through sparse where the non-zero entries are rows vary from 1 up to as many elements as you have in ind and the columns vary like what you have given us.
S = sparse((1:numel(ind)).',ind(:),true,size(A,1),size(A,2));
A = A.'; out = A(S.');
Be mindful that you are trying to access each element in a row-major fashion, yet MATLAB will do this in a column-major format. As such, we would need to transpose our data matrix, and also take our sparse matrix and transpose that too. The end result should give you the same order as Method #1.

Resources