I am looking if there is a good way to implement the following problem in Spark BlockMatrix:
Suppose I have a matrix A of n * m, n is huge, and m is not that large, represented by a BlockMatrix. I have another matrix B with only one column, the size is n * 1, also a BlockMatrix. Now I want to multiply these two matrix of A times B, in the way very close to element-wise multiplication. The first row of matrix A will multiply by the first element in matrix B, and so on....
So the result will still be a n * m Matrix. The operation is like B is acting as a weight that put to A.
I think the key is how to mapping the corresponding blocks together.
I know I can make B as a diagonal matrix of n * n, then B * A will be the result. Multiplication is supported already in Spark, but this will not be a good idea since it costs much more in communication.
Related
I'm implementing an efficient PageRank algorithm so I'm using sparse matrices. I'm close, but there's one problem. I have a matrix where I want the sum of each column to be one. This is easy to implement, but the problem occurs when I get a matrix with a zero column.
In this case, I want to set each element in the column to be 1/(n-1) where n is the dimension of the matrix. I divide by n-1 and not n because I wish to keep the diagonals zero, always.
How can I implement this efficiently? My naive solution is to just determine the sum of each column and then find the column indices that are zero and replace the entire column with an 1/(n-1) value like so:
# naive approach (too slow!)
# M is my nxn sparse matrix where each column sums to one
col_sums = M.sum(axis=0)
for i in range(n):
if col_sums[0,i] == 0:
# set entire column to 1/(n-1)
M[:, i] = 1/(n-1)
# make sure diagonal is zeroed
M[i,i] = 0
My M matrix is very very very large and this method simply doesn't scale. How can I do this efficiently?
You can't add new nonzero values without reallocating and copying the underlying data structure. If you expect these zero columns to be very common (> 25% of the data) you should handle them in some other way, or you're better off with a dense array.
Otherwise try this:
import scipy.sparse
M = scipy.sparse.rand(1000, 1000, density=0.001, format='csr')
nz_col_weights = scipy.sparse.csr_matrix(M.shape, dtype=M.dtype)
nz_col_weights[:, M.getnnz(axis=0) == 0] = 1 / (M.shape[0] - 1)
nz_col_weights.setdiag(0)
M += nz_col_weights
This has only two allocation operations
In scikit-learn's PolynomialFeatures preprocessor, there is an option to include_bias. This essentially just adds a column of ones to the dataframe. I was wondering what the point of having this was. Of course, you can set it to False. But theoretically how does having or not having a column of ones along with the Polynomial Features generated affect Regression.
This is the explanation in the documentation, but I can't seem to get anything useful out of it relation to why it should be used or not.
include_bias : boolean
If True (default), then include a bias column, the feature in which
all polynomial powers are zero (i.e. a column of ones - acts as an
intercept term in a linear model).
Suppose you want to perform the following regression:
y ~ a + b x + c x^2
where x is a generic sample. The best coefficients a,b,c are computed via simple matricial calculus. First, let us denote with X = [1 | X | X^2] a matrix with N rows, where N is the number of samples. The first column is a column of 1s, the second column is a column of values x_i, for all the samples i, the third column is a column of values x_i^2, for all samples i. Let us denote with B the following column vector B=[a b c]^T If Y is a column vector of the N target values for all samples i, we can write the regression as
y ~ X B
The i-th row of this equation is y_i ~ [1 x_i x^2] [a b c]^t = a + b x_i + c x_i^2.
The goal of training a regression is to find B=[a b c] such that X B be as close as possible to y.
If you don't add a column of 1, you are assuming a-priori that a=0, which might not be correct.
In practice, when you write Python code, and you use PolynomialFeatures together with sklearn.linear_model.LinearRegression, the latter takes care by default of adding a column of 1s (since in LinearRegression the fit_intercept parameter is True by default), so you don't need to add it as well in PolynomialFeatures. Therefore, in PolynomialFeatures one usually keeps include_bias=False.
The situation is different if you use statsmodels.OLS instead of LinearRegression
Suppose you have Matrix A.
Suppose also that we have Matrix C
If we have A = B x C and we want to find out the B matrix values which I believe should be 3x3 (Correct me if I am wrong)
Do we need to use matrix inversion here? I did not use algebra since many years.
I do not have a code yet but if someone can provide a snippet that will be great.
This is a problem that I have in image processing where the A , C hold RGB values.
The submitted matrices are just for illustration.
I am trying to solve this problem using Python numpy
I hope that someone can help with it.
Your matrix should be 5x5. As we are dealing with non-square matrices, you could use the generalized inverse of C to obtain B:
import numpy as np
np.random.seed(10)
A = np.random.randint(0,9,(5,3))
C = np.random.randint(0,9,(5,3))
B = np.matmul(A,np.linalg.pinv(C))
print B
Building on percusse's comment, you can do this with numpy.linalg.lstsq. However, this assumes that we are performing matrix left division but the situation is your question is for right division.
Using the fact that you are solving for B with B = A / C, lstsq solves problems of the type A \ C. To convert this into a form for lstsq, we can convert it into the latter problem by:
B = A / C = (C' \ A')'
The ' operator is the transpose. The above is found by linear algebra rules. Specifically, perform two transposes: ((A / C)')' where transposing a matrix twice is simply the result of itself. Also, knowing that (AC)' is equal to C'A' and for a matrix, the inverse of the transpose is equal to the transpose of the inverse you should get the above relationship.
Therefore:
B = numpy.linalg.lstsq(C.T, A.T)[0].T
The output of lstsq is a tuple where the first element is the actual solution.
Take note that for your particular example, C is a rank-deficient matrix so you won't be able to reconstruct A properly from B and C.
I have an S1 (21x21) matrix and a W (21x21) matrix given. I define a matrix results with each element as a matrix as results = {W};
and then, I have two regular for loops such that it runs all the values in index1 and then goes to the second index; but each time it should take a specific value of k for example.
There are also two given vectors cos and ens each having dimension 21x1. Here is the code:
rowsP=21;
M=0;
beta=0.9;
p=0.5;
q=0.5;
k= [1:rowsP-1];
for j=1:rowsP-k
for i=1:rowsP-k
R(i,j) = ( S1(i,end-k) - cos(j+k) ) *ens(j)-0.001*M +
beta*(p*results{k}(i,end-j)+q*results{k}(i+1,end-j));
results{k+1}=fliplr(R);
end
end
I am getting the error
Matrix Dimensions must agree.
So I am trying to calculate a matrix results each time using two for loops given results{1}=W (a given matrix) given k=1.
Then flipping the matrix left to right, I get results{2} which then helps to calculate R again but for k=2. And this is then repeated until k=21.
As you see, I keep dropping the last column of each successive R, the matrix results should be appended each time giving a row of 21 elements each cell having 21x21 matrix (the given matrix W) and then a matrix of 20x20 and then 19x19 and so on... until a matrix of 1x1. I am unable to solve the problem as Matlab only does 1 iteration and then does not compute the correct answer. I keep getting two cells in results with a 21x21 matrix (the one given) and the next 20x20 matrix.
I tried with another for loop for k, but in that case, for a given k, starting from k=1, it runs the whole code for j and then i, but it does not solve my problem.
I tried to solve the problem of two dimensional search using a combination of Aho-Corasick and a single dimensional KMP, however, I still need something faster.
To elaborate, I have a matrix A of characters of size n1*n2 and I wish to find all occurrences of a smaller matrix B of size m1*m2 and I want that to be in O(n1*n2+m1*m2) if possible.
For example:
A = a b c b c b
b c a c a c
d a b a b a
q a s d q a
and
B = b c b
c a c
a b a
the algorithm should return the indexes of say, the upper left corner of the match, which in this case should return (0,1) and (0,3). notice that the occurrences may overlap.
There is an algorithm called the Baker-Bird algorithm that I just recently encountered that appears to be a partial generalization of KMP to two dimensions. It uses two algorithms as subroutines - the Aho-Corasick algorithm (which itself is a generalization of KMP), and the KMP algorithm - to efficiently search a two-dimensional grid for a pattern.
I'm not sure if this is what you're looking for, but hopefully it helps!