Scikit Learn PolynomialFeatures - what is the use of the include_bias option? - scikit-learn

In scikit-learn's PolynomialFeatures preprocessor, there is an option to include_bias. This essentially just adds a column of ones to the dataframe. I was wondering what the point of having this was. Of course, you can set it to False. But theoretically how does having or not having a column of ones along with the Polynomial Features generated affect Regression.
This is the explanation in the documentation, but I can't seem to get anything useful out of it relation to why it should be used or not.
include_bias : boolean
If True (default), then include a bias column, the feature in which
all polynomial powers are zero (i.e. a column of ones - acts as an
intercept term in a linear model).

Suppose you want to perform the following regression:
y ~ a + b x + c x^2
where x is a generic sample. The best coefficients a,b,c are computed via simple matricial calculus. First, let us denote with X = [1 | X | X^2] a matrix with N rows, where N is the number of samples. The first column is a column of 1s, the second column is a column of values x_i, for all the samples i, the third column is a column of values x_i^2, for all samples i. Let us denote with B the following column vector B=[a b c]^T If Y is a column vector of the N target values for all samples i, we can write the regression as
y ~ X B
The i-th row of this equation is y_i ~ [1 x_i x^2] [a b c]^t = a + b x_i + c x_i^2.
The goal of training a regression is to find B=[a b c] such that X B be as close as possible to y.
If you don't add a column of 1, you are assuming a-priori that a=0, which might not be correct.
In practice, when you write Python code, and you use PolynomialFeatures together with sklearn.linear_model.LinearRegression, the latter takes care by default of adding a column of 1s (since in LinearRegression the fit_intercept parameter is True by default), so you don't need to add it as well in PolynomialFeatures. Therefore, in PolynomialFeatures one usually keeps include_bias=False.
The situation is different if you use statsmodels.OLS instead of LinearRegression

Related

Efficient sparse matrix column change

I'm implementing an efficient PageRank algorithm so I'm using sparse matrices. I'm close, but there's one problem. I have a matrix where I want the sum of each column to be one. This is easy to implement, but the problem occurs when I get a matrix with a zero column.
In this case, I want to set each element in the column to be 1/(n-1) where n is the dimension of the matrix. I divide by n-1 and not n because I wish to keep the diagonals zero, always.
How can I implement this efficiently? My naive solution is to just determine the sum of each column and then find the column indices that are zero and replace the entire column with an 1/(n-1) value like so:
# naive approach (too slow!)
# M is my nxn sparse matrix where each column sums to one
col_sums = M.sum(axis=0)
for i in range(n):
if col_sums[0,i] == 0:
# set entire column to 1/(n-1)
M[:, i] = 1/(n-1)
# make sure diagonal is zeroed
M[i,i] = 0
My M matrix is very very very large and this method simply doesn't scale. How can I do this efficiently?
You can't add new nonzero values without reallocating and copying the underlying data structure. If you expect these zero columns to be very common (> 25% of the data) you should handle them in some other way, or you're better off with a dense array.
Otherwise try this:
import scipy.sparse
M = scipy.sparse.rand(1000, 1000, density=0.001, format='csr')
nz_col_weights = scipy.sparse.csr_matrix(M.shape, dtype=M.dtype)
nz_col_weights[:, M.getnnz(axis=0) == 0] = 1 / (M.shape[0] - 1)
nz_col_weights.setdiag(0)
M += nz_col_weights
This has only two allocation operations

Deriving a matrix values from another 2 matrices

Suppose you have Matrix A.
Suppose also that we have Matrix C
If we have A = B x C and we want to find out the B matrix values which I believe should be 3x3 (Correct me if I am wrong)
Do we need to use matrix inversion here? I did not use algebra since many years.
I do not have a code yet but if someone can provide a snippet that will be great.
This is a problem that I have in image processing where the A , C hold RGB values.
The submitted matrices are just for illustration.
I am trying to solve this problem using Python numpy
I hope that someone can help with it.
Your matrix should be 5x5. As we are dealing with non-square matrices, you could use the generalized inverse of C to obtain B:
import numpy as np
np.random.seed(10)
A = np.random.randint(0,9,(5,3))
C = np.random.randint(0,9,(5,3))
B = np.matmul(A,np.linalg.pinv(C))
print B
Building on percusse's comment, you can do this with numpy.linalg.lstsq. However, this assumes that we are performing matrix left division but the situation is your question is for right division.
Using the fact that you are solving for B with B = A / C, lstsq solves problems of the type A \ C. To convert this into a form for lstsq, we can convert it into the latter problem by:
B = A / C = (C' \ A')'
The ' operator is the transpose. The above is found by linear algebra rules. Specifically, perform two transposes: ((A / C)')' where transposing a matrix twice is simply the result of itself. Also, knowing that (AC)' is equal to C'A' and for a matrix, the inverse of the transpose is equal to the transpose of the inverse you should get the above relationship.
Therefore:
B = numpy.linalg.lstsq(C.T, A.T)[0].T
The output of lstsq is a tuple where the first element is the actual solution.
Take note that for your particular example, C is a rank-deficient matrix so you won't be able to reconstruct A properly from B and C.

element-wise multiplication in blockMatrix in Spark

I am looking if there is a good way to implement the following problem in Spark BlockMatrix:
Suppose I have a matrix A of n * m, n is huge, and m is not that large, represented by a BlockMatrix. I have another matrix B with only one column, the size is n * 1, also a BlockMatrix. Now I want to multiply these two matrix of A times B, in the way very close to element-wise multiplication. The first row of matrix A will multiply by the first element in matrix B, and so on....
So the result will still be a n * m Matrix. The operation is like B is acting as a weight that put to A.
I think the key is how to mapping the corresponding blocks together.
I know I can make B as a diagonal matrix of n * n, then B * A will be the result. Multiplication is supported already in Spark, but this will not be a good idea since it costs much more in communication.

masking a double over a string

This is a question in MatLab...
I have two matrices, one being a (5 x 1 double) :
1
2
3
1
3
And the second matrix being a (5 x 3 string), with spaces where no character appears :
a
bc
def
g
hij
I am trying to get an output such that a (5 x 1 string) is created and outputs the nth value from each line of matrix two, where n is the value in matrix one. I am unsure how to do this using a mask which would be able to handle much larger matrces. My target matrix would have the following :
a
c
f
g
j
Thank you very much for the help!!!
There are so many ways you can accomplish this task. I'll give you two.
Method #1 - Generate linear indices and access elements
Use sub2ind to generate a set of linear indices that correspond to the row and column locations you want to access in your matrix. You'll note that the column locations are the ones changing, but the row locations are always increasing by 1 as you want to access each row. As such, given your string matrix A, and your columns you want to access stored in ind, just do this:
A = ['a '; 'bc '; 'def'; 'g ';'hij'];
ind = [1 2 3 1 3];
out = A(sub2ind(size(A), (1:numel(ind)).', ind(:)))
out =
a
c
f
g
j
Method #2 - Create a sparse matrix, convert to logical and access
Alternatively, you can create a sparse matrix through sparse where the non-zero entries are rows vary from 1 up to as many elements as you have in ind and the columns vary like what you have given us.
S = sparse((1:numel(ind)).',ind(:),true,size(A,1),size(A,2));
A = A.'; out = A(S.');
Be mindful that you are trying to access each element in a row-major fashion, yet MATLAB will do this in a column-major format. As such, we would need to transpose our data matrix, and also take our sparse matrix and transpose that too. The end result should give you the same order as Method #1.

How to curve fit data in Excel to a multi variable polynomial?

I have a simple set of data, 10 values that increase.
I want to fit them to a polynomial of the form:
Z = A1 + A2*X + A3*Y + A4*X^2 + A5*X*Y+ A6*Y^2
Where Z the output is the set of data above, A1 - A6 are the coefficients I am looking for,
X is the range of inputs (10 of course), and Y for the moment is a constant value.
How can I curve fit to this polynomial and not the standard 2nd order one that is created using 'trendline'?
Construct a Vandermonde matrix on your data points, find it's inverse with MINVERSE, then apply this to the vector of Z values with MMULT. This would work for polynomial degree n with n data points.
Otherwise you could try polynomial regression, which will again use the Vandermonde matrix.
More math than Excel really.

Resources