Multiclass vs. multilabel fitting - scikit-learn

In scikit-learn tutorials, I found the following paragraphs in the section 'Multiclass vs. multilabel fitting'.
I couldn't understand why the following codes generate the given results.
First
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer
X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
y = [0, 0, 1, 1, 2]
classif = OneVsRestClassifier(estimator=SVC(random_state=0))
classif.fit(X, y).predict(X)
array([0, 0, 1, 1, 2])
y = LabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
array([[1, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 0]])
Next
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
y = MultiLabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])

Label binarization in scikit-learn will transform your targets and represent them in a label indicator matrix. This label indicator matrix has the shape (n_samples, n_classes) and is composed as follows:
each row represents a sample
each column represents a class
each element is 1 if the sample is labeled with the class and 0 if not
In your first example, you have a target collection with 5 samples and 3 classes. That's why transforming y with LabelBinarizer results in a 5x3 matrix. In your case, [1, 0, 0] corresponds to class 0, [0, 1, 0] corresponds to class 1 and so forth. Notice that in each row there is only one element set to 1, since each sample can have one label only.
In your next example, you have a target collection with 5 samples and 5 classes. That's why transforming y with MultiLabelBinarizer results in a 5x5 matrix. In your case, [1, 1, 0, 0, 0] corresponds to the multilabel [0, 1], [0, 1, 0, 1, 0] corresponds to the multilabel [1, 3] and so forth. The key difference to the first example is that each row can have multiple elements set to 1, because each sample can have multiple labels/classes.
The predicted values you get follow the very same pattern. They are however not equivalent to the original values in y since your classification model has obviously predicted different values. You can check this with the inverse_transform() of the binarizers:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y = np.array([[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]])
y_bin = mlb.fit_transform(y)
# direct transformation
[[1 1 0 0 0]
[1 0 1 0 0]
[0 1 0 1 0]
[1 0 1 1 0]
[0 0 1 0 1]]
# prediction of your classifier
y_pred = np.array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])
# inverting the binarized values to the original classes
y_inv = mlb.inverse_transform(y_pred)
# output
[(0, 1), (0, 2), (1, 3), (0, 2), (0, 2)]

Related

How to select rows from two different Numpy arrays conditionally?

I have two Numpy 2D arrays and I want to get a single 2D array by selecting rows from the original two arrays. The selection is done conditionally. Here is the simple Python way,
import numpy as np
a = np.array([4, 0, 1, 2, 4])
b = np.array([0, 4, 3, 2, 0])
y = np.array([[0, 0, 0, 0],
[0, 0, 0, 1],
[0, 0, 1, 0],
[0, 0, 1, 1],
[0, 0, 1, 0]])
x = np.array([[0, 0, 0, 0],
[1, 1, 1, 0],
[1, 1, 0, 0],
[1, 1, 1, 1],
[0, 0, 1, 0]])
z = np.empty(shape=x.shape, dtype=x.dtype)
for i in range(x.shape[0]):
z[i] = y[i] if a[i] >= b[i] else x[i]
print(z)
Looking at numpy.select, I tried, np.select([a >= b, a < b], [y, x], -1) but got ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (5,) and arg 1 with shape (5, 4).
Could someone help me write this in a more efficient Numpy manner?
This should do the trick, but it would be helpful if you could show an example of your expected output:
>>> np.where((a >= b)[:, None], y, x)
array([[0, 0, 0, 0],
[1, 1, 1, 0],
[1, 1, 0, 0],
[0, 0, 1, 1],
[0, 0, 1, 0]])

accessing elements through sparse matrix in scipy

I have below code in python
# dense to sparse
from numpy import array
from scipy.sparse import csr_matrix
# create dense matrix
A = array([[1, 0, 0, 1, 0, 0], [0, 0, 2, 0, 0, 1], [0, 0, 0, 2, 0, 0]])
print(A)
# convert to sparse matrix (CSR method)
S = csr_matrix(A)
print(S)
# reconstruct dense matrix
B = S.todense()
print(B)
Above code when I have following statement I have
print(B[0])
I have following output:
[[1 0 0 1 0 0]]
How can I loop through the above values i.e, 1, 0, 0, 1, 0, 0, 0
In [2]: from scipy.sparse import csr_matrix
...: # create dense matrix
...: A = np.array([[1, 0, 0, 1, 0, 0], [0, 0, 2, 0, 0, 1], [0, 0, 0, 2, 0, 0]])
...: S = csr_matrix(A)
In [3]: A
Out[3]:
array([[1, 0, 0, 1, 0, 0],
[0, 0, 2, 0, 0, 1],
[0, 0, 0, 2, 0, 0]])
In [4]: S
Out[4]:
<3x6 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>
S.toarray() or S.A for short, makes a dense ndarray:
In [5]: S.A
Out[5]:
array([[1, 0, 0, 1, 0, 0],
[0, 0, 2, 0, 0, 1],
[0, 0, 0, 2, 0, 0]])
todense makes a np.matrix object, which is always 2d
In [6]: S.todense()
Out[6]:
matrix([[1, 0, 0, 1, 0, 0],
[0, 0, 2, 0, 0, 1],
[0, 0, 0, 2, 0, 0]])
In [7]: S.todense()[0]
Out[7]: matrix([[1, 0, 0, 1, 0, 0]])
In [9]: S.todense()[0][0]
Out[9]: matrix([[1, 0, 0, 1, 0, 0]])
To iterate by 'columns' we have to do something like:
In [10]: [S.todense()[0][:,i] for i in range(3)]
Out[10]: [matrix([[1]]), matrix([[0]]), matrix([[0]])]
In [11]: [S.todense()[0][0,i] for i in range(3)]
Out[11]: [1, 0, 0]
There is a shortcut for converting a 1d row np.matrix to a 1d ndarray:
In [12]: S.todense()[0].A1
Out[12]: array([1, 0, 0, 1, 0, 0])
Get a 1d array from a "row" of a ndarray is simpler:
In [14]: S.toarray()[0]
Out[14]: array([1, 0, 0, 1, 0, 0])
np.matrix is generally discouraged, as a remnant from a time when the transition from MATLAB was more important. Now that fact that sparse is modeled on np.matrix (but not subclassed) is the main reason for keeping np.matrix. Row and column sums of a sparse matrix return dense matrix.

How to get padding mask from input ids?

Considering a batch of 4 pre-processed sentences (tokenization, numericalizing and padding) shown below:
batch = torch.tensor([
[1, 2, 0, 0],
[4, 0, 0, 0],
[3, 5, 6, 7]
])
where 0 states for [PAD] token.
Thus, what would be an efficient approach to generate a padding masking tensor of the same shape as the batch assigning zero at [PAD] positions and assigning one to other input data (sentence tokens)?
In the example above it would be something like:
padding_masking=
tensor([
[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 1, 1, 1]
])
The following is tested on pytorch 1.3.1.
pad_token_id = 0
batch = torch.tensor([
[1, 2, 0, 0],
[4, 0, 0, 0],
[3, 5, 6, 7]
])
pad_mask = ~(batch == pad_token_id)
print(pad_mask)
Output
tensor([[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 1, 1, 1]], dtype=torch.uint8)
You can get your desired result with
padding_masking = batch > 0
If you want ints instead of booleans, use
padding_masking.type(torch.int)

Remove variable with zero variance

Could somebody help me in writing the code to remove the variable whose variance is zero in the data frame using python?
Removing features with low variance
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
There are 3 boolean features here, each with 6 instances. Suppose we wish to remove those that are constant in at least 80% of the instances. Some probability calculations show that these features will need to have variance lower than 0.8 * (1 - 0.8). Consequently, we can use Ref: Scikit link
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
Output will be:
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])

Is there a way to form sparse n-dimensional array in Python3?

I am pretty new to Python and have been wondering if there an easy way so that I could form a sparse n-dimensional array M in Python3 with following 2 conditions mainly required (along the lines of SciPy COO_Matrix):
M[dim1,dim2,dim3,...] = 1.0
Like SciPy COO_Matrix M: M.row, M.col, I may be able to get all the row and column indices for which non-zero entries exist in the matrix. In N-dimension, this generalizes to calling: M.1 for 1st dimension, M.2 for 2nd dimension and so on...
For 2-dimension (the 2 conditions):
1.
for u, i in data:
mat[u, i] = 1.0
2. def get_triplets(mat):
return mat.row, mat.col
Can these 2 conditions be generalized in N-dimensions? I searched and came across this:
sparse 3d matrix/array in Python?
But here 2nd condition is not satisfied: In other words, I can't get the all the nth dimensional indices in a vectorized format.
Also this:
http://www.janeriksolem.net/sparray-sparse-n-dimensional-arrays-in.html works for python and not python3.
Is there a way to implement n-dimensional arrays with above mentioned 2 conditions satisfied? Or I am over-complicating things? I appreciate any help with this :)
In the spirit of coo format I could generate a 3d sparse array representation:
In [106]: dims = 2,4,6
In [107]: data = np.zeros((10,4),int)
In [108]: data[:,-1] = 1
In [112]: for i in range(3):
...: data[:,i] = np.random.randint(0,dims[i],10)
In [113]: data
Out[113]:
array([[0, 2, 3, 1],
[0, 3, 4, 1],
[0, 0, 1, 1],
[0, 3, 0, 1],
[1, 1, 3, 1],
[1, 0, 2, 1],
[1, 1, 2, 1],
[0, 2, 5, 1],
[0, 1, 5, 1],
[0, 1, 2, 1]])
Does that meet your requirements? It's possible there are some duplicates. sparse.coo sums duplicates before it converts the array to dense for display, or to csr for calculations.
The corresponding dense array is:
In [130]: A=np.zeros(dims, int)
In [131]: for row in data:
...: A[tuple(row[:3])] += row[-1]
In [132]: A
Out[132]:
array([[[0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 1],
[0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 0]],
[[0, 0, 1, 0, 0, 0],
[0, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]]])
(no duplicates in this case).
A 2d sparse matrix using a subset of this data is
In [118]: sparse.coo_matrix((data[:,3],(data[:,1],data[:,2])),(4,6)).A
Out[118]:
array([[0, 1, 1, 0, 0, 0],
[0, 0, 2, 1, 0, 1],
[0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 0]])
That's in effect the sum over the first dimension.
I'm assuming that
M[dim1,dim2,dim3,...] = 1.0
means the non-zero elements of the array must have a data value of 1.
Pandas has a sparse data series and data frame format. That allows for a non-zero 'fill' value. I don't know if the multi-index version can be thought of as higher than 2d or not. There have been a few SO questions about converting the Pandas sparse arrays to/from the scipy sparse.
Convert Pandas SparseDataframe to Scipy sparse csc_matrix
http://pandas-docs.github.io/pandas-docs-travis/sparse.html#interaction-with-scipy-sparse

Resources