Confusion matrix in linear regression - python-3.x

I have actual values and predicted values.
Actual:
33.3663, 38.2561, 28.6362, 35.6252
Predicted:
28.9721, 35.6161, 27.9561, 22.6272
I want to apply confusion matrix to find the accuracy.

Solution
First thing, confusion matrix is not for continuous values. AND you can also use it by converting continuous values to classes. check https://datascience.stackexchange.com/questions/46019/continuous-variable-not-supported-in-confusion-matrix
from sklearn.metrics import confusion_matrix
expected = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
predicted = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0]
results = confusion_matrix(expected, predicted)
print(results)
Output
[[4 2]
[1 3]]
Reference
https://machinelearningmastery.com/confusion-matrix-machine-learning/

Related

torchmetrics represent uncertainty

I am using torchmetrics to calculate metrics such as F1 score, Recall, Precision and Accuracy in multilabel classification setting. With random initiliazed weights the softmax output (i.e. prediction) might look like this with a batch size of 8:
import torch
y_pred = torch.tensor([[0.1944, 0.1931, 0.2184, 0.1968, 0.1973],
[0.2182, 0.1932, 0.1945, 0.1973, 0.1968],
[0.2182, 0.1932, 0.1944, 0.1973, 0.1969],
[0.2182, 0.1931, 0.1945, 0.1973, 0.1968],
[0.2184, 0.1931, 0.1944, 0.1973, 0.1968],
[0.2181, 0.1932, 0.1941, 0.1970, 0.1976],
[0.2183, 0.1932, 0.1944, 0.1974, 0.1967],
[0.2182, 0.1931, 0.1945, 0.1973, 0.1968]])
With the correct labels (one-hot encoded):
y_true = torch.tensor([[0, 0, 1, 0, 1],
[0, 1, 0, 0, 1],
[0, 1, 0, 0, 1],
[0, 0, 1, 1, 0],
[0, 0, 1, 1, 0],
[0, 1, 0, 1, 0],
[0, 1, 0, 1, 0],
[0, 0, 1, 0, 1]])
And I can calculate the metrics by taking argmax:
import torchmetrics
torchmetrics.functional.f1_score(y_pred.argmax(-1), y_true.argmax(-1))
output:
tensor(0.1250)
The first prediction happens to be correct while the rest are wrong. However, none of the predictive probabilities are above 0.3, which means that the model is generally uncertain about the predictions. I would like to encode this and say that the f1 score should be 0.0 because none of the predictive probabilities are above a 0.3 threshold.
Is this possible with torchmetrics or sklearn library?
Is this common practice?
You need to threshold you predictions before passing them to your torchmetrics
t0, t1, mask_gt = batch
mask_pred = self.forward(t0, t1)
loss = self.criterion(mask_pred.squeeze().float(), mask_gt.squeeze().float())
mask_pred = torch.sigmoid(mask_pred).squeeze()
mask_pred = torch.where(mask_pred > 0.5, 1, 0)
# integers to comply with metrics input type
mask_pred = mask_pred.long()
mask_gt = mask_gt.long()
f1_score = self.f1(mask_pred, mask_gt)
precision = self.precision_(mask_pred, mask_gt)
recall = self.recall(mask_pred, mask_gt)
jaccard = self.jaccard(mask_pred, mask_gt)
The defined torchmetrics
self.f1 = F1Score(num_classes=2, average='macro', mdmc_average='samplewise')
self.recall = Recall(num_classes=2, average='macro', mdmc_average='samplewise')
self.precision_ = Precision(num_classes=2, average='macro', mdmc_average='samplewise') # self.precision exists in torch.nn.Module. Hence '_' symbol
self.jaccard = JaccardIndex(num_classes=2)

Count Unique elements in pytorch Tensor

Suppose I have the following tensor: y = torch.randint(0, 3, (10,)). How would you go about counting the 0's 1's and 2's in there?
The only way I can think of is by using collections.Counter(y) but was wondering if there was a more "pytorch" way of doing this. A use case for example would be when building the confusion matrix for predictions.
You can use torch.unique with the return_counts option:
>>> x = torch.randint(0, 3, (10,))
tensor([1, 1, 0, 2, 1, 0, 1, 1, 2, 1])
>>> x.unique(return_counts=True)
(tensor([0, 1, 2]), tensor([2, 6, 2]))

Multiclass vs. multilabel fitting

In scikit-learn tutorials, I found the following paragraphs in the section 'Multiclass vs. multilabel fitting'.
I couldn't understand why the following codes generate the given results.
First
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer
X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
y = [0, 0, 1, 1, 2]
classif = OneVsRestClassifier(estimator=SVC(random_state=0))
classif.fit(X, y).predict(X)
array([0, 0, 1, 1, 2])
y = LabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
array([[1, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 0]])
Next
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
y = MultiLabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])
Label binarization in scikit-learn will transform your targets and represent them in a label indicator matrix. This label indicator matrix has the shape (n_samples, n_classes) and is composed as follows:
each row represents a sample
each column represents a class
each element is 1 if the sample is labeled with the class and 0 if not
In your first example, you have a target collection with 5 samples and 3 classes. That's why transforming y with LabelBinarizer results in a 5x3 matrix. In your case, [1, 0, 0] corresponds to class 0, [0, 1, 0] corresponds to class 1 and so forth. Notice that in each row there is only one element set to 1, since each sample can have one label only.
In your next example, you have a target collection with 5 samples and 5 classes. That's why transforming y with MultiLabelBinarizer results in a 5x5 matrix. In your case, [1, 1, 0, 0, 0] corresponds to the multilabel [0, 1], [0, 1, 0, 1, 0] corresponds to the multilabel [1, 3] and so forth. The key difference to the first example is that each row can have multiple elements set to 1, because each sample can have multiple labels/classes.
The predicted values you get follow the very same pattern. They are however not equivalent to the original values in y since your classification model has obviously predicted different values. You can check this with the inverse_transform() of the binarizers:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y = np.array([[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]])
y_bin = mlb.fit_transform(y)
# direct transformation
[[1 1 0 0 0]
[1 0 1 0 0]
[0 1 0 1 0]
[1 0 1 1 0]
[0 0 1 0 1]]
# prediction of your classifier
y_pred = np.array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])
# inverting the binarized values to the original classes
y_inv = mlb.inverse_transform(y_pred)
# output
[(0, 1), (0, 2), (1, 3), (0, 2), (0, 2)]

Not able to use Stratified-K-Fold on multi label classifier

The following code is used to do KFold Validation but I am to train the model as it is throwing the error
ValueError: Error when checking target: expected dense_14 to have shape (7,) but got array with shape (1,)
My target Variable has 7 classes. I am using LabelEncoder to encode the classes into numbers.
By seeing this error, If I am changing the into MultiLabelBinarizer to encode the classes. I am getting the following error
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.
The following is the code for KFold validation
skf = StratifiedKFold(n_splits=10, shuffle=True)
scores = np.zeros(10)
idx = 0
for index, (train_indices, val_indices) in enumerate(skf.split(X, y)):
print("Training on fold " + str(index+1) + "/10...")
# Generate batches from indices
xtrain, xval = X[train_indices], X[val_indices]
ytrain, yval = y[train_indices], y[val_indices]
model = None
model = load_model() //defined above
scores[idx] = train_model(model, xtrain, ytrain, xval, yval)
idx+=1
print(scores)
print(scores.mean())
I don't know what to do. I want to use Stratified K Fold on my model. Please help me.
MultiLabelBinarizer returns a vector which is of the length of your number of classes.
If you look at how StratifiedKFold splits your dataset, you will see that it only accepts a one-dimensional target variable, whereas you are trying to pass a target variable with dimensions [n_samples, n_classes]
Stratefied split basically preserves your class distribution. And if you think about it, it does not make a lot of sense if you have a multi-label classification problem.
If you want to preserve the distribution in terms of the different combinations of classes in your target variable, then the answer here explains two ways in which you can define your own stratefied split function.
UPDATE:
The logic is something like this:
Assuming you have n classes and your target variable is a combination of these n classes. You will have (2^n) - 1 combinations (Not including all 0s). You can now create a new target variable considering each combination as a new label.
For example, if n=3, you will have 7 unique combinations:
1. [1, 0, 0]
2. [0, 1, 0]
3. [0, 0, 1]
4. [1, 1, 0]
5. [1, 0, 1]
6. [0, 1, 1]
7. [1, 1, 1]
Map all your labels to this new target variable. You can now look at your problem as simple multi-class classification, instead of multi-label classification.
Now you can directly use StartefiedKFold using y_new as your target. Once the splits are done, you can map your labels back.
Code sample:
import numpy as np
np.random.seed(1)
y = np.random.randint(0, 2, (10, 7))
y = y[np.where(y.sum(axis=1) != 0)[0]]
OUTPUT:
array([[1, 1, 0, 0, 1, 1, 1],
[1, 1, 0, 0, 1, 0, 1],
[1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 1, 1, 1],
[1, 1, 0, 0, 0, 1, 1],
[1, 1, 1, 1, 0, 1, 1],
[0, 0, 1, 0, 0, 1, 1],
[1, 0, 1, 0, 0, 1, 1],
[0, 1, 1, 1, 1, 0, 0]])
Label encode your class vectors:
from sklearn.preprocessing import LabelEncoder
def get_new_labels(y):
y_new = LabelEncoder().fit_transform([''.join(str(l)) for l in y])
return y_new
y_new = get_new_labels(y)
OUTPUT:
array([7, 6, 3, 3, 2, 5, 8, 0, 4, 1])

Is there a way to form sparse n-dimensional array in Python3?

I am pretty new to Python and have been wondering if there an easy way so that I could form a sparse n-dimensional array M in Python3 with following 2 conditions mainly required (along the lines of SciPy COO_Matrix):
M[dim1,dim2,dim3,...] = 1.0
Like SciPy COO_Matrix M: M.row, M.col, I may be able to get all the row and column indices for which non-zero entries exist in the matrix. In N-dimension, this generalizes to calling: M.1 for 1st dimension, M.2 for 2nd dimension and so on...
For 2-dimension (the 2 conditions):
1.
for u, i in data:
mat[u, i] = 1.0
2. def get_triplets(mat):
return mat.row, mat.col
Can these 2 conditions be generalized in N-dimensions? I searched and came across this:
sparse 3d matrix/array in Python?
But here 2nd condition is not satisfied: In other words, I can't get the all the nth dimensional indices in a vectorized format.
Also this:
http://www.janeriksolem.net/sparray-sparse-n-dimensional-arrays-in.html works for python and not python3.
Is there a way to implement n-dimensional arrays with above mentioned 2 conditions satisfied? Or I am over-complicating things? I appreciate any help with this :)
In the spirit of coo format I could generate a 3d sparse array representation:
In [106]: dims = 2,4,6
In [107]: data = np.zeros((10,4),int)
In [108]: data[:,-1] = 1
In [112]: for i in range(3):
...: data[:,i] = np.random.randint(0,dims[i],10)
In [113]: data
Out[113]:
array([[0, 2, 3, 1],
[0, 3, 4, 1],
[0, 0, 1, 1],
[0, 3, 0, 1],
[1, 1, 3, 1],
[1, 0, 2, 1],
[1, 1, 2, 1],
[0, 2, 5, 1],
[0, 1, 5, 1],
[0, 1, 2, 1]])
Does that meet your requirements? It's possible there are some duplicates. sparse.coo sums duplicates before it converts the array to dense for display, or to csr for calculations.
The corresponding dense array is:
In [130]: A=np.zeros(dims, int)
In [131]: for row in data:
...: A[tuple(row[:3])] += row[-1]
In [132]: A
Out[132]:
array([[[0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 1],
[0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 0]],
[[0, 0, 1, 0, 0, 0],
[0, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]]])
(no duplicates in this case).
A 2d sparse matrix using a subset of this data is
In [118]: sparse.coo_matrix((data[:,3],(data[:,1],data[:,2])),(4,6)).A
Out[118]:
array([[0, 1, 1, 0, 0, 0],
[0, 0, 2, 1, 0, 1],
[0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 0]])
That's in effect the sum over the first dimension.
I'm assuming that
M[dim1,dim2,dim3,...] = 1.0
means the non-zero elements of the array must have a data value of 1.
Pandas has a sparse data series and data frame format. That allows for a non-zero 'fill' value. I don't know if the multi-index version can be thought of as higher than 2d or not. There have been a few SO questions about converting the Pandas sparse arrays to/from the scipy sparse.
Convert Pandas SparseDataframe to Scipy sparse csc_matrix
http://pandas-docs.github.io/pandas-docs-travis/sparse.html#interaction-with-scipy-sparse

Resources