I am facing an issue in a multi label, multi class classification task. I have a dataset of size 33000, each samples containing 104 classes. I split the dataset in 16500 samples with labels such as [1, 0, 1, 0, 0, …], [0, 1, 1, 0, 1, …], [1, 0, 0, 0] (each label has at least one element 1 in it) and 16500 labels such as [0, 0, 0, …], [0, 0, 0, …] (all elements in all labels are 0). When calculating the pos_count for each class, the number pos_count_0 for class 0 is how many of 1 appear in the first position of each label in my dataset. For class 1, pos_count_1 the number of 1 in the second position and so on. And after that, the pos_weight of class 0 is (33000-pos_count_0)/pos_count_0, pos_weight of class 1 is (33000-pos_count_1)/pos_count_1 ? I am a little bit confused how neg_count and pos_count for a class are calculated.
Related
Is there a function or a set of arguments that I can use in order to calculate Precision and Recall for a multi-label problem?
Note that with multi-label I mean that each sample can be classified into more than one class.
The following is not returning what I would expect:
import torch
from torchmetrics import Precision
target = torch.tensor([
[0, 0, 1, 1, 0], # Sample 1 belongs to class 2 and 3 (zero-indexed)
[0, 0, 1, 0, 0], # Sample 2 belongs to class 2 (zero-indexed)
])
preds = torch.tensor([
[0, 0, 0, 0, 0], # Sample 1 predicted to belong to no class
[0, 0, 0, 0, 0], # Sample 2 predicted to belong to no class
])
metric = Precision(num_classes=5, mdmc_average="samplewise")
print(metric(preds, target))
It returns: tensor(0.7000), but it should be 0% since there are no True Positives.
I have 10 classes, and my y_test has shape (1000, 10) and it looks like this:
array([[0, 0, 1, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 1],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=int64)
If I use the following where i is the class number
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred[:, i])
should y_pred be
y_pred = model.predict(x_test)
OR
y_pred = np.argmax(model.predict(x_test), axis=1)
lb = LabelBinarizer()
lb.fit(y_test)
y_pred = lb.transform(y_pred)
The first option gives me something like this:
[[6.87280996e-11 6.28617670e-07 9.96915460e-01 ... 3.08361766e-03
3.47333212e-14 2.83545876e-09]
[7.04240659e-30 1.51786850e-07 8.49807921e-28 ... 6.62584656e-33
6.97696034e-19 1.01019222e-20]
[2.97537670e-14 2.67199534e-24 2.85646610e-19 ... 2.19898160e-15
7.03626012e-22 7.56072279e-18]
...
[1.63774752e-15 1.32784101e-06 1.23182635e-05 ... 3.60217566e-14
6.01247484e-05 2.61179358e-01]
[2.09420733e-35 6.94865276e-10 1.14242395e-22 ... 5.08080394e-22
1.20934697e-19 1.77760468e-17]
[1.68334747e-13 8.53335252e-04 4.40571597e-07 ... 1.70050384e-06
1.48684137e-06 2.93400045e-03]]
with shape (1000,10).
where the latter option gives
[[0 0 1 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
with shape (1000,10)
Which way is the correct approach? in other words, what would this y_pred be when passing to sklearn.metrics.roc_curve().
Forget to mention, using the first option gives me extremely high (almost 1) AUC values for all classes, whereas the second option seems to generate reasonable AUC values.
The ROC curves using the two options are below, which one looks more correct?
There is nothing wrong with the first option, and that's what the documentation asks for:
y_scorendarray of shape (n_samples,)
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).
Also, the first graph looks like a ROC curve, while the second is weird.
And finally, ROC curves intend to study "different classification thresholds". That means you need predictions "as probabilities" (confidences), not as 0's and 1's.
When you take an argmax, you throw away the probabilities/confidences, making it impossible to study thresholds.
I have a BxCxd tensor of coordinates and want to repeat each row in the following way:
[[[1,0,0],[0,1,0],[0,0,1]]] -> [[[1,0,0],[1,0,0],[0,1,0],[0,1,0],[0,0,1],[0,0,1]]]
In the above example each row is repeated 2 times. What's especially important is the ordering. Each row in the first tensor should appear k times in the second one before the next row appears.
I tried the following code:
print(x.size())
params = x.repeat_interleave(self.k, dim=-1).permute(0,2,1)
In the above snippet, x is of size 32x128x4 before repeat_interleave. With self.k = 64 I would expect the result to be a 32x8192x4 tensor, however the result I am getting is 32x256x128 which does not make sense to me. What am I missing here?
I think you want:
t.repeat_interleave(2, dim=1)
Output:
ensor([[[1, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 1],
[0, 0, 1]]])
I have actual values and predicted values.
Actual:
33.3663, 38.2561, 28.6362, 35.6252
Predicted:
28.9721, 35.6161, 27.9561, 22.6272
I want to apply confusion matrix to find the accuracy.
Solution
First thing, confusion matrix is not for continuous values. AND you can also use it by converting continuous values to classes. check https://datascience.stackexchange.com/questions/46019/continuous-variable-not-supported-in-confusion-matrix
from sklearn.metrics import confusion_matrix
expected = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
predicted = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0]
results = confusion_matrix(expected, predicted)
print(results)
Output
[[4 2]
[1 3]]
Reference
https://machinelearningmastery.com/confusion-matrix-machine-learning/
The following code is used to do KFold Validation but I am to train the model as it is throwing the error
ValueError: Error when checking target: expected dense_14 to have shape (7,) but got array with shape (1,)
My target Variable has 7 classes. I am using LabelEncoder to encode the classes into numbers.
By seeing this error, If I am changing the into MultiLabelBinarizer to encode the classes. I am getting the following error
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.
The following is the code for KFold validation
skf = StratifiedKFold(n_splits=10, shuffle=True)
scores = np.zeros(10)
idx = 0
for index, (train_indices, val_indices) in enumerate(skf.split(X, y)):
print("Training on fold " + str(index+1) + "/10...")
# Generate batches from indices
xtrain, xval = X[train_indices], X[val_indices]
ytrain, yval = y[train_indices], y[val_indices]
model = None
model = load_model() //defined above
scores[idx] = train_model(model, xtrain, ytrain, xval, yval)
idx+=1
print(scores)
print(scores.mean())
I don't know what to do. I want to use Stratified K Fold on my model. Please help me.
MultiLabelBinarizer returns a vector which is of the length of your number of classes.
If you look at how StratifiedKFold splits your dataset, you will see that it only accepts a one-dimensional target variable, whereas you are trying to pass a target variable with dimensions [n_samples, n_classes]
Stratefied split basically preserves your class distribution. And if you think about it, it does not make a lot of sense if you have a multi-label classification problem.
If you want to preserve the distribution in terms of the different combinations of classes in your target variable, then the answer here explains two ways in which you can define your own stratefied split function.
UPDATE:
The logic is something like this:
Assuming you have n classes and your target variable is a combination of these n classes. You will have (2^n) - 1 combinations (Not including all 0s). You can now create a new target variable considering each combination as a new label.
For example, if n=3, you will have 7 unique combinations:
1. [1, 0, 0]
2. [0, 1, 0]
3. [0, 0, 1]
4. [1, 1, 0]
5. [1, 0, 1]
6. [0, 1, 1]
7. [1, 1, 1]
Map all your labels to this new target variable. You can now look at your problem as simple multi-class classification, instead of multi-label classification.
Now you can directly use StartefiedKFold using y_new as your target. Once the splits are done, you can map your labels back.
Code sample:
import numpy as np
np.random.seed(1)
y = np.random.randint(0, 2, (10, 7))
y = y[np.where(y.sum(axis=1) != 0)[0]]
OUTPUT:
array([[1, 1, 0, 0, 1, 1, 1],
[1, 1, 0, 0, 1, 0, 1],
[1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 1, 1, 1],
[1, 1, 0, 0, 0, 1, 1],
[1, 1, 1, 1, 0, 1, 1],
[0, 0, 1, 0, 0, 1, 1],
[1, 0, 1, 0, 0, 1, 1],
[0, 1, 1, 1, 1, 0, 0]])
Label encode your class vectors:
from sklearn.preprocessing import LabelEncoder
def get_new_labels(y):
y_new = LabelEncoder().fit_transform([''.join(str(l)) for l in y])
return y_new
y_new = get_new_labels(y)
OUTPUT:
array([7, 6, 3, 3, 2, 5, 8, 0, 4, 1])