how to increase precision as well as recall in svm for highly imbalanced data set - python-3.x

I have loan data set with shape which is highly imbalanced:
(116058, 29)
how to improve precision and recall scores
target column m13
Counter({1: 636, 0: 115422})
I have used to split data in train and test set:
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size = 0.8,random_state = 100,stratify = y)
and then used svm for classification:
svc = SVC(class_weight = {1:0.95,0:0.05},kernel='rbf')
svc.fit(X_train,y_train)
y_pred = svc.predict(X_test)
I got precision as .54 and recall as .55
I tried grid search as well with different value of C and gamma, the above code gave the best result
svc = SVC(class_weight = {1:0.95,0:0.05},kernel='rbf')
svc.fit(X_train,y_train)
y_pred = svc.predict(X_test)
is there any way to improve the precision as well as recall score?

first of all let me comment the baseline of your prediction, If i understand you correct, you have 636 of class 1, and 115422 of class 0.
Imagen you would built a prediction model that always predicts class 0, your precision would be (if class 0 is your true class):
115422/(115422+636)=0,9945
and your recall (if class 0 is your true class):
1
If class 1 is your true class, precision would be: 0
As you can see it is quite a task to tune it. In general there are books about this topic, it will be super hard to tune it. But your target should be to predict the class 1 correctly! The goal should be to identify every class 1 in your algorithm. For example you could try to target your sensivity, here are some goals to target: https://en.wikipedia.org/wiki/Precision_and_recall
What you defetnly should do, make sure that your train and test sets have target ofs class 1.

Related

How to customize threshold PyTorch

I have trained ResNet50 for binary image classification.
I want to descrease FalseNegatives by reducing threshold value.
How can I do that?
To decrease the number of false negatives (FN) i.e. increase the recall (since recall = TP / (TP + FN)) you should increase the positive weight (the weight of the occurrence of that class) above 1. For example nn.BCEWithLogitsLoss allows you to provide the pos_weight option:
pos_weight > 1 increases the recall, pos_weight < 1 increases the precision.
For example, if a dataset contains 100 positive and 300 negative examples of a single class, then pos_weight for the class should be equal to 300/100 = 3. The loss would act as if the dataset contains 3*100 = 300 positive examples.
As a side note, the explicit expression for the binary cross entropy with logits (where "with logits" should rather be understood as "from logits") is:
>>> z = torch.sigmoid(q)
>>> loss = -(w_p*p*torch.log(z) + (1-p)*torch.log(1-z))
Above q are the raw logit values while w_p is the weight of the positive instance.

Confusion about positive class and sklearn metric pos_label=0

I have a binary classification problem for detecting AO/Non-AO images, using PyTorch for this purpose.
First, I load the data using the ImageFolder utility.
The Dataset class to label mapping in dataset.class_to_idx is {0: 'AO', 1: 'Non-AO'}.
So, my 'positive class' AO, is assigned a label 0, and my 'negative class' Non-AO is assigned a label 1.
Then I train and validate the model without any issues.
When I come to testing, I need to calculate some metrics on the test data.
Here is where I am confused.
[Method A]
fpr, tpr, thresholds = roc_curve(y_true, y_score)
roc_auc = auc(fpr, tpr)
[Method B]
# because 0 is my actual 'positive' class for this problem
fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label=0)
roc_auc = auc(fpr, tpr)
Now, this second curve is basically the mirror of the first one along the diagonal, right?
And I think, that it can't be the correct curve, because I checked the accuracy of the model by directly comparing y_true and y_pred to get the following accuracies.
Apart from this, here is what my confusion matrix looks like.
So, my first question is, am I doing something wrong? What is the significance of the curve from Method B? Can I say that Method A gives me the correct ROC curve for my classification task? If not, then how do I proceed for getting the correct curve?
What does true positive or true negative or any of the other terms signify for my confusion matrix? Does the matrix consider 0 : AO as negative and 1 : Non-AO as positive (I think so, yes) or the vice versa?
If 0 is indeed being considered as negative, when I actually want 0 to be considered as positive, how can I make changes to reflect so in the matrix (because I am using the matrix later to calculate other matrics like specificity, sensitivity, etc) ?

Multiclass semantic segmentation model evaluation

I am doing a project on multiclass semantic segmentation. I have formulated a model that outputs pretty descent segmented images by decreasing the loss value. However, I cannot evaluate the model performance in metrics, such as meanIoU or Dice coefficient.
In case of binary semantic segmentation it was easy just to set the threshold of 0.5, to classify the outputs as an object or background, but it does not work in the case of multiclass semantic segmentation. Could you please tell me how to obtain model performance on the aforementioned metrics? Any help will be highly appreciated!
By the way, I am using PyTorch framework and CamVid dataset.
If anyone is interested in this answer, please also look at this issue. The author of the issue points out that mIoU can be computed in a different way (and that method is more accepted in literature). So, consider that before using the implementation for any formal publication.
Basically, the other method suggested by the issue-poster is to separately accumulate the intersections and unions over the entire dataset and divide them at the final step. The method in the below original answer computes intersection and union for a batch of images, then divides them to get IoU for the current batch, and then takes a mean of the IoUs over the entire dataset.
However, this below given original method is problematic because the final mean IoU would vary with the batch-size. On the other hand, the mIoU would not vary with the batch size for the method mentioned in the issue as the separate accumulation would ensure that batch size is irrelevant (though higher batch size can definitely help speed up the evaluation).
Original answer:
Given below is an implementation of mean IoU (Intersection over Union) in PyTorch.
def mIOU(label, pred, num_classes=19):
pred = F.softmax(pred, dim=1)
pred = torch.argmax(pred, dim=1).squeeze(1)
iou_list = list()
present_iou_list = list()
pred = pred.view(-1)
label = label.view(-1)
# Note: Following for loop goes from 0 to (num_classes-1)
# and ignore_index is num_classes, thus ignore_index is
# not considered in computation of IoU.
for sem_class in range(num_classes):
pred_inds = (pred == sem_class)
target_inds = (label == sem_class)
if target_inds.long().sum().item() == 0:
iou_now = float('nan')
else:
intersection_now = (pred_inds[target_inds]).long().sum().item()
union_now = pred_inds.long().sum().item() + target_inds.long().sum().item() - intersection_now
iou_now = float(intersection_now) / float(union_now)
present_iou_list.append(iou_now)
iou_list.append(iou_now)
return np.mean(present_iou_list)
Prediction of your model will be in one-hot form, so first take softmax (if your model doesn't already) followed by argmax to get the index with the highest probability at each pixel. Then, we calculate IoU for each class (and take the mean over it at the end).
We can reshape both the prediction and the label as 1-D vectors (I read that it makes the computation faster). For each class, we first identify the indices of that class using pred_inds = (pred == sem_class) and target_inds = (label == sem_class). The resulting pred_inds and target_inds will have 1 at pixels labelled as that particular class while 0 for any other class.
Then, there is a possibility that the target does not contain that particular class at all. This will make that class's IoU calculation invalid as it is not present in the target. So, you assign such classes a NaN IoU (so you can identify them later) and not involve them in the calculation of the mean.
If the particular class is present in the target, then pred_inds[target_inds] will give a vector of 1s and 0s where indices with 1 are those where prediction and target are equal and zero otherwise. Taking the sum of all elements of this will give us the intersection.
If we add all the elements of pred_inds and target_inds, we'll get the union + intersection of pixels of that particular class. So, we subtract the already calculated intersection to get the union. Then, we can divide the intersection and union to get the IoU of that particular class and add it to a list of valid IoUs.
At the end, you take the mean of the entire list to get the mIoU. If you want the Dice Coefficient, you can calculate it in a similar fashion.

SVM model averaging in sklearn

l would like to average the scores of two different SVMs trained on different samples but same classes
# Data have the smae label x_1[1] has y_1[1] and x_2[1] has y_2[1]
# Where y_2[1] == y_1[1]
Dataset_1=(x_1,y)
Dataset_2=(x_2,y)
test_data=(test_sample,test_labels)
We have 50 classes. Same classes for dataset_1 and dataset_2 :
list(set(y_1))=list(set(y_2))
What l have tried :
from sklearn.svm import SVC
clf_1 = SVC(kernel='linear', random_state=42).fit(x_1, y)
clf_2 = SVC(kernel='linear', random_state=42).fit(x_2, y)
How to average clf_1 and clf_2 scores before doing :
predict(test_sample)
?
What l would like to do ?
Not sure I understand your question; to simply average the scores as in a typical ensemble, you should first get prediction probabilities from each model separately, and then just take their average:
pred1 = clf_1.predict_proba(test_sample)
pred2 = clf_2.predict_proba(test_sample)
pred = (pred1 + pred2)/2
In order to get prediction probabilities instead of hard classes, you should initialize the SVC using the additional argument probability=True.
Each row of pred will be an array of length 50, as many as your classes, with each element representing the probability that the sample belongs to the respective class.
After averaging, simply take the argmax of pred - just be sure that the order of the returned probabilities is OK; according to the docs:
The columns correspond to the classes in sorted order, as they appear in the attribute classes_
As I am not exactly sure what this means, run some checks with predictions on your training set, to be sure that the order is correct.

spark ml 2.0 - Naive Bayes - how to determine threshold values for each class

I am using NB for document classification and trying to understand threshold parameter to see how it can help to optimize algorithm.
Spark ML 2.0 thresholds doc says:
Param for Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.
0) Can someone explain this better? What goal it can achieve? My general idea is if you have threshold 0.7 then at least one class prediction probability should be more then 0.7 if not then prediction should return empty. Means classify it as 'uncertain' or just leave empty for prediction column. How can p/t function going to achieve that when you still pick the category with max probability?
1) What probability it adjust? default column 'probability' is actually conditional probability and 'rawPrediction' is
confidence according to document. I believe threshold will adjust 'rawPrediction' not 'probability' column. Am I right?
2) Here's how some of my probability and rawPrediction vector look like. How do I set threshold values based on this so I can remove certain uncertain classification? probability is between 0 and 1 but rawPrediction seems to be on log scale here.
Probability:
[2.233368649314982E-15,1.6429456680945863E-9,1.4377313514127723E-15,7.858651849363202E-15]
rawPrediction:
[-496.9606736723107,-483.452183395287,-497.40111830218746]
Basically I want classifier to leave Prediction column empty if it doesn't have any probability that is more then 0.7 percent.
Also, how to classify something as uncertain when more then one category has very close scores e.g. 0.812, 0.800, 0.799 . Picking max is something I may not want here but instead classify as "uncertain" or leave empty and I can do further analysis and treatment for those documents or train another model for those docs.
I haven't played with it, but the intent is to supply different threshold values for each class. I've extracted this example from the docstring:
model = nb.fit(df)
>>> result.prediction
1.0
>>> result.probability
DenseVector([0.42..., 0.57...])
>>> result.rawPrediction
DenseVector([-1.60..., -1.32...])
>>> nb = nb.setThresholds([0.01, 10.00])
>>> model3 = nb.fit(df)
>>> result = model3.transform(test0).head()
>>> result.prediction
0.0
If I understand correctly, the effect was to transform [0.42, 0.58] into [.42/.01, .58/10] = [42, 5.8], switching the prediction ("largest p/t") from column 1 (third row above) to column 0 (last row above). However, I couldn't find the logic in the source. Anyone?
Stepping back: I do not see a built-in way to do what you want: be agnostic if no class dominates. You will have to add that with something like:
def weak(probs, threshold=.7, epsilon=.01):
return np.all(probs < threshold) or np.max(np.diff(probs)) < epsilon
>>> cases = [[.5,.5],[.5,.7],[.7,.705],[.6,.1]]
>>> for case in cases:
... print '{:15s} - {}'.format(case, weak(case))
[0.5, 0.5] - True
[0.5, 0.7] - False
[0.7, 0.705] - True
[0.6, 0.1] - True
(Notice I haven't checked whether probs is a legal probability distribution.)
Alternatively, if you are not actually making a hard decision, use the predicted probabilities and a metric like Brier score, log loss, or info gain that accounts for the calibration as well as the accuracy.

Resources