I am working on DNA sequences data and using CNN in Pytorch. My dataset is hugely imbalanced.
positive class samples (~500)
negative class samples (~150,000)
So I am using WeightedRandomSampler to oversample and balance classes before feeding to data loader.
I use a 5-fold cross-validation. When I did few test runs, I could get a decent ROC value but the PR-AUC value seems to be really low.
For fold 1:
roc auc 0.9667848699763594
precision auc 0.055329116326074484
For fold 2:
roc auc 0.8476321207961566
precision auc 0.03307627288669479
For fold 3:
roc auc 0.9528898540612085
precision auc 0.05020178518546394
I suspect that there are lot of false negatives. Since the positive class samples (~500) is very low compared to negative class samples (~150,000) the model learns the negative class better and predicts most of the test samples as negative.
I tried weighing the positive class using
weight = [50.0]
class_weight = torch.FloatTensor(weight).to(device)
criterion = nn.BCEWithLogitsLoss(pos_weight=class_weight)
By doing this, almost all samples are predicted as positive.
I tried Adaptive learning rates as well but the precision-recall values do not seem to improve.
Can someone guide me and let me know the ideas to improve Precision and Recall values?
Thanks!
Related
I am currently trying to figure if there is a way to get the 95% CI of the AUC in python. Currently, I have a ypred list that contains the highest probability class predictions between the 4 classes I have(so either a 0/1/2/3 at each position) and a yactual list which contains the actual labels at each position. How exactly do I go about bootstrapping samples for multiple classes?
Edit: Currently the way I am calculating the AUC is by doing a one-vs-all scheme, where I take the AUC for each classes versus the rest and averaging those 4 values to get the final AUC.
Performing a one-vs-all classification scheme for each class and reporting out per class was good enough.
I want to evaluate a logistic regression model (binary event) using two measures:
1. model.score and confusion matrix which give me a 81% of classification accuracy
2. ROC Curve (using AUC) which gives back a 50% value
Are these two result in contradiction? Is that possible
I'missing something but still can't find it
y_pred = log_model.predict(X_test)
accuracy_score(y_test , y_pred)
cm = confusion_matrix( y_test,y_pred )
y_test.count()
print (cm)
tpr , fpr, _= roc_curve( y_test , y_pred, drop_intermediate=False)
roc = roc_auc_score( y_test ,y_pred)
enter image description here
The accuracy score is calculated based on the assumption that a class is selected if it has a prediction probability of more than 50%. This means that you are looking only at 1 case (one working point) out of many. Let's say you'd like to classify an instance as '0' even if it has a probability greater than 30% (this may happen if one of your classes is more important for you, and its a-priori probability is very low). In this case - you will have a very different confusion matrix with a different accuracy ([TP+TN]/[ALL]). The ROC auc score examines all of these working points and gives you an estimation of your overall model. A score of 50% means that the model is equal to a random selection of classes based on your a-priori probabilities of the classes. You would like the ROC to be much higher to say that you have a good model.
So in the above case - you can say that your model does not have a good prediction strength. As a matter of fact - a better prediction will be to predict everything as "1" - in your case it will lead to an accuracy of above 99%.
I need to train a Logistic Regression in Sklearn with different profit-loss weights between classes.
The positive class is a loss. Meaning that each time a negative happens, it costs the company, say 1.000$. This obviously happens to both True Positive and False Negative cases.
On the other hand each negative case (both True Negative and False Positive) makes the company gain 50$.
The question is: how do I train, say, a Logistic Regression classifier in SkLearn to maximizes the prifit?
A further complication is that the positive and negative classes are unbalanced meaning that the Positives represent a 5% of overall sample size while the Negatives rerpesent a 95% of overall sample size.
Thanks for helping
So I am approaching the classification problem with logistic regression algorithm and I obtain all of the predictions for the test set for class "1". The set is very imbalanced as it has over 200k inputs and more or less 92% are from class "1". Logistic regression generally classifies the input to class "1" if the P(Y=1|X)>0.5. So since all of the observations in test set are being classified into class 1 I thought that maybe there is a way to change this threshold and set it for example to 0.75 so that only observations with P(Y=1|X)>0.75 are classified to class 1 and otherwise class 0. How to implement it in python?
model= LogisticRegression(penalty='l2', C=1)
model.fit(X_train, y_train)
score=accuracy_score(y_test, model2.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, model2.predict_proba(X_test)[:,1])
roc=roc_auc_score(y_test, model2.predict_proba(X_test)[:,1])
cr=classification_report(y_test, model2.predict(X_test))
PS. Since all the observations from test set are being classified to class 1 the F1 score and recall from classification report are 0. Maybe by changing the threshold this problem will be solved.
A thing you might want to try is balancing the classes instead of changing the threshold. Scikit-learn is supporting this via class_weights. For example you could try model = LogisticRegression(penalty='l2', class_weight='balanced', C=1). Look at the documentation for more details:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
One option of the SVM classifier (SVC) is probability which is false by default. The documentation does not say what it does. Looking at libsvm source code, it seems to do some sort of cross-validation.
This option does not exist for LinearSVC nor OneSVM.
I need to calculate AUC scores for several SVM models, including these last two. Should I calculate the AUC score using decision_function(X) as the thresholds?
Answering my own question.
Firstly, it is a common "myth" that you need probabilities to draw the ROC curve. No, you need some kind of threshold in your model that you can change. The ROC curve is then drawn by changing this threshold. The point of the ROC curve being, of course, to see how well your model is reproducing the hypothesis by seeing how well it is ordering the observations.
In the case of SVM, there are two ways I see people drawing ROC curves for them:
using distance to the decision bondary, as I mentioned in my own question
using the bias term as your threshold in the SVM: http://researchgate.net/post/How_can_I_plot_determine_ROC_AUC_for_SVM. In fact, if you use SVC(probabilities=True) then probabilities will be calculated for you in this manner, by using CV, which you can then use to draw the ROC curve. But as mentioned in the link I provide, it is much faster if you draw the ROC curve directly by varying the bias.
I think #2 is the same as #1 if we are using a linear kernel, as in my own case, because varying the bias is varying the distance in this particular case.
In order to calculate AUC, using sklearn, you need a predict_proba method on your classifier; this is what the probability parameter on SVC does (you are correct that it's calculated using cross-validation). From the docs:
probability : boolean, optional (default=False)
Whether to enable probability estimates. This must be enabled prior to calling fit, and will slow down that method.
You can't use the decision function directly to compute AUC, since it's not a probability. I suppose you could scale the decision function to take values in the range [0,1], and compute AUC, however I'm not sure what statistical properties this will have; you certainly won't be able to use it to compare with ROC calculated using probabilities.