Sklearn logistic regression optimized on a profit loss matrix with unbalanced classes - scikit-learn

I need to train a Logistic Regression in Sklearn with different profit-loss weights between classes.
The positive class is a loss. Meaning that each time a negative happens, it costs the company, say 1.000$. This obviously happens to both True Positive and False Negative cases.
On the other hand each negative case (both True Negative and False Positive) makes the company gain 50$.
The question is: how do I train, say, a Logistic Regression classifier in SkLearn to maximizes the prifit?
A further complication is that the positive and negative classes are unbalanced meaning that the Positives represent a 5% of overall sample size while the Negatives rerpesent a 95% of overall sample size.
Thanks for helping

Related

Loss function negative log likelihood giving loss despite perfect accuracy

I am debugging a sequence-to-sequence model and purposely tried to perfectly overfit a small dataset of ~200 samples (sentence pairs of length between 5-50). I am using negative log-likelihood loss in pytorch. I get low loss (~1e^-5), but the accuracy on the same dataset is only 33%.
I trained the model on 3 samples as well and obtained 100% accuracy, yet during training I had loss. I was under the impression that negative log-likelihood only gives loss (loss is in the same region of ~1e^-5) if there is a mismatch between predicted and target label?
Is a bug in my code likely?
There is no bug in your code.
The way things usually work in deep nets is that the networks predicts the logits (i.e., log-likelihoods). These logits are then transformed to probability using soft-max (or a sigmoid function). Cross-entropy is finally evaluated based on the predicted probabilities.
The advantage of this approach is that is numerically stable, and easy to train with. On the other side, because of the soft-max you can never have "perfect" 0/1 probabilities for your predictions: That is, even when your network has perfect accuracy it will never assign probability 1 to the correct prediction, but "close to one". As a result, the loss will always be positive (albeit small).

ROC-AUC is high but PR-AUC value is very low

I am working on DNA sequences data and using CNN in Pytorch. My dataset is hugely imbalanced.
positive class samples (~500)
negative class samples (~150,000)
So I am using WeightedRandomSampler to oversample and balance classes before feeding to data loader.
I use a 5-fold cross-validation. When I did few test runs, I could get a decent ROC value but the PR-AUC value seems to be really low.
For fold 1:
roc auc 0.9667848699763594
precision auc 0.055329116326074484
For fold 2:
roc auc 0.8476321207961566
precision auc 0.03307627288669479
For fold 3:
roc auc 0.9528898540612085
precision auc 0.05020178518546394
I suspect that there are lot of false negatives. Since the positive class samples (~500) is very low compared to negative class samples (~150,000) the model learns the negative class better and predicts most of the test samples as negative.
I tried weighing the positive class using
weight = [50.0]
class_weight = torch.FloatTensor(weight).to(device)
criterion = nn.BCEWithLogitsLoss(pos_weight=class_weight)
By doing this, almost all samples are predicted as positive.
I tried Adaptive learning rates as well but the precision-recall values do not seem to improve.
Can someone guide me and let me know the ideas to improve Precision and Recall values?
Thanks!

Model evaluation : model.score Vs. ROC curve (AUC indicator)

I want to evaluate a logistic regression model (binary event) using two measures:
1. model.score and confusion matrix which give me a 81% of classification accuracy
2. ROC Curve (using AUC) which gives back a 50% value
Are these two result in contradiction? Is that possible
I'missing something but still can't find it
y_pred = log_model.predict(X_test)
accuracy_score(y_test , y_pred)
cm = confusion_matrix( y_test,y_pred )
y_test.count()
print (cm)
tpr , fpr, _= roc_curve( y_test , y_pred, drop_intermediate=False)
roc = roc_auc_score( y_test ,y_pred)
enter image description here
The accuracy score is calculated based on the assumption that a class is selected if it has a prediction probability of more than 50%. This means that you are looking only at 1 case (one working point) out of many. Let's say you'd like to classify an instance as '0' even if it has a probability greater than 30% (this may happen if one of your classes is more important for you, and its a-priori probability is very low). In this case - you will have a very different confusion matrix with a different accuracy ([TP+TN]/[ALL]). The ROC auc score examines all of these working points and gives you an estimation of your overall model. A score of 50% means that the model is equal to a random selection of classes based on your a-priori probabilities of the classes. You would like the ROC to be much higher to say that you have a good model.
So in the above case - you can say that your model does not have a good prediction strength. As a matter of fact - a better prediction will be to predict everything as "1" - in your case it will lead to an accuracy of above 99%.

Keras: how to figure out the Null hypothesis?

I am training a deep neural net using keras. One of the scores is called val_acc. I get like a 70% val_acc. How do I know if this is good or bad? The neural net is a binary classifier, so I am trying to predict a 1 or a 0. The data itself is about 65% 0's and 35% 1's. Is my 70% val_acc any good?
Accuracy is not always the right metric for the evaluation of a classifier. For example, it could be more important for you to classify the 1s more correctly than 0s (for example fraud detection) or the other way. So you may be interested to have a classifier with higher precision (specificity) or recall (sensitivity). In other words, false positives may be more expensive for you than false negatives. If you have some idea about the costs of misclassifications (e.g. for FPs & FNs) then you can precisely compute the specific threshold that will be optimal (instead of default 0.5) for 0-1 classification. You can use ROC curves and AUC to find performance of your classifier as well (the higher AUC the better). Finally you may want to consider kappa statistics to find how useful / effective your classifier is.

Test error lower than training error

Would appreciate your input on this. I am constructing a regression model with the help of genetic programming.
If my RMSE on test data is (much) lower than my RMSE on training data for a 1:5 ratio of data, should I be worried?
The test data is drawn randomly without replacement from a set of 24 data points. The model was built using genetic programming technique so the number of features, modeling framework etc vary as I minimize the training RMSE regularized by the number of nodes in the GP tree.
Is the model underfitted? Or should I have minimized MSE instead of RMSE (I thought it would be the same as MSE is positive and the minimum of MSE would coincide with the minimum of RMSE assuming the optimizer is good enough to find the minimum)?
Tks
So your model is trained on 20 out of 24 data points and tested on the 4 remaining data points?
To me it sounds like you need (much) more data, so you can have a larger train and test sets. I'm not surprised by the low performance on your test set as it seems that your model wasn't able to learn from such few data. As a rule of thumb, for machine learning you can never have enough data. Is it a possibility to gather a larger dataset?

Resources