I'm working on a multi-class classification problem and want to make predictions with high precision for a single class only (i.e. to predict less but correctly).
I've high lighted the total number of predictions and True positive cases for class-1. Any suggestion, how to tune the model of high precision?
PS: Result of other classes don't matter, we are only focusing on the precision of class-1. Please find the results below
One of the possible issues over here can be the class imbalance problem. Due to the unequal size of samples in your dataset, the model you have developed might be biased to a particular class. Keeping a similar sample size for all the classes may increase your precision/accuracy. Hope this helps
Related
I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130.
The problem is that my dataset is very imbalance. For some classes, I have only ~900 examples, which is around 1%. For “overrepresented” classes I have ~12000 examples (15%). When I train the model I use BCEWithLogitsLoss from pytorch with a positive weights parameter. I calculate the weights the same way as described in the documentation: the number of negative examples divided by the number of positives.
As a result, my model overestimates almost every class… Mor minor and major classes I get almost twice as many predictions as true labels. And my AUPRC is just 0.18. Even though it’s much better than no weighting at all, since in this case the model predicts everything as zero.
So my question is, how do I improve the performance? Is there anything else I can do? I tried different batch sampling techniques (to oversample minority class), but they don’t seem to work.
I would suggest either one of these strategies
Focal Loss
A very interesting approach for dealing with un-balanced training data through tweaking of the loss function was introduced in
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar Focal Loss for Dense Object Detection (ICCV 2017).
They propose to modify the binary cross entropy loss in a way that decrease the loss and gradient of easily classified examples while "focusing the effort" on examples where the model makes gross errors.
Hard Negative Mining
Another popular approach is to do "hard negative mining"; that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016)
#Shai has provided two strategies developed in the deep learning era. I would like to provide you some additional traditional machine learning options: over-sampling and under-sampling.
The main idea of them is to produce a more balanced dataset by sampling before starting your training. Note that you probably will face some problems such as losing the data diversity (under-sampling) and overfitting the training data (over-sampling), but it might be a good start point.
See the wiki link for more information.
I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130.
The problem is that my dataset is very imbalance. For some classes, I have only ~900 examples, which is around 1%. For “overrepresented” classes I have ~12000 examples (15%). When I train the model I use BCEWithLogitsLoss from pytorch with a positive weights parameter. I calculate the weights the same way as described in the documentation: the number of negative examples divided by the number of positives.
As a result, my model overestimates almost every class… Mor minor and major classes I get almost twice as many predictions as true labels. And my AUPRC is just 0.18. Even though it’s much better than no weighting at all, since in this case the model predicts everything as zero.
So my question is, how do I improve the performance? Is there anything else I can do? I tried different batch sampling techniques (to oversample minority class), but they don’t seem to work.
I would suggest either one of these strategies
Focal Loss
A very interesting approach for dealing with un-balanced training data through tweaking of the loss function was introduced in
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar Focal Loss for Dense Object Detection (ICCV 2017).
They propose to modify the binary cross entropy loss in a way that decrease the loss and gradient of easily classified examples while "focusing the effort" on examples where the model makes gross errors.
Hard Negative Mining
Another popular approach is to do "hard negative mining"; that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016)
#Shai has provided two strategies developed in the deep learning era. I would like to provide you some additional traditional machine learning options: over-sampling and under-sampling.
The main idea of them is to produce a more balanced dataset by sampling before starting your training. Note that you probably will face some problems such as losing the data diversity (under-sampling) and overfitting the training data (over-sampling), but it might be a good start point.
See the wiki link for more information.
I am using keras sequential model for binary classification. But My data is unbalanced. I have 2 features column and 1 output column(1/0). I have 10000 of data. Among that only 20 results in output 1, all others are 0. Then i have extended the data size to 40000. Now also only 20 results in output 1, all others are 0. Since the data is unbalanced(0 dominates 1), which neural network will be better for correct prediction?
First of all, two features is a really small amount. Neural Networks are highly non-linear models with a really really high amount of freedom degrees, thus if you try to train a network with more than just a couple of networks it will overfit even with balanced classes. You can find more suitable models for a small dimensionality like Support Vector Machines in scikit-learn library.
Now about unbalanced data, the most common techniques are Undersampling and Oversampling. Undersampling is basically training your model several times with a fraction of the dataset, that contains the non dominant class and a random sample of the dominant so that the ratio is acceptable, where as oversampling consist on generating artificial data to balance the classes. In most cases undersampling works better.
Also when working with unbalanced data it's quite important to choose the right metric based on what is more important for the problem (is minimizing false positives more important than false negatives, etc).
I am using the LogisticRegression classifier to classify documents. The results are good (macro-avg. f1 = 0.94). I apply an extra step to the prediction results (predict_proba) to check if a classification is "confident" enough (e.g. >0.5 confidence for the first class, >0.2 distance in confidence to the 2. class etc.). Otherwise, the sample is discarded as "unknown".
The score that is most significant for me is the number of samples that, despite this additional step, are assigned to the wrong class. This score is unfortunately too high (~ 0.03). In many of these cases, the classifier is very confident (0.8 - 0.9999!) that he chose the correct class.
Changing parameters (C, class_weight, min_df, tokenizer) so far only lead to a small decrease in this score, but a significant decrease in correct classifications. However, looking at several samples and the most discriminative features of the respective classes, I cannot understand where this high confidence comes from. I would assume it is possible to discard most of these samples without discarding significantly more correct samples.
Is there a way to debug/analyze such cases? What could be the reason for these high confidence values?
Would appreciate your input on this. I am constructing a regression model with the help of genetic programming.
If my RMSE on test data is (much) lower than my RMSE on training data for a 1:5 ratio of data, should I be worried?
The test data is drawn randomly without replacement from a set of 24 data points. The model was built using genetic programming technique so the number of features, modeling framework etc vary as I minimize the training RMSE regularized by the number of nodes in the GP tree.
Is the model underfitted? Or should I have minimized MSE instead of RMSE (I thought it would be the same as MSE is positive and the minimum of MSE would coincide with the minimum of RMSE assuming the optimizer is good enough to find the minimum)?
Tks
So your model is trained on 20 out of 24 data points and tested on the 4 remaining data points?
To me it sounds like you need (much) more data, so you can have a larger train and test sets. I'm not surprised by the low performance on your test set as it seems that your model wasn't able to learn from such few data. As a rule of thumb, for machine learning you can never have enough data. Is it a possibility to gather a larger dataset?