Hard Negative Mining in CNN Leading to Imbalanced Sample - conv-neural-network

I am working on deep learning for object detection (binary classification). I read from various sources that it is preferable to have a balanced training data. I was hoping if someone could link me to some papers that substantiate this claim ?
I ask because I am performing hard negative mining to augment my training set of negative samples. This however, will lead to an imbalanced training set as it will result in more negative than positive samples. Is there any way I can alleviate the problem of imbalanced training data ? And is hard negative mining a good idea in CNNs ?

Related

Why my ELMo-CNN model gives worse performance than Word2vec?

I want to compare the performance between ELMo and word2vec as word embedding using the CNN model by classifying 4000 tweet data on five class labels, but the results show that ELMo gives worse performance than word2vec.
I used ELMoformanylangs for ELMo and pretrained 1 million tweets for word2vec
Curve loss of word2vec-cnn
Curve loss of ELMo-cnn
It shows that the 2 models are overfitting, but why can ELMo be worse than word2vec?
From the elmoformanylangs project you've linked, it looks like your generic ELMo model was trained on "on a set of 20-million-words data randomly sampled from the raw text released by the shared task (wikidump + common crawl)".
Given that many tweets are larger than 20 words, your 1-million-tweets training set for word2vec might be larger training data than was used for the ELMo model. And, coming from actual tweets, it may also reflect words/word-senses used in tweets better than generic wikidump/common-crawl text.
Given that, I'm not sure why you'd have expected the ELMo approach to necessarily be better.
But also, as you've noted, the fact that your classifier is performing worse with more training is highly indicative of extreme overfitting. You may want to fix that before attempting to reason any further about the relative merits of different approaches. (When both classifiers are massively broken, exactly why one's brokenness is a bit better than the others' brokenness should be a fairly moot point. After they're both fixed to do as well as they can, then the remaining difference may be interesting to choose between, or understand deeply.)

Multilabel classification with class imbalance in Pytorch

I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130.
The problem is that my dataset is very imbalance. For some classes, I have only ~900 examples, which is around 1%. For “overrepresented” classes I have ~12000 examples (15%). When I train the model I use BCEWithLogitsLoss from pytorch with a positive weights parameter. I calculate the weights the same way as described in the documentation: the number of negative examples divided by the number of positives.
As a result, my model overestimates almost every class… Mor minor and major classes I get almost twice as many predictions as true labels. And my AUPRC is just 0.18. Even though it’s much better than no weighting at all, since in this case the model predicts everything as zero.
So my question is, how do I improve the performance? Is there anything else I can do? I tried different batch sampling techniques (to oversample minority class), but they don’t seem to work.
I would suggest either one of these strategies
Focal Loss
A very interesting approach for dealing with un-balanced training data through tweaking of the loss function was introduced in
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar Focal Loss for Dense Object Detection (ICCV 2017).
They propose to modify the binary cross entropy loss in a way that decrease the loss and gradient of easily classified examples while "focusing the effort" on examples where the model makes gross errors.
Hard Negative Mining
Another popular approach is to do "hard negative mining"; that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016)
#Shai has provided two strategies developed in the deep learning era. I would like to provide you some additional traditional machine learning options: over-sampling and under-sampling.
The main idea of them is to produce a more balanced dataset by sampling before starting your training. Note that you probably will face some problems such as losing the data diversity (under-sampling) and overfitting the training data (over-sampling), but it might be a good start point.
See the wiki link for more information.

Should sentiment analysis training data be evenly distributed?

If I am training a sentiment classifier off of a tagged dataset where most documents are negative, say ~95%, should the classifier be trained with the same distribution of negative comments? If not, what would be other options to "normalize" the data set?
You don't say what type of classifier you have but in general you don't have to normalize the distribution of the training set. However, usually the more data the better but you should always do blind tests to prevent over-fitting.
In your case you will have a strong classifier for negative comments and unless you have a very large sample size, a weaker positive classifier. If your sample size is large enough it won't really matter since you hit a point where you might start over-fitting your negative data anyway.
In short, it's impossible to say for sure without knowing the actual algorithm and the size of the data sets and the diversity within the dataset.
Your best bet is to carve off something like 10% of your training data (randomly) and just see how the classifier performs after being trained on the 90% subset.

Test error lower than training error

Would appreciate your input on this. I am constructing a regression model with the help of genetic programming.
If my RMSE on test data is (much) lower than my RMSE on training data for a 1:5 ratio of data, should I be worried?
The test data is drawn randomly without replacement from a set of 24 data points. The model was built using genetic programming technique so the number of features, modeling framework etc vary as I minimize the training RMSE regularized by the number of nodes in the GP tree.
Is the model underfitted? Or should I have minimized MSE instead of RMSE (I thought it would be the same as MSE is positive and the minimum of MSE would coincide with the minimum of RMSE assuming the optimizer is good enough to find the minimum)?
Tks
So your model is trained on 20 out of 24 data points and tested on the 4 remaining data points?
To me it sounds like you need (much) more data, so you can have a larger train and test sets. I'm not surprised by the low performance on your test set as it seems that your model wasn't able to learn from such few data. As a rule of thumb, for machine learning you can never have enough data. Is it a possibility to gather a larger dataset?

find the 1 gem in 1000 with a neural net? something else?

There's something I don't understand about neural networks. I've tried to use them with financial data analysis and audio pitch classification. In both cases, I need a classifier that can detect the significant item from among the many. My audio application literally has one positive hit for every thousand negative hits. I run the network trainer, and it learns that it's a pretty darn fine guess to just go with the negative. Is there some other algorithm for detecting the rare gem? Is there some form of neural network training that is especially suited for this type of problem? I can change the range on my positive data to be equivalent to the sum of the negative values, but I don't understand how that fits in with the preferred range of zero to one on the typical neural network.
Here are two possible suggestions:
Balance your training set
Even if the real-world data contains 1000x as many negatives as positives, your training data does not have to. You can modify your training data set to increase the proportion of positives in your training set. That will improve the recall (more true positives), but also worsen the precision (also more false positives). So, you'd have to experiment with the ideal proportion of positives to negatives in the training set.
This paper discusses this approach in more detail: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2243711/pdf/procamiasymp00003-0260.pdf
Anomaly detection
... on the other hand, if you have too few positive examples to train the neural network with a more balanced training set, then perhaps you could try anomaly detection. With anomaly detection, you train your algorithm (e.g., a neural network) to recognize what negative data points look like. Then, any data point that looks different than normal gets flagged as positive.

Resources