sklearn HistGradientBoostingClassifier with large unbalanced data - scikit-learn

I've been using Sklearn HistGradientBoostingClassifier to classify some data. My experiment is multi-class classification with single label predictions (20 labels).
My experience shows two cases. The first case is the measurement of the accuracy of these algorithms without data augmentation (around unbalanced 3,000 samples). The second case is the measurement of accuracy with data augmentation (around 12,000 unbalanced samples). I am using default parameters.
In the first case, the HistGradientBoostingClassifier shows an accuracy of around 86.0%. However, with data augmentation, results show weak accuracy, around 23%.
I am wondering if this accuracy was coming from unbalanced datasets, but since there are no features to fix unbalanced datasets for the HistGradientBoostingClassifier algorithm within the Sklearn library, I cannot verify that fact.
Do some people have the same kind of problem with large dataset and HistGradientBoostingClassifier?
Edit: I tried other algorithms with the same data split, and the results seems normal (accuracy around 5% more w/ data augmentation). I am wondering why I am only getting this with HistGradientBoostingClassifier.

Accuracy is a poor metric when dealing with imbalanced data. Suppose I have 90:10 class 0 and class 1. A DummyClassifier that only predicts class 0 will achieve 90% accuracy.
You'll have to look at precision, recall, f1, confusion matrix, and not just accuracy alone.

I have found something that could be the reason of the lack of accuracy while using HistGradientBoostingClassifier algorithm with default parameters on augmented dataset of roughly 12,000 samples.
I compared HistGradientBoostingClassifier and LightGBM algorithms on the same data split (HistGradientBoostingClassifier from sklearn is an implementation of Microsoft's LightGBM.). HistGradientBoostingClassifier shows a weak accuracy of 24.7% and LightGBM a strong one 87.5%.
As I can read on sklearn's and Microsoft's docs, HistGradientBoostingClassifier "cannot handle properly" unbalanced dataset while LightGBM can. The latter has this parameter: class_weigth (dict, 'balanced' or None, optional (default=None)) (found on that page)
My hypothesis is that, for the time being, the dataset becomes more unbalanced with augmentation and, without any feature for the HistGradientBoostingClassifier algorithm to handle unbalanced data, the algorithm is misled.
Also, as mentioned by Hanafi Haffidz in comments the algorithm could tend to overfit with default parameters.

Related

Multilabel text classification with BERT and highly imbalanced training data

I'm trying to train a multilabel text classification model using BERT. Each piece of text can belong to 0 or more of a total of 485 classes. My model consists of a dropout layer and a linear layer added on top of the pooled output from the bert-base-uncased model from Hugging Face. The loss function I'm using is the BCEWithLogitsLoss in PyTorch.
I have millions of labeled observations to train on. But the training data are highly unbalanced, with some labels appearing in less than 10 observations and others appearing in more than 100K observations! I'd like to get a "good" recall.
My first attempt at training without adjusting for data imbalance produced a micro recall rate of 70% (good enough) but a macro recall rate of 45% (not good enough). These numbers indicate that the model isn't performing well on underrepresented classes.
How can I effectively adjust for the data imbalance during training to improve the macro recall rate? I see we can provide label weights to BCEWithLogitsLoss loss function. But given the very high imbalance in my data leading to weights in the range of 1 to 1M, can I actually get the model to converge? My initial experiments show that a weighted loss function is going up and down during training.
Alternatively, is there a better approach than using BERT + dropout + linear layer for this type of task?
In your case it might be helpful to balance the labels in the training data. You have a lot of data, so you could afford to loose a part of it by balancing. But before you do this, I recommend to read this answer about balancing classes in traing data.
If you really only care about recall, you could try to tune your model maximizing recall.

Model underfitting

I have trained a model and it took me quite a while to find the correct hyperparameters.
The model has now been trained for 15h and it seems to to its job quite well.
When I observed the training and validation loss though, the training loss is somewhat higher than the validation loss. (red curve: training, green: validation)
I use dropout to regularize my model and as far as I have understood, droput is is only applied during training which might be the reason.
Now Iam wondering if I have trained a valid model?
It doesn't seem like the model is heavily underfitted?
Thanks in advance for any advice,
cheers,
M
First, check whether you have good data set, i.e., if it is a classification, then get equal number of images for all classes and get it from same source not from different sources. And regularization, dropout are used for overfitting/High variance so don't worry about these.
Then, I think your model is doing good when you trained your model the initial error between them are different but as you increased the epochs then they both got into some steady path. So it is good. And may be reason for this is as I mentioned above or you should try shuffle them then using train_test_split for getting better distribution of training and validation sets.
A plot of learning curves shows a good fit if:
The plot of training loss decreases to a point of stability.
The plot of validation loss decreases to a point of stability and has a small gap with the training loss.
In your case these conditions are satisfied.
Still if you want to deal with High Bias/underfitting then here are few methods:
Train bigger models
Train longer. Use better optimization techniques
Try different Neural Network Architecture and also hyper parameters
And also you can use cross-validation or GridSearchCV for finding better optimizer or hyper parameters but it may take really long because you have to train it on different parameters each time considering your time which is 15 hours then it might be very long but you will find better parameters and then train on it.
Above all I think your model is doing okay.
If your model underfits, its performance will be lower, similar as in the case of overfitting, because actually it can not learn effectively to get the optimal result, i.e the proper function to fit the given distribution. So you have to use less regularization technique e.g. less dropout to get the optimal result.
Furthermore the sampling can also be crucial, because there can be training-validation subsets where your model performs well on validation set and less effective on training set and vice-versa. This is one of the reason why we use crossvalidation and different sampling methods e.g. stratified k-fold.

How to improve validation accuracy in training convolutional neural network?

I am training a CNN model(made using Keras). Input image data has around 10200 images. There are 120 classes to be classified. Plotting the data frequency, I can see that sample data for every class is more or less uniform in terms of distribution.
Problem I am facing is loss plot for training data goes down with epochs but for validation data it first falls and then goes on increasing. Accuracy plot reflects this. Accuracy for training data finally settles down at .94 but for validation data its around 0.08.
Basically its case of over fitting.
I am using learning rate of 0.005 and dropout of .25.
What measures can I take to get better accuracy for validation? Is it possible that sample size for each class is too small and I may need data augmentation to have more data points?
Hard to say what could be the reason. First you can try classical regularization techniques like reducing the size of your model, adding dropout or l2/l1-regularizers to the layers. But this is more like randomly guessing the models hyperparameters and hoping for the best.
The scientific approach would be to look at the outputs for your model and try to understand why it produces these outputs and obviously checking your pipeline. Did you had a look at the outputs (are they all the same)? Did you preprocess the validation data the same way as the training data? Did you made a stratified train/test-split, i.e. keeping the class distribution the same in both sets? Is the data shuffles when you feed it to your model?
In the end you have about ~85 images per class which is really not a lot, compare CIFAR-10 resp. CIFAR-100 with 6000/600 images per class or ImageNet with 20k classes and 14M images (~500 images per class). So data augmentation could be beneficial as well.

Is there class weight (or alternative way) for GradientBoostingClassifier in Sklearn when dealing with VotingClassifier or Grid search?

I'm using GradientBoostingClassifier for my unbalanced labeled datasets. It seems like class weight doesn't exist as a parameter for this classifier in Sklearn. I see I can use sample_weight when fit but I cannot use it when I deal with VotingClassifier or GridSearch. Could someone help?
Currently there isn't a way to use class_weights for GB in sklearn.
Don't confuse this with sample_weight
Sample Weights change the loss function and your score that you're trying to optimize. This is often used in case of survey data where sampling approaches have gaps.
Class Weights are used to correct class imbalances as a proxy for over \ undersampling. There is no direct way to do that for GB in sklearn (you can do that in Random Forests though)
Very late, but I hope it can be useful for other members.
In the article of Zichen Wang in towardsdatascience.com, the point 5 Gradient Boosting it is told:
For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset.
And a chart shows that the half of the grandient boosting model have an AUROC over 80%. So considering GB models performances and the way they are done, it seems not to be necessary to introduce a kind of class_weight parameter as it is the case for RandomForestClassifier in sklearn package.
In the book Introduction To Machine Learning with Pyhton written by Andreas C. Müller and Sarah Guido, edition 2017, page 89, Chapter 2 *Supervised Learning, section Ensembles of Decision Trees, sub-section Gradient boosted regression trees (gradient boosting machines):
They are generally a bit more sensitive to
parameter settings than random forests, but can provide better accuracy if the parameters are set correctly.
Now if you still have scoring problems due to imbalance proportions of categories in the target variable, it is possible you should see if your data should be splited to apply different models on it, because they are not as homogeneous as it seems to be. I mean it may have a variable you have not in your dataset train (an hidden variable clearly) that influences a lot the model results, then it is difficult even for the greater GB to give correct scoring because it misses a huge information that you cannot make appear in the matrix to compute sometimes for many reasons.
Some updates:
I found, by random, there are libraries that implement it as parameters of their gradient boosting instance objects. It is the case of H2O where for the parameter balance_classes it is told:
Balance training data class counts via over/under-sampling (for
imbalanced data).
Type: bool (default: False).
If you want to keep with sklearn you should do as HakunaMaData told: over/under-sampling because that's what other libraries finally do when the parameter exist.

Test error lower than training error

Would appreciate your input on this. I am constructing a regression model with the help of genetic programming.
If my RMSE on test data is (much) lower than my RMSE on training data for a 1:5 ratio of data, should I be worried?
The test data is drawn randomly without replacement from a set of 24 data points. The model was built using genetic programming technique so the number of features, modeling framework etc vary as I minimize the training RMSE regularized by the number of nodes in the GP tree.
Is the model underfitted? Or should I have minimized MSE instead of RMSE (I thought it would be the same as MSE is positive and the minimum of MSE would coincide with the minimum of RMSE assuming the optimizer is good enough to find the minimum)?
Tks
So your model is trained on 20 out of 24 data points and tested on the 4 remaining data points?
To me it sounds like you need (much) more data, so you can have a larger train and test sets. I'm not surprised by the low performance on your test set as it seems that your model wasn't able to learn from such few data. As a rule of thumb, for machine learning you can never have enough data. Is it a possibility to gather a larger dataset?

Resources