Test error lower than training error - statistics

Would appreciate your input on this. I am constructing a regression model with the help of genetic programming.
If my RMSE on test data is (much) lower than my RMSE on training data for a 1:5 ratio of data, should I be worried?
The test data is drawn randomly without replacement from a set of 24 data points. The model was built using genetic programming technique so the number of features, modeling framework etc vary as I minimize the training RMSE regularized by the number of nodes in the GP tree.
Is the model underfitted? Or should I have minimized MSE instead of RMSE (I thought it would be the same as MSE is positive and the minimum of MSE would coincide with the minimum of RMSE assuming the optimizer is good enough to find the minimum)?
Tks

So your model is trained on 20 out of 24 data points and tested on the 4 remaining data points?
To me it sounds like you need (much) more data, so you can have a larger train and test sets. I'm not surprised by the low performance on your test set as it seems that your model wasn't able to learn from such few data. As a rule of thumb, for machine learning you can never have enough data. Is it a possibility to gather a larger dataset?

Related

Multilabel text classification with BERT and highly imbalanced training data

I'm trying to train a multilabel text classification model using BERT. Each piece of text can belong to 0 or more of a total of 485 classes. My model consists of a dropout layer and a linear layer added on top of the pooled output from the bert-base-uncased model from Hugging Face. The loss function I'm using is the BCEWithLogitsLoss in PyTorch.
I have millions of labeled observations to train on. But the training data are highly unbalanced, with some labels appearing in less than 10 observations and others appearing in more than 100K observations! I'd like to get a "good" recall.
My first attempt at training without adjusting for data imbalance produced a micro recall rate of 70% (good enough) but a macro recall rate of 45% (not good enough). These numbers indicate that the model isn't performing well on underrepresented classes.
How can I effectively adjust for the data imbalance during training to improve the macro recall rate? I see we can provide label weights to BCEWithLogitsLoss loss function. But given the very high imbalance in my data leading to weights in the range of 1 to 1M, can I actually get the model to converge? My initial experiments show that a weighted loss function is going up and down during training.
Alternatively, is there a better approach than using BERT + dropout + linear layer for this type of task?
In your case it might be helpful to balance the labels in the training data. You have a lot of data, so you could afford to loose a part of it by balancing. But before you do this, I recommend to read this answer about balancing classes in traing data.
If you really only care about recall, you could try to tune your model maximizing recall.

Large dataset - ANN

I am trying to classify around 400K data with 13 attributes. I have used python sklearn's SVM package, but it didn't work, and then I learned that SVM's are not suitable for large dataset classification. Then I used the (sklearn) ANN using the following MLPClassifier:
MLPClassifier(solver='adam', alpha=1e-5, random_state=1,activation='relu', max_iter=500)
and trained the system using 200K samples, and tested the model on the remaining ones. The classification worked well. However, my concern is that the system is over trained or overfit. Can you please guide me on the number of hidden layers and node sizes to make sure that there is no overfit? (I have learned that the default implementation has 100 hidden neurons. Is it ok to use the default implementation as is?)
To know if your are overfitting you have to compute:
Training set accuracy
Test set accuracy
Once you have calculated this scores, compare it. If training set score is much better than your test set score, then you are overfitting. This means that your model is "memorizing" your data, instead of learning from it to make future predictions.
If you are overfitting with Neuronal Networks you probably have to reduce the number of layers and reduce the number of neurons per layer. There isn't any strict rule that says the number of layer or neurons you need depending on you dataset size. Every dataset can behaves completely different with the same dataset size.
So, to conclude, if you are overfitting, you would have to evaluate your model accuracy using different parameters of layers and number of neurons, and, then, observe with which values you obtain the best results. There are some methods you can use to find the best parameters, is like gridsearchCV.

impact of labels' distribution in deep learning for regression problem

I am trying to train a CNN model for a regression problem, after that, I categorize predicted labels into 4 classes and check some accuracy metrics. In confusion matrix accuracy of class 2,3 are around 54% and accuracy of class 1,4 are more than 90%. labels are between 0-100 and classes are 1: 0-45,2: 45-60, 3:60-70, 4:70-100. I do not know where the problem comes from Is it because of the distribution of labels in the training set and what is the solution! Regards...
I attached the plot in the following link.
Training set target distribution
It's not a good idea to create classes that way. Giving to some classes a smaller window of values (i.e. you predict 2 for 15 values and 1 for 45 values), it is intrinsically more difficult for your model to predict class 2, and the best thing the model will learn during training will be to avoid class 2 as much as possible.
You may confirm this having a look at False Negatives for classes 2 and 3, if they are too many, it might be due to this.
The best thing to do would be categorizing your output space in equal portions, and trusting your model will learn which classes are less frequent, without trying to force that proportion by yourself.
If you don't have good results, it means you have to improve your model in other ways, maybe using data augmentation to get a uniform distribution of training samples may help.
If this doesn't sound convincing for you, try to have a look at this paper:
https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf
In end-to-end models for autonomous driving, neural networks have to predict classes indicating the steering angle. The distribution of these values is highly imbalanced as most of the time the car is going straight. Despite this, the best models do not discriminate against some classes to adapt to data distribution.
Good luck!

Model underfitting

I have trained a model and it took me quite a while to find the correct hyperparameters.
The model has now been trained for 15h and it seems to to its job quite well.
When I observed the training and validation loss though, the training loss is somewhat higher than the validation loss. (red curve: training, green: validation)
I use dropout to regularize my model and as far as I have understood, droput is is only applied during training which might be the reason.
Now Iam wondering if I have trained a valid model?
It doesn't seem like the model is heavily underfitted?
Thanks in advance for any advice,
cheers,
M
First, check whether you have good data set, i.e., if it is a classification, then get equal number of images for all classes and get it from same source not from different sources. And regularization, dropout are used for overfitting/High variance so don't worry about these.
Then, I think your model is doing good when you trained your model the initial error between them are different but as you increased the epochs then they both got into some steady path. So it is good. And may be reason for this is as I mentioned above or you should try shuffle them then using train_test_split for getting better distribution of training and validation sets.
A plot of learning curves shows a good fit if:
The plot of training loss decreases to a point of stability.
The plot of validation loss decreases to a point of stability and has a small gap with the training loss.
In your case these conditions are satisfied.
Still if you want to deal with High Bias/underfitting then here are few methods:
Train bigger models
Train longer. Use better optimization techniques
Try different Neural Network Architecture and also hyper parameters
And also you can use cross-validation or GridSearchCV for finding better optimizer or hyper parameters but it may take really long because you have to train it on different parameters each time considering your time which is 15 hours then it might be very long but you will find better parameters and then train on it.
Above all I think your model is doing okay.
If your model underfits, its performance will be lower, similar as in the case of overfitting, because actually it can not learn effectively to get the optimal result, i.e the proper function to fit the given distribution. So you have to use less regularization technique e.g. less dropout to get the optimal result.
Furthermore the sampling can also be crucial, because there can be training-validation subsets where your model performs well on validation set and less effective on training set and vice-versa. This is one of the reason why we use crossvalidation and different sampling methods e.g. stratified k-fold.

Why does more features in a random forest decrease accuracy dramatically?

I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?
Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.
Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.

Resources