training accuracy highprediction of the training set is low - keras

I am using keras 2.2.4 to train a text-based emotion recognition model, which is three categories classification.
I put accuracy in the model compile metric. While training, the accuracy showed in the result bar was normal, around 86%,and the loss is around 0.35, which i believed it was working properly.
After training, I found that the prediction with the testing set was pretty bad, acc was only around 0.44. with my instinct, it might be overfitting issue with the model. However, with my random curiosity, I put the training set into the model prediction, the accuracy was also pretty bad, same as the testing set.
The result showed that it might not be an overfitting issue with the model, and I cannot come up with any possible reason why this happen. Also, the difference between the accuracy output while training with the training set and the accuracy with the training set after training is even more confusing.
Does anyone ever encounter the same situation, and what problem it may be?

Related

What is the proper to save the fitted CNN model for MNIST dataset?

I develpoed a simple CNN model for MNIST dataset and i got 98% validation accuracy. But after saving the model through keras as model.h5 and evaluating the inference of th saved model in another jypyter session, the performance of the model is poor and the predictions are random
What needs to be done to get same accuracy after saving and uploading the model in different jypyter notebook session?
(Consider sharing your code/results so the community can help you better).
I'm assuming you're using Tensorflow/Keras, so model.save('my_model.h5') after your model.fit(...) should save the model, including the trained parameters (but not including the internal optimizer data; i.e gradients, etc..., which shouldn't affect the prediction capabilities of the model).
A number of things could cause a generalization gap like that, but...
Case 1: having a high training/validation accuracy and a low test (prediction) accuracy typically means your model overfit on the given training data.
I suggest adding some regularization to your training phase (dropout layers, cutout augmentation, L1/L2, etc...), a fewer number of epochs or early-stopping, or cross-validation/data reshuffle to cross off the possibility of overfitting.
Case 2: low intrinsic dataset variance, but unless you're using a subset of MNIST, this is unlikely. Make sure you are properly splitting your training/validation/test sets.
Again, it could be a number of issues, but these are the most common cases for low model generalization. Post your code (specifying the architecture, optimizer, hyperparameters, data prepropcessing, and test data used) so the answers can be more relevant to your problem.

After 18th iteration, the loss and validation loss suddenly rise from 0.17 to 0.5 and never reduce from there.. The training has stopped

I'm training a Deep Learning Model using Tensorflow.keras. The Loss function is Triplet Loss. The optimizer is Adam, with learning rate as 0.00005.
Initially the training Loss was 0.38 and it started converging slowly. At 17th Epoch the val_loss became 0.1705. And suddenly in the 18th Epoch training Loss and val_loss both became 0.5 and the same continued for 5-6 epochs. The Loss values didn't change.
Any insight on this behavior would be great. Thank you.
For some models, the loss goes down during training, but at some point an update is applied to a variable that drastically changes your models output. If this is the case, sometimes the model is not able to recover from this new state, and the best it can do from then on is to output 0.5 for every input (I assume that you try to do a binary classification task).
Why can these erroneous updates happen even though they are bad for your model? This is because updates are done using gradient descent. Gradient descent uses the first derivative only though. This means, that the model does know how it needs to change a specific variable, but only very close to its current value. If your learning rate is too high, then the update can be to big and your gradient descent update might be very bad for your models performance.
I read first as training loss much greater than validation loss. That is underfitting.
I read second as training loss much less than validation loss. That is overfitting.
Not an expert but my assumptions have been
Typically validation loss should be similar to but slightly higher than training loss. As long as validation loss is lower than or even equal to training loss one should keep doing more training.
If training loss is reducing without increase in validation loss then again keep doing more training
If validation loss starts increasing then it is time to stop
If overall accuracy still not acceptable then review mistakes model is making and think of what can one change:
More data? More / different data augmentations? Generative data?
Different architecture?
Underfitting – Validation and training error high
Overfitting – Validation error is high, training error low
Good fit – Validation error low, slightly higher than the training
error
Unknown fit - Validation error low, training error ‘high’

Can I use BERT as a feature extractor without any finetuning on my specific data set?

I'm trying to solve a multilabel classification task of 10 classes with a relatively balanced training set consists of ~25K samples and an evaluation set consists of ~5K samples.
I'm using the huggingface:
model = transformers.BertForSequenceClassification.from_pretrained(...
and obtain quite nice results (ROC AUC = 0.98).
However, I'm witnessing some odd behavior which I don't seem to make sense of -
I add the following lines of code:
for param in model.bert.parameters():
param.requires_grad = False
while making sure that the other layers of the model are learned, that is:
[param[0] for param in model.named_parameters() if param[1].requires_grad == True]
gives
['classifier.weight', 'classifier.bias']
Training the model when configured like so, yields some embarrassingly poor results (ROC AUC = 0.59).
I was working under the assumption that an out-of-the-box pre-trained BERT model (without any fine-tuning) should serve as a relatively good feature extractor for the classification layers. So, where do I got it wrong?
From my experience, you are going wrong in your assumption
an out-of-the-box pre-trained BERT model (without any fine-tuning) should serve as a relatively good feature extractor for the classification layers.
I have noticed similar experiences when trying to use BERT's output layer as a word embedding value with little-to-no fine-tuning, which also gave very poor results; and this also makes sense, since you effectively have 768*num_classes connections in the simplest form of output layer. Compared to the millions of parameters of BERT, this gives you an almost negligible amount of control over intense model complexity. However, I also want to cautiously point to overfitted results when training your full model, although I'm sure you are aware of that.
The entire idea of BERT is that it is very cheap to fine-tune your model, so to get ideal results, I would advise against freezing any of the layers. The one instance in which it can be helpful to disable at least partial layers would be the embedding component, depending on the model's vocabulary size (~30k for BERT-base).
I think the following will help in demystifying the odd behavior I reported here earlier –
First, as it turned out, when freezing the BERT layers (and using an out-of-the-box pre-trained BERT model without any fine-tuning), the number of training epochs required for the classification layer is far greater than that needed when allowing all layers to be learned.
For example,
Without freezing the BERT layers, I’ve reached:
ROC AUC = 0.98, train loss = 0.0988, validation loss = 0.0501 # end of epoch 1
ROC AUC = 0.99, train loss = 0.0484, validation loss = 0.0433 # end of epoch 2
Overfitting, train loss = 0.0270, validation loss = 0.0423 # end of epoch 3
Whereas, when freezing the BERT layers, I’ve reached:
ROC AUC = 0.77, train loss = 0.2509, validation loss = 0.2491 # end of epoch 10
ROC AUC = 0.89, train loss = 0.1743, validation loss = 0.1722 # end of epoch 100
ROC AUC = 0.93, train loss = 0.1452, validation loss = 0.1363 # end of epoch 1000
The (probable) conclusion that arises from these results is that working with an out-of-the-box pre-trained BERT model as a feature extractor (that is, freezing its layers) while learning only the classification layer suffers from underfitting.
This is demonstrated in two ways:
First, after running 1000 epochs, the model still hasn’t finished learning (the training loss is still higher than the validation loss).
Second, after running 1000 epochs, the loss values are still higher than the values achieved with the non-freeze version as early as the 1’st epoch.
To sum it up, #dennlinger, I think I completely agree with you on this:
The entire idea of BERT is that it is very cheap to fine-tune your model, so to get ideal results, I would advise against freezing any of the layers.

Sentiment analysis using images

I am trying sentiment analysis of images.
I have 4 classes - Hilarious , funny very funny not funny.
I tried pre trained models like VGG16/19 densenet201 but my model is overfitting getting training accuracy more than 95% and testing around 30
Can someone give suggestions what else I can try?
Training images - 6K
You can try the following to reduce overfitting:
Implement Early Stopping: compute the validation loss at each epoch and a patience threshold for stopping
Implement Cross Validation: refer to Section Cross-validation in
https://cs231n.github.io/classification/#val
Use Batch Normalisation: normalises the activations of layers to
unit variance and zero mean, improves model generalisation
Use Dropout (either or with batch norm): randomly zeros some activations to incentivise use of all neurons
Also, if your dataset isn't too challenging, make sure you don't use overly complex models and overkill the task.

Model underfitting

I have trained a model and it took me quite a while to find the correct hyperparameters.
The model has now been trained for 15h and it seems to to its job quite well.
When I observed the training and validation loss though, the training loss is somewhat higher than the validation loss. (red curve: training, green: validation)
I use dropout to regularize my model and as far as I have understood, droput is is only applied during training which might be the reason.
Now Iam wondering if I have trained a valid model?
It doesn't seem like the model is heavily underfitted?
Thanks in advance for any advice,
cheers,
M
First, check whether you have good data set, i.e., if it is a classification, then get equal number of images for all classes and get it from same source not from different sources. And regularization, dropout are used for overfitting/High variance so don't worry about these.
Then, I think your model is doing good when you trained your model the initial error between them are different but as you increased the epochs then they both got into some steady path. So it is good. And may be reason for this is as I mentioned above or you should try shuffle them then using train_test_split for getting better distribution of training and validation sets.
A plot of learning curves shows a good fit if:
The plot of training loss decreases to a point of stability.
The plot of validation loss decreases to a point of stability and has a small gap with the training loss.
In your case these conditions are satisfied.
Still if you want to deal with High Bias/underfitting then here are few methods:
Train bigger models
Train longer. Use better optimization techniques
Try different Neural Network Architecture and also hyper parameters
And also you can use cross-validation or GridSearchCV for finding better optimizer or hyper parameters but it may take really long because you have to train it on different parameters each time considering your time which is 15 hours then it might be very long but you will find better parameters and then train on it.
Above all I think your model is doing okay.
If your model underfits, its performance will be lower, similar as in the case of overfitting, because actually it can not learn effectively to get the optimal result, i.e the proper function to fit the given distribution. So you have to use less regularization technique e.g. less dropout to get the optimal result.
Furthermore the sampling can also be crucial, because there can be training-validation subsets where your model performs well on validation set and less effective on training set and vice-versa. This is one of the reason why we use crossvalidation and different sampling methods e.g. stratified k-fold.

Resources