Accuracy of fine-tuning BERT varied significantly based on epochs for intent classification task - nlp

I used Bert base uncased as embedding and doing simple cosine similarity for intent classification in my dataset (around 400 classes and 2200 utterances, train:test=80:20). The base BERT model performs 60% accuracy in the test dataset, but different epochs of fine-tuning gave me quite unpredictable results.
This is my setting:
max_seq_length=150
train_batch_size=16
learning_rate=2e-5
These are my experiments:
base model accuracy=0.61
epochs=2.0 accuracy=0.30
epochs=5.0 accuracy=0.26
epochs=10.0 accuracy=0.15
epochs=50.0 accuracy=0.20
epochs=75.0 accuracy=0.92
epochs=100.0 accuracy=0.93
I don't understand while it behaved like this. I expect that any epochs of fine-tuning shouldn't be worse than the base model because I fine-tuned and inferred on the same dataset. Is there anything I misunderstand or should care about?

Well, generally you'll not be able to feed in all the data in your training set at once (I am assuming you have a huge-dataset that you'll have to use mini-batches). Hence, you split it into mini-batches. So, the accuracy that is displayed is strongly infuluenced a lot by the last mini-batch, or the last training step of the epoch.

Related

Multilabel text classification with BERT and highly imbalanced training data

I'm trying to train a multilabel text classification model using BERT. Each piece of text can belong to 0 or more of a total of 485 classes. My model consists of a dropout layer and a linear layer added on top of the pooled output from the bert-base-uncased model from Hugging Face. The loss function I'm using is the BCEWithLogitsLoss in PyTorch.
I have millions of labeled observations to train on. But the training data are highly unbalanced, with some labels appearing in less than 10 observations and others appearing in more than 100K observations! I'd like to get a "good" recall.
My first attempt at training without adjusting for data imbalance produced a micro recall rate of 70% (good enough) but a macro recall rate of 45% (not good enough). These numbers indicate that the model isn't performing well on underrepresented classes.
How can I effectively adjust for the data imbalance during training to improve the macro recall rate? I see we can provide label weights to BCEWithLogitsLoss loss function. But given the very high imbalance in my data leading to weights in the range of 1 to 1M, can I actually get the model to converge? My initial experiments show that a weighted loss function is going up and down during training.
Alternatively, is there a better approach than using BERT + dropout + linear layer for this type of task?
In your case it might be helpful to balance the labels in the training data. You have a lot of data, so you could afford to loose a part of it by balancing. But before you do this, I recommend to read this answer about balancing classes in traing data.
If you really only care about recall, you could try to tune your model maximizing recall.

What is the proper to save the fitted CNN model for MNIST dataset?

I develpoed a simple CNN model for MNIST dataset and i got 98% validation accuracy. But after saving the model through keras as model.h5 and evaluating the inference of th saved model in another jypyter session, the performance of the model is poor and the predictions are random
What needs to be done to get same accuracy after saving and uploading the model in different jypyter notebook session?
(Consider sharing your code/results so the community can help you better).
I'm assuming you're using Tensorflow/Keras, so model.save('my_model.h5') after your model.fit(...) should save the model, including the trained parameters (but not including the internal optimizer data; i.e gradients, etc..., which shouldn't affect the prediction capabilities of the model).
A number of things could cause a generalization gap like that, but...
Case 1: having a high training/validation accuracy and a low test (prediction) accuracy typically means your model overfit on the given training data.
I suggest adding some regularization to your training phase (dropout layers, cutout augmentation, L1/L2, etc...), a fewer number of epochs or early-stopping, or cross-validation/data reshuffle to cross off the possibility of overfitting.
Case 2: low intrinsic dataset variance, but unless you're using a subset of MNIST, this is unlikely. Make sure you are properly splitting your training/validation/test sets.
Again, it could be a number of issues, but these are the most common cases for low model generalization. Post your code (specifying the architecture, optimizer, hyperparameters, data prepropcessing, and test data used) so the answers can be more relevant to your problem.

Large dataset - ANN

I am trying to classify around 400K data with 13 attributes. I have used python sklearn's SVM package, but it didn't work, and then I learned that SVM's are not suitable for large dataset classification. Then I used the (sklearn) ANN using the following MLPClassifier:
MLPClassifier(solver='adam', alpha=1e-5, random_state=1,activation='relu', max_iter=500)
and trained the system using 200K samples, and tested the model on the remaining ones. The classification worked well. However, my concern is that the system is over trained or overfit. Can you please guide me on the number of hidden layers and node sizes to make sure that there is no overfit? (I have learned that the default implementation has 100 hidden neurons. Is it ok to use the default implementation as is?)
To know if your are overfitting you have to compute:
Training set accuracy
Test set accuracy
Once you have calculated this scores, compare it. If training set score is much better than your test set score, then you are overfitting. This means that your model is "memorizing" your data, instead of learning from it to make future predictions.
If you are overfitting with Neuronal Networks you probably have to reduce the number of layers and reduce the number of neurons per layer. There isn't any strict rule that says the number of layer or neurons you need depending on you dataset size. Every dataset can behaves completely different with the same dataset size.
So, to conclude, if you are overfitting, you would have to evaluate your model accuracy using different parameters of layers and number of neurons, and, then, observe with which values you obtain the best results. There are some methods you can use to find the best parameters, is like gridsearchCV.

Sentiment analysis using images

I am trying sentiment analysis of images.
I have 4 classes - Hilarious , funny very funny not funny.
I tried pre trained models like VGG16/19 densenet201 but my model is overfitting getting training accuracy more than 95% and testing around 30
Can someone give suggestions what else I can try?
Training images - 6K
You can try the following to reduce overfitting:
Implement Early Stopping: compute the validation loss at each epoch and a patience threshold for stopping
Implement Cross Validation: refer to Section Cross-validation in
https://cs231n.github.io/classification/#val
Use Batch Normalisation: normalises the activations of layers to
unit variance and zero mean, improves model generalisation
Use Dropout (either or with batch norm): randomly zeros some activations to incentivise use of all neurons
Also, if your dataset isn't too challenging, make sure you don't use overly complex models and overkill the task.

Transformers architecture for machine translation

I have adapted the base transformer model, for my corpus of aligned Arabic-English sentences. As such the model has trained for 40 epochs and accuracy (SparseCategoricalAccuracy) is improving by a factor of 0.0004 for each epoch.
To achieve good results, my estimate is to attain final accuracy anywhere around 0.5 and accuracy after 40 epochs is 0.0592.
I am running the model on the tesla 2 p80 GPU. Each epoch is taking ~2690 sec.
This implies I need at least 600 epochs and training time would be 15-18 days.
Should I continue with the training or is there something wrong in the procedure as the base transformer in the research paper was trained on an ENGLISH-FRENCH corpus?
Key highlights:
Byte-pair(encoding) of sentences
Maxlen_len =100
batch_size= 64
No pre-trained embeddings were used.
Do you mean Tesla K80 on aws p2.xlarge instance.
If that is the case, these gpus are very slow. You should use p3 instances on aws with V100 gpus. You will get around 6-7 times speedup.
Checkout this for more details.
Also, if you are not using the standard model and have made some changes to model or dataset, then try to tune the hyperparameters. Simplest is to try decreasing the learning rate and see if you get better results.
Also, first try to run the standard model with standard dataset to benchmark the time taken in that case and then make your changes and proceed. See when the model starts converging in the standard case. I feel that it should give some results after 40 epochs also.

Resources