Training accuracy and validation accuracy gives nearly 0.87, but in testing part using evaluate() function gives fluctuated results according to different batch_size parameter values. Testing accuracy varies from 0.5 to 0.66. Is the optimum batch_size value for evaluate has to be same as in fit()?

I don't see how the batch size parameter of the evaluate function can change the accuracy of your model. Only the batch size used during the training can modify the performances of your model (see this). Are you testing the same trained model for your different tests? If you're testing newly trained models every time, it explains the variation of accuracy you observe (because of the random initialization of the weights for example).


Data augmentation affects convergence speed

Data augmentation is surely a great regularization method, and it improves my accuracy on the unseen test set. However, I do not understand why it reduces the convergence speed of the network? I know each epoch takes a longer time to train since image transformations are applied on the fly. But why does it affect the convergence? For my current setup, the network hits a 100% training accuracy after 5 epochs without data augmentation (and clearly overfits) - with data augmentation, it takes 23 epochs to hit 95% training accuracy and never seems to hit 100%.
Any links to research papers or comments on the reasonings behind this?
I guess you are evaluating accuracy on the train set, right? And it is a mistake...
Without augmentation your network simply overfits. You have a predefined number of images, for instance, 1000, and your network during training can easily memorize dataset labels. And you are evaluating the model on the fixed (not augmented) dataset.
When you are training your network with data augmentation, basically, you are training a model on a dataset of infinite size. You are doing augmentation on the fly, which means that the model "sees" new images every time, and it cannot memorize them perfectly with 100% accuracy. And you are evaluating the model on the augmented (infinite) dataset.
When you train your model with and without augmentation, you evaluate it on the different datasets, so it is not correct to compare their accuracy.
Piece of advice:
Do not look at train set accuracy, it is simply misleading when you use augmentations. Instead - evaluate your model on the test set (or validation set), which is not augmented. By doing this - you'll see the real accuracy increase for your model.
P.S. If you want to find out more about image augmentaitons, I really recommend you to check this guide - https://notrocketscience.blog/complete-guide-to-data-augmentation-for-computer-vision/

Can I use BERT as a feature extractor without any finetuning on my specific data set?

I'm trying to solve a multilabel classification task of 10 classes with a relatively balanced training set consists of ~25K samples and an evaluation set consists of ~5K samples.
I'm using the huggingface:
model = transformers.BertForSequenceClassification.from_pretrained(...
and obtain quite nice results (ROC AUC = 0.98).
However, I'm witnessing some odd behavior which I don't seem to make sense of -
I add the following lines of code:
for param in model.bert.parameters():
param.requires_grad = False
while making sure that the other layers of the model are learned, that is:
[param[0] for param in model.named_parameters() if param[1].requires_grad == True]
['classifier.weight', 'classifier.bias']
Training the model when configured like so, yields some embarrassingly poor results (ROC AUC = 0.59).
I was working under the assumption that an out-of-the-box pre-trained BERT model (without any fine-tuning) should serve as a relatively good feature extractor for the classification layers. So, where do I got it wrong?
From my experience, you are going wrong in your assumption
an out-of-the-box pre-trained BERT model (without any fine-tuning) should serve as a relatively good feature extractor for the classification layers.
I have noticed similar experiences when trying to use BERT's output layer as a word embedding value with little-to-no fine-tuning, which also gave very poor results; and this also makes sense, since you effectively have 768*num_classes connections in the simplest form of output layer. Compared to the millions of parameters of BERT, this gives you an almost negligible amount of control over intense model complexity. However, I also want to cautiously point to overfitted results when training your full model, although I'm sure you are aware of that.
The entire idea of BERT is that it is very cheap to fine-tune your model, so to get ideal results, I would advise against freezing any of the layers. The one instance in which it can be helpful to disable at least partial layers would be the embedding component, depending on the model's vocabulary size (~30k for BERT-base).
I think the following will help in demystifying the odd behavior I reported here earlier –
First, as it turned out, when freezing the BERT layers (and using an out-of-the-box pre-trained BERT model without any fine-tuning), the number of training epochs required for the classification layer is far greater than that needed when allowing all layers to be learned.
For example,
Without freezing the BERT layers, I’ve reached:
ROC AUC = 0.98, train loss = 0.0988, validation loss = 0.0501 # end of epoch 1
ROC AUC = 0.99, train loss = 0.0484, validation loss = 0.0433 # end of epoch 2
Overfitting, train loss = 0.0270, validation loss = 0.0423 # end of epoch 3
Whereas, when freezing the BERT layers, I’ve reached:
ROC AUC = 0.77, train loss = 0.2509, validation loss = 0.2491 # end of epoch 10
ROC AUC = 0.89, train loss = 0.1743, validation loss = 0.1722 # end of epoch 100
ROC AUC = 0.93, train loss = 0.1452, validation loss = 0.1363 # end of epoch 1000
The (probable) conclusion that arises from these results is that working with an out-of-the-box pre-trained BERT model as a feature extractor (that is, freezing its layers) while learning only the classification layer suffers from underfitting.
This is demonstrated in two ways:
First, after running 1000 epochs, the model still hasn’t finished learning (the training loss is still higher than the validation loss).
Second, after running 1000 epochs, the loss values are still higher than the values achieved with the non-freeze version as early as the 1’st epoch.
To sum it up, #dennlinger, I think I completely agree with you on this:
The entire idea of BERT is that it is very cheap to fine-tune your model, so to get ideal results, I would advise against freezing any of the layers.

Loss function negative log likelihood giving loss despite perfect accuracy

I am debugging a sequence-to-sequence model and purposely tried to perfectly overfit a small dataset of ~200 samples (sentence pairs of length between 5-50). I am using negative log-likelihood loss in pytorch. I get low loss (~1e^-5), but the accuracy on the same dataset is only 33%.
I trained the model on 3 samples as well and obtained 100% accuracy, yet during training I had loss. I was under the impression that negative log-likelihood only gives loss (loss is in the same region of ~1e^-5) if there is a mismatch between predicted and target label?
Is a bug in my code likely?
There is no bug in your code.
The way things usually work in deep nets is that the networks predicts the logits (i.e., log-likelihoods). These logits are then transformed to probability using soft-max (or a sigmoid function). Cross-entropy is finally evaluated based on the predicted probabilities.
The advantage of this approach is that is numerically stable, and easy to train with. On the other side, because of the soft-max you can never have "perfect" 0/1 probabilities for your predictions: That is, even when your network has perfect accuracy it will never assign probability 1 to the correct prediction, but "close to one". As a result, the loss will always be positive (albeit small).

For regression model, why the validation set passed to model.fit have different metric result than the model.evaluate?

I have a regression model with Euclidean distance as a loss function and RMSE as a metric evaluation (lower is better). When I passed my train, test sets to model.fit I have train_rmse, and test_rmse which their values made sense to me. But when I pass the test sets into model.evalute after loading the weight of the trained model I got different results which are approximately twice the result with model.fit. And I am aware of the difference that should happen between the train evaluation and test evaluation as I know from Keras that :
the training loss is the average of the losses over each batch of training data. Because your model is changing over time, the loss over the first batches of an epoch is generally higher than over the last batches. On the other hand, the testing loss for an epoch is computed using the model as it is at the end of the epoch, resulting in a lower loss.
But here I am talking about the result of test-set passed to model.fit in which I beleived is evaluated on the final model. In Keras documentation, they said on validation argument that I am passing the test set in it:
validation_data: Data on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data.
When I searched about the problem I found several issues
1- Some people like here report that this issue is with the model itself if they have batch normalization layer,or if you do transfer learning and freeze some BN layers like here. my model has BN layers, and I did not freeze any layer. Also, I used the same model for the mulit-class classification problem (not regression) and the result was the same for test set in the model.fit and model.evaluate.
2- Other people like said that this is related with either the prediction or metric calculation like here, in which they found that this difference is related with the different of dtype for y_true and y_pred if someone is float32 and other float64 for example, then the metric calculation will be different. When they unified the dtype the problem is fixed.
I believed that the last case applied to me since in the regression task my labels now is tf.float32. My y_true labels already cast to tf.float32 through tfrecord, so I tried to cast the y_pred to tf.float32 before the rmse calculation, and I still have the difference in the result.
So My questions are:
Why this difference in results
To whom I should rely for test set, on model.fit result or model.evalute
I know that for training loss and accuracy, keras does a running average over the batches, and I know for model.evalute, these metric are calculated by taking all the dataset one time on the final model. But how the validation loss and accuracy calculated for validation set passed to model.fit?
The problem was in the shape conflict between the y_true and y_pred. As for y_true label I save it in tfrecords as float single value and eventually will be with the size of [batch_size] while the regression model gives the prediction as [batch_size, 1] and then the result of tf.subtract(y_true, y_pred) in rmse equation will result in matrix of [batch_size, batch_sizze] and with taking the mean of this final one you will never guess it is wrong and the code will not throw any error but the calculation of rmse will be wrong. I am still working to make the shape consistent but still didn't find a good solution.

Why in model.evaluate() from Keras the loss is used to calculate accuracy?

It may be a stupid question but:
I noticed that the choice of the loss function modifies the accuracy obtained during evaluation.
I thought that the loss was used only during training and of course from it depends the goodness of the model in making prediction but not the accuracy i.e amount of right predictions over the total number of samples.
I didn't explain my self correctly.
My question comes because I recently trained a model with binary_crossentropy loss and the accuracy coming from model.evaluate() was 96%.
But it wasn't correct!
I checked "manually" and the model was getting 44% of the total predictions. Then I changed to categorical_crossentropy and then the accuracy was correct.
I have found the problem. metrics=['accuracy'] calculates accuracy automatically from cost function. So using binary_crossentropy shows binary accuracy, not categorical accuracy. Using categorical_crossentropy automatically switches to categorical accuracy and now it is the same as calculated manually using model1.predict().
Keras chooses the performace metric to use based on your loss funktion. When you use binary_crossentropy it although uses binary_accuracy which is computed differently than categorical_accuracy. You should always use categorical_crossentropy if you have more than one output neuron.
The model tries to minimize the loss function chosen. It adjusts the weights to do this. A different loss function results in different weights.
Those weights determine how many correct predictions are made over the total number of samples. So it is correct behavior to see that the loss function chosen will affect the model accuracy.
I have found the problem. metrics=['accuracy'] calculates accuracy
automatically from cost function. So using binary_crossentropy shows
binary accuracy, not categorical accuracy. Using
categorical_crossentropy automatically switches to categorical
accuracy and now it is the same as calculated manually using
