BERT Ensemble shared linear layer worse than individual model

BERT Ensemble shared linear layer worse than individual model - pytorch

I fine-tuned two bert-base models, initialized with different weights, on the same dataset. I then attempted to combine my pretrained models via a shared linear layer. Supposed there is no problem in my code, is there a possibility that this combination performs worse during training and hence on a test set than the individual models? - This is my situation.

No. The shared linear layer is essentially an ensemble machine learning method, which combines two "weak" models into a single stronger model. The parameters for this combination are learned to optimize performance on the training set, so unless the shared layer is designed in such a way that it doesn't actually utilize the input features, its performance should always be at least as good as the worse of the two ensembled models on the training set. This is because at a minimum, the shared layer should be able to learn to output exactly the result of the better model and ignore the other model. Of course, it would be reasonable to achieve worse testing performance as the distribution of the data may differ.
Some causes of your issue may be:
Initializing in a local optimum
Different activation function in shared layer
Other parameter settings

Related

Improve the neural network by analyzing the loss curve

I builted some network based on LSTM. I tuneded parameters. The results are shown in the figure and are not impressive.
How to understand what is bad? Is the dataset bad or the network is not well built?

Since validation loss decreased initially and later increased what you're experiencing is model overfitting.
Since training loss kept decreasing, your model has learnt training set excessively and now model is not generalizing well. Due to this validation loss increased.
To avoid overfitting, you need to regularize your model. You can use L1 or L2 regularization techniques. Additionally, you can also try dropout in your model.
Now coming to your question:
If the dataset is of good quality i.e. it is annotated well and it surely has features which could give result, then dataset and model hand-in-hand decides the quality of the predictions.
Since you're using RNNs that consists a good numbers of parameters, make sure that dataset is also huge to avoid RNNs overfitting on a small dataset. If available dataset is small, start with a small deep learning with less parameters (you can build a small neural network) and gradually scale up the model until you're satisfied with the prediction scores.
You can also refer this: https://towardsdatascience.com/rnn-training-tips-and-tricks-2bf687e67527

How can/should we weight classes in HuggingFace token classification (entity recognition)?

I'm training a token classification (AKA named entity recognition) model with the HuggingFace Transformers library, with a customized data loader.
Like most NER datasets (I'd imagine?) there's a pretty significant class imbalance: A large majority of tokens are other - i.e. not an entity - and of course there's a little variation between the different entity classes themselves.
As we might expect, my "accuracy" metrics are getting distorted quite a lot by this: It's no great achievement to get 80% token classification accuracy if 90% of your tokens are other... A trivial model could have done better!
I can calculate some additional and more insightful evaluation metrics - but it got me wondering... Can/should we somehow incorporate these weights into the training loss? How would this be done using a typical *ForTokenClassification model e.g. BERTForTokenClassification?

This is actually a really interesting question, since it seems there is no intention (yet) to modify losses in the models yourself. Specifically for BertForTokenClassification, I found this code segment:
loss_fct = CrossEntropyLoss()
# ...
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
To actually change the loss computation and add other parameters, e.g., the weights you mention, you can go about either one of two ways:
You can modify a copy of transformers locally, and install the library from there, which makes this only a small change in the code, but potentially quite a hassle to change parts during different experiments, or
You return your logits (which is the case by default), and calculate your own loss outside of the actual forward pass of the huggingface model. In this case, you need to be aware of any potential propagation from the loss calculated within the forward call, but this should be within your power to change.

When and Whether should we normalize the ground-truth labels in the multi-task regression models?

I am trying a multi-task regression model. However, the ground-truth labels of different tasks are on different scales. Therefore, I wonder whether it is necessary to normalize the targets. Otherwise, the MSE of some large-scale tasks will be extremely bigger. The figure below is part of my overall targets. You can certainly find that columns like ASA_m2_c have much higher values than some others.
First, I have already tried some weighted loss techniques to balance the concentration of my model when it does gradient backpropagation. The result shows it didn't perform well.
Secondly, I have seen tremendous discussions regarding normalizing the input data, but hardly discovered any particular talking about normalizing the labels. It's partly because most of the people's problems are classification type and a single task. I do know pytorch provides a convenient approach to normalize the vision dataset by transform.normalize, which is still operated on the input rather than the labels.
Similar questions: https://forums.fast.ai/t/normalizing-your-dataset/49799
https://discuss.pytorch.org/t/ground-truth-label-normalization/26981/19
PyTorch - How should you normalize individual instances
Moreover, I think it might be helpful to provide some details of my model architecture. The input is first fed into a feature extractor and then several generators use the shared output representation from that extractor to predict different targets.

I've been working on a Multi-Task Learning problem where one head has an output of ~500 and another between 0 and 1.
I've tried Uncertainty Weighting but in vain. So I'd be grateful if you could give me a little clue about your studies.(If there is any progress)
Thanks.

Model underfitting

I have trained a model and it took me quite a while to find the correct hyperparameters.
The model has now been trained for 15h and it seems to to its job quite well.
When I observed the training and validation loss though, the training loss is somewhat higher than the validation loss. (red curve: training, green: validation)
I use dropout to regularize my model and as far as I have understood, droput is is only applied during training which might be the reason.
Now Iam wondering if I have trained a valid model?
It doesn't seem like the model is heavily underfitted?
Thanks in advance for any advice,
cheers,
M

First, check whether you have good data set, i.e., if it is a classification, then get equal number of images for all classes and get it from same source not from different sources. And regularization, dropout are used for overfitting/High variance so don't worry about these.
Then, I think your model is doing good when you trained your model the initial error between them are different but as you increased the epochs then they both got into some steady path. So it is good. And may be reason for this is as I mentioned above or you should try shuffle them then using train_test_split for getting better distribution of training and validation sets.
A plot of learning curves shows a good fit if:
The plot of training loss decreases to a point of stability.
The plot of validation loss decreases to a point of stability and has a small gap with the training loss.
In your case these conditions are satisfied.
Still if you want to deal with High Bias/underfitting then here are few methods:
Train bigger models
Train longer. Use better optimization techniques
Try different Neural Network Architecture and also hyper parameters
And also you can use cross-validation or GridSearchCV for finding better optimizer or hyper parameters but it may take really long because you have to train it on different parameters each time considering your time which is 15 hours then it might be very long but you will find better parameters and then train on it.
Above all I think your model is doing okay.

If your model underfits, its performance will be lower, similar as in the case of overfitting, because actually it can not learn effectively to get the optimal result, i.e the proper function to fit the given distribution. So you have to use less regularization technique e.g. less dropout to get the optimal result.
Furthermore the sampling can also be crucial, because there can be training-validation subsets where your model performs well on validation set and less effective on training set and vice-versa. This is one of the reason why we use crossvalidation and different sampling methods e.g. stratified k-fold.

Difference of filters in convolutional neural network

When creating a convolutional neural network (CNN) (e.g. as described in
https://cs231n.github.io/convolutional-networks/) the input layer is connected with one or several filters, each representing a feature map. Here, each neuron in a filter layer is connected with just a few neurons of the input layer.
In the most simple case each of my n filters has the same dimensionality and uses the same stride.
My (tight-knitted) questions are:
How is ensured that the filters learn different features, although they are trained with the same patches?
"Depends" the learned feature of a filter on the randomly assigned values (for weights and biases) when initiating the network?

I'm not an expert, but I can speak a bit to your questions. To be honest, it sounds like you already have the right idea: it's specifically the initial randomization of weights/biases in the filters that fosters their tendencies to learn different features (although I believe randomness in the error backpropagated from higher layers of the network can play a role as well).
As #user2717954 indicated, there is no guarantee that the filters will learn unique features. However, each time the error of a training sample or batch is backpropagated to a given convolutional layer, the weights and biases of each filter is slightly modified to improve the overall accuracy of the network. Since the initial weights and biases are all different in each filter, it's possible (and likely given a suitable model) for most of the filters to eventually stabilize to values representing a robust set of unique features.
In addition to proper randomization of weights, this also demonstrates why it's crucial to use convolutional layers with an adequate number of filters. Without enough filters, the network is fundamentally limited such that there are important, useful patterns at the given layer of abstraction that simply can't be represented by the network.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

BERT Ensemble shared linear layer worse than individual model - pytorch

Related

Improve the neural network by analyzing the loss curve

How can/should we weight classes in HuggingFace token classification (entity recognition)?

When and Whether should we normalize the ground-truth labels in the multi-task regression models?

Model underfitting

Difference of filters in convolutional neural network

Categories

Resources