Delete and Reinitialize pertained BERT weights / parameters - nlp

I tried to fine-tune BERT for a classification downstream task.
Now I loaded the model again and I run into the following warning:
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[Screen Shot][1]
I already deleted and reinstalled transformers==4.6.0 but nothing helped.
I thought maybe through the parameter "force_download=True" it might get the original weights back but nothing helped.
Shall I continue and ignore the warning? Is there a way to delete the model checkpoints such when the model is downloaded the weights are fixed again?
Thanks in advance!

As long as you're fine-tuning a model for a downstream task this warning can be ignored. The idea is that the [CLS] token weights from the pretrained model aren't going to be useful for downstream tasks and need to be fine-tuned.
Huggingface randomly initializes them because you're using bert-base-cased which is a BertForPretraing model and you're created a BertModel from it. The warning is to ensure that you understand the difference of directly using the pretrained model directly or if you're planning on finetuning them for a different task.
On that note if you plan working on a classification task I'd recommend using their BertForSequenceClassification class instead.
TL;DR you can ignore it as long as you're finetuning.

Hi Thanks for your answer! I was not very specific in the description! I first fine_tuned Bert for a downstream task and afterwards in a different Notebook I just wanted the usual pertained BERT and work with its embeddings.
I have correlated things that were not related at all. I thought through fine-tuning the BERT parameters on the downstream-task I have changed the parameters for all my 'bert_base_uncased' models and that's why I get this warning. Even when I just wanted the usual embeddings from the standard pertained BERT.
I have kind of "solved" the problem or at least I found a solution:
One Conda environment for downstream task classification: conda install -c conda-forge transformers
One Conda environment for just getting the embeddings: conda install -c conda-forge/label/cf202003 transformers
Maybe this is a Apple/Mac specific thing I do not know why I run into this problem but nobody else ^^
Anyway thanks for your answer!


Hugging face transformer: model bio_ClinicalBERT not trained for any of the task?

This maybe the most beginner question of all :sweat:.
I just started learning about NLP and hugging face. The first thing I'm trying to do is to apply one the bioBERT models on some clinical note data and see what I do, before moving on to the fine-tuning the model. And it looks like "emilyalsentzer/Bio_ClinicalBERT" to be the closest model for my data.
But as I try to use it for any of the analyses I always get this warning.
Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
From the hugging face course chapter 2 I understand this meant.
This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now.
So I went on to test which NLP task I can use "emilyalsentzer/Bio_ClinicalBERT" for, out of the box.
from transformers import pipeline, AutoModel
checkpoint = "emilyalsentzer/Bio_ClinicalBERT"
nlp_task = ['conversational', 'feature-extraction', 'fill-mask', 'ner',
'question-answering', 'sentiment-analysis', 'text-classification',
'zero-shot-classification' ]
for task in nlp_task:
process = pipeline(task=task, model = checkpoint)
And I got the same warning message for all the NLP tasks, so it appears to me that I shouldn't/advised not to use the model for any of the tasks. This really confuses me. The original bio_clinicalBERT model paper stated that they had good results on a few different tasks. So certainly the model was trained for those tasks. I also have similar issue with other models as well, i.e. the blog or research papers said a model obtained good results with a specific task but when I tried to apply with pipeline it gives the warning message. Is there any reason why the head layers were not included in the model?
I only have a few hundreds clinical notes (also unannotated :frowning_face:), so it doesn't look like it's big enough for training. Is there any way I could use the model on my data without training?
Thank you for your time.
This Bio_ClinicalBERT model is trained for Masked Language Model (MLM) task. This task basically used for learning the semantic relation of the token in the language/domain. For downstream tasks, you can fine-tune the model's header with your small dataset, or you can use a fine-tuned model like Bio_ClinicalBERT-finetuned-medicalcondition which is the fine-tuned version of the same model. You can find all the fine-tuned models in HuggingFace by searching 'bio-clinicalBERT' as in the link.

Pytorch Lightning Inference

I trained a model using pytorch lightning and especially appreciated the ease of using multiple GPU's. Now after training, how can I still make use of lightnings GPU features to run inference on a test set and store/export the predictions?
The documentation on inference does not target that.
Thanks in advance.
You can implement the validation_epoch_end on your LightningModule which is called "at the end of the validation epoch with the outputs of all validation steps". For this to work you also need to define validation_step on that same module.
Once this is done, you can run validation using your trainer and a given dataloader by calling:
trainer.validate(pl_module, dataloaders=validation_dataloader)

Why some weights of GPT2Model are not initialized?

I am using the GPT2 pre-trained model for a research project and when I load the pre-trained model with the following code,
from transformers.models.gpt2.modeling_gpt2 import GPT2Model
gpt2 = GPT2Model.from_pretrained('gpt2')
I get the following warning message:
Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
From my understanding, it says that the weights of the above layers are not initialized from the pre-trained model. But we all know that attention layers ('attn') are so important in GPT2 and if we can not have their actual weights from the pre-trained model, then what is the point of using a pre-trained model?
I really appreciate it if someone could explain this to me and tell me how I can fix this.
The masked_bias was added but the huggingface community as a speed improvement compared to the original implementation. It should not negatively impact the performance as the original weights are loaded properly. Check this PR for further information.

Porting pre-trained keras models and run them on IPU

I am trying to port two pre-trained keras models into the IPU machine. I managed to load and run them using IPUstrategy.scope but I dont know if i am doing it the right way. I have my pre-trained models in .h5 file format.
I load them this way:
def first_model():
model = tf.keras.models.load_model("./model1.h5")
return model
After searching your file I couldn't find any load methods to load my pre-trained models, and this is why i used tf.keras.models.load_model().
Then i use this code to run:
cfg=ipu.utils.auto_select_ipus(cfg, 1)
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
model = first_model()
print('compile attempt\n')
model.compile("sgd", "categorical_crossentropy", metrics=["accuracy"])
print('compilation completed\n')
print('running attempt\n')
res = model.predict(input_img)[0]
print('run completed\n')
you can see the output here:link
So i have some difficulties to understand how and if the system is working properly.
Basically the model.compile wont compile my model but when i use model.predict then the system first compiles and then is running. Why is that happening? Is there another way to run pre-trained keras models on an IPU chip?
Another question I have is if its possible to load a pre-trained keras model inside an ipu.keras.model and then use to further train and evaluate it and then save it for future use?
One last question I have is about the compilation part of the graph. Is there a way to avoid recompilation of the graph every time i use the model.predict() in a different strategy.scope()?
I use tensorflow2.1.2 wheel
Thank you for your time
To add some context, the Graphcore TensorFlow wheel includes a port of Keras for the IPU, available as tensorflow.python.ipu.keras. You can access the API documentation for IPU Keras at this link. This module contains IPU-specific optimised replacement for TensorFlow Keras classes Model and Sequential, plus more high-performance, multi-IPU classes e.g. PipelineModel and PipelineSequential.
As per your specific issue, you are right when you mention that there are no IPU-specific ways to load pre-trained Keras models at present. I would encourage you, as you appear to have access to IPUs, to reach out to Graphcore Support. When doing so, please attach your pre-trained Keras model model1.h5 and a self-contained reproducer of your code.
Switching topic to the recompilation question: using an executable cache prevents recompilation, you can set that up with environmental variable TF_POPLAR_FLAGS='--executable_cache_path=./cache'. I'd also recommend to take a look into the following resources:
this tutorial gathers several considerations around recompilation and how to avoid it when using TensorFlow2 on the IPU.
Graphcore TensorFlow documentation here explains how to use the pre-compile mode on the IPU.

AllenNLP Multi-Task Model: Keep encoder weights for new heads

I have trained a (AllenNLP) multi-task model. I would like to keep the encoder/backbone weights and continue training with new heads on new datasets. How can I do that with AllenNLP?
I have two basic ideas for how to do that:
I followed this AllenNLP tutorial to load the trained model and then instead of just making predictions I wanted to change the configuration and the model-heads to continue training on the new datasets...but I am kinda lost in how to do that.
I guess it should be possible to (a) save the state-dict of the previously trained encoder in a file and then (b) point to those weights in the configuration file for the new model (instead of pointing to "bert-base-cased"-weights for example). But looking at the PretrainedTransformerEmbedder-class I don't see how I could pass my own model-weights to that class.
As an additional question: Is it also possible to save the weights of the heads separately and initialize new heads with those weights?
Any help is appreciated :)
Your second idea is the preferred way, which you can accomplish by using a PretrainedModelInitializer. See the CopyNet model for an example of how to add this to your model.
