biobert for keras version of huggingface transformers - keras

(also posted in https://github.com/dmis-lab/biobert/issues/98)
Hi, does anyone know how to load biobert as a keras layer using the huggingface transformers (version 2.4.1)? I tried several possibilities but none of these worked. All that I found out is how to use the pytorch version but I am interested in the keras layer version. Below are two of my attempts (I saved the biobert files into folder "biobert_v1.1_pubmed").
Attempt 1:
biobert_model = TFBertModel.from_pretrained('bert-base-uncased')
biobert_model.load_weights('biobert_v1.1_pubmed/model.ckpt-1000000')
Error message:
AssertionError: Some objects had attributes which were not restored:
: ['tf_bert_model_4/bert/embeddings/word_embeddings/weight']
: ['tf_bert_model_4/bert/embeddings/position_embeddings/embeddings']
(and many more lines like above...)
Attempt 2:
biobert_model = TFBertModel.from_pretrained("biobert_v1.1_pubmed/model.ckpt-1000000", config='biobert_v1.1_pubmed/bert_config.json')
Error message:
NotImplementedError: Weights may only be loaded based on topology into Models when loading TensorFlow-formatted weights (got by_name=True to load_weights).
Any help appreciated! My experience with huggingface's transformers library is almost zero. I also tried to load the following two models but it seems they only support the pytorch version.
https://huggingface.co/monologg/biobert_v1.1_pubmed
https://huggingface.co/adamlin/NCBI_BERT_pubmed_mimic_uncased_base_transformers

Might be a bit late but I have found a not so elegant fix to this problem. The tf bert models in the transformers library can be loaded with a PyTorch save file.
Step 1: Convert the tf checkpoint to a Pytorch save file with the following command (more here: https://github.com/huggingface/transformers/blob/master/docs/source/converting_tensorflow_models.rst)
transformers-cli convert --model_type bert\
--tf_checkpoint=./path/to/checkpoint_file \
--config=./bert_config.json \
--pytorch_dump_output=./pytorch_model.bin
Step 2: Make sure to combine the following files in a directory
config.json - bert config file (must be renamed from bert_config.json!)
pytorch_model.bin - the one we just converted
vocab.txt - bert vocab file
Step 3: Load model from the directory we just created
model = TFBertModel.from_pretrained('./pretrained_model_dir', from_pt=True)
There is actually also an argument "from_tf" which, according to the documentation should work with tf style checkpoints but I can't get it to work. See: https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained

Related

Delete and Reinitialize pertained BERT weights / parameters

I tried to fine-tune BERT for a classification downstream task.
Now I loaded the model again and I run into the following warning:
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[Screen Shot][1]
[1]: https://i.stack.imgur.com/YJZVc.png
I already deleted and reinstalled transformers==4.6.0 but nothing helped.
I thought maybe through the parameter "force_download=True" it might get the original weights back but nothing helped.
Shall I continue and ignore the warning? Is there a way to delete the model checkpoints such when the model is downloaded the weights are fixed again?
Thanks in advance!
Best,
Alex
As long as you're fine-tuning a model for a downstream task this warning can be ignored. The idea is that the [CLS] token weights from the pretrained model aren't going to be useful for downstream tasks and need to be fine-tuned.
Huggingface randomly initializes them because you're using bert-base-cased which is a BertForPretraing model and you're created a BertModel from it. The warning is to ensure that you understand the difference of directly using the pretrained model directly or if you're planning on finetuning them for a different task.
On that note if you plan working on a classification task I'd recommend using their BertForSequenceClassification class instead.
TL;DR you can ignore it as long as you're finetuning.
Hi Thanks for your answer! I was not very specific in the description! I first fine_tuned Bert for a downstream task and afterwards in a different Notebook I just wanted the usual pertained BERT and work with its embeddings.
I have correlated things that were not related at all. I thought through fine-tuning the BERT parameters on the downstream-task I have changed the parameters for all my 'bert_base_uncased' models and that's why I get this warning. Even when I just wanted the usual embeddings from the standard pertained BERT.
I have kind of "solved" the problem or at least I found a solution:
One Conda environment for downstream task classification: conda install -c conda-forge transformers
One Conda environment for just getting the embeddings: conda install -c conda-forge/label/cf202003 transformers
Maybe this is a Apple/Mac specific thing I do not know why I run into this problem but nobody else ^^
Anyway thanks for your answer!
Best,
Alex

Porting pre-trained keras models and run them on IPU

I am trying to port two pre-trained keras models into the IPU machine. I managed to load and run them using IPUstrategy.scope but I dont know if i am doing it the right way. I have my pre-trained models in .h5 file format.
I load them this way:
def first_model():
model = tf.keras.models.load_model("./model1.h5")
return model
After searching your ipu.keras.models.py file I couldn't find any load methods to load my pre-trained models, and this is why i used tf.keras.models.load_model().
Then i use this code to run:
cfg=ipu.utils.create_ipu_config()
cfg=ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)
ipu.utils.move_variable_initialization_to_cpu()
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
model = first_model()
print('compile attempt\n')
model.compile("sgd", "categorical_crossentropy", metrics=["accuracy"])
print('compilation completed\n')
print('running attempt\n')
res = model.predict(input_img)[0]
print('run completed\n')
you can see the output here:link
So i have some difficulties to understand how and if the system is working properly.
Basically the model.compile wont compile my model but when i use model.predict then the system first compiles and then is running. Why is that happening? Is there another way to run pre-trained keras models on an IPU chip?
Another question I have is if its possible to load a pre-trained keras model inside an ipu.keras.model and then use model.fit/evaluate to further train and evaluate it and then save it for future use?
One last question I have is about the compilation part of the graph. Is there a way to avoid recompilation of the graph every time i use the model.predict() in a different strategy.scope()?
I use tensorflow2.1.2 wheel
Thank you for your time
To add some context, the Graphcore TensorFlow wheel includes a port of Keras for the IPU, available as tensorflow.python.ipu.keras. You can access the API documentation for IPU Keras at this link. This module contains IPU-specific optimised replacement for TensorFlow Keras classes Model and Sequential, plus more high-performance, multi-IPU classes e.g. PipelineModel and PipelineSequential.
As per your specific issue, you are right when you mention that there are no IPU-specific ways to load pre-trained Keras models at present. I would encourage you, as you appear to have access to IPUs, to reach out to Graphcore Support. When doing so, please attach your pre-trained Keras model model1.h5 and a self-contained reproducer of your code.
Switching topic to the recompilation question: using an executable cache prevents recompilation, you can set that up with environmental variable TF_POPLAR_FLAGS='--executable_cache_path=./cache'. I'd also recommend to take a look into the following resources:
this tutorial gathers several considerations around recompilation and how to avoid it when using TensorFlow2 on the IPU.
Graphcore TensorFlow documentation here explains how to use the pre-compile mode on the IPU.

scikit learn upgrade causes failure when old models are loaded

I trained some data science models with scikit learn from v0.19.1. The models are stored in a pickle file. After upgrading to latest version (v0.23.1), I get the following error when I try to load them:
File "../../Utils/WebsiteContentSelector.py", line 100, in build_page_selector
page_selector = pickle.load(pkl_file)
AttributeError: Can't get attribute 'DeprecationDict' on <module 'sklearn.utils.deprecation' from '/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py'>
Is there a way to upgrade without retraining all my models (which is very expensive)?
You used a new version of sklearn to load a model which was trained by an old version of sklearn.
So, the options are:
Retrain the model with current version of sklearn if you have the training script and data
Or fall back to the lower sklearn version reported in the warning message
Depending on the kind of sklearn model used, if the model is simple regression model, what is probably needed is to get the actual weights and bias (or intercept) values.
You can check these values in your model:
model.classes_
model.coef_
model.intercept_
they are of numpy type and can be pickled easily. Also, you need to get the same parameters passed to the model construction. For example:
tol
max_iter
and so on. With this, in the upgraded version, the same model created with the same parameters can read the weights and intercept.
In this way, no re-training is needed and you can use the upgrade sklearn.
When lib versions are not backward compatible you can do the following:
Downgrade sklearn back to the original version
Load each model, extract and store its coefficients (which are model-specific - check documentation)
Upgrade sklearn, load coefficients and init models with them, save models
Related question.

Extracting fixed vectors from BioBERT without using terminal command?

If we want to use weights from pretrained BioBERT model, we can execute following terminal command after downloading all the required BioBERT files.
os.system('python3 extract_features.py \
--input_file=trial.txt \
--vocab_file=vocab.txt \
--bert_config_file=bert_config.json \
--init_checkpoint=biobert_model.ckpt \
--output_file=output.json')
The above command actually reads individual file containing the text, reads the textual content from it, and then writes the extracted vectors to another file. So, the problem with this is that it could not be scaled easily for very large data-sets containing thousands of sentences/paragraphs.
Is there is a way to extract these features on the go (using an embedding layer) like it could be done for the word2vec vectors in PyTorch or TF1.3?
Note: BioBERT checkpoints do not exist for TF2.0, so I guess there is no way it could be done with TF2.0 unless someone generates TF2.0 compatible checkpoint files.
I will be grateful for any hint or help.
You can get the contextual embeddings on the fly, but the total time spend on getting the embeddings will always be the same. There are two options how to do it: 1. import BioBERT into the Transformers package and treat use it in PyTorch (which I would do) or 2. use the original codebase.
1. Import BioBERT into the Transformers package
The most convenient way of using pre-trained BERT models is the Transformers package. It was primarily written for PyTorch, but works also with TensorFlow. It does not have BioBERT out of the box, so you need to convert it from TensorFlow format yourself. There is convert_tf_checkpoint_to_pytorch.py script that does that. People had some issues with this script and BioBERT (seems to be resolved).
After you convert the model, you can load it like this.
import torch
from transformers import *
# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('directory_with_converted_model')
model = BertModel.from_pretrained('directory_with_converted_model')
# Call the model in a standard PyTorch way
embeddings = model([tokenizer.encode("Cool biomedical tetra-hydro-sentence.", add_special_tokens=True)])
2. Use directly BioBERT codebase
You can get the embeddings on the go basically using the code that is exctract_feautres.py. On lines 346-382, they initialize the model. You get the embeddings by calling estimator.predict(...).
For that, you need to format your format the input. First, you need to format the string (using code on line 326-337) and then apply and call convert_examples_to_features on it.

Word2Vec word not found with Gensim but shows up on TensorFlow embedding projector?

I've recently started experimenting with pre-trained word embeddings to enhance the performance of my LSTM model on a NLP task. In this case, I looked into Google's Word2Vec. Based on online tutorials, I first downloaded Word2Vec with wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz and used python's gensim package to query the embeddings, using the following code.
from gensim.models import KeyedVectors
if __name__ == "__main__":
model = KeyedVectors.load_word2vec_format("./data/word2vec/GoogleNews-vectors-negative300.bin", binary=True)
print(model["bosnia"])
However, after noticing that many common words weren't found in the model, I started to wonder if something was awry. I tried searching for bosnia in the embedding repo, as shown above, but it wasn't found. So, I went on the TensorFlow embedding projector, loaded the Word2Vec model, and searched for bosnia - it was there.
So, my question is: why is this happening? Was the version of Word2Vec I downloaded not complete? Or is gensim unable to load all words into memory and therefore omitting some?
You should check the length of the downloaded file(s), to ensure it's as expected (in case it was truncated or incompletely downloaded).
You should double-check that you're using the same file in both places, and also checking the exact same token (eg 'bosnia' vs 'Bosnia') via both paths. (None of the 5 options in the https://projector.tensorflow.org/ drop-down correspond to the GoogleNews 300-d, 3-million-token dataset, and the load button doesn't appear to support word2vec .bin files, so I'm not sure how that could be used to cross-check what's in that file.)
(There aren't any known bugs in gensim's load_word2vec_format() that would explain it missing vectors that are actually present.)

Resources