FitBERT is an useful package , but I have a small doubt on BERT development for masked word prediction as below: I trained a bert model with custom corpus using Google's Scripts like create_pretraining_data.py, run_pretraining.py, extract_features.py etc..as a result I got vocab file, .tfrecord file, .json file and check point files.
Now how to use those file for your package to predict a masked word in a given sentence??
From the tensorflow documentation:
A TFRecord file stores your data as a sequence of binary strings. This means you need to specify the structure of your data before you write it to the file. Tensorflow provides two components for this purpose: tf.train.Example and tf.train.SequenceExample. You have to store each sample of your data in one of these structures, then serialize it and use a tf.python_io.TFRecordWriter to write it to disk.
This document along with the tensorflow documentation explain quite well how to use those file types.
While instead to use FitBERT directly through the library you can follow the examples you find on the project's github.
Related
I am using BERT Word Embeddings for sentence classification task with 3 labels. I am using Google Colab for coding. My problem is, since I will have to execute the embedding part every time I restart the kernel, is there any way to save these word embeddings once it is generated? Because, it takes a lot of time to generate those embeddings.
The code I am using to generate BERT Word Embeddings is -
[get_features(text_list[i]) for text_list[i] in text_list]
Here, gen_features is a function which returns word embedding for each i in my list text_list.
I read that converting embeddings into bumpy tensors and then using np.save can do it. But I actually don't know how to code it.
You can save your embeddings data to a numpy file by following these steps:
all_embeddings = here_is_your_function_return_all_data()
all_embeddings = np.array(all_embeddings)
np.save('embeddings.npy', all_embeddings)
If you're saving into google colab, then you can download it to your local computer. Whenever you need it, just upload it and load it.
all_embeddings = np.load('embeddings.npy')
That's it.
Btw, You can also directly save your file to google drive.
If we want to use weights from pretrained BioBERT model, we can execute following terminal command after downloading all the required BioBERT files.
os.system('python3 extract_features.py \
--input_file=trial.txt \
--vocab_file=vocab.txt \
--bert_config_file=bert_config.json \
--init_checkpoint=biobert_model.ckpt \
--output_file=output.json')
The above command actually reads individual file containing the text, reads the textual content from it, and then writes the extracted vectors to another file. So, the problem with this is that it could not be scaled easily for very large data-sets containing thousands of sentences/paragraphs.
Is there is a way to extract these features on the go (using an embedding layer) like it could be done for the word2vec vectors in PyTorch or TF1.3?
Note: BioBERT checkpoints do not exist for TF2.0, so I guess there is no way it could be done with TF2.0 unless someone generates TF2.0 compatible checkpoint files.
I will be grateful for any hint or help.
You can get the contextual embeddings on the fly, but the total time spend on getting the embeddings will always be the same. There are two options how to do it: 1. import BioBERT into the Transformers package and treat use it in PyTorch (which I would do) or 2. use the original codebase.
1. Import BioBERT into the Transformers package
The most convenient way of using pre-trained BERT models is the Transformers package. It was primarily written for PyTorch, but works also with TensorFlow. It does not have BioBERT out of the box, so you need to convert it from TensorFlow format yourself. There is convert_tf_checkpoint_to_pytorch.py script that does that. People had some issues with this script and BioBERT (seems to be resolved).
After you convert the model, you can load it like this.
import torch
from transformers import *
# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('directory_with_converted_model')
model = BertModel.from_pretrained('directory_with_converted_model')
# Call the model in a standard PyTorch way
embeddings = model([tokenizer.encode("Cool biomedical tetra-hydro-sentence.", add_special_tokens=True)])
2. Use directly BioBERT codebase
You can get the embeddings on the go basically using the code that is exctract_feautres.py. On lines 346-382, they initialize the model. You get the embeddings by calling estimator.predict(...).
For that, you need to format your format the input. First, you need to format the string (using code on line 326-337) and then apply and call convert_examples_to_features on it.
I have two pretrained word embeddings: Glove.840b.300.txt and custom_glove.300.txt
One is pretrained by Stanford and the other is trained by me.
Both have different sets of vocabulary. To reduce oov, I'd like to add words that don't appear in file1 but do appear in file2 to file1.
How do I do that easily?
This is how I load and save the files in gensim 3.4.0.
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('path/to/thefile')
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)
I don't know an easy way.
In particular, word-vectors that weren't co-trained together won't have compatible/comparable coordinate-spaces. (There's no one right place for a word – just a relatively-good place compared to the other words that are in the same model.)
So, you can't just append the missing words from another model: you'd need to transform them into compatible locations. Fortunately, it seems to work to use some set of shared anchor-words, present in both word-vector-sets, to learn a transformation – then apply that the words you want to move over.
There's a class, [TranslationMatrix][1], and demo notebook in gensim showing this process for language-translation (an application mentioned in the original word2vec papers). You could concievably use this, combined with the ability to append extra vectors to a gensim KeyedVectors instance, to create a new set of vectors with a superset of the words in either of your source models.
I've noticed that for any tutorial or example of a Keras CNN that I've seen, the input images are numbered, e.g.:
dog0001.jpg
dog0002.jpg
dog0003.jpg
...
Is this necessary?
I'm working with an image dataset with fairly random filenames (the classes come from the directory name), e.g.:
picture_A2.jpg
image41110.jpg
cellofinterest9A.jpg
I actually want to keep the filenames because they mean something to me, but do I need to append sequential numbers to my image files?
No they can be of different names, it really depends on how you load your data. In your case, you can use flow_from_directory to generate the training data and indeed the directory will be the associated class, this is part of ImageDataGenerator.
I have read lots of examples regarding doc2vec, but I couldn't find any answer. Like a real example, I want to build a model with doc2vec and then train it with some ML models. after that, how can I get the vector of a raw string with the exact trained Doc2vec model? because I need to predict with my ML model with the same size and logical vector
There are a collection of example Jupyter (aka IPython) notebooks in the gensim docs/notebooks directory. You can view them online at:
https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
But they'll be in your gensim installation directory, if you can find that for your current working environment.
Those that include doc2vec in their name demonstrate the use of the Doc2Vec class. The most basic intro operates on the 'Lee' corpus that's bundled with gensim for use in its unit tests. (It's really too small for real Doc2Vec success, but by forcing smaller models and many training iterations the notebook just barely manages to get some consistent results.) See:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
It includes a section on inferring a vector for a new text:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
Note that inference is performed on a list of string tokens, not a raw string. And those tokens should have been preprocessed/tokenized the same way as the original training data for the model, so that the vocabularies are compatible. (Any unknown words in a new text are silently ignored.)
Note also that especially on short texts, it often helps to provide a much-larger-than-default value of the optional steps parameter to infer_vector() - say 50 or 200 rather than the default 5. It may also help to provide a starting alpha parameter more like the training default of 0.025 than the method-default of 0.1.