If we want to use weights from pretrained BioBERT model, we can execute following terminal command after downloading all the required BioBERT files.
os.system('python3 extract_features.py \
--input_file=trial.txt \
--vocab_file=vocab.txt \
--bert_config_file=bert_config.json \
--init_checkpoint=biobert_model.ckpt \
--output_file=output.json')
The above command actually reads individual file containing the text, reads the textual content from it, and then writes the extracted vectors to another file. So, the problem with this is that it could not be scaled easily for very large data-sets containing thousands of sentences/paragraphs.
Is there is a way to extract these features on the go (using an embedding layer) like it could be done for the word2vec vectors in PyTorch or TF1.3?
Note: BioBERT checkpoints do not exist for TF2.0, so I guess there is no way it could be done with TF2.0 unless someone generates TF2.0 compatible checkpoint files.
I will be grateful for any hint or help.
You can get the contextual embeddings on the fly, but the total time spend on getting the embeddings will always be the same. There are two options how to do it: 1. import BioBERT into the Transformers package and treat use it in PyTorch (which I would do) or 2. use the original codebase.
1. Import BioBERT into the Transformers package
The most convenient way of using pre-trained BERT models is the Transformers package. It was primarily written for PyTorch, but works also with TensorFlow. It does not have BioBERT out of the box, so you need to convert it from TensorFlow format yourself. There is convert_tf_checkpoint_to_pytorch.py script that does that. People had some issues with this script and BioBERT (seems to be resolved).
After you convert the model, you can load it like this.
import torch
from transformers import *
# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('directory_with_converted_model')
model = BertModel.from_pretrained('directory_with_converted_model')
# Call the model in a standard PyTorch way
embeddings = model([tokenizer.encode("Cool biomedical tetra-hydro-sentence.", add_special_tokens=True)])
2. Use directly BioBERT codebase
You can get the embeddings on the go basically using the code that is exctract_feautres.py. On lines 346-382, they initialize the model. You get the embeddings by calling estimator.predict(...).
For that, you need to format your format the input. First, you need to format the string (using code on line 326-337) and then apply and call convert_examples_to_features on it.
Related
FitBERT is an useful package , but I have a small doubt on BERT development for masked word prediction as below: I trained a bert model with custom corpus using Google's Scripts like create_pretraining_data.py, run_pretraining.py, extract_features.py etc..as a result I got vocab file, .tfrecord file, .json file and check point files.
Now how to use those file for your package to predict a masked word in a given sentence??
From the tensorflow documentation:
A TFRecord file stores your data as a sequence of binary strings. This means you need to specify the structure of your data before you write it to the file. Tensorflow provides two components for this purpose: tf.train.Example and tf.train.SequenceExample. You have to store each sample of your data in one of these structures, then serialize it and use a tf.python_io.TFRecordWriter to write it to disk.
This document along with the tensorflow documentation explain quite well how to use those file types.
While instead to use FitBERT directly through the library you can follow the examples you find on the project's github.
So I trained a CNN using Keras with a samplewise_std_normalization preprocessing function which is available when applying the Image_data_generator.
but when I try to predict only on one image the CNN does not work (obviously) because I did not perform this preprocessing when importing the one image.
does anybody have any idea how the code should look like to import only one image then apply this preprocessing function and feed it into a CNN?
Those preprocessing are meant only for training. It is a technique called data augmentation. It is not meant to be used while testing.
Even then, in case you want to augment the images, you can use this library called ImgAug
I am using Torchtext in an NLP project. I have a pretrained embedding in my system, which I'd like to use. Therefore, I tried:
my_field.vocab.load_vectors(my_path)
But, apparently, this only accepts the names of a short list of pre-accepted embeddings, for some reason. In particular, I get this error:
Got string input vector "my_path", but allowed pretrained vectors are ['charngram.100d', 'fasttext.en.300d', ..., 'glove.6B.300d']
I found some people with similar problems, but the solutions I can find so far are "change Torchtext source code", which I would rather avoid if at all possible.
Is there any other way in which I can work with my pretrained embedding? A solution that allows to use another Spanish pretrained embedding is acceptable.
Some people seem to think it is not clear what I am asking. So, if the title and final question are not enough: "I need help using a pre-trained Spanish word-embedding in Torchtext".
It turns out there is a relatively simple way to do this without changing Torchtext's source code. Inspiration from this Github thread.
1. Create numpy word-vector tensor
You need to load your embedding so you end up with a numpy array with dimensions (number_of_words, word_vector_length):
my_vecs_array[word_index] should return your corresponding word vector.
IMPORTANT. The indices (word_index) for this array array MUST be taken from Torchtext's word-to-index dictionary (field.vocab.stoi). Otherwise Torchtext will point to the wrong vectors!
Don't forget to convert to tensor:
my_vecs_tensor = torch.from_numpy(my_vecs_array)
2. Load array to Torchtext
I don't think this step is really necessary because of the next one, but it allows to have the Torchtext field with both the dictionary and vectors in one place.
my_field.vocab.set_vectors(my_field.vocab.stoi, my_vecs_tensor, word_vectors_length)
3. Pass weights to model
In your model you will declare the embedding like this:
my_embedding = toch.nn.Embedding(vocab_len, word_vect_len)
Then you can load your weights using:
my_embedding.weight = torch.nn.Parameter(my_field.vocab.vectors, requires_grad=False)
Use requires_grad=True if you want to train the embedding, use False if you want to freeze it.
EDIT: It looks like there is another way that looks a bit easier! The improvement is that apparently you can pass the pre-trained word vectors directly during the vocabulary-building step, so that takes care of steps 1-2 here.
I want to extract features using caffe and train those features using SVM. I have gone through this link: http://caffe.berkeleyvision.org/gathered/examples/feature_extraction.html. This links provides how we can extract features using caffenet. But I want to use Lenet architecture here. I am unable to change this line of command for Lenet:
./build/tools/extract_features.bin models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel examples/_temp/imagenet_val.prototxt fc7 examples/_temp/features 10 leveldb
And also, after extracting the features, how to train these features using SVM? I want to use python for this. For eg: If I get features from this code:
features = net.blobs['pool2'].data.copy()
Then, how can I train these features using SVM by defining my own classes?
You have two questions here:
Extracting features using LeNet
Training an SVM
Extracting features using LeNet
To extract the features from LeNet using the extract_features.bin script you need to have the model file (.caffemodel) and the model definition for testing (.prototxt).
The signature of extract_features.bin is here:
Usage: extract_features pretrained_net_param feature_extraction_proto_file extract_feature_blob_name1[,name2,...] save_feature_dataset_name1[,name2,...] num_mini_batches db_type [CPU/GPU] [DEVICE_ID=0]
So if you take as an example val prototxt file this one (https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/train_val.prototxt), you can change it to the LeNet architecture and point it to your LMDB / LevelDB. That should get you most of the way there. Once you did that and get stuck, you can re-update your question or post a comment here so we can help.
Training SVM on top of features
I highly recommend using Python's scikit-learn for training an SVM from the features. It is super easy to get started, including reading in features saved from Caffe's format.
Very lagged reply, but should help.
Not 100% what you want, but I have used the VGG-16 net to extract face features using caffe and perform a accuracy test on a small subset of the LFW dataset. Exactly what you needed is in the code. The code creates classes for training and testing and pushes them into the SVM for classification.
https://github.com/wajihullahbaig/VGGFaceMatching
I have trained a model from caffe in ".caffemodel.h5" format. I want to parse it to extract parameters and feed it to a lasagne model. How can I do it?
You will have to replicate the architecture of your caffemodel in terms of Lasagne layers first, by hand, or use a script that does this (a script that takes a caffe protobuf and translates it to lasagne exists AFAIK, or should be made if not. In any case it exists in sklearn-theano.)
By hand, you need to open the hf5 file using e.g. pytables, and then replicate the architecture in Lasagne.