Use pretrained embedding in Spanish with Torchtext - nlp

I am using Torchtext in an NLP project. I have a pretrained embedding in my system, which I'd like to use. Therefore, I tried:
my_field.vocab.load_vectors(my_path)
But, apparently, this only accepts the names of a short list of pre-accepted embeddings, for some reason. In particular, I get this error:
Got string input vector "my_path", but allowed pretrained vectors are ['charngram.100d', 'fasttext.en.300d', ..., 'glove.6B.300d']
I found some people with similar problems, but the solutions I can find so far are "change Torchtext source code", which I would rather avoid if at all possible.
Is there any other way in which I can work with my pretrained embedding? A solution that allows to use another Spanish pretrained embedding is acceptable.
Some people seem to think it is not clear what I am asking. So, if the title and final question are not enough: "I need help using a pre-trained Spanish word-embedding in Torchtext".

It turns out there is a relatively simple way to do this without changing Torchtext's source code. Inspiration from this Github thread.
1. Create numpy word-vector tensor
You need to load your embedding so you end up with a numpy array with dimensions (number_of_words, word_vector_length):
my_vecs_array[word_index] should return your corresponding word vector.
IMPORTANT. The indices (word_index) for this array array MUST be taken from Torchtext's word-to-index dictionary (field.vocab.stoi). Otherwise Torchtext will point to the wrong vectors!
Don't forget to convert to tensor:
my_vecs_tensor = torch.from_numpy(my_vecs_array)
2. Load array to Torchtext
I don't think this step is really necessary because of the next one, but it allows to have the Torchtext field with both the dictionary and vectors in one place.
my_field.vocab.set_vectors(my_field.vocab.stoi, my_vecs_tensor, word_vectors_length)
3. Pass weights to model
In your model you will declare the embedding like this:
my_embedding = toch.nn.Embedding(vocab_len, word_vect_len)
Then you can load your weights using:
my_embedding.weight = torch.nn.Parameter(my_field.vocab.vectors, requires_grad=False)
Use requires_grad=True if you want to train the embedding, use False if you want to freeze it.
EDIT: It looks like there is another way that looks a bit easier! The improvement is that apparently you can pass the pre-trained word vectors directly during the vocabulary-building step, so that takes care of steps 1-2 here.

Related

How to convert small dataset into word embeddings instead of one-hot encoding?

I have a dataset of 33 words that are a mix of verbs and nouns, for eg. father, sing, etc. I have tried converting them to 1-hot encoding but for my use case, it has been suggested to look into word2vec embedding. I have looked in gensim and glove but struggling to make it work.
How could I convert my data into an embedding? Such that two words that may be semantically closer may have a lesser distance between their respective vectors. How may this be achieved or any helpful material on the same?
Such as this
Since your dataset is quite small, and I'm assuming it doesn't contain any jargon, it's best to use a pre-trained model in order to save up on training time.
With gensim, it's as simple as:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
The 'word2vec-google-news-300' model has been pre-trained on a part of the Google News Dataset and generalizes well enough to most tasks. Following this, you can create word embeddings/vectors like so:
vec = wv['father']
And, finally, for computing word similarity:
similarity_score = wv.similarity('father', 'sing')
Lastly, one major limitation of Word2Vec is it's inability to deal with words that are OOV(out of vocabulary). For such cases, it's best to train a custom model for your corpus.

When doing pre-training of a transformer model, how can I add words to the vocabulary?

Given a DistilBERT trained language model for a given language, taken from the Huggingface hub, I want to pre-train the model on a specific domain, and I want to add new words that are:
definitely non existing in the original training set
and impossible to handle via word piece toeknization - basically you can think of these words as "codes" that are a normalized form of a named entity
Consider that:
I would like to avoid to learn a new tokenizer: I am fine to add the new words, and then let the model learn their embeddings via pre-training
the number of the "words" is way larger that the "unused" tokens in the "stock" vocabulary
The only advice that I have found is the one reported here:
Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.
Do you think this is the only way of achieve my goal?
If yes, I do not have any idea of how to write this "script": does someone has some hints at how to proceeed (sample code, documentation etc)?
As per my comment, I'm assuming that you go with a pre-trained checkpoint, if only to "avoid [learning] a new tokenizer."
Also, the solution works with PyTorch, which might be more suitable for such changes. I haven't checked Tensorflow (which is mentioned in one of your quotes), so no guarantees that this works across platforms.
To solve your problem, let us divide this into two sub-problems:
Adding the new tokens to the tokenizer, and
Re-sizing the token embedding matrix of the model accordingly.
The first can actually be achieved quite simply by using .add_tokens(). I'm referencing the slow tokenizer's implementation of it (because it's in Python), but from what I can see, this also exists for the faster Rust-based tokenizers.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Will return an integer corresponding to the number of added tokens
# The input could also be a list of strings instead of a single string
num_new_tokens = tokenizer.add_tokens("dennlinger")
You can quickly verify that this worked by looking at the encoded input ids:
print(tokenizer("This is dennlinger."))
# 'input_ids': [101, 2023, 2003, 30522, 1012, 102]
The index 30522 now corresponds to the new token with my username, so we can check the first part. However, if we look at the function docstring of .add_tokens(), it also says:
Note, hen adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the model so that its embedding matrix matches the tokenizer.
In order to do that, please use the PreTrainedModel.resize_token_embeddings method.
Looking at this particular function, the description is a bit confusing, but we can get a correctly resized matrix (with randomly initialized weights for new tokens), by simply passing the previous model size, plus the number of new tokens:
from transformers import AutoModel
model = AutoModel.from_pretrained("distilbert-base-uncased")
model.resize_token_embeddings(model.config.vocab_size + num_new_tokens)
# Test that everything worked correctly
model(**tokenizer("This is dennlinger", return_tensors="pt"))
EDIT: Notably, .resize_token_embeddings() also takes care of any associated weights; this means, if you are pre-training, it will also adjust the size of the language modeling head (which should have the same number of tokens), or fix tied weights that would be affected by an increased number of tokens.

get closest vector from unknown vector with gensim

I am currently implementing a natural text generator for a school project. I have a dataset of sentences of predetermined lenght and key words, I convert them in vectors thanks to gensim and GoogleNews-vectors-negative300.bin.gz. I train a recurrent neural network to create a list of vectors that I compare to the list of vectors of the real sentence. So I try to get as close as possible to the "real" vectors.
My problem happens when I have to convert back vectors into words: my vectors aren't necessarily in the google set. So I would like to know if there is an efficient solution to get the closest vector in the Google set to an outpout vector.
I work with python 3 and Tensorflow
Thanks a lot, feel free to ask any questions about the project
Charles
The gensim method .most_similar() (on KeyedVectors & similar classes) will also accept raw vectors as the 'origin' from which to search.
Just be sure to explicitly name the positive parameter - a list of target words/vectors to combine to find the origin point.
For example:
gvecs = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')
target_vec = gvecs['apple']
similars = gvecs.most_similar(positive=[target_vec,])

Transformers PreTrainedTokenizer add_tokens Functionality

Referring to the documentation of the awesome Transformers library from Huggingface, I came across the add_tokens functions.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
model.resize_token_embeddings(len(tokenizer))
I tried the above by adding previously absent words in the default vocabulary. However, keeping all else constant, I noticed a decrease in accuracy of the fine tuned classifier making use of this updated tokenizer. I was able to replicate similar behavior even when just 10% of the previously absent words were added.
My questions
Am I missing something?
Instead of whole words, is the add_tokens function expecting masked tokens, for example : '##ah', '##red', '##ik', '##si', etc.? If yes, is there a procedure to generate such masked tokens?
Any help would be appreciated.
Thanks in advance.
If you add tokens to the tokenizer, you indeed make the tokenizer tokenize the text differently, but this is not the tokenization BERT was trained with, so you are basically adding noise to the input. The word embeddings are not trained and the rest of the network never saw them in context. You would need a lot of data to teach BERT to deal with the newly added words.
There are also some ways how to compute a single word embedding, such that it would not hurt BERT like in this paper but it seems pretty complicated and should not make any difference.
BERT uses a word-piece-based vocabulary, so it should not really matter if the words are present in the vocabulary as a single token or get split into multiple wordpieces. The model probably saw the split word during pre-training and will know what to do with it.
Regarding the ##-prefixed tokens, those are tokens can only be prepended as a suffix of another wordpiece. E.g., walrus gets split into ['wal', '##rus'] and you need both of the wordpieces to be in the vocabulary, but not ##wal or rus.

Setting max length of char n-grams for fastText

I want to compare word2vec and fasttext model based on this comparison tutorial.
https://github.com/jayantj/gensim/blob/fast_text_notebook/docs/notebooks/Word2Vec_FastText_Comparison.ipynb
According to this, the semantic accuracy of fastText model increase when we set the max length of char n-grams to zero, such that fastText starts to behave almost like to word2vec. It ignores the ngrams.
However, I can not find any formation on how to set this parameter while loading a fastText model. Any ideas on how to do this?
The parameter is set at training time – and then the model is built using that parameter, and dependent on that parameter for interpretation. So you wouldn't typically change it upon loading an already-trained model, and there's no API in gensim (or the original FastText) to change the setting on an already-trained model.
(By looking at the source and tampering with the loaded model state directly, you might be able to approximate the effect of ignoring char-ngrams that had been trained – but that'd be a novel mode, not at all like the no-ngrams-trained mode evaluated in the notebook you've linked. It might generate interesting, or awful, results – no way to tell without trying it.)

Resources